Collaborative Differential Evolution Filtering for Tracking Hand-Object Interactions

Human hands engage in interactive activities in many practical working scenarios, among which the interactions between human hands and objects are the most common. Tracking the movement of the human hand during hand-object interactions is an important research task that is also challenging due to the high-dimensionality and occlusions. In this paper, we track hand-object interactions from depth observations with a model-based method. To overcome the difficulties of optimum searching in the hand-object high-dimensional space, we propose a new algorithm — collaborative differential evolution filtering (CoDEF) — for tracking hand-object interactions. The proposed CoDEF algorithm integrates the differential evolution (DE) algorithm into a particle filtering (PF) framework to accelerate the convergence of particles. Particles are driven to the regions with a high probability by optimizing the matching error under the current observation with DE. To decompose the state space and decrease the complexity of optimum searching, CoDEF tracks the movement of the hand and object by using two collaborative trackers. Based on the proposed CoDEF algorithm, we develop a model-based tracking system with 3D graphic techniques. According to the experimental results, the proposed CoDEF algorithm can achieve robust tracking of hand-object interactions using fewer particles.


I. INTRODUCTION
Tracking the movement of the human hand is an important task in many applications, such as the perception of human grasping, movement capture for animation, and human-machine interfacing. In many practical working scenes, the human hand engages in interactive activities. Interactions between the human hand and objects are the most common. Therefore, it is important to track the movement of the human hand during hand-object interactions. Nevertheless, tracking hand-object interactions is limited by several complicated factors. First, it is a high-dimensional problem. Next, occlusions occur frequently during hand-object interactions, including hand-object mutual occlusions and self-occlusions of the hand. However, useful contextual information with the manipulated object can promote the recognition and estimation of human hand movement.
Currently, hand-object tracking methods based on vision can generally be divided into two types: appearance-based The associate editor coordinating the review of this manuscript and approving it for publication was Tomasz Trzcinski. methods and model-based methods. Appearance-based methods [1]- [11] estimate hand-object poses directly from image features via a learned mapping. They require no initialization and have a quick tracking speed. However, accurate estimations of poses need a well-trained mapping. Kjellström et al. [1] proposed a method for recognizing the movement of the hand and the manipulated object by expressing their relationship with a conditional random field model. However, this method does not provide detailed information about the movement of the human hand. Romero et al. [2], [3] reconstructed the 3D gestures of the human hand that interacted with objects using a real-time nonparametric appearance-based method. The method searches for the hand pose that best matches the input image from a large template database with nearest-neighbor searching. Gupta et al. [4] proposed a Bayesian approach to integrate multiple perception tasks in human-object interactions. The method searches for consistent semantic expressions by applying space limitations to perception elements. This method not only allows for the recognition of the object and corresponding actions when their appearance cannot be completely distinguished, VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ but it also allows for the recognition of the actions of the human body from static images. However, this method does not produce detailed information about body gestures. Yao and Fei-Fei [5], [6] applied a new random field model for the modeling of objects and body gestures. They estimated the degree of connection among objects, body gestures, and different parts of the human body through a structure learning method. The method calculates the parameters of the model using a new max-margin algorithm. Under this mode, object detection provides strong prior knowledge for the estimation of body gestures, and the estimation of body gestures helps the system conduct more accurate detections of objects interacting with the human body. However, this method only produces 2D estimates for body poses.
Recently, some researchers [12]- [28] have introduced deep learning methods to estimate hand poses. Tompson et al. [12] trained a convolutional network to extract hand heat-map features from depth images. Then, they recovered hand poses from the heat-map representation with inverse kinematics. Ge et al. [13] acquired volumetric representations of hands from depth images. By using the volumetric representations as the input, they regressed the 3D hand joint locations by using a trained 3D convolutional neural network (3D CNN). However, these methods assume an isolated free-moving hand that is not interacting with objects.
Model-based methods [29]- [39] use prebuilt models to generate hypotheses. These methods compare the features extracted from the models with those extracted from visual observation and evaluate the similarity between them. They search for a set of hand-object state parameters that best matches the visual observation in the model state space using an optimization method. However, the tracking process involves a search task in a high-dimensional space, which is challenging. Moreover, the tracking needs to be initialized. Hamer et al. [30] searched for the optimal configuration of the hand states through belief propagation (BP). They connected different parts of the multijointed human hand through pairwise Markov random fields. However, they did not construct a model for the manipulated object. Oikonomidis et al. [31] regarded the hand-object tracking problem as a sequential optimization problem. They used particle swarm optimization (PSO) to search for the solution. Their system uses multiview RGB image sequences as the input. Kyriazis and Argyros [32] acquired the observation input using a depth camera and only searched for hand pose parameters. They deduced the object pose according to the hand pose and the hand-object interaction model. Zhang and Seah [33] performed a hybrid particle-based search that derives from PSO and differential evolution (DE) to track human body poses. They used a voxel model for the human body. Some researchers [40]- [42] have combined learning-based methods with model fitting for estimating hand poses. Sharp et al. [40] used a multilayered random forest to predict a hand pose distribution. Using the hand pose hypotheses sampled from the distribution for the initialization, they performed a model fitting process by minimizing the error between the hand model and the observation with PSO. Their method focused on tracking a single hand. When tracking a hand manipulating an object, failures occurred for their method. Sridhar et al. [41] modeled the hand and object with Gaussian mixtures. They performed object segmentation using color information and then carried out hand part classification from the depth input with a multilayered random forest. By using the hand part classification for guidance, they tracked the hand manipulating an object with a 3D Gaussian mixture alignment method. However, the hand part classification did not perform well under situations of severe hand-object occlusions.
Many researchers have performed model-based tracking of the movement of the hand [43]- [46] or body [47], [48] by using a particle filtering (PF) framework. PF has the ability to express a multipeak distribution through the propagation of multiple samples along time. Nevertheless, the standard PF requires a large sample size, especially for high-dimensional problems such as hand movement tracking. A small sample set will lead to particle divergence and tracking failures. For this problem, many researchers [43]- [46] have tracked hand movement by combining optimization methods with PF. Based on a PF framework, the particles predicted by a dynamic model are used as the initial values, and an optimization method is then applied to optimize the particles and accelerate the convergence of the particle set. In a related work [45], Gaussian PSO is combined with PF to track hand-object interactions. However, since the segmented images include a small amount of forearm pixels adjacent to hand pixels, the estimated hand pose slides up and down the arm from frame to frame. Another related work [46] integrates DE into PF for tracking. However, the method considered an isolated hand that was not interacting with objects.
As in [45], we track hand-object interactions from depth observations with a model-based method under a PF framework. However, in this paper, the constructed 3D hand model includes a part of the forearm that can be scaled, enabling the observation model to explain the forearm pixels adjacent to hand pixels in segmented depth images. To accelerate the convergence of particles and improve the distribution of particle samples, we integrate DE into the PF framework to track hand-object interactions. By optimizing the matching error under the current observation with DE, particles are moved towards the regions with high-likelihood probability. However, due to the high-dimensionality of the problem and the occlusions during hand-object interactions, there are many local optima around the global optimum in the hand-object space, making the optimum searching process still challenging. To decrease the complexity of optimum searching, we track the movement of the human hand and the object using two collaborative trackers. The resulting new algorithm -collaborative differential evolution filtering (CoDEF) -assigns one tracker to the hand and another tracker to the object. The two trackers exchange information frequently during the tracking process. Such a collaborative tracking scheme decomposes the state space with multiple trackers, decreasing the complexity of optimum searching. We develop a model-based tracking system based on the proposed CoDEF algorithm with 3D graphic techniques. The experiments demonstrate that CoDEF can achieve the robust tracking of hand-object movement by using fewer particles. The main contributions of this paper are as follows: • We propose a new algorithm -CoDEF -for tracking hand-object interactions. To overcome the difficulties of searching in the hand-object high-dimensional space, CoDEF integrates DE into the PF framework and applies two collaborative trackers for the hand and object.
• We construct a 3D hand model including a part of the forearm that can be scaled. In this way, we make the observation model able to explain the forearm pixels adjacent to hand pixels in segmented depth images.
• We develop a model-based prototype system for tracking hand-object interactions based on the proposed CoDEF algorithm with 3D graphic techniques.
The reminder of this paper is organized as follows: Since we track hand-object interactions with a model-based method, we first introduce the constructed hand-object models in Section II. Then, we describe the matching error function and observation model in Section III. In Section IV, we describe the proposed tracking algorithm -CoDEF. In Section V, we describe the model-based tracking system that we have developed based on CoDEF with 3D graphic techniques. Section VI provides the experimental results on real and synthetic data. Section VII presents the conclusion of this paper.

II. HAND-OBJECT MODEL
We track the hand-object interactions by using a modelbased method. The human hand is an articulated object, and each joint of the hand has one or more degrees of freedom (DOFs) in rotation. From the application perspective, it is not necessary to capture the movement of all bones in the hand. Therefore, the kinematics modeling of some joints is usually simplified by some approximations. Lee and Kunii [49] introduced a 27-DOF model, which has been widely used. In this paper, we build a hand kinemics model that is similar to [49], which is shown in Fig. 1. However, different from [49], we model the MCP joint of the thumb with only 1 revolute DOF. In addition, since our model includes a part of the forearm, we add a wrist joint to the hand kinemics model. The resulting hand state vector x h covers 29 DOFs, including 6 DOFs for global hand motion, 20 DOFs for local finger motion, and 3 DOFs for the wrist joint. The CMC joints of all fingers are fixed. The movement of the palm corresponds to 6 global DOFs of the human hand. Each finger is connected to the palm by a 2-DOF (1 flexion-extension DOF and 1 abduction-adduction DOF) joint. In addition, each finger consists of three parts that are connected by two 1-DOF joints. These 1-DOF joints are only capable of flexion-extension motion. The wrist joint has 1 flexion-extension DOF, 1 abduction-adduction DOF, and 1 scaling DOF. We use human anatomy to establish the movement constraints of the finger joints and the wrist joint. The object state vector x o covers 6 DOFs of the manipulated object. By using the PTC Pro/Engineer 1 and Multigen-Paradigm Creator 2 , we build a unified 3D model for the human hand and the manipulated object with parametric geometric primitives. The model has local coordinates and DOF nodes for hand-object pose updating. Moreover, the 3D hand model built in this paper involves a part of the forearm of the human body, which makes the model able to describe the forearm pixels adjacent to hand pixels in segmented depth images. The wrist joint has 1 scaling DOF, which makes the forearm model able to extend or retract. This paper mainly focuses on the interactions of the human hand with a sphere and the interactions with a cylinder. Fig. 2 shows the corresponding models. However, this method can also be used to track the interactions between the human hand and more complex shapes of objects.

III. OBSERVATION MODEL
In this paper, we construct a matching error function and observation likelihood function to evaluate the hand-object hypotheses. The hand-object foreground regions are segmented by a simple threshold from the current depth observation z, generating a depth image z d (z). Given a hand-object pose vector is generated correspondingly with graphic rendering techniques by a virtual depth camera under the given calibration. Then, two By comparing the features extracted from the hypotheses with those extracted from visual observation, a matching error function is defined as follows: where λ d , λ s and λ h are the normalization factors. E d measures the depth differences between the pose hypothesis x h−o and the observation z. E d is defined as follows: The pixelwise depth differences are calculated and accumulated over the whole image. The accumulated sum is normalized by dividing by the total pixel area of the hand and the manipulated object. Any significant difference in depth will cause significant changes in the functional values, thus influencing the performance of the search method. For this reason, the maximum constant T d for depth differences is introduced, and the depth differences of all pixels are limited within the range of [0, T d ]. E s describes the incompatibility of silhouette images based on the area of the nonoverlapping regions between z s (z) and r s (x h−o ). It is defined as follows: The first part in E s calculates the pixel area that belongs to z s (z) but does not belong to r s (x h−o ), whereas the second part calculates the pixel area that belongs to r s (x h−o ) but does not belong to z s (z). Both parts are normalized independently.
To punish the mutual penetration of adjacent fingers, the matching error function E(z, x h ) involves an additional prior part, which is the penalty term E h (x h ). It is defined as follows: where J refers to three pairs of adjacent fingers, except the thumb. ϕ refers to the difference between the abductionadduction angles of the MCP joints between a pair of adjacent fingers in the hand pose hypothesis x h . The observation likelihood function is defined as follows: where λ e is a normalization factor.

IV. THE TRACKING ALGORITHM
We propose a new tracking algorithm -collaborative differential evolution filtering (CoDEF) -for tracking hand-object interactions. CoDEF integrates the differential evolution (DE) algorithm into a particle filtering (PF) framework. The distribution of the PF samples is improved by optimizing the matching error under the current observation with DE. In addition, CoDEF uses two collaborative trackers to track the movement of the hand and object. In this way, the hand-object space is decomposed and the complexity of the optimum searching is decreased.

A. PARTICLE FILTERING
Particle filtering (PF) can express a multipeak distribution through the propagation of multiple samples along time [50]. The basic idea of PF can be summarized as follows.
According to the particle samples {( to represent the posterior probability distribution of time t, by using the transition prior p(x t |x t−1 ) and the observation likelihood p(z t |x t ). x i t denotes the i-th sampled state particle at time t, and w i t denotes its weight. However, the transition prior which ignores the latest observation value z t is used as the importance distribution. Therefore, the importance sampling process of particles is suboptimal. For PF, a small sample set will lead to particle divergence and tracking failures. To address this problem, some kind of optimization method is often introduced into the PF framework to accelerate the convergence of the particles.

B. OPTIMIZATION WITH DIFFERENTIAL EVOLUTION
In this paper, we use the differential evolution (DE) algorithm to optimize the matching error. DE is an efficient swarm intelligence optimization algorithm for nonlinear and nondifferentiable objective functions [51]. After initialization, DE searches for the optimal global solution in a continuous space through iterative evolutions of N D-dimensional . Population evolution is completed through mutation, crossover, and selection. Mutation and crossover are used to generate new candidates, whereas selection is used to determine whether the new candidate can survive the next generation.
During mutation, DE selects three different individuals randomly from the previous generation for each individual index i of the population, which are combined to generate a mutant individual: where individual indexes r 1 , r 2 and r 3 are selected randomly within the range of [1, 2, · · · , N ]. These three individual indexes are different from each other and different from i. F is the scaling factor of the differential vector (x r 2 g −x r 3 g ), and it controls the convergence speed during the search process. The scaling factor F of the standard DE algorithm is constant. To improve the convergence of the algorithm, in this paper, F is adjusted on each dimension by using a ''jitter'' [52] factor. Therefore, F = F C · N (0, 1), where F C is a constant and N (0, 1) is a Gaussian random number that has a mean of 0 and a variance of 1. In this paper, F C is set to 0.5.
Then, a candidate u i g+1 = {u j,i g+1 } D j=1 is generated by combining the mutant individual v i g+1 and the old individual x i g through the crossover operation: where rand j ∼ U (0, 1) is a random number, which follows a uniform distribution over the interval [0,1]. The crossover parameter CR determines the probability for each element in a candidate to inherit from the mutant individual. In this paper, CR is set to 0.9. r i g+1 is a random number in the range of [1, 2, · · · , D], which ensures that candidates choose at least one element from the mutant individual.
After the mutation and crossover operations are completed, a one-to-one greedy selection operation is conducted: The generated candidate u i g+1 and the old individual x i g are compared to determine which one should be retained in the next generation. If u i g+1 has a better objective function value than x i g , it will replace x i g in the next generation. Otherwise, x i g is retained. The basic steps of the DE algorithm can be summarized as follows: (1) Initialization: The population

C. COLLABORATIVE DIFFERENTIAL EVOLUTION FILTERING
We integrate the DE algorithm into the PF framework for tracking hand-object interactions. After the new positions of the particles are predicted, the DE algorithm is carried out to conduct the iterative evolution of the particles, by using the matching error function under the latest observation z t as the objective function. Particles are moved to regions with higher observation likelihoods in the state space via DE. The particle optimization process can be regarded as an importance sampling process, whereas the new particle swarm after optimization can be regarded as an approximation of the optimal importance distribution p(x t |x t−1 , z t ) [50]. The optimization process based on DE improves the distribution of PF samples and accelerates the convergence of the particle set, thus enabling robust hand-object tracking using fewer particles.
As Equation (9) shows, the transition prior p(x t |x t−1 ) is defined as a first-order dynamics model to propagate particles along time: where r i t−1 is a Gaussian random number. {x i t−1,G } N i=1 are the final positions gained from particle convergence after G generations of iterative optimization via DE at time t-1. The newly obtained particle set {x i t,0 } N i=1 is used to initialize the DE population at time t. The improved algorithmdifferential evolution filtering (DEF) -is summarized as follows: VOLUME 8, 2020 For time t > 0: (1) Resampling: Particles are resampled from the particle set {( (2) Prediction: According to Equation (9), the position of each particle at time t is predicted from its position at time t-1, thus obtaining a new particle (3) Optimization: Using the matching error function under the latest observation z t as the objective function, run the DE algorithm to optimize {(x i t,0 , 1 N ))} N i=1 . (4) Weight updating: The particle weight w i t ∝ p(z t |x i t ) is updated according to the observation likelihood p(z t |x i t ), and a weighted particle set State estimation: Output the estimates of the system state by using the maximum posterior criteria. In this paper, two collaborative DEF trackers are applied for hand-object movement tracking and we propose a new algorithm -collaborative differential evolution filtering (CoDEF). The proposed CoDEF algorithm assigns two trackers to the hand and object to track the hand pose x h and object pose x o . The two trackers are not independent of each other and they exchange information frequently during the tracking process. The hand tracker regards object pose x o as static during the iterative optimization of hand pose x h at the current frame, while x o is determined by the tracking result of the object tracker for the previous frame. The object tracker regards hand pose x h as static during the iterative optimization of object pose x o at the current frame, while x h is determined by the tracking result of the hand tracker for the previous frame. As soon as one tracker gains the solution for the current frame, the solution is transmitted to the other tracker, and the corresponding pose values are kept static during the iterative optimization for the next frame by the other tracker. Such a collaborative tracking scheme not only models occlusions between the hand and object, but it also decomposes the unified state space with multiple trackers, decreasing the complexity of optimum searching.

V. DEVELOPMENT OF THE TRACKING SYSTEM
We develop a prototype system for tracking hand-object interactions using the proposed CoDEF algorithm with the graphic rendering engine OpenSceneGraph (OSG) 3 . A prebuilt 3D hand-object model with DOF nodes is loaded into OSG. During the tracking process, the movement of the hand and the object is controlled by using osgSim::DOFTransform nodes. The depth images of the hand-object model are generated by OSG off-screen rendering, which are then compared with the observed images to calculate the matching errors and observation likelihood values for different particles. The state parameters for the minimum matching error are searched for 3 http://www.openscenegraph.org/ within the hand space and the object space using the CoDEF algorithm.
OSG organizes spatial data in a scene graph tree for efficient graphic rendering. Headed by a root node on the top, the scene graph tree is composed of many group nodes and leaf nodes. The group nodes organize the geometries and their rendering states in a scene, whereas leaf nodes contain the actual geometric data for rendering. As an object-oriented rendering engine, OSG provides various group node types by using inheritance, such as transform nodes and camera nodes, allowing for many different functionalities. In our system, we create a camera node to render the hand-object pose hypotheses into depth images for matching error calculations. The camera node has a child, the hand-object model node, which is created by reading the corresponding model file.
In addition, to allow for off-screen rendering, we connect a buffer object with the camera. Then, the hand-object model will be rendered onto the buffer object by the virtual camera per OSG frame. For each rendered frame, OSG performs three traversals: the update, cull and draw traversals. In the update traversal step, updates are made to the scene graph to enable dynamic scenes. Our system updates the model poses with a callback object (NodeCallback) assigned to the model node in this traversal. In the cull traversal step, OSG tests the bounding volumes of all nodes and culls the nodes that are not in the view. For our system, no special operations are added to this traversal. In the draw traversal step, OSG traverses the list of geometries created by the cull traversal and invokes drawing commands to render the geometries. In our system, for each OSG frame, after the pose-updated model is rendered into a depth image by the virtual camera, the matching error for the new pose hypothesis is calculated in this traversal through a callback object (DrawCallback) assigned to the camera.
This system calculates new hand-object pose parameters iteratively using the CoDEF algorithm. As shown in Fig. 3  matching error of the new candidate. The rendering of OSG frames is conducted by a multithread mode as the default. In the multithread mode, a thread is assigned to each camera and each graphics context. The cull and draw traversals are conducted in the threads of the cameras and graphics contexts, respectively. Before the current frame finishes drawing in the graphics context threads, the update traversal and cull traversal of the next frame will be started. To avoid data conflicts among different threads, our system uses the Win32 SetEvent() and WaitForSingleObject() functions for synchronization and communication among threads. When the matching error has been calculated, a signal is sent to the main thread by an event object. When this event signal is received, the system calculates the weight of the new candidate particle based on its matching error in the main thread. Then, a selection operation is conducted to decide whether the old individual or the new candidate will be retained. After a fixed number of iterations, the system combines the best hand and object poses attained respectively by the two populations as the solution.

VI. EXPERIMENTS
The effectiveness of the tracking method is verified by experiments on real sequences and synthetic sequences. The tracking is initialized manually by putting the real hand and object in their initial positions at the first input frame. In all experiments, the proposed CoDEF algorithm applies 32 particles for the hand tracker and 8 particles for the object tracker. For each input frame of the two trackers, the DE algorithm conducts 60 generations of iterative optimization. In this paper, the experiments are carried out on a PC with a quad-core Core i5 2.9 GHz CPU, 8.0 GBs of memory, and an Nvidia GTX 950M GPU. Tracking one input frame costs 5 s on average.

A. EXPERIMENTS ON REAL IMAGES
We use depth images, which are captured from a Kinect 1.0 sensor with the Microsoft Kinect 1.0 Beta2 SDK, as the observation input. The image resolution and frame rate are 640 × 480 and 30 fps, respectively. Two depth image sequences have been acquired. The first one shows a hand grasping and manipulating a sphere, whereas the second   one shows a hand grasping and manipulating a cylinder. Both of the sequences consist of 270 frames. Experiments are conducted on the two real sequences to evaluate the proposed CoDEF algorithm. We compare CoDEF with two algorithms: another improved PF algorithm with DE operators (DEPF) [46] and a hybrid particle-based search (HPS) algorithm that derives from PSO and DE [33]. Both DEPF and HPS track in the hand-object joint pose space. In all experiments, both DEPF and HPS apply 40 particles and run for 60 generations for each input frame. The configuration of these parameters allows for a fair comparison among the three algorithms, since for each input frame, the three algorithms calculate the same numbers of matching errors.
The matching error values attained by the three tracking algorithms are plotted in Fig. 4. It can be seen from Fig. 4 that both CoDEF and DEPF outperform HPS on the two real sequences. CoDEF and DEPF have nearly equal performance in terms of matching errors. Then, we compare the validity of the estimates attained by CoDEF and DEPF by reconstructing the hand-object poses with the estimates. Fig. 5 and Fig. 6 show the 3D reconstruction of the results of CoDEF and DEPF on the real sequences. For Fig. 5 and Fig. 6, the first and second rows show the RGB images and depth images, respectively, which are acquired by the Kinect sensor. Here, the depth images have been segmented by a simple depth threshold. The third row shows the tracking results of DEPF. The fourth row shows the tracking results of the proposed CoDEF algorithm. Athough CoDEF and DEPF have nearly equal performance in terms of matching errors, the 3D reconstruction of the tracking results shows that CoDEF actually performs better than DEPF. Especially for the real hand-cylinder sequence, when severe occlusions happen, DEPF can not achieve accurate tracking, whereas CoDEF still tracks hand-object movement correctly.

B. EXPERIMENTS ON SYNTHETIC DATA
We conduct a quantitative evaluation of the proposed CoDEF algorithm based on synthetic depth images, since ground truth pose data are hard to acquire from real images. The synthetic images are rendered using the 3D hand-object models. In addition, the movement of the hand-object models is defined by the tracking results of CoDEF on the two real sequences. Therefore, for these synthetic sequences, the CoDEF tracking results on the real sequences are actually the ground truth values. The resulting two synthetic sequences both consist of 270 frames. By using synthetic data as the observation, experiments are carried out to evaluate the CoDEF tracking algorithm. Table 1 shows the tracking  Comparisons between the estimates of the proposed CoDEF algorithm and the corresponding ground truth values on some parameters are shown in Fig. 7 and Fig. 8. Table 2 and Table 3 show the mean errors of the estimated parameters VOLUME 8, 2020  on the sequence and the corresponding standard deviations. The results show that the parameters estimated by CoDEF can follow the changes of the ground truth values along the sequence.

VII. CONCLUSION
In this paper, we propose an improved PF algorithm -CoDEF -to track hand-object interactions. We construct hand-object models with geometric primitives and establish an observation model with depth observation. The proposed CoDEF algorithm integrates the DE algorithm into the PF framework. By optimizing the matching error with DE under the current observation, the PF sampling process is improved and the particles are moved towards the areas with a high probability. In addition, CoDEF tracks the movement of the hand and object by using two collaborative trackers. In this way, the hand-object space is decomposed and the complexity of optimum searching is decreased. We develop a prototype system using the proposed CoDEF algorithm with 3D graphic techniques. Experiments demonstrate that the proposed algorithm can achieve robust tracking of hand-object movement using fewer particles.
Since the proposed method is model-based, the tracking needs to be initialized, which is performed manually by putting the real hand and object in their initial positions at the first input frame. To make the method able to initialize automatically and enhance its capability to recover from tracking failures, our future research will combine some kind of learning-based method with model fitting for tracking hand-object interactions. We will use the learning-based method to predict a distribution for the hand-object poses. Then, using the hand and object hypotheses sampled from the distribution for initializing, the model-based tracking will be performed to estimate the hand and object poses. In this paper, CoDEF cannot track hand-object movement in real time. According to the parallel computing characteristics of the proposed CoDEF algorithm and matching error calculation, in the future, we will speed up the system by using CUDA programming. He is currently a Professor with the School of Mechanical and Automotive Engineering, Qingdao University of Technology, China. He has contributed more than 200 research-level papers to national and international journals and conferences. His current research interests include computer graphics, virtual environment, software engineering, and knowledge retention.