Robust Hand Gesture Recognition Based on RGB-D Data for Natural Human-Computer Interaction

To naturally interact with virtual environment by hand gesture, this paper presents a robust RGB-D data based recognition method of static and dynamic hand gesture. Firstly, for static hand gesture recognition, starting from the hand gesture contour extraction, the palm center is identified by Distance Transform (DT) algorithm. The fingertips are localized by employing the K-Curvature-Convex Defects Detection algorithm (K-CCD). On the basis, the distances of the pixels on hand gesture contour to palm center and the angle between fingertips are considered as the auxiliary features to construct a multimodal feature vector, and then recognition algorithm is presented to robustly recognize the static hand gestures. Secondly, combining Euclidean distance between hand joints and shoulder center joint with the modulus ratios of skeleton features, this paper generates a unifying feature descriptor for each dynamic hand gesture and proposes an improved dynamic time warping (IDTW) algorithm to obtain recognition results of dynamic hand gestures. Finally, we conduct extensive experiments to test and verify the static and dynamic hand gesture recognition algorithm and realize a low-cost real-time application of natural interaction with virtual environment by hand gestures.


I. INTRODUCTION
Among different human body parts, the hand is the most effective interaction tool because of its dexterity. Adopting hand gesture as an interface in Human Computer Interaction (HCI) affords users the ability to interact with computers in more natural and intuitive ways, which allows deploying a wide range of applications such as virtual reality, computer games, and sign language recognition. Consequently, current hand gesture recognition is no surprise to become one of the active research areas in natural HCI [1,2].
The principal components of hand gesture recognition are data acquisition, hand localization (e.g., segmentation and tracking), hand feature extraction, and gesture recognition based on identified features. Various approaches have been designed for hand gesture recognition including Dynamic Time Warping (DTW), Hidden Markov Models (HMM), Support Vector Machines (SVM), and so on [1][2][3][4][5]. Recently, current popular deep neural networks (DNN) such as Convolutional Neural Networks (CNN) have been applied to recognize human hand gestures, and achieved better recognition performance [4][5][6]. However, deep neural networks require enormous training data. Training a deep network requires carefully tuning the hyper-parameters and usually suffers from convergence to a local optimal solution. In addition, their implementation and high requirements for machine and equipment on the real-life applications also are the typical limitations. Therefore, traditional featuremodeling based approaches still acquire a lot of attentions and are widely used in real-life applications.
Classic color cameras have already been employed for data acquisition of hand gesture recognition tasks [1][2][3]. These solutions are, however, sensitive to clutter, lighting conditions, and skin color. Video capture has the extra challenge related to the speed of movement. In terms of 3dimensional (3D) motion capture at the level of the fingers, possible solutions include optical marker systems, accelerometers, magnetic trackers, and data gloves. These require extensive calibration, limit the natural movement of the fingers, and are generally very expensive. Recent development of depth sensors (e.g., Kinect sensor) provides a robust solution to hand gesture recognition [7,8]. Data captured by Kinect in RGB-D (Red, Green, Blue, and Depth information) form, are often used as a source for hand gesture recognition. In spite of many recent successes in applying the Kinect sensor to face recognition, human body tracking and human action recognition, it is still an open problem to use Kinect for hand gesture recognition in natural HCI. Due to the low-resolution and inaccuracy of the Kinect depth map, it is difficult to detect and segment a hand gesture from an image with this resolution. In such a case, the segmentation of the hand is usually inaccurate, thus may significantly affect the recognition step [7].
In this paper, we aim to perform static and dynamic gesture cognition using RGB-D data from Kinect. For static hand gesture recognition, the depth and joint information collected by Kinect is proposed to locate and detect the hands. Starting from the hand gesture contour extraction, the center of the palm is identified by Distance Transform (DT) algorithm. The fingertips are localized by employing a K-Curvature-Convex Defects Detection algorithm (K-CCD). On the basis, the distances between the pixels on the gesture contour and palm center and the angle between fingertips are considered as the auxiliary features to construct a multimodal feature vector. To recognize dynamic hand gesture, we proposed an improved dynamic time warping algorithm (IDTW) in [8]. This paper extends the IDTW algorithm, especially, the restricted search path and the weight optimization is discussed in detail. On this basis, we demonstrate an application to interaction with a virtual environment for underground coalmine simulation (e.g. virtual coalmine) using static and dynamic gesture cognition.
The motivation of this work is to perform robust static and dynamic gesture cognition using RGB-D data with the aim of realizing a low-cost real-time application of natural interaction with virtual environment by hand gesture recognition. The main contributions of this paper are summarized as follows: (1) A K-Curvature-Convex Defects Detection algorithm (K-CCD) and multimodal feature vector are proposed for static hand gesture recognition. (2) A unifying feature descriptor is constructed for dynamic hand gesture. (3) A low-cost real-time application is presented for natural interaction with virtual coalmine by hand gesture recognition.
The remainder of this paper is organized into seven sections. Section II reviews related work of the state-of-theart-of hand recognition. Section III describes the framework of static gesture recognition and the methods of hand detection, feature extraction, and gesture recognition. Section IV introduces the methods of hand tracking, feature extraction, and the improved DTW gesture recognition algorithm for dynamic hand gesture cognition. Section V illustrates experimental results of static and dynamic gesture recognition. Section VI demonstrates an application to interact with virtual coalmine environment. Section VII makes final conclusions and discusses the next work.

II. RELATED WORK
In recent years, somatosensory interaction technology has been widely concerned in every area. Hand gestures are essential parts of somatosensory interaction technology and has been applied to various scenes such as somatic game, military training, and virtual environment. The motivation of this paper is to perform hand gesture recognition based on RGB-D data and traditional feature-modeling approaches. A good survey for the gesture recognition is available in [1,4,5]. The following literature review mainly summarizes the related work of human hand gesture recognition with depth information and skeleton information for natural human-computer interaction with virtual environment applications.
Different approaches can be employed for hand gesture feature extraction. One of commonly used techniques are based on RGB-D data [2,7,9]. Specifically, depth data can be used to extract hand region as the area of body closer to the camera [9], and then to identify the fingers, palms and wrists by using geometric size, and finally extract a set of features descriptors which characterize the shape and the pose of hand gestures. Though this method is insensitive to lighting conditions and cluster background, it still has limitations, such as assumption of the hand is the closet object to the camera [7]. Another popular technique for hand features extraction is based on skeleton. This can be divided into two categories: position features and orientation features [5,10,11]. In contrast to the depth-data methods, the majority of the skeleton-based methods model temporal dynamics explicitly. Bhattacharya etc. [12] applied 20 joints to recognize three simple body gestures and used a Z-score normalization to deal with parameters of different units and scale of body-joint points. However, skeleton based feature extraction methods may increase computation amount and time complexity. Saponaro etc. [13] used geometric transformation to set hand coordinates to the reference system centered on the human torso, instead of the default sensor-centered reference frame. This transformation provides invariance to the starting point of a physical gesture. Using the sequences of joint coordinates, Du etc. [14] applied Kalman filter to estimate the hand position for the precise localization of hand movement in a human manipulation interface for robot tele-operation. Slama etc. [15] first placed the hip center joint at the origin of the coordinate system for the skeletons scale invariant, and then took a skeleton template as reference to normalize all the other skeleton.
Orientation features based on the angular information between joint vectors can maximize the invariance of the skeletal representation. Angles between specific pairs of direction vectors are computed to obtain the corresponding joint angles in [16][17]. Raptis etc. [18] used angular skeleton representation to map the skeleton motion data into a smaller set of features, which reduces the overall entropy of the signals and removes the dependence on camera position. The Euler angles have been largely used to describe the orientation of a rigid body in a 3D Euclidean space [19]. Another way to model orientation information is by means of the unit quaternions which represent a system of numbers that extends the complex numbers [20].
After the feature extraction, generating recognition model is another important step. The conventional recognition methods mainly include probability-based approach and distance-based method. The most common probability method is HMM [3], which is a statistical model. HMM has two hypotheses: output independence and Markov assumption. However, most of the sequence data in fact cannot be expressed as a series of independent events. In addition, defining states for gestures is not an easy task since hand gestures can be formed by a complex interaction of different joints. The distance-based method [10] is an earlier method applied in the classifier learning for realtime detection. Dynamic time warping (DTW) [3,4,10] is the most used technique to find the optimal alignment of two signals. The conventional DTW is basically a dynamic programming algorithm, which uses an iterative update of DTW cost by adding the distance between mapped elements of the two sequences at each iteration step. The distance between two elements is oftentimes the Euclidean distance, which gives equal weights to all dimensions of a sequence sample. However, a weighted distance might perform better in assessing the similarity between test sequences and reference sequences. In [10], a weighted DTW algorithm was proposed to maximize a discriminant ratio based on DTW costs. The weights were obtained from a parametric model which depends on how active a joint was in a gesture class. Chaaraoui etc. [21] constructed gesture sets by sequences of key poses and then defined a DTW distance between two sequences by combining the Euclidean distance between couples of key poses in all the possible alignments of the test and reference sequences. As far as static gesture recognition is concerned, SVM [8,9] and multiclass SVM approaches are used broadly [17]. Wang etc. [8] used an adaptive square to extract the region of hand based on the depth information and applied SVM to classify the static hand gestures.
There are other prominent works reviewed in [1,3,5]. However, most of the existing solutions for hand gesture recognition are designed for hand properties (hand contour, hand palm center, fingers, and hand trajectory). Overall, there are only very few solutions for static and dynamic hand gestures recognition that work on hand, wrists, elbow, arm, and shoulder for natural HCI applications. The objective of this paper is to develop an improved, lowcomplexity, and real-time solution for the recognition of static and dynamic hand gestures from Kinect depth sensor. Experimental results show that our hand gesture recognition system not only operates accurately and efficiently, but also is robust to uncontrolled environments and hand gesture variations in orientation, scale, articulation, and shape distortions.

III STATIC GESTURE RECOGITION
The framework for static hand gesture recognition is illustrated in Fig. 1, which mainly includes hand detection, feature extraction, and gesture recognition. We use Kinect to obtain joint positions. Raw data collected with the Kinect are used to recuperate the depth information on all the pixels of an image. The depth and joint information are then proposed to locate and detect the hands within the digital skeleton. Starting from the hand gesture contour extraction, the center of the palm is identified by DT algorithm. The fingertips are localized by employing the K-CCD algorithm, which is based on the change in the slope angle of the tangent line at selected points over the contour and the convex defects detection for filtering the noise points. On the basis, we consider the distances of the pixels on the gesture contour to palm center and the angle between fingertips as the auxiliary features to construct a multimodal feature vector. Finally, the gesture recognition algorithm is built to classify feature parameters.

A. HAND DETECTION
Hand detection mainly includes localization of hand, gesture segmentation, and processing of gesture image. Firstly, we use the digital skeleton provided by Kinect to identify the hand position and to locate joints of hand, elbow and wrist. Then combining the gray value distribution of the pixels in each frame depth image, we can segment the hand gestures, as shown in Fig. 2. The hand gesture images in Fig.2 still contain an arm and some unwanted noise, such as rough edge, which will affect the accuracy of subsequent contour extraction, feature extraction, and gesture recognition. Due to wrist is the smallest part of whole arm except the fingers, the number of pixels contained in wrist is the least in the segmented hand image except the fingers. Therefore, we can count the number of pixels in the segmented hand gesture images to localize the wrist so as to remove the part section below the wrist. Referring to the method in [22], we firstly rotate gesture image to make its fingers horizontally to the right, as shown in Fig. 3 (b). And then, the corrosion operation of morphology is employed to erode the fingers so as to avoid the fingers' affection on the wrist, as shown in Fig. 3(c). Finally, we calculate the number of pixels in the vertical direction in Fig. 3(c) to generate the pixel waveform of hand gesture, as shown in Fig. 4. The blue curve is the pixel waveform of the original hand gesture. It can be found that there are some burrs in this curve. With the aid of least-square method, the original curve is fitted to generate a new depth pixel waveform, the green curve in Fig. 4. Analyzing the green fitting curve, we can find the first minimum point on the curve is the wrist position. And then we remove the wrist parts to get the hand gesture image without arm, as shown in Fig. 3 (d). Finally, we employ median filter to smooth the edge region of the hand gesture image. The final gesture image is shown in Fig. 3(e). As we can see, the edge of the hand gesture in Fig. 3 (e) is smoother and the hand gesture features are more clearly compared with that of the hand gesture in Fig. 3

B. FEATURE EXTRACTION
The extraction of hand gesture features mainly includes the contour detection and tracking, the localization of hand palm center, and the detection of fingertips. (

1) Contour Detection and Tracking
To obtain a complete hand gesture contour, we firstly scan the segmented hand gesture image to identify the initial pixel. In order to optimize the search, Plouffe et al. proposed a method of detecting every other five pixels (that is, scan by 5 5  square search window) [2]. In this paper, we adjust the search range at 10 10  to improve the search algorithm. Similar to the approach in [2], if a pixel is valid and does not possess any valid neighboring pixel, the search continues pixel by pixel toward an invalid neighboring pixel until a contour pixel is found, as illustrated in Fig. 5. We verify if this pixel is part of an already found contour and, if not, a new potential hand is found. If a pixel is valid and none of the neighboring pixels are invalid, the search continues with the next block. Once the initial contour pixel is found, a directional search is performed to identify the full contour of the hand. The considered directions in this paper are upper left, upper right, bottom left, and bottom right. A search direction is favored if it has been used for the previous pixel as well. For each potential hand contour, the algorithm traces in order each point of contour from the initial point found and stops when the next point is already present in the contour list. The algorithm also includes a solution to backtrack if an unknown valid configuration is encountered. To further improve this approach, a constraint is added to validate only closed contours. For the Five sign in Fig. 6 (a), the detected contour using this procedure is shown in Fig. 6  (2) Localization of palm center According to the characteristic of hand structure, if the distance from an interior pixel ( , ) n x y in palm to the contour pixel on the edge of the hand has the maximal value, the pixel point ( , ) n x y is considered to be the center of the palm. In this paper, Distance Transform (DT) is adopted to obtain the coordinate of palm center. The DT algorithm calculates the distance between non-zero pixel and the nearest zero pixel in a digital image so as to get the minimum distance from this pixel to the contour edge. According to the features of hand gesture image, the DT algorithm is defined as follows [23] ( ) where d P is the detection result of gesture contour, D is the distance image of gesture image P, ( , ) d u v is the distance measurement from the pixel v to the target pixel u, and O is the background target of gesture image. The transform results are also different due to different distance metric. In this paper, ( , ) d u v is taken as Euclidean distance by (2) After DT transform, the image is the minimum distance from all target pixels to the image contour. The closer the pixel to target center, the larger the DT feature of the pixel. The result of localization of palm center for the Five gesture is shown in Fig.6(c). (

3) Detection of Fingertips
Based on the results of contour detection and localization of palm center, the K-CDD algorithm is presented to identify the fingertips over the hand gesture contour. The proposed K-CDD method effectively combines K-Curvature method with convex defects detection. Firstly, the K-curvature algorithm is employed to detect the candidate fingertips of the gesture contour, which are also called like-fingertips [24]. And then, the concave points detection and convex defect detection are used to filter the concave points and noise points. K-CDD method can avoid the false fingertip detection of the traditional Kcurvature method. In addition, it provides better results in terms of overall success rate, and supports the highest range of hand rotations within which it is capable of performing reliably.
As illustrated in Fig.7(a), the K-curvature algorithm takes each vector point ( ) P i to its neighbor points ( ) P i k  and ( ) P i k  at distance of K. K is a constant value. According to our experiment, the final K value is 20 pixels, which is suitable for almost all situations. The angle  between the vector ( ) ( is calculated over the contour of the hand. According to [2] and our tests, if the angle has a value between 25° and 55°, a fingertip is identified at that point. The detected fingertips by K-CDD for the sign Five are shown in green dots in Fig.  7 In Fig.7 (b), there are still some valleys points between fingers. In this paper, the cross product of vectors is employed to remove the concave points in the valley between fingers. Randomly selecting a like-fingertip ( ) P i on the contour, if the cross product of vectors ( ) ( is negative, then ( ) P i is the fingertip. The results of concave points for the Five sign are shown in blue dots in Fig. 7(c). After the concave points are filtered out, the remained noise points are mainly gathered around the wrist. Therefore, the convex defect detection method is employed to eliminate theses noise points. As shown in Fig.7 Fig. 7(e).

C. GESTURE RECOGNITION
On the basis of the contour detection, the localization of palm center, and the detection of fingertips, a multimodal feature vector is constructed for the purpose of recognizing the hand gestures more robustly. In this paper, the distance S from part pixels on the gesture contour (every 4 pixels select 1 pixel) to palm center is considered as an auxiliary feature. Moreover, the angle  between two adjacent fingertips is also added to the auxiliary features. According to [22] and the America sign language, we collected the 10 Chinese sign gesture data from Zero to Nine, as shown in Fig.8. The calculated distance S curves of the 10 digit gestures are illustrated in Fig. 9. As shown in the S curves, features of the 10 digit gestures have similarity. For instance, both gesture Two and gesture Six have two fingertips while their S curves are also increasingly blurred. The specific features of the 10 Chinese digit gestures are described in TABLE I.    , no convex point between two fingertips in S curve Nine 1 fingertip, the peak value of S curve less than  According to features of the digit gestures, the gesture recognition algorithm starts from localizing the hand region using the obtained depth data and skeleton data, and then calculates the number of fingertips and feature parameters.

IV DYNAMIC GESTURE RECOGNITION
The dynamic gesture recognition includes hand tracking, feature extraction, and gesture recognition. We firstly apply Microsoft Kinect sensor to obtain the depth data and 3D coordinate information of hand joints (including hands, elbows, shoulders, etc.). The depth and joints information are used to generate a 3D motion trajectory of hand gestures to realize hand tracing and localization. The acquired joints information (3D coordinate sequence) of hand gestures is then used to extract the geometric feature of dynamic hand gestures by calculating the Euclidean distance between hand joints. Meanwhile, in order to further describe the relative position features of hand gesture to body, we create an auxiliary modulus ratio feature vector based on human skeleton structure. According to Euclidean distance between hand joints and the modulus ratio of feature vectors, we can generate a unifying feature vector descriptor to represent each dynamic hand gesture. Finally, the IDTW is built to obtain the final recognition results by calculating the similarity between test sequence and template sequence. The proposed approach allows the user to train a reference (template) sequence of dynamic hand gesture. In order to ensure real-time behavior, reference gesture sequence is limited to 40 images. When the training is finished, these images are saved in an xml file. During recognition, once a sequence of new images representing a dynamic hand gesture is made available by the Kinect, the IDTW algorithm is activated to recognize it based on the similarity between the observed gesture and each reference gesture.

A 3D TRACING AND LOCATION OF HAND RECOGNITION
In order to ensure real-time, natural experiences in HCI system, hand tracing and positioning methods should be robust to the change of illumination, color and complex background. In this paper, hand tracking and localization algorithm takes fully use of the depth and joints information to describe the real-time coordinate of hand node and generate a 3D motion trajectory. To transform the coordinate system of the depth and skeleton image to that of the color image, some calibration parameters are adjusted so that the depth pixels can match the color pixels. The tracking results of a 3D dynamic hand gesture and its 3D trajectory of the acquired hand gesture are shown in Fig.10.

B. DYNAMIC FEATURE EXTRACTION
Dynamic hand gestures not only contain three-dimensional position information, but also involve time information. Therefore, the joint coordinate sequence of dynamic hand gesture should be transformed into a feature vector which can be used in training and recognition of classification model. In most of previous research, direction, position, and speed are the most commonly used gestures features in dynamic gesture recognition system. In the existing research on the dynamic gesture recognition based on Kinect, [2] proposed 15 skeleton nodes as feature vector. For a higher gesture recognition ratio, we adopt Euclidean distance of hand joints and the modulus ratio of human skeleton structure feature vector as the main feature of dynamic gesture recognition algorithm. In order to ensure the translation invariance and scalability of the feature vector for dynamic hand gesture cognition, the difference of users, such as body, height, thin, position to Kinect and so on, should be effectively eliminated. In this paper, the 3D coordinate information of each joint is firstly normalized by calculating the Euclidean distance between hand joints. Meanwhile, the skeleton structure feature vector of human hand is also constructed using coordinate information of human hand joints. On this basis, the modulus ratio of skeleton structure feature vector is calculated. Finally, the unified feature vector for dynamic hand gesture recognition is built based on the Euclidean distance of hand joint points and the modulus ratio of skeleton structure feature vector.
The Euclidean distance between hand, elbow, shoulder joint point and shoulder center are concretely used to represent the geometric feature of dynamic hand gestures. Let three-dimensional coordinate of shoulder center point s and hand joint j in Kinect coordinate system at time t be ( , , ) According traits of human arms, the skeleton structure feature vector of hand gestures can be built with corresponding joint data.

FIGURE 11. Auxiliary feature vector for modulus ratio information
In order to further describe the position features of hand gesture relative to the body, four auxiliary feature vectors are constructed as shown in Fig.11 According to Euclidean distance between hand joints and the modulus ratio of feature vectors, we can analyze the normalization effect of three-dimension coordinate between hand joint points and the dynamic process of hand gesture. Once the Euclidean distance of hand joint points and the modulus ratio of auxiliary feature vectors are obtained, we can further construct a unifying gesture feature vector to represent the dynamic hand gesture by combining the two different gesture feature vectors. For the hand gesture feature of the i-th frame image in the dynamic hand gesture sequence, the unifying feature vector i Z can be described in (5)   1  2  8 , , , is the Euclidean distance between hand joints in the image or the modulus ratio of the auxiliary feature vector as above defined. i Z is an eightdimensional vector of hand gesture feature, including the Euclidean distance between four joints (i.e. hand left, hand right, elbow left, and elbow right) and shoulder center, and the four modulus ratios of auxiliary skeleton structure feature vectors. In order to ensure real-time behavior of the system, the length of a dynamic hand gesture sequence is limited to 40 images. Therefore, the unifying feature vector for a dynamic hand gesture are described in (6) 1 2 , , where n is the number of total frames of a dynamic hand gesture sequence, 1, 2, , 40 n   . i Z is the gesture feature vector of i-th frame image defined in (5).

C IMPROVED DTW ALGORITHM
In the previous research of dynamic gesture, the most frequently methods are HMM and DTW [3,4,10]. However, HMM not only needs a huge training data but also demands a cumbersome and complex computation. Therefore, we choose the DTW algorithm as the gesture recognition algorithm of the system. DTW is a template matching algorithm to find the best match for a test pattern out of the reference patterns, where the patterns are represented as a time sequence of features. In our case, let template gesture sequence be VOLUME XX, 2017 The basic idea of DTW algorithm is to align the two sequences L and S in time via a best path to make the sum of cost minimum and this path must pass through all the points of sequence S. The computation and time complexity of conventional DTW algorithm will greatly increase with the length of gesture sequence in the iteration process. Moreover, in a typical hand gesture recognition problem, hand joints used in a hand gesture can vary from gesture class to gesture class. Hence, not all joints are equally important in recognizing a hand gesture. In this paper, we present an improved DTW algorithm (IDTW) by restricting the wrapping path and using a weighted distance in the cost computation [8].
Firstly, in order to reduce the DTW computational complexity and increase the reliability of DTW's dissimilarity measure, some global constraints have been imposed to the wrapping path [25,26]. In this paper, we use a well-known global constraint parallelogram band to constrain search path [26], which can effectively limit the warping amount, i.e., slowing down or speeding up of a sequence in time. In the parallelogram, the maximum slope is 2 and the minimum slope is 0.5 shown in Fig.12. According to the feature of the parallelogram in Fig.12, the length M and N of two hand gesture sequences can be limited in (7) 2 If the length of template gesture sequence L and unknown gesture sequence S are not satisfied the constrains by (7), there is no need to compare each frame image in X axis and Y axis, then we only need to compare them in the interval [ , ] min max Q y y . min y and max y of interval Q can be computed by (8) and (9), respectively.
On the other hand, the conventional DTW algorithm gives equal weights to all dimensions of a sequence sample. However, in typical dynamic hand gesture recognition, hand joints can vary from one gesture class to anther gesture class. Therefore, we propose a weighted DTW algorithm that uses a weighted distance in the cost computation. Different from the weighted DTW algorithms in [27], the weights in this paper are obtained based on a joint's displacement in a dynamic hand gesture. To infer a joint's weight in a trained template gesture, we compute its total displacement by (10) where K is the total frame number of template gesture sequences, w is the gesture index, and j is the joint index.
According to (11), if joint j remains static in performing hand gesture w, its weight w j  is zero. On this basis, to incorporate these weights, the final DTW distance between template gesture sequence L and test gesture sequence S is transformed as: where w r  is r-th joint's weight value and r is the number of joints in hand gesture w. The parameter  in (11) can be calculated by minimizing the within-class variation while between-class variation is maximized [21]. Defining the average weighted DTW distance cost between all samples of hand gesture sequence N and gesture sequence M as ( ) The discriminant ratio of a given  , ( ) R  , is then obtained The optimal value *  is chosen as the one that maximizes ( ) R  as follows:

V. EXPERIMENTAL RESULTS
Several experiments were performed in order to test the proposed methods for static and dynamic hand gestures recognition. All experiments were carried out on an Intel Core(TM) i7-4790 3.60 GHz CPU with 8 GB of RAM. Kinect for 3D sensor was used as data acquisition device. Visual Studio 2010, Kinect SDK-v1.8 and C# programming languages were employed as the programming tools.

A. RESULTS OF HAND DETECTION AND TRACING
We have tested the performance of hand detection and tracking by sign digits from Zero to Nine shown in Fig. 8.
In normal scene, a few testing samples for the gesture segmentation, the detection of the contour, palm, and fingertips are shown in Fig.13. Statistic results of 10 sign digits are shown in TABLE II. The average accuracy of each gesture in the Numbers scenario is calculated. For the gesture Five and Four, the accuracy is nearly 100%. The worst case is the Nine with accuracy of 74%. For Three, Six, Seven, Eight, and Nine, each of them consists of three fingers and presents some difficulty for the system to distinguish among them. From TABLE II, we are sure that the proposed solution is able to correctly locate the points of interest over the hand surface as well as its contour. The average accuracy of fingertip detection for 10 sign digits is 97.8%, and that of gesture segmentation is 98.4%. Compared with K curvature method [24], the proposed K-CDD method for fingertip detection is more robust and also can effectively eliminate noise near the wrist.

B. Results of static hand gesture recognition
In order to test the performance of the proposed solution for static hand gesture recognition, we use the sign digits from Zero to Nine and perform the experiments in normal lighting condition and complex scenes. Five volunteers are invited to perform each of the 10 gestures for 100 times: 50 times with left hand and 50 times with right hand. The distance between the camera and the hands is about 1000mm. The test user is given a sequence of images corresponding to different sign digits, and he/she is asked to reproduce them. Each test user is allowed to practice each gesture once or twice before tests. The recognition results can be seen on screen in real time. Some samples for the hand gesture cognition in normal illumination condition are shown in Fig.14. The confusion matrices of hand cognition for 10 sign digits are given in Fig.15 (a). The confusion matrices of 10 sign digits under weak illumination are shown in Fig.15   The average recognition rates of 10 sign digits in two kinds of light conditions are 97.4% and 97.6%, respectively. Therefore, the illumination has no effect on the recognition results. In traditional hand gesture cognition using RGB images, the complexity of the background is also one of crucial factors that affect the cognition performance. In order to further verify the performance of the proposed cognition methods in complex background conditions, we test the ten gestures in different scenes, i.e. multi-user scene, multi-user and weak light scene, and strong light scene. A few cognition samples are shown in Fig.16. Fig. 16(a) is the testing result of gesture Five in multi-user and normal light scene (Scene 1). Fig.16(b) is the testing result of gesture Eight in multi-user and weak light scene (Scene 2). The testing result of gesture Zero in strong light scene (Scene 3) is shown in Fig. 16(c). Statistic results of 10 sign digits in different scene are illustrated in Fig.17. The average recognition rates of 10 sign digits in complex background scenes is more than 95.5%. Analyzing the experimental results of 10 sign digits in different scene, we can observe that the proposed method of static hand gesture recognition, which combines depth data, fingertip features and S curve features, can accurately identify the hand gesture and has a strong robustness to complex background such as illumination, multi-user interference, and so on.
Finally, we compared the static hand gesture recognition algorithm with the state-of-the-art systems related to the sign language, including the FEMD (Finger-EarthMover's Distance) [7], DTW [2], K-Curvature [24], Random Forest [29], and RBF-kernel (Radial Basis Function kernel) [30]. The results are demonstrated in TABLE III. Due to the definitions of the 10 digital gestures are not exactly the same, we only list the recognition results of very similar individual gestures. On the other hand, K-Curvature in [24] and RBF-kernel in [29] only have the average recognition accuracy. In addition, the recognition accuracy in [30] is the average values of all sign language gestures.

C. Results of dynamic hand gesture recognition
(1) Dataset To validate the proposed algorithm for dynamic gesture recognition, we used the experimental hand gesture dataset trained and generated by our volunteers according to the UDLR-8 datasets in [28], as illustrated in Fig. 18  (2) Experimental results of dynamic hand gesture For each gesture, we select ten samples from the fifty trained templates randomly and invite another ten volunteers to test the proposed cognition algorithm. Each gesture will be tested ten times by each of the ten volunteers. Therefore, the total test number of each gesture is one hundred. A gesture is considered unrecognizable if the IDTW algorithm displays a wrong recognized result within a predefined interval after test user finished his/her performance. We carried out test experiment under normal light and weak light condition to verify the performance of recognition algorithm.
The test results for our UDLR-8 gestures using IDTW algorithm and DTW are shown in confusion matrix in Fig.19 (a) and Fig.19 (b), respectively. We can further obtain that the average recognition rates of the proposed IDTW algorithm is 96.5%, and the average recognition rates of the DTW algorithm is 91.4%. According to our test results, the gesture recognition rates are almost same under different lighting conditions. In addition, the IDTW algorithm for dynamic gesture recognition mainly based on the human joint information and skeleton structure features obtained by Kinect, therefore, the recognition results also are independent of human body of different test users. More results and details can be referred in [8].  Fig. 20 lists the comparison of IDTW algorithm with DTW algorithm on response time. As one can observe, the average response time is less than 500ms IDTW algorithm versus DTW algorithm. According to test results shown in Fig.19 and Fig.20, the proposed IDTW algorithm not only improves the total recognition rate but also decreases the recognition response time.

VI. APPLICATION TO INTERACTION WITH VIRTUAL COALMINE
To verify the proposed solutions of hand gesture recognition, we have developed an interactive application to control a virtual coalmine by hand gestures. The virtual coalmine is a specific application of virtual environment technology in coalmine to simulate the underground production environment, including devices, environments, and miners. The interactive system obtains the depth data and joint information of human hand using Kinect sensor and runs the proposed algorithms of static and dynamic hand gesture recognition. According to recognition results of the predefined hand gestures, the interactive system sends real time control command to virtual environment engine and realize the interaction with virtual coalmine. Considering principle of human ergonomics and daily habit of human communication, we defined 16 hand gestures illustrated in Fig. 21. The corresponding interactive semantic of the predefined hand gestures is listed in TABLE IV. Using these predefined hand gestures, we can control virtual miner's motion, virtual device's status, and view angle of camera to realize the typical interaction with virtual mine. Fig.22 shows the typical interactive scenario with virtual coalmine according to hand gesture recognition.   Fig.22 (a) and Fig.22 (d) are the interface to virtual mine, which displays user hand gesture, the depth image in Kinect, the recognized hand gesture, command semantic of hand gesture, and the system response to user hand gestures. In Fig.22 (a), user performs the gesture 02, i.e. his left hand swipes up and his right hand down. Fig.22 (c) shows that virtual miner is walking. In Fig.22 (d), user performs the gesture 09, i.e. his right hand swipes left and his left hand keeps straight forward, and virtual coalmine's visual angle turns left, shown in Fig.22 (f). Fig.23 shows the real-time frame frequency test when users interactively control the virtual mine according to the defined interactive gestures. We have counted 20 times of interactive tests. It can be seen that when the system is in the initial state or triggers another state, the frame rate is high. When the user makes two gestures at the same time, the frame rate will decrease. Currently, system frame rate is basically stable at 38~40 FPS and meets the real-time requirements of virtual environment system.

VII. CONCLUSION AND NEXT WORK
This paper proposed a RGB-D based recognition method of static and dynamic hand gesture for natural human-computer interaction application with virtual environment. To static hand gesture recognition, we built a novel K-CCD method, which adds the feature of S curves and angle  between two fingertips and palm center to reduce resemble fingertips and improve recognition rate. For dynamic hand recognition, we combined Euclidean distance between hand joints and shoulder center with the modulus ratios of skeleton features to generate a unifying feature descriptor for each dynamic hand gesture and proposed an IDTW algorithm to obtain recognition results. In our experiment evaluation, the system achieved an average performance of 97.4%, 96% for static and dynamic gestures, respectively. Finally, we realized a low-cost real-time application of natural interaction with virtual coalmine by hand gesture recognition. However, although we defined 16 static and dynamic gestures to interact with the virtual coalmine, current interactive gestures mainly focus on the scene and vision of virtual environment, which are relatively simple. Moreover, the disabled people with hand deformities cannot used our methods. Some other HCI methods such as speech and brain-computer interface technology may be more suitable for them. In addition, this paper applied Kinect to obtained the RGB-D data for current application, other depth sensors such as Intel RealSense, Leap Motion Controller and ASUS Xtion, will be tested for the proposed algorithms of static and dynamic hand gesture recognition.
In future, we will improve the overall performance of hand gestures recognition by extracting more robust and discriminative features and optimizing the recognition algorithm. Moreover, we will further enrich the HCI functions for virtual coalmine by designing more effective interaction hand gestures so as to improve the practicability of virtual coalmine. In addition, current hand gesture recognition algorithm needs to be trained according to specific applications, we plan to explore the popular deep learning based approaches for hand gesture recognition with smaller datasets and lightweight algorithms to enhance its learning ability and further improve its adaptability and expansibility.