A Novel Detection and Recognition Method for Continuous Hand Gesture Using FMCW Radar

In this article, a novel method for continuous hand gesture detection and recognition is proposed based on a frequency modulated continuous wave (FMCW) radar. Firstly, we adopt the 2-Dimensional Fast Fourier Transform (2D-FFT) to estimate the range and Doppler parameters of the hand gesture raw data, and construct the range-time map (RTM) and Doppler-time map (DTM). Meanwhile, we apply the Multiple Signal Classification (MUSIC) algorithm to calculate the angle and construct the angle-time map (ATM). Secondly, a hand gesture detection method is proposed to segment the continuous hand gestures using a decision threshold. Thirdly, the central time-frequency trajectory of each hand gesture spectrogram is clustered using the $k$ -means algorithm, and then the Fusion Dynamic Time Warping (FDTW) algorithm is presented to recognize the hand gestures. Finally, experiments show that the accuracy of the proposed hand gesture detection method can reach 96.17%. The hand gesture average recognition accuracy of the proposed FDTW algorithm is 95.83%, while its time complexity is reduced by more than 50%.


I. INTRODUCTION
With the rapid development of 5G communications [1] and artificial intelligence, human-computer interaction (HCI) has become an indispensable technology in our daily life. As an important branch, hand gesture recognition [2] is of great significance for promoting the development of HCI. Meanwhile, the development of hand gesture recognition also promotes the progress of many fields, such as smart home [3], industrial internet of things [4]- [6], sign language interaction [7], radio frequency identification [8], [9], etc. As a result, hand gesture recognition has become a research hotspot recently [9]- [13].
The research of hand gesture recognition mainly focuses on the recognition algorithms design with different data sources. The hand gesture recognition method is divided into three types: the sensor-based [10], [11], the vision-based [12] and the radar-based ones [13]. Specifically, the sensor-based hand gesture recognition method adopts sensors to collect the The associate editor coordinating the review of this manuscript and approving it for publication was Wei Wang . hand gesture data. However, since the tester needs to wear a wearable device for a long time, the sensor-based method is usually inconvenient in terms of user experience. The visionbased method has a very high recognition accuracy because of the adopted camera images; however, this method is not friendly in terms of privacy exposure. The radar-based hand gesture recognition method applies radar to collect the hand gesture signal, and analyzes the hand gesture information through the signal processing, and then classifies the hand gestures. Compared with the two aforementioned ones the radar-based hand gesture recognition method performs in a non-contact way and brings a good user experience, as a result, it attracts extensive attention in both industry [14] and academic [15], [16]. In [15], the authors adopt the radar to measure the range, speed and angle information of the hand gestures. In the vehicle-mounted auxiliary control system [16], the authors apply the time-frequency analysis method on the beat signals of different hand gestures using a frequency modulated continuous wave (FMCW) millimeterwave radar. Moreover, the Google Soli project [17] designs a hand gesture recognition system using a 7 GHz bandwidth radar.
Nowadays, the methods of hand gesture recognition are usually tested using a single hand gesture [18], [19]. However, in the real-time applications of the hand gesture recognition system, the hand gestures are usually continuous and dynamic. To this end, some researchers focus on hand gesture recognition with continuous and dynamic hand gestures [20]- [23]. In [20], the authors obtain the start and end times of each single gesture by a near-zero speed detection method. However, the continuous hand gesture considered in [20] is periodical, and such a simple hand gesture type is limited to the practical applications of complex hand gestures. In [21], the authors judge the start and end of the hand gesture based on the thresholds of the calculated trigger value and ratio. However, with the number of hand gestures increases, the instability of the trigger ratio makes it difficult to detect hand gestures correctly. For the hand gesture recognition algorithms, deep learning [24]- [28] is usually used for feature extraction and hand gesture classification. Since deep learning method requires a large number of datasets for training, and the hand gesture collection time and the training time increase sharply with the increase of hand gesture types. More importantly the duration time of each hand gesture is usually different from each other, making it difficult to directly use the deep learning method for hand gesture recognition. Therefore, the design of hand gesture recognition algorithms based on the non-deep learning such as dynamic time warping (DTW) [29]- [31], Adaptive Boosting (AdaBoost) [32], [33] and Hidden Markov model (HMM) [34], [35] has received much attention. The data imbalance of AdaBoost results in the decrease of classification accuracy, and the calculation of the HMM model is complex. Since the DTW algorithm can match the similarity of two sequences with different time lengths, it is widely used in the field of hand gesture recognition. In [36], the authors propose an RGB video-based hand gesture recognition method for posture extraction, and apply the DTW algorithm for posture classification. Reference [20] adopts the DTW algorithm with the combination of the range and Doppler information for hand gestures classification, and achieves a recognition accuracy of 91.5%. In [37], a novel DTW algorithm for hand gesture recognition of wearable gloves is proposed. In [38], based on simple gesture database, the efficiency of different DTW algorithms is compared. However, the hand gesture data used in the above studies are mostly based on cameras or sensor devices. Moreover, the above works focus on the design of hand gesture recognition algorithms for a single hand gesture, and the problem of continuous hand gesture recognition is not discussed. More importantly, the time complexity of the DTW algorithm [39] is usually high because all pixels of the hand gesture spectrogram are used for matching, and the angle information of hand gesture in the above studies is still not mentioned. Therefore, this article proposes a hand detection and recognition method for continuous and dynamic hand gestures using a FMCW radar. Specifically, we use the hand gesture raw data to obtain the intermediate frequency (IF) signal, and estimate the range-time map (RTM) and Doppler-time map (DTM). Besides, we adopt the Multiple Signal Classification (MUSIC) [40] algorithm to estimate the angle-time map (ATM) of different hand gestures. Then, the amplitude is obtained by normalizing the hand gesture spectrogram, and a threshold is set to effectively segment the continuous hand gestures. Moreover, we present a hand gesture detection range method to evaluate the segmented accuracy of the hand gestures. The central time-frequency trajectory of each hand gesture is clustered by the k-means algorithm, and the Fusion Dynamic Time Warping (FDTW) algorithm is proposed to classify the hand gestures. The experimental results show that the accuracy of the proposed gesture detection method is 96.17%, and compared to the existing alternatives, the recognition accuracy of the proposed FDTW algorithm is improved by more than 5%, and the time complexity is reduced by about 50%.
The reminder of this article is organized as follows. In Section II, we describe the FMCW radar signal processing, including the generation of IF signal and the construction of spectrogram. In Section III, we present the details of the hand gesture detection and recognition methods. Section IV presents the experiments and discussions. Finally, the conclusions are given in V.

II. FMCW RADAR SIGNAL PROCESSING A. IF SIGNAL EXTRACTION AND RDM CONSTRUCTION
In this article, the FMCW radar is used for hand gesture data acquisition. The FMCW radar consists of a waveform generator, an antenna array with two transmitters and four receivers, a signal demodulator and an analog-to-digital converter (ADC) converter. The waveform generator transmits the chirp signal through the transmitting antenna. Then the IF signal is obtained using a low frequency filter (LPF). FIGURE 1 shows the IF signal processing. From FIGURE 1 we learn the storage format of radar data. For the receiving antennas, two Low Voltage Differential Signaling (LVDS) channels are used, and the first channel sends odd samples while the other one sends even samples, VOLUME 8, 2020 and each LVDS channel sends one real and one imaginary part signal. For each chirp signal, the data is stored starting from the lowest receiver to the highest one. With known the storage principle, the data of the I/Q signals can be obtained.
Since we use the sawtooth wave as the transmitted signals, the received signals are also sawtooth wave, and the frequencies of the transmitted and received signals are shown in FIGURE 2. The transmitted signal is expressed as where A T is the amplitude of the transmission signal, f c is the central frequency of the carrier signal, f T (τ ) = τ B/T is the frequency of the transmitted signal and it increases linearly with time τ during the period of T , and T is the width of the pulse signal, B is the bandwidth of the signal The received signal can be expressed as where A R is the amplitude of the received signal, t d is the delayed time, f R (τ ) is the frequency of the received signal. By mixing the transmitted and the received signals we get the mixed signal S M (t) = s T (t) · s R (t), and an LPF is used to filter out the high frequency part and obtain the IF signal s IF (t).
where f doppler is the Doppler shift caused by the movement of hand gestures. Assume that the range between the hand gesture and radar is R. The relationship between t d and the R is where c is the speed of the light. Then, we can measure the delayed time t d to obtain the range R. Since t d is very small and it is difficult to evaluate in practical. The frequency f IF of the IF signal is generally  used for the estimation of t d , and the relationship between f IF and t d is shown in FIGURE 3.
From FIGURE 3, we learn that Then, substituting Eq. (4) into Eq. (5), we have For the dynamic hand gestures, the phase of the signal changes with the range of the hand gesture and the phase difference between each chirp can be expressed as The Doppler frequency shift of IF signal on each chirp can be obtained by the 2D-FFT. The speed of the hand gesture can be obtained using the frequency of IF signal and Eq.
Combining Eq. (6) and Eq. (8), we know that the range is proportional to the frequency of the IF signal, and the speed of the hand gesture is proportional to the Doppler shift. In the following, a range-Doppler map (RDM) of the hand gesture is generated, and the construction process of RDM is shown in FIGURE 4. It can be seen from FIGURE 4 that in the fasttime domain (each single chirp), the frequency spectrum of the IF signal is obtained by performing FFT on each chirp signal, and the frequency corresponding to the hand gesture can be obtained by a spectral peak search approach. The speed (Doppler) estimation needs to accumulate the estimated spectrum results and performs FFT on multiple chirps (namely a frame). Then, the coupling RDM is obtained by cumulating the slow-time domain of the range and speed. Since the range and Doppler frequency shift of the hand gesture can be extracted from RDM on each frame, RTM and DTM are obtained using multi-frame cumulation.

B. MUSIC BASED ATM
Although RTM and DTM represent the features of the hand gestures, the angle parameters provides more features of the hand gestures. In FMCW radar, the phase difference of multiple antennas on the echo signal is used to measure the angle information, shown in FIGURE 5. In the following, the mathematical model is established to calculate the angle of hand gestures. The phase difference of the signals received by two adjacent receiving antennas is where d = l · sin θ is the distance difference, l is the distance difference between the two antennas, θ is the arrival angle of signal. Then, the angle θ can be obtained Based on the above analysis, this article adopts the Multiple Signal Classification (MUSIC) algorithm [40] to estimate the angle of the hand gestures. The MUSIC algorithm is based on signal subspace decomposition, and the signal and noise subspace can be obtained by characteristic decomposition of the covariance matrix. The spatial spectrum function is constructed, and then the angle is detected by a spectral peak search approach. When the kth target is found the phase difference between the first receiving antenna and the m receiving antenna is Therefore, the received signal of the m-th antenna can be expressed as where a k (t) is the amplitude of the k-th target, n m (t) is the noise received by the m-th antenna. We define the guide Thus, the signal of the m-th receiving antenna can be further expressed as The received signals of all antennas is where O is the amplitude matrix, S is the guide vector matrix, J is the noise matrix, and these variables can be further expressed as Then, the covariance matrix of radar signal is where R S = E[SS H ] is the correlation matrix of the signal and R J = σ 2 I is the noise, where σ 2 is noise power and I is the identity matrix. By applying the decomposition of the covariance matrix, M eigenvalues of the matrix can be obtained, and the eigenvalues can be obtained by sorting the eigenvalues as It is assumed that there are K hand gestures in front of the radar, the signal is corresponded to the K larger eigenvalues, while the noise is corresponded to M −K smaller eigenvalues. We also assume that the eigenvector of the i-th eigenvalue λ i is ζ i then the noise matrix can be constructed from the M − K eigenvectors VOLUME 8, 2020 The spatial spectral function can be constructed by using the noise matrix E J and guidance vector s(θ) When the inner product of signal vector s(θ) and noise E J approaches 0, the spectral function has peak value. Therefore, the angle of the hand gesture can be found by searching the angle of the spectral function, and then using the spatial spectrum to search the spectral peak. Finally, we can obtain an ATM by accumulating the estimated angles with multiple frames.

III. PROPOSED DETECTION AND RECOGNITION METHOD
In order to achieve recognition with dynamic and continuous hand gestures, this article proposes a two-step scheme, illustrated in FIGURE 6. In step one, the continuous hand gestures are detected and segmented using the proposed amplitude based detection method, and the detected hand gesture is recognized by the proposed FDTW method in step two. The FDTW method contains the offline and online stages. In offline stage, the k-means algorithm is employed to cluster the central trajectory of all hand gesture spectrums, and the FDTW is applied for matched distance training. In online stage, the testing data of the hand gestures are clustered and input into the FDTW model for matching distance calculation and hand gesture recognition.

A. HAND GESTURE DETECTION METHOD
In real-time applications, the hand gesture is usually continuous and dynamic. Therefore, the first step for hand gesture recognition is how to detect and segment each gesture from the continuous and dynamic real-time hand gesture data. In this article, we propose a simple but effective continuous hand gesture detection method based on the amplitude of hand gesture spectrograms. The detailed steps are as follows.
Firstly, the hand gesture spectrograms of RTM, DTM and ATM are respectively normalized using the following equation.
where A ij is the normalized spectrograms R ij is value of the original hand gesture spectrograms m and n are the column and row of the hand gesture spectrograms, respectively. Then, we sum the amplitude of each frame of hand gesture spectrogram where Y sumi is the summation amplitude of the normalized spectrogram on the i-th frame Finally, we set a threshold, when the amplitude is first larger than the threshold, we mark it as the starting frame of the hand gesture. The hand gesture is ended when the Y sumi is no larger than the threshold. The hand gesture area is where G area is the gesture frame area, i is the frame number.
In this article, we found that the value of Y sumi is usually larger than zero when there is a hand gesture, otherwise it is less than zero. Therefore, the threshold used for hand gesture segmentation is zero. The RTM of two continuous push hand gestures before amplitude calculation is shown in FIGURE 7, and the corresponding results of the amplitude based hand gesture detection method is given in FIGURE 8. We learn from FIGURE 7 that the starting and ending frames of the gestures are not clear. In FIGURE 8, by applying the amplitude based detection method, the duration time of the first hand gesture is 16 frames (from the 10-th frame to the 25-th frame), and the second hand gesture starts from the 39-th frame to the 54-th frame.

B. HAND GESTURE RECOGNITION ALGORITHM
In this subsection, we propose a FDTW hand gesture recognition algorithm, which applies RTM, DTM and ATM to improve hand gesture recognition accuracy. Since the spectrograms of the three parameters contains too many pixels, it may cost much time by calculating cosine similarity point by point. Therefore, in the offline stage, to reduce the redundant input data of hand gesture spectrograms, the k-means algorithm is presented to cluster the hand gesture central trajectory. Then, the cosine similarity of the three parameters are respectively calculated as the distances. Finally, the three distances are fused and classified by the FDTW algorithm. The block diagram of proposed algorithm is shown in FIGURE 9.

1) HAND GESTURE CLUSTERING
The k-means algorithm is applied to extract the center trajectory of the gesture spectrogram, and the clustered points describe the features of the hand gestures. In this article, the spectrograms of the three parameters (RTM, DTM and ATM) are two-dimensional matrixes. We assume that there are n gesture features in each spectrogram, then we cluster the hand gesture features of each spectrogram as hand gesture central trajectory (contains k (k <= n) points). The specific steps of the k-means algorithm for hand gesture spectrogram clustering is given as follows.
Step 1, we randomly select k hand gesture feature samples of RTM, DTM and ATM, respectively, as the initial central trajectory points, record as u 1 , u 2 , . . . , u k for each hand gesture spectrogram. In this article, we choose the value of k as 10.
Step 2, we define the cost function as where x i is i−th hand gesture sample, c i is the cluster of x i , u c i is the central trajectory point of cluster, and M is total number of hand gesture samples.
Step 3, we define t = 0 as the iteration number, and repeat the following process until the value of the cost function t (c,µ) ≤ 10 −5 a) Assign x i to the nearest cluster Therefore, the features in the hand gesture spectrogram of RTM, DTM and ATM are, respectively, extracted and represented by the k clustered points.

2) FDTW ALGORITHM FOR GESTURE RECOGNITION
The DTW model for hang gesture recognition is shown in FIGURE 10, which gives the relationship between the test template and the matching template, and there are m 1 = n 1 = 10 points. The horizontal and vertical axes represent the hand gesture frames, and each small cell in the VOLUME 8, 2020 grid represents the intersection of the test template and the matching template. The similarity between the two templates of different lengths can be measured by finding the shortest path.
To apply the DTW algorithm for hand gesture recognition, the following three conditions should be satisfied. a) Monotonicity: if the shortest path is W k = (i, j) k then the next point W k+1 in the path must be one of (i + 1, j) k+1 , (i, j + 1) k+1 , or (i + 1, j + 1) k+1 b) Continuity: continuity ensures that every coordinate in sequence N and M appears in path W . c) Starting and ending constraints: the boundary condition guarantees that the path starts from W (1, 1) to the end of the gesture W (m 1 , n 1 ). Fortunately, the three conditions are satisfied since the clustered central trajectory points are sequences. In order to improve the accuracy of hand gesture recognition, this article proposes the FDTW recognition algorithm based on different hand gesture features of RTM, DTM and ATM.
First, we calculate the matching distance between the RTM, DTM and ATM of hand gestures by using the cosine similarity, which is defined as where l is the lth point of the matching hand gesture template and the test hand gesture template. Second, we define α(c, v) as the matching distance of RTM, which is the shortest distance. Similarly, β(g, h) and γ (e, j) are respectively the shortest matching distances of DTM and ATM, where c is the training samples of RTM, v is the test samples of RTM, g is the training samples of DTM, h is the test samples of DTM, e is the training samples of ATM, j is the test samples of ATM. Then, the fusion matching distance d f is where α, β and γ are the coefficients of the range, speed and angle, respectively. In our experiments we found that we can get highest recognition accuracy when the coefficients equal to one. In the online stage, the k-means algorithm is applied to the tested hand gesture data, and fusion matching distance d f is calculated. Then, the shortest fusion distance can be obtained where η is the head gesture type. Finally, we record the hand gesture with the shortest fusion distance as the recognized hand gesture type.

A. PALTFORM AND RADAR CONFIGURATION
The experimental environment and the radar equipment chosen for data acquisition is shown in FIGURE 11. The millimeter wave radar is Texas Instruments (TI) AWR1642 radar   board, and the high-speed data acquisition card is TSW1400. The radar system parameter configuration used in this article is shown in TABLE 1. The starting frequency of the radar is 77 GHz, the bandwidth is 4 GHz, and the radar has 2 transmitting and 4 receiving antennas. Each frame of radar data contains 128 chirps, and the period of each chirp is 38µs. The tester sits in front of the radar with the dynamic hand gesture range from 10 cm to 70 cm. We first collect the radar signal and construct the spectrograms. Then, the continuous hand gestures are effectively segmented, and the FDTW algorithm is proposed to recognize the hand gestures.  200 sample data for each type of hand gesture, and the total number of hand gesture sample is 1200. In FIGURE 12, we give a completed hand gesture diagrams, and the corresponding three spectrograms (RTM, DTM and ATM) of each hand gesture. It should be noted that the duration time of a completed hand gesture is less than one second, and Figure 12 shows the results of 32 frames (equally 1.28 seconds). We learn from FIGURE 12 that different hand gestures have different RTM, DTM, as well as ATM. FIGURE 12 (a) shows an sample of a push hand gesture, and the range of the hand gesture becomes smaller while speed of the gesture (see FIGURE 12 (b)) first increases and then reduces to zero when the gesture finally stop in front of the radar. It should be noted that in order to show the dynamic and continuous hand gestures, each collected data sample contains multiple hand gestures.

C. ANALYSIS OF HAND GESTURE DEDETION RESULTS
Before investigating the recognition performance of continuous hand gesture, we first analyze the segmentation accuracy of the proposed amplitude-based detection method.
To describe hand gesture segmentation accuracy, the detection range is applied in this article.

1) HAND GESTURE DETECTION RANGE
Through the experimental test, the detection offset distribution map of each hand gesture can be obtained. The period with the highest occurrence probability in the distribution map is defined as the detection range. In the actual commercial system, the accuracy of recognition system is more than 95%, that means there will be an error (drop) rate of 5% [21]. As a result, the detection range built in this article can be reduced to 5% drop. FIGURE 13 shows the detection offset for the six types of hand gestures. The detection offset indicates that the detection range will increase with the appearance of the error frame, as a result, the hand gesture recognition performance will decrease. We define the error frame as 1 frame unit.  14 to 20 frames of the PS gesture, no error frames appear and the hand gesture detection performance is 100%. When the performance drops by 1%, the hand gesture detection range is expanded to 13 to 21 frames. Similarly, the performance drops by 3%, the gesture detection range is expanded to 12 to 22 frames, and the performance drops by 5%, the hand gesture detection range becomes larger. This change results in the drop of hand gesture detection performance and the increase of gesture detection range.

2) DETECTION PROBABILITY
In this part, we evaluate the accuracy of the segmented hand gesture, and get the detection probability of the hand gestures. FIGURE 14 shows the detection probability of hand gestures after segmentation by using trigger value (TV) [21], trigger ratio (TR) [21] and the amplitude-based detection method proposed in this article under different performance drop. We can see from FIGURE 14 that the detection probability of the proposed amplitude-based detection method is higher than that of TV and TR. With the decrease of the detection offset, the detection range of the hand gesture increases, making the improvement of detection probability of the hand gestures. The average detection probability is shown in TABLE 3. We learn from TABLE 3 that the detection probability of the amplitude-based detection method is higher than that of the TV and TR. The average detection probability of the proposed method is 83.50% when the performance drops by 1%, which is about 10% higher than the other two alternatives. When the performance drops by 5%, the range of hand gesture detection is enlarged, making the detection probability increase to 96.17%, which is 3% higher than the existing alternatives. Therefore, we apply 5% performance drop for subsequent experiments.

D. HAND GESTURE CLUSTERING RESULTS
The k-means algorithm is used to cluster the central timefrequency trajectory of the hand gesture spectrogram. Since the duration of a single hand gesture after continuous gesture segmentation is different, the observation duration of the segmented hand gesture is uniformly expanded to the original 32 frames. FIGURE 15 shows the extended segmented spectrogram of RTM, DTM and ATM of PS, and their corresponding central trajectory after clustering. It is observed from FIGURE 15 that the time-frequency trajectories represent the main features of the gesture spectrograms.

E. HAND GESTURE RECOGNITION RESULTS
To prove the effectiveness of the proposed FDTW algorithm, we use several different datasets: RTM, DTM, ATM and multi-feature dataset (contains RTM, DTM and ATM). The K-Nearest Neighbor (KNN) algorithm and the traditional DTW algorithm are taken for comparison. The recognition accuracies are shown in TABLE 4. As can be seen from TABLE 4 that the average recognition accuracy of DTW algorithm is higher than the KNN algorithm. This is attributed to the fact that the KNN algorithm uses the simple Euclidean Distance as the distance, while DTW applies the cosine similarity, which is improved to be more efficient [41]. Moreover, the recognition accuracies of single-parameter datasets for both KNN algorithm and DTW algorithm are always less than 90%. However, the proposed FDTW algorithm makes full use of RTM, DTM and ATM features, making the hand gesture recognition rate (95.83%) higher than that of the existing alternatives. The accuracy of proposed FDTW algorithm is even 1.33% higher than using the DTW algorithm with RTM and DTW, which proves the effectiveness of using the fusion data.

F. TIME COMPLEXITY ANALYSIS
Since the central time-frequency trajectories of RTM, DTM and ATM are clustered, we will show the time efficiency in this subsection. We test the running time of different methods before clustering (BC) and after clustering (AC) respectively.
The running time is shown in TABLE 5. We learn from TABLE 5 that the running time of KNN algorithm under the single-parameter datasets before clustering is 2.12 ms, and the running time after clustering is 1.13 ms. Moreover, the running time proposed FDTW algorithm after clustering VOLUME 8, 2020 is about 5 ms, which is reduced by more than 50% compared to the FDTW algorithm without time-frequency trajectory clustering.

V. CONCLUSION
In this article, a continuous hand gesture detection and recognition method was proposed. Firstly, we collected the raw data of the radar to obtain the IF signal, and used the IF signal to estimate the RTM, DTM and ATM of hand gestures. Then, we proposed an amplitude-based detection method, which applied the amplitude of the normalized hand gesture spectrogram in conjunction with a threshold. Finally, the k-means algorithm was adopted to cluster the center time-frequency trajectories of RTM, DTM and ATM, and FDTW algorithm was proposed to recognize the hand gestures. Experimental results showed that the segmentation accuracy of the proposed hand gesture detection method could reach 96.17%, and the average recognition accuracy of the proposed FDTW algorithm for six types of hand gestures was improved by more than 5% while saving half of the running time.