Research on sEMG Based gesture recognition by dual-view deep learning

In the field of human-machine interaction, gesture recognition using sparse multichannel surface electromyography (sEMG) remains a challenge. Based on the Hilbert filling curve, a dual-view multi-scale convolutional neural network (DVMSCNN) is designed to enhance gesture recognition performance in this paper. The network consists of two parts. In the first part, sEMG is filled using Hilbert filling curve, and the obtained images in the time and electrode domain are used as inputs to the block. In the second part, the depth features learned by block are fused and classified by a "layer fusion" based view aggregation network. The evaluation of the architecture in the four databases of Ninapro-DB1, DB2, DB3 and DB4 shows that DVMSCNN is more than 7% more accurate than other state-of-the-art methods. When validated using a home-grown dataset, DVMSCNN was able to achieve a recognition rate of 0.8848.


I. INTRODUCTION
As human-machine interaction playing an important role in modern life, the question of how to interact with computers in an efficient and natural way has become an important research topic. Hand gestures, which are simple and natural, are essential parts of body language. Thus, gesture recognition is also a key technology in human machine interaction [1].
In the case of classification, the traditional machine learning (ML) has been widely used for gesture recognition [17], [18]. Lu Z. et al. [19] adopted Bayesian linear classifier and an improved dynamic time warping algorithm for classification recognition of 19 gestures. Results suggest that the average accuracy of 89.6% in user-independent testing. Besides, by comparing different sEMG features and classifiers for the classification of 52 gestures from the Ninapro reference dataset. A random forest classifier, a combination of statistical, and frequency domain features, i.e., MAV, histogram, wavelet, and Fourier transform features, yield the best performance [20]- [24]. With the continuous development of deep learning (DL), it has been gradually applied to gesture recognition in recent years [25]- [28]. Panwar M. et al. [29] presents a deep learning framework, Rehab-Net, which can classify the three movements from stroke survivors without using any feature engineering. Finally, the overall accuracy of Rehab-Net achieves 88.87%. Zhai X. et al. [30] used short latency dimension reduced sEMG spectrograms as input to convolutional neural network (CNN) and support vector machine (SVM). The results showed that CNN consistently exhibited better performance.
CNN have made breakthrough in feature extraction and image classification tasks in 2D problems. Thus, it makes sense to find a suitable method to convert sEMG into an image that can be used as CNN input [31], [32]. Hilbert filling curves can be applied to organize or compress data by providing a mapping between D-dimensional spaces while retaining locality, such as biomedical signals [33]. Chen L. et al. [34] proposed a feature extraction method based on Hilbert-Huang transform and used extreme learning machine for classification. The experimental results showed that the classification accuracy of this method was 88%. Kurek J. et al. [35] used Hilbert curves to represent mammograms as 1dimensional vectors and extracted features from them to detect breast cancer, with a final accuracy of 85.83%. In this case, for the input image problem of CNN, filling sEMG with Hilbert curve helps to enhance the classification effect of gesture recognition.
In this paper, we design a dual-view multi-scale convolutional neural network (DVMSCNN) to improve sEMG-based gesture recognition performance. The network consists of two parallel multi-scale CNN and a view aggregation network based on "layer fusion". The two views' inputs are 2D time domain and electrode domain images obtained by filling the sEMG with Hilbert.
The rest of this paper is arranged as follows. Section 2 introduces the process of data collection and completion. Section 3 proposes the classification model based on DVMSCNN. Section 4 presents the experiment results of proposed algorithm. Section 5 summarizes the study and puts forward future work.

A. Dataset and preprocessing
The evaluations in this work were performed offline using multi-channel sEMG signals from the publicity available NinaPro databases [41]. We chose 4 sub-databases of NinaPro, which the details are as follows: The first sub-database (denoted as NinaPro-DB1) contains 10-channels sparse multi-channel sEMG signals recorded from 27 intact subjects. Each gesture was recorded with 10 trials at a sampling rate of 100Hz. Each subject was asked to perform 53 gestures, including 12 finger movements, 17 wrist movements and hand postures, 23 grasping and functional movement. Relaxation state between each repetition was resting gesture.
The second sub-database (denoted as NinaPro-DB2) contains 12-channels sparse multichannel sEMG signals recorded from 40 intact subjects. Each gesture was recorded with 6 trials at a sampling rate of 2000Hz. Each subject was asked to perform 50 gestures, including 9 force patterns, 17 wrist movements and hand postures, 23 grasping and functional movement, and the rest movement.  The third sub-database (denoted as NinaPro-DB3) 12channels sparse multichannel sEMG signals recorded from 11 transradial amputees; other information exactly the same as those in NinaPro-DB2. According to the authors of NinaPro database, three amputated subjects performed only a part of gestures due to fatigue or pain, and in two amputated subjects, the number of electrodes was reduced to ten due to insufficient space. To ensure training and testing of the model can be completed, we omitted data from these subjects following the experimental configuration used by [36].
The fourth sub-database (denoted as NinaPro-DB4) contains 12-channels sparse multichannel sEMG signals recorded from 10 intact subjects. Each gesture was recorded with 6 trials at a sampling rate of 2000Hz. Each subject was asked to perform exactly the same 53 gestures as those in NinaPro-DB1. Because two subjects (ie subject 4 and subject 6) did not complete all hand movements, their data was omitted in our experiment.

B. Data processing
Due to memory limitation of the hardware, for experiments on NinaPro DB2-DB4, we down-sampled the sEMG signals from 2000 Hz to 100 Hz following the experimental configuration used in [37]. The data processing is divided into two parts, as shown in Fig.2. The first part is data preprocessing. Firstly, the sEMG signals use a 1st order 1 Hz lowpass Butterworth filter. Then, the data are Min-Max normalized. As the last step, the data is segmented into overlapping windows of length 640ms with a step of 10ms, using the sliding window method. Fig.1 shows the sEMG signal before and after data preprocessing of NinaPro-DB1. The second part is to fill the preprocessed data with Hilbert, which is a continuous fractal space-filling curve. Spacefilling curves have been widely applied to tasks in data organization and compression. The Hilbert curve is known for being superior in preserving locality compared to alternatives [38], [39], such as the z-order and Peano curves. The rule is to rearrange the D-dimensional space in a recursive manner for another dimension while keeping one dimension of the sequence data unchanged, where D = 2. The sEMG signal is transformed in two ways: (1) across the time dimension, i.e., for each sEMG channel, map the time series into a 2D image, or (2) across the sEMG channels, i.e., for h each time instant, map the values of the channels into a 2D image. If denotes a Hilbert curve of order i, the specific conversion is as follows: 1) 0 H is a single point.Second item; 2) 1 H consists of four copies of (the point) 0 H , connected with three straight segments of length h at right angles to each other. Four orientations of this curve, labeled 1, 2, 3, and 4, are shown in Fig.3.
3) 2 H is constructed by connecting four copies of different orientations 1 H with three straight segments of length 2 h . There are four possible directions, and the rules of construction are summarized in Table Ⅰ. Fig.3 shows a 2nd order Hilbert curve, which is oriented #2, i.e., 1 H consists of the 1223 direction. 4) n H is constructed by connecting four copies of different orientations -1 n H with three straight segments of length hn. Therefore, Hilbert curves can be generated according to this recursive approach for higher order curves.
For a given M×N sEMG signal, M represents the time series and N represents the electrode channels. When   Fig.4 shows that the image representation of Hilbert electrode dimension on time (0,20,40,60). When it is transformed in the time dimension, it can eventually become 8×8×10 image, as shown in Fig.5 for the image representation of the Hilbert time dimension on the electrode. Note that when using sequence segments of length less than or, rows and columns with only zeroes can be deleted(filled) and cropped to the final image.

A. Multi-view learning
Multi-view learning is an emerging direction in machine learning which refers to the learning of multi-view data or multiple feature sets that can reflect different attributes or views of the data. Compared with single-view learning, multi-view learning can achieve higher performance by making full use of the information in different views of the data. Multi-view CNN is one of the important issues in the practical application of multi-view learning, and it consists of two parts. The first half is a multi-stream CNN composed of multiple branches. Each branch models the data of each view separately to make full use of each view in the learning process; the second half uses a multi-view aggregation network to perform aggregation on the multi-view features learned in the first half of the network.  The data of N views are assumed to be 12 , , , n v v v  . The multi-view convolutional neural network is modeled in the first half by N convolutional neural network branches 12 , , , are the parameters of these N convolutional neural network branches: where i H is the feature of the output of the hidden layer specified for the i-th network branch, which can be understood as the feature obtained from the data , and the final gesture recognition label y is obtained:

B. Proposed deep learning framework
Inspired by multi-view learning, propose the dual-view multi-scale convolutional neural network (DVMSCNN) architecture illustrated in Fig.6. First, two views ( ) 1, 2 = i vi of the sEMG signal after Hilbert transformation are expressed as follows: where x is denoted as sEMG, ( ) v f denotes Hilbert transform, 1 v and 2 v are the image representations of the time domain and electrode domain, respectively.
Then, the two views are modeled in parallel by two blocks. This process can be formulated as:

C. The block architecture
Traditional neural networks learn features of fine scale in early layers and coarse scale in later layers (through repeated convolution, pooling, and stride convolution). Coarse scale features in the final layers are important to classify the content of the whole image into a single class. Early layers lack coarse-level features and early-exit classifiers attached to these layers will likely yield unsatisfactory high error rates. To address this issue, we propose an architecture similar to the Multi-Scale Dense Network (MSDNet) [40], which are shown in Fig.7 and Fig.8, respectively. The horizontal direction corresponds to the layer direction (depth) of the network, it can preserve and progress high-resolution information, which facilitates the construction of high quality coarse features in later layers. The vertical direction represents the scale of the feature map and produce coarse features throughout that are amenable to classification. The n×n in the feature represents the size of the feature map and the top right of block1 shows the meaning of each icon and arrow.
The detailed steps of the Block input part are shown in Fig.9(a). First, batch normalization is required to process the data before the first convolutional layer consisting of 64 3×3 filters to prevent overfitting, after which the batch normalization and ReLU activation function are processed as the original input image  Fig.9(c)(d) shows the detailed steps of regular convolution, and stride convolution. Among them, regular convolution increases the depth of the architecture along the horizontal direction (d), and stride convolution changes the scale along the vertical path, transfers information from higher to lower resolutions, and learns a more comprehensive range of depth features.
Since the fusion of regular and stride convolution is a cascade along the channel dimension, its output must have feature maps of the same size. Therefore, there is no zero padding in the second convolutional layer in stride convolution, and all other convolutional layers have zero padding. Besides, batch normalization and ReLU activation functions are applied to each layer to prevent overfitting. Finally, the features of the output of layer L in Block1 and Block2 with scale s are shown in Tables Ⅱ and Ⅲ.

D. View aggregation network
To aggregate depth features and improve gesture recognition accuracy, DVMSCNN embeds a view aggregation network based on "layer fusion", as shown in Fig.10. The network fuses the depth features from different layers ( 1 L  ), which are shown in Table Ⅳ. Then, the fused features are passed through a classifier consisting of an FC layer with 256 hidden units, a 128-hidden unit FC layer, and softmax activation, a single output label is finally obtained. Batch normalization and ReLU nonlinearity function are applied to each layer, while Dropout is applied to each FC layer to prevent overfitting.

A. Performance Metrics
In this paper, the five-fold cross-validations used when experimenting with the NinaPro database. Specifically, for each subject, eight out of ten repetitions are used as training data, and the remaining two are used as testing data. This process is repeated five times and these results are averaged to compute the optimum test performance. Accuracy is used as performance metrics.

B. Experimental results of DVMSCNN
The influence of different network layers ( 2,3, 4,5,6 L = ) on gesture recognition are shown in Fig.11, which show that the performance grows in a stepwise manner as the number of layers increases ( 2,3, 4 L = ). However, when 4 L  , the number of layers does not bring better performance optimization. Thus, the number of layers of DVMSCNN is set to a four-layer structure as shown in Fig.12, where block 1 and block 2 are shown in Fig.13 and Fig.14.
According to the number of layers of DVMSCNN =4 L , four layers fusion can be obtained, which are only 4 L fusion,

C. Hyper-parameter selection
Hyper-parameter is an important concept in deep learning, referring to the parameter that need to be set artificially before starting training on a model. The setting of hyperparameter has a great impact on the performance of a network model, so finding a suitable set of hyper-parameters is one of the key steps in building a deep model. For the problem of choosing hyper-parameters for Pool, Dropout, and Initial learning rate,    In summary, the network was trained using SGD for 90 epochs with an initial learning rate of 0.1, halved every 10 epochs, and a batch size of 1024. Dropout layers were appended after convolutional layers with a forget rate of 0.25 to avoid overfitting the networks caused by the small training set. Besides, weight decay regularization with a value of 2 0.0005 l = was applied to all convolutional layers. This section mainly verifies Hilbert can help to enhance gesture recognition performance or not. DVMSCNN inputs sEMG without Hilbert, sEMG only with time-domain filling, and sEMG with only electrode-domain mapping as the comparison. According to Fig.15-18, it can be concluded that Hilbert filling can significantly improve the performance of gesture recognition. In addition, the combination of time domain and electrode domain is better than the combination of single domain.

E. Comparison with the state-of-the-art gesture recognition approaches
To evaluate the performance of the DVMSCNN, comparative study is conducted with other state-of-the-art sEMG-based models. The introduction and processing are described in section II, and the performance metrics and hyper parameter selection are described from sections IV.A to IV.B. Notably, the training was performed according to the original paper's process since there is a manual extraction of features in the data processing using Random forests and SVM.

F. The Comparison results between the actual gesture and the standard data set
This section adds the results of comparing the actual gestures with the standard dataset. A total of 8 channels of sparse multichannel sEMG signals recorded from 8 healthy subjects (3 females, 5 males) were acquired using the Delsys Trigno wireless acquisition system in this paper. Each gesture was recorded for 5 experiments at a sampling rate of 200 Hz with a 10 second interruption in between each experiment to avoid muscle fatigue. Each subject was asked to perform 12 gestures, including 5 basic finger movements, 4 isotonic and isometric hand configurations, and 3 grasping hand-gestures. The specific gesture movements are shown in Fig.19. The actual gesture dataset uses the same data processing as the Ninapro database. The table IX shows the final accuracy results and the Fig.20 shows the confusion matrix for the actual gesture data, where the numbers 1-12 represent the gestures in Fig.19, respectively. Compared to the standard database, the home-grown dataset in this paper is deficient in terms of the number of subjects and gesture actions, but in general, it verifies that DVMSCNN has good recognition performance for the actual gesture map dataset as well.

V. Conclusions
This paper designs a dual-view multi-scale Hilbert convolutional neural network for effectively classifying hand gesture from Ninapro-DB1, Ninapro-DB2, Ninapro-DB3 and Ninapro-DB4. Firstly, the sEMG is partitioned into the timedomain and electrode-domain based datasets using the Hilbert filling curve's property. Secondly, to improve the classification effect, two views are used as the input to the network, and a view aggregation network based on "layer fusion" is embedded into the network to aggregate the features from each layer of the two views. In conclusion, the framework we designed for DVMSCNN achieved accuracies of 0.8672, 0.8329, 0.7058, and 0.7332. When validated with a home-grown dataset, the recognition rate can reach 0.8848. In addition, better overall performance is reported compared to the state-of-the-art model. In the future, following recent