Small-Data-Driven Temporal Convolutional Capsule Network for Locomotion Mode Recognition of Robotic Prostheses

Locomotion mode recognition has been shown to substantially contribute to the precise control of robotic lower-limb prostheses under different walking conditions. In this study, we proposed a temporal convolutional capsule network (TCCN) which integrates the spatial-temporal-based, dilation-convolution-based, dyna- mic routing and vector-based features for recognizing locomotion mode recognition with small data rather than big-data-based neural networks for robotic prostheses. TCCN proposed in this study has four characteristics, which extracts the (1) spatial-temporal information in the data and then makes (2) dilated convolution to deal with small data, and uses (3) dynamic routing, which produces some similarities to the human brain to process the data as a (4) vector, which is different from other scalar-based networks, such as convolutional neural network (CNN). By comparison with a traditional machine learning, e.g., support vector machine(SVM) and big-data-driven neural networks, e.g., CNN, recurrent neural network(RNN), temporal convolutional network(TCN) and capsule network(CN). The accuracy of TCCN is 4.1% higher than CNN under 5-fold cross-validation of three-locomotion-mode and 5.2% higher under the 5-fold cross-validation of five-locomotion modes. The main confusion we found appears in the transition state. The results indicate that TCCN may handle small data balancing global and local information which is closer to the way how the human brain works, and the capsule layer allows for better processing vector information and retains not only magnitude information, but also direction information.


I. INTRODUCTION
32 R OBOTIC prostheses could assist people with lower-limb 33 injuries in performing basic movements [1]. Human loco- , 40 and based on these signals, versatile algorithms including 41 model-based computational methods, e.g. fuzzy logic [7], 42 a slope gradient estimator [8], and model-free computational 43 methods, traditional machine learning, e.g. support vector 44 machines (SVMs) [9], [10], and big-data-driven neural net-45 work, e.g., convolutional neural network (CNN) [6], [11], are 46 proposed to obtain more accurate recognition. 47 Model-based computational methods are proposed to addr-48 ess the issue on accurate locomotion mode recognition [7], [8]. 49 With one inertial measurement unit (IMU) in the backpack 50 and encoders to measure lower-limb joint angles, a slope 51 gradient estimator based on the sensor data fusion is proposed 52 to construct an adaptive gait planning approach for sloped 53 terrains. The performance of the approach is limited by 54 the sensor accuracy, tracking errors of the controllers and 55 kinematics computation [8]. The fuzzy logic-based method 56 could apply fuzzy sets and fuzzy rules for reasoning about 57 transitional boundaries or qualitative knowledge experiences 58 for describing systems which is suitable for terrain identifi-59 cation for robotic prostheses, and the average identification 60 accuracy could be 98.74%, and the average identification delay 61 could be 9.06% of one gait cycle [7]. If the human locomotion 62 modes are infinite, which may be established accurately and 63 completely, human locomotion may be more easily and rapidly 64 recognized, However, the varied and individualized differences 65 in human movement increase the challenge to recognize accu-66 rate human movement modes [7].

67
Model-free methods gradually attract wide attention with 68 the development of machine learning and neural network. 69 Strides could be divided into three separate feature sets includ-70 ing sensed, translational, and expanded, then cross-validation 71 is performed using linear discriminant analysis to enhance 72 walking task prediction in robotic prostheses [12]. Support 73 vector machine (SVM) classifier is also used to recognize the 74 locomotion mode [10]. Raw data is collected from onboard 75 sensors to calculate feature values and concatenated together 76 to be feature vectors with their special locomotion mode 77 labels to train the classifier. After trained, SVM classifier 78 is the input of the next real-time recognition [10]. To cope 79 with the time-varying problems of surface electromyography 80 (sEMG) signals, adaptive intent recognition algorithms are 81 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ proposed which were evaluated on transfemoral amputees and vector-based features for recognizing locomotion mode 126 with small data rather than big-data-based neural networks for 127 robotic prostheses.

129
Three subjects with transtibial amputation participated in the 130 experiment as same in [11]. We collected the data of the strain 131 gauge signal when they walked after wearing the robotic pros- The experiments were approved by the Local Ethics Com-142 mittee of Peking University. The raw data in this study are 143 from one strain gauge inside a robotic prosthesis, as same 144 in [11]. The prosthesis in this study comes from a previous 145 study [11]. The detailed parameters of the prosthetic model 146 can be referred to [11]. The prosthesis uses two kinds of 147 sensors(including one full bridge of strain gauge and one 148 angle sensor). One angle sensor was used to measure ankle 149 angle. The strain gauge which reflects the deformation infor-150 mation due to contact force between carbon-fibre footplate 151 and ground contains motion mode information while walking 152 on different terrains. To reduce the weight and volume, the 153 battery installation is designed to be embedded. In this study, 154 five locomotion modes (i.e. level ground, ramp ascent, ramp 155 descent, stair ascent and stair descent) were studied and 156 analyzed. Each subject was asked to perform level ground, 157 ramp ascent, level ground and stair descent (forward walking), 158 turn around, and perform stair ascent, level ground, ramp 159 descent and level ground (reverse walking) in order, as seen in 160 Fig. 1. There were 35 repetitions of forward/reserve walking 161 for each subject. All subjects volunteered to participate in 162 the experiments and signed informed consent. The possible 163 existing risks were also explained to them in advance. More 164 details could refer to [11]. The data used in this study was derived from the previous 167 study on locomotion mode recognition [11]. Participants, data 168 collection, and data pre-processing could refer to [11] in 169 detail. The original signals from the strain gauge bridge were 170 amplified 100 times and processed by a low-pass digital filter 171 with a 10 Hz cutoff frequency [11]. The raw signal of each trial 172 was divided into 10 strides at the heel strike time [11]. Linear 173 interpolation is performed on the signal of each step, and the 174 signal length of each step is transformed into a fixed length 175 (1000) [11]. All data was max-min normalized [11]. In this 176 study, several different neural networks and SVM algorithms 177 were used to identify locomotion patterns. Different types of 178 networks, accordingly, have different data processing methods. 179 CNN processes data with spatial characteristics, and RNN 180 processes data with temporal characteristics, TCN, TCCN, 181 and CN process data with spatial-temporal characteristics. 182 After pre-processing, the data format from the strain gauge 183 is 1000 × 1 where 1000 represents the length of the signal 184 of each stride. CNN and SVM can directly apply this data, 185 while other neural networks dealing with data with temporal 186 characteristics or data with spatial-temporal characteristics can 187 not. Therefore, the data was further processed to convert the 188 data of each stride (1000 × 1) into (20 × 50), as shown in 189   proposed to apply to the mode recognition of robotic pros-199 theses, which extracts the spatial-temporal information in lower layer to the higher layer and add the input, which may 219 solve the problem of network degradation to a certain extent. 220 In particular, the convolutional layer used in the residual 221 block has dilated convolution. By sampling the input data 222 exponentially layer by layer, the dilated convolution can obtain 223 a larger receptive field with fewer network layers, which 224 can reduce the depth of the network. Dilated convolution 225 demonstrates the capability to improve the performance of 226 small data sets in deep convolutional neural networks [24]. 227 Therefore, we try to use dilated convolution in the proposed 228 TCCN model. The TCCN model is shown in Fig. 4. The 229 details of the structure are shown in Table I.   In this study, we identified three motion patterns and five 257 motion patterns. Because the control parameters of the pros-258 thesis are similar in similar movement terrains, such as ramp 259 ascent and stair ascent [11]. The misclassification occurring 260 between similar states shows a relatively small impact on the 261 control of robotic prostheses. Therefore, we classified similar 262 terrains into the same category. Stair ascent terrain and ramp 263 ascent terrain were classified as the ascent mode and stair 264 descent terrain and ramp descent terrain were classified as the 265 descent mode. Among the three locomotion modes, the ascent 266 mode includes stair ascent and ramp ascent, and the descent 267 mode includes stair descent and ramp descent. We randomly 268 divided the data of each subject into three groups, one for 269 training, one for verification, and one for testing. The size of 270 the training set is 60%, and the size of the verification set and 271 test set is 20%.

272
A loss function is used to calculate the difference between 273 the predicted value of the neural network and the real value. 274 Through the size of the loss value, the coefficients of the neural 275 network are optimized through specific algorithms (e.g. back-276 propagation algorithm). In CNN, RNN, and TCN models, 277 we used the cross-entropy loss function, while in CN and 278 TCCN, we used the margin loss function. The margin loss 279 function could refer to [25]. The cross-entropy loss function 280 used in this study is defined as follows, where Loss represents the total loss, S is the total number 283 of samples, i represents the sample number, j represents 284 the locomotion mode,ŷ i j represents the probability that the 285 i th sample is predicted to be class j ( j = 1, 2, . . . , 5). If the 286 ith sample belongs to class k(k = 1, 2, , , , 5), then,   The data of three subjects were trained separately for locomotion mode recognition and the average accuracy and standard deviation were obtained. Fig. 6. Recognition accuracy using 5-fold cross-validation(total). The data of three subjects were trained by mixing them together for locomotion mode recognition and the accuracy was obtained.
2) 5-Fold Cross-Validation(Total): To better illustrate the per-343 formance of TCCN, we also mixed the data of three subjects 344 to evaluate these algorithms. Fig. 6 shows the results with 345 different neural networks and SVM algorithm in 5-fold cross 346 validation(total). The accuracy of TCCN is the highest among 347 these neural networks and SVM algorithms, whether in three 348 locomotion mode recognition or five locomotion mode recog-349 nition. The average accuracy of the TCCN in 3 classification 350 using 5-fold cross-validation(total) is 97.2%. The average 351 accuracy in 5 classification is 93.6%. The average accuracy of 352 the CNN in 3 classification using 5-fold cross-validation(total) 353 is 95.8%. The accuracy in 5 classification is 91.0%. In the five 354 classification using 5-fold cross-validation(total), the average 355 accuracy of the TCCN is 2.6% higher than that of the CNN. 356 In the three classification, the average accuracy of the TCCN 357 is 1.4% higher than the accuracy of the CNN.

B. Confusion Matrix in 5-Fold Cross-Validation 359
To better understand the recognition results, we also pro-360 vided the confusion matrix of each neural network and SVM 361 algorithms. 362 Fig. 7 is the confusion matrix of neural network and SVM 363 algorithms in 5-fold cross-validation (separate) for 3 loco-364 motion mode recognition. The confusion matrix of neural 365 network and SVM algorithms in 5-fold cross-validation (sep-366 arate) for 5 locomotion mode recognition is shown in Fig. 8. 367 Fig. 9 and Fig. 10 show the confusion matrix of neural net-368 works and SVM algorithms in 5-fold cross-validation(total) 369 for three-locomotion and five-locomotion modes, respectively. 370     From Fig. 8, the results 408 are similar to those obtained from Fig. 7. The misclassification 409 mainly occurred in the transition state. The least misclassifica-410 tion occurred between ascent mode and descent mode. In CNN 411 and TCCN model, the misclassification rate between ascendant 412 and descendant terrain is close to zero. In RNN, TCN, CN and 413 SVM models, the misclassification rate between ascendant and 414 descendant terrain is zero. The confusion matrix of neural networks and SVM algorithms 420 in the 5-fold cross-validation(total) is shown in Fig. 9. The 421 main misclassification occurred between level ground and 422 ascendant environment, and between level ground and descen-423 dant environment.

b) 5-fold cross-validation(total) for five-locomotion modes: 425
The confusion matrix of neural networks and SVM algo-426 rithms in the 5-fold cross-validation(total) is shown in Fig. 10. 427 The main misclassification occurred between similar motion 428 trends, e.g. stair ascent and ramp ascent. The misclassification 429 between ascent terrain and descent terrain is relatively small, 430 compared with the misclassification between level ground and 431 other locomotion modes. The main contribution of this study is that we proposed a 434 temporal convolutional capsule network(TCCN) and compared 435 the performance of TCCN and other algorithms using the same 436 small dataset. We used several different methods to compare 437 and evaluate the performance of the algorithms. We divided the 438 data into three dimensions: original data dimension, temporal 439 dimension, and spatial-temporal dimension to be fed into 440 neural networks with different characteristics. The results 441 suggested that the performance of TCCN was better than CNN, 442 RNN, TCN, CN and SVM. Limited by the number of subjects, in the field of the robotic 445 prosthesis, getting big data based on locomotion patterns is not 446 realistic, and how to extract sufficient information from small 447 data and recognise multiple locomotion modes is a challenge. 448 The prominent batch normalization and dilated convolution 449 demonstrate the capability to improve the performance of 450 small data sets in deep convolutional neural networks [24]. 451 Taking these features to the proposed TCCN model, dilated 452 convolution was used and the data was max-min normalized 453 in data pre-processing. Several neural networks used in this 454 study used the same data, and therefore, normalization maybe 455 not be a factor to improve the performance of locomotion 456 mode recognition.   9. Matrix confusion using 5-fold cross-validation(total) in threelocomotion-mode. The data of three subjects were trained by mixing them together for locomotion mode recognition. Then, the confusion matrix was obtained. The abbreviations LG, AS, and DS, stand for Level Ground, Ascent, and Descent respectively. AS includes ramp ascent terrain and stair ascent terrain. DS includes ramp descent terrain and stair descent terrain.
small data sets in deep convolutional neural networks [24]. 466 The results indicate that dilated convolution may improve the 467 recognition performance on small data. The dynamic routing algorithm in the capsule layer is used 477 to update the weight coefficient. The specific introduction 478 of the algorithm can be found in [25]. The human brain 479 tends to classify objects according to the global features, 480 while CNN, RNN, and TCN classify objects according to 481 the local features [26]. The proposed temporal convolutional 482 capsule network(TCCN) in this study and the human brain 483 have certain similarities in object recognition. The eye receives 484 the object's information and transmits it to the brain. The 485 brain analyzes the hierarchical relationship in the information 486 and tries to match the relationship already stored in the brain. 487 When recognizing objects, the hierarchical pose relationship 488 between object components is important. In the TCCN 489 model, by using the capsule layer, the pose information in 490 the data can be extracted, and then different objects can be 491 recognized. From the point of view, TCCN and the human 492 brain have certain similarities in object recognition. Each 493 neuron in the human brain has a special function. When 494 recognizing a feature, the feature must be transmitted to the 495 neuron that is best at processing the feature. A similar idea 496 of the capsule layer is reflected in dynamic routing. If the 497 similarity between low-level features and high-level features is 498 high, the coefficient between them is large, on the contrary, the 499 coefficient is relatively small. By dynamic routing, the lower 500 capsule is connected with a certain higher capsule. Through 501  between object components, which is important for correctly 520 classifying objects. On the contrary, other networks, e.g., 521 CNN, RNN, TCN, and SVM in this study process the data 522 as a scalar and ignore the "direction" information between 523 features, which may not conducive to identifying differ-524 ent locomotion modes. In addition, TCCN regards data as 525 spatial-temporal information rather than not only spatial or 526 temporal information, which may be the cause of the higher 527 recognition accuracy of TCCN compared with CNN and RNN. 528 Fang et al. compared the recognition results of different 529 methods on the HuGaDB data [16], and this study suggests 530 that GNN yields the highest accuracy rate of 98.04% compared 531 with CNN 79.24% and LSTM 92.78% [16]. GNN extracts 532 spatial-temporal information from data, which also indicates 533 that spatial-temporal information from small data is conducive 534 to locomotion pattern recognition. The strain gauge sensor and other sensors are fused together 537 for motion pattern recognition, which may reduce the number 538 of sensors and classification errors in both transition and static 539 states [27], [28]. In this study, we aim to investigate what 540 features of algorithms may be conducive to locomotion mode 541 recognition and only one sensor was utilized. Future research 542 could focus on the experimental data obtained by fusing the 543 strain gauge sensor and other sensors, and then apply the 544 TCCN model to the fused data. Due to the limitation of 545 experimental data, we only verified the performance of the 546 TCCN proposed in this study based on our small data set 547 and discovered that TCCN could improve the accuracy of 548 motion pattern recognition. More small datasets from more 549 experiments and sensors are needed to enhance the persuasion 550 that TCCN can improve the performance of motion recognition 551 on small datasets. Future research is needed to visualize what 552 features the different dimensions of vectors represent and 553 what role these features play in locomotion mode recognition. 554 The input signal, namely the strain gauge signal, may be 555 reconstructed from the capsule layer values in the TCCN 556 model. Through the output value of the capsule layer, the 557 output value of the previous layer is calculated backward layer 558 by layer, and finally, the value of the input signal is obtained. 559 Then, the image of the input signal can be reconstructed.