A Discriminative Dual-Stream Model With a Novel Sustained Attention Mechanism for Skeleton-Based Human Action Recognition

,


I. INTRODUCTION
Human action recognition has been widely applied in the areas of entertainment games, health care, remote video surveillance, smart home, and educational assistant [1]- [4]. In the past decades, it becomes a research hotspot in the fields of machine vision and has been greatly developed [5].
The emergence of RGB cameras has enhanced the ability of the human motion data collection, which improves the human action classification model by supporting rich training data [6], such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). For example, Simonyan et al. proposed two convolutional neural networks based on the study of the ImageNet for the large-scale image classification [7]. Similarly, another deep convolutional The associate editor coordinating the review of this manuscript and approving it for publication was Bo Shen . neural network architecture was proposed by extending the ImageNet, which achieved the new high-level performance for image classification and detection [8]. As the depth of network increases, model training becomes a challenge. A residual learning framework was proposed by He et al. to effectively simplify the network architecture to achieve efficient training [9]. To extract high-level performance spatiotemporal features, a novel 3D-convolutional neural network has been developed. For example, Ji et al. proposed an effective 3DCNN model to capture the action information encoded in multiple adjacent frames for complex human action recognition [10]. An effective 3D ConvNets was proposed to model spatiotemporal feature for the large-scale supervised video recognition [11]. Comparing with the 2D-convolutional neural network, 3D-convolutional-based models require higher computational costs. To solve this problem, Chen et al. proposed a VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ novel Multi-Fiber architecture by slicing the complex neural structure into several lightweight subnetworks [12]. Similarly, a novel MicroNets was proposed by designing an enhanced input with multi-source [13]. However, since RGB images lack depth data, RGB-based human action recognition is greatly affected by lighting and occlusion, and it does not perform well in real applications. Fortunately, the development of RGB-D sensors (such as Microsoft Kinect sensors) allows the easy collection of depth data and human skeleton joints data. Compared to RGB data, both data are not disturbed by changes in the complex environment. For depth-based human action recognition, there are many different ideas of learning high-performance features, such as ScTPM-based model [14], HONV-based model [15], and the real-time detection model [16]. However, because of the redundancy of information in the depth maps, huge calculations and complex structure models are required. For skeleton-based data, many excellent works have also attracted great attention in recent years. For example, yang et al. proposed a novel motion feature descriptor to calculate the differences of body joint points for skeleton-based human action recognition [17]. Li et al. designed a new graph-based network to model the complex spatial structure, in which the proposed RVJRDs selected the key joint pairs among the resulting image [18]. To eliminate the influence of the instability of skeleton joints, a global optimal matching algorithm was proposed to model temporal misalignment features without pre-segmentation [19]. For skeleton-based motion recognition, the input sequence of consecutive frames and the role of each frame need further study. To improve the performance of the model in real applications, high-level features extraction and modeling needs to be further enhanced. This is because complex action, similar action, and interactive action often occur in daily life. Motived by the recent success of the attention mechanism, its basic idea is to assign weights to different motion parts, most works introduce and extend the attention model for effective global long-term feature modeling. However, these existing works did not take the fact that only part of body joint points moves in the action into consideration, and they also neglect that different joint points play different roles in the same motion. The above limitations lead to poor skeleton features extraction, less robust, and less accurate in motion recognition.
Our study is partially motivated by the recent success of the attention model and the fusion model, such as in [20]- [22]. In this work, we intend to study the priority of each body joint in the action display to achieve the discriminative spatiotemporal representations. Comparing with the traditional attention mechanism, namely, the uniform weights are assigned to different actions, our proposed attention model is more discriminative by assigning corresponding weights to different key stages in various actions. Our basic idea is to select the significance feature area that best represents the input action sequence and model the joint points in this area. Our main contributions are summarized as follows.
1) We propose a sustained attention mechanism (SA) that assigns the corresponding weight to each body joint point adaptively to facilitate better human motion recognition from skeleton-based data, enabling the model to focus on the modeling of skeleton-specific features. This method avoids the artificial design of skeleton representation and emancipates the human energy. 2) We design two-stream neural network, that is, the recurrent neural subnetwork with sustained attention model (SA-LSTM), the convolutional neural subnetwork with sustained attention model (SA-CNN). For the SA-LSTM, we integrate the SA into an LSTM-based classification network for the weights of body joints and input frames learning. For the SA-CNN, we integrate the SA into an CNN-based classification network by re-designing the structure of the convolutional layer for the attention weights learning. 3) To improve the ability of the weights learning in the global representation modeling process, we propose a data enrichment scheme by randomly translating and rotating the body skeletons in the training process, which allows our model ''see'' more action samples. 4) To comprehensively analyze and compare, we conduct a set of ablation studies and test our proposed model on four popular benchmark databases under multiple evaluation standards. The results show that our model achieves state-of-the-art performance.
The remainder of this article is organized as follows. Section II briefly reviews related work on human motion recognition. Section III describes the proposed two-stream model (including SA, SA-CNN, SA-LSTM, and the training scheme) in detail. Section IV presents the experimental settings and discusses the results. Section V concludes our work and gives the future work.

II. RELATED WORK
In this section, we present a brief review of the recent LSTM-based works and the attention model and thus summarize the existing challenges. Compared to CNN-based networks, LSTM-based models have a stronger ability for temporal feature learning. Benefiting from the structure of ''gate'' in the LSTM network, the temporal information can be easily recalled.
With the development of deep learning, many recurrent neural networks were proposed to learn temporal dynamic features. For example, a novel RNN-based model with the end-to-end training method was proposed that divided all the skeleton joints into multiple parts as the input for skeleton-based human action recognition [23]. An RNN Tree-based adaptive learning network was proposed to determine the structure of skeleton representations for the large-scale motion recognition [24]. To improve the ability to model global long-term features, the attention model has been introduced and extended in recent years. Attention model has been widely applied in many fields, such as language processing, data mining, and pattern recognition, and good results have been achieved [25]. For human action recognition, the basic idea is to imitate human visual processing, that is, quickly browse the target motion, locate the key part that affects the motion, and focus on that part. For example, an excellent soft attention mechanism was proposed to imitate human intuition, in which a soft-search was automatically executed to predict the weight by observing a small range of samples [26]. The soft attention model was introduced to the RNN and LSTM to learn spatiotemporal features by focusing on the key frames from the input sequence [27]. Similarly, a novel deep model was proposed based the soft attention model for RGB-based and Depth-based human action recognition [28]. A LSTM-based spatiotemporal attention network was proposed to obtain more discriminative spatiotemporal features for skeleton-based human motion detection and recognition [29]. A novel VideoLSTM model with the endto-end training was proposed based on soft attention mechanism, in which the hardwire convolution operation was also introduced [30]. However, the existing skeleton-based learning approaches rarely consider the different contributions of the body joint-based input sequence. These deep models analyze the entire image, that is, all joints participate in the attention equation. In fact, the motion is only determined by a certain subset of body joints. Representation generated by unrelated body joints would reduce the accuracy of human motion recognition.
The main limitations and challenges of the traditional attention model are shown as follow. 1) The assignation scheme of attention weights is contrary to human prior expectation, which is a challenge to conduct targeted training. For example, the movement of the body joints of the legs and feet is the key to perform the ''walking'' action, and thus these joints should get higher weights instead of all joints sharing the same weight. 2) Inconsistent with the way humans observe action that results in incorrect locating of key skeleton parts, which is a challenge to distinguish the contribution of each joint. Specifically, human beings analyze the key body parts that affect this activity after a complete sequence of observation, rather than simply analyzing a certain moment in the process of this action. 3) The dynamic representation between frames in the input sequence are not fully studied, which is a challenge to distinguish similar action.
The proposed model in this work differs from traditional attention mechanism-based models. First, in SA-LSTM, we design two global attention models, that is, one part for modeling the key body joints, and another part for modeling the key frames. Second, in SA-CNN, we introduce the attention model into the convolutional layer that assigns corresponding weight to each body joints to enhance the ability of spatial features modeling. Next, our proposed model is based on the global input sequence for weight calculation instead of local frames. Finally, we integrate two subnetworks with the relatively effective and simple fusion method, and data enrichment is utilized in the end-to-end training process. VOLUME 8, 2020

III. PROPOSED MODEL
The key skeleton representations under input sequences of different actions are very different. Assigning a weight to each frame input means that all the joint points in this frame share the same weight, which can lead to recognizing bias and the poor recognition performance. To obtain more discriminative representations, we propose an end-to-end two-stream deep model that can assign different weights to all the joint points according to the influence of each joint point on performance, the architecture of the proposed model is shown in Fig. 1. It consists of a LSTM-based subnetwork and a CNN-based subnetwork. In each subnetwork, we re-design the corresponding network structure by introducing our proposed sustained attention mechanism. The deep model is trained by an end-to-end method and a data enrichment scheme is utilized to enhance the overall robustness.

A. LSTM WITH THE SUSTAINED ATTENTION MECHANISM (SA-LSTM)
Our proposed SA-LSTM subnetwork consists of two-part (as shown in Fig.2), namely, the attention model for joint points that focuses on the role of each joint point in the input sequence, and another attention model for the frame of the input sequence that focuses on the connection between each frame.
First, we consider that each joint in the body skeleton has different importance for the same action, that is, the movement of several joints can effectively describe an action, the structure is shown in the part 1 of Fig.2. Therefore, we re-define the input sequence that can more conveniently represent the function of joint points, which is different from the traditional method, as shown in (1). Where S t denotes the input sequence at time t, S n denotes the nth frame in this sequence, s n,k denotes the kth joint point in the frame n. Note that the input sequence is usually divided into the frame-level representation in traditional works rather than considered as the joint-point-level representation, in which the attention model can only assign the weight to each frame.
In our work, we assign the weight to each joint by designing a novel sustained attention model, which can achieve a deeper analysis of the representation of an action, the basic idea is shown in (2). Where S SA t denotes the new output sequence from our proposed attention model at time t, and all the frames of the output sequence include the attention weights of joint points; W SA t denotes the set of attention weights at time t, in which w SA t,k denotes the attention weight of the kth joint point. Note that the same joint point would share the same attention weight in an action because our proposed model can train and modeling the entire input sequence.
W SA t can be obtained by applying an LSTM-based model that consists of four parts connected in series, followed by a set of the traditional LSTM networks for training the raw input sequence, the Full Connection layer for transforming the dimension of the LSTM output vectors, the activation layer (the ReLU function) for enhancing the nonlinearity of the structure, and the Normalize layer for preventing data dispersion. The process is shown in (3) Additionally, we select the key frames what can represent the action, which can improve the ability to modeling global long-term temporal features. Therefore, in this article, we utilize the temporal attention model to assign the corresponding weights to each frame. Note that these frames have been modeling by the above LSTM-based model and have obtained the attention weights of the joint point, that is,S SA t = (s SA 1 , s SA 2 . . . . . . , s SA n ). Based on this idea, we use the LSTM-based model to train the different weights, namely, frames that are more important for the action description will obtain higher weights, the process is shown in (4) and the structure is shown in the part 2 of Fig.2. Where X SA t is the output sequence at time t, α SA n is the attention weight of the nth frame, s SA n is the nth frame of the input sequence S SA t . α SA n is the dynamic weight that can be determined by the previous frame and the current frame. The corresponding weights can be obtained by the model training, in which the activation function tanh is utilized and the Normalize layer is also included. In the SA-CNN subnetwork, we re-design the structure of the CNN-based model according to the idea of attention mechanism. We focus on salient areas of the feature maps to extract the discriminative joint points that can describe the entire movement. Our main purpose is also to assign the weight to each joint point to extract the key features that have the greatest influence on human motion, improving recognition accuracy. Specifically, we select the AlexNet as the basic model, in which the third Conv-layer is re-designed to the twochannels-based convolution layer. One channel is the traditional convolution layer to calculate the feature maps. Another channel consists of three convolution operations to obtain the prediction feature maps with the key spatiotemporal information, which can analyze the connection between each joint point movement and the entire skeleton motion to predict what joint points are more important for this action. In the end of this layer, the feature maps of both channels are merged by using the multiplication operation. For the input process, the method that mapping 3D skeleton-based data to 2D image-based data is utilized. Specifically, the input skeleton-based sequence consists of continuous frames, the skeleton in each frame includes the same joints, and each joint contains corresponding 3D coordinate information (x, y, z); the matrix corresponding to the image data contains rows, columns, and the number of channels. Therefore, the following transformations are conducted: 1) Columns denote the input different frames; 2) Rows denote different joints; 3) The three of channels denote the 3D coordinate information (x, y, z).
The structure of the SA-CNN subnetwork is shown in Fig.3. The pipeline is as follows: 1) The first Conv-layer is used to train the input skeleton data and the feature maps is obtained, the kernel size is 5 × 5 and the stride size is 2; 2) The pooling layer is used to reduce the size of feature maps, the kernel size is 5 × 5 and the stride size is 2; 3) The second Conv-layer is utilized and the parameters are the same as the first Conv-layer; 4) The second Pooling-layer is used, the kernel size is 3 × 3 and the stride size is 2, the padding operation is used to keep the size of feature maps consistent; 5) In the third Conv-layer, both channels are designed, the parameters of each channel are the same, that is, the kernel size is 3 × 3, the stride size is 2, and the padding size is 1; 6) Followed by two same Conv-layers, the parameters are the same as the third Conv-layer; 7) The full connection layer is used for human action classification and the number of neurons is the same as the number of action categories.

C. TWO-STREAM FUSION
Similar to most of the recent success studies, we also utilize the weighted fusion algorithm to calculate the final results, that is, assigning weights to both subnetworks output respectively, and conducting weighted average calculation. To obtain the optimal weight values, we conduct a group of comparative experiments, in which we select some of the most common values utilized in the past works. After that, we determine the final weights values, that is, setting the weight of SA-LSTM to 1, and setting the weight of SA-CNN to 3. Besides, some automatic weight training methods have been proposed and applied in some specialized research fields. However, these methods cannot achieve better results for our work and have some problems with training speed and over-fitting. In the work, our purpose is to study the novel integration idea of attention model with CNNs and RNNs, the independent subnetworks are more suitable for our study. In the experiment section, we mainly discuss the performance of CNN-based models and RNN-based models, and compare our results with other state-of-the-art methods.

D. MODEL TRAINING AND DATA ENRICHMENT
In our proposed model, we deeply integrate the attention mechanism into two subnetworks, both spatial features and temporal features are modeling at the same time. Training our proposed model is still a challenge. For SA-CNN, we utilize the common training scheme, like in AlexNet, which can achieve satisfactory results. For SA-LSTM, due to the more complex structure (it consists of two parts, both have been integrated with attention mechanism), overfitting may occur during the training process. We utilize the regularization algorithm to the loss function (we select the cross-entropy function) that can alleviate the overfitting, as shown in (5). Where Y c is a set of the real action categories,Ŷ c is a set of the predict action categories; ϕ 1 part1 2 2 is the first part in the SA-LSTM subnetwork for assigning the attention weights of joint points, ϕ 2 part2 2 2 is the second part in the SA-LSTM subnetwork for assigning the attention weights of temporal frames.
In the training process, we found that the proposed model achieved the best recognition rate when the training epoch was set as 20. Additionally, the proposed model does not include a large number of parameters because there are only three channels. The two sub-networks are trained and tested simultaneously. The model simply performs a weighted average of the sub-results to obtain the final classification results. Therefore, a large amount of computational is not required.
On the other hand, to improve the ability of generalization, we utilize the data enrichment method to increase skeleton samples in the training process. Our purpose is that let the model ''see'' more trained samples in the real applications. Specifically, we randomly set the rotation and translation parameters, and transform the skeleton-based data along the space coordinate axes to generate more samples. Experimental results show that the data augmentation method can significantly improve the recognition accuracy, which would also improve the robustness.

IV. EXPERIMENTS AND DISCUSSION
In the section, we conduct extensive experiments on four popular benchmark databases to verify the superiority of our proposed model and conduct a set of ablation study to verify its effectiveness, including MSR-Action-3D dataset, SYSU 3D Human-Object Interaction Set, SBU dataset, and Northwestern-UCLA dataset. To evaluate its performance in more comprehensively, we consider the areas of visualization, effect enhancement, comparison, and different popular experimental settings.

A. EXPERIMENTAL SETTINGS
In our work, the test platform consists of Intel i7 − 10700FCPU , RX 5700XT 8GGPU , and Ubuntu16.04 operating system; the framework includes Python3. 6 and Tensorflow1. 1.4. SGD is used to optimize the model; the learning rate is 0.001 and the training epoch is 20; other parameters are initialized randomly.
MSR-Action-3D dataset [31]. This dataset consists of 20 action classes (as shown in Table 1) that have been divided into three groups, that is, similar action in AS1 and AS1, complex action in AS3; each action includes various versions performed by 10 participants. For a fair comparison, we conduct the Cross-Subject Test that is more challenging, that is, half of samples for model training.
SBU dataset [32]. In this database, 282 action sequences performed by 7 participants have been collected, it has been divided into 21 groups. This dataset consists of three types of data, including RGB data, Depth data, and skeleton-based data. For a fair comparison, we also conduct the common cross-validation test, that is, 50% of samples are used to train the model. SYSU 3D Human-Object Interaction Set [33]. This dataset consists of 12 action categories collected by the RGB-D sensor, and the total number of video sequences is 480 (Daily action: Sweeping, mopping, taking from wallet, taking out wallet, moving chair, sitting chair, packing backpacks, wearing backpacks, playing phone, calling phone, pouring, and drinking). Note that the developer provides open source files for data preprocessing and visualization, which is convenient for research. For a fair comparison, we utilize the challenging test scheme, that is, 50% of samples are for training and 50% of samples are for testing, and 30-fold cross-validation are conducted.
Northwestern-UCLA dataset [34]. This dataset consists of 10 daily activities collected by three different viewpoints which are a challenge for skeleton-based human action recognition (Daily action: Pick up with one hand, Pick up with two hands, Drop trash, Walk around, Sit down, Stand up, Donning, Doffing, Throw, and Carry). 1475 video samples have been performed by 10 participants. For a fair comparison, the common test scheme is conducted, that is, samples from the views 2 and 3 are for training and another is for testing.

B. ABLATION STUDY
In this subsection, our purpose is to discuss the idea of our works and prove the correctness of our proposed model. The ideas of our work are shown as follow: 1) In SA-LSTM, we design two methods of combining attention mechanism with LSTM, that is, the attention model is introduced to assign the weight to each joint point, the attention model is introduced to assign the weight to each input frame; 2) In SA-CNN, we re-design the structure of CNN by introducing the attention mechanism to assign the weight to each joint point; 3) In both subnetworks, we utilize the data enrichment method to increase more samples in the training process; 4) Finally, we fuse the results of both networks by introducing the weighted average algorithm to obtain the final recognition rate. After analysis, the following conclusions can be seen: 1) In most cases, the data enrichment can improve the recognition accuracy by about 3%; 2) Our proposed SA-LSTM method can bring gains of 4 percent on most datasets; 3) Our proposed SA-CNN method can also improve the 3% recognition rate; 4) The dual-stream fusion scheme is effective for improving model performance. Note that the traditional LSTM is the combination of the basic LSTM and the basic soft attention mechanism; the traditional CNN is a basic variant of the AlexNet network.
We randomly select serval frames of tow action sequences (put on glasses and kicking something) and show their visualization results (as shown in Fig. 4), these key joint points are marked in red. The results demonstrate that our proposed model can accurately locate the joint points that are key to perform the action. For example, ''put on glasses'' is main performed by the hands and arms, that is, elbows and wrists are the key joint points; ''kicking something'' is main finished by legs and feet, that is, the joint points of knees and ankles are important. As shown, these key joint points of elbows and wrists can be learned by our proposed model all the time. Due to our re-design the attention model, that is, a new scheme combined with CNN and LSTM, our proposed model can continuously extract key joint points in a continuous action, even for the input long-time sequence containing many frames.

C. COMPARISON WITH STATE-OF-THE-ART WORKS
In this subsection, we compare our proposed model with the recent state-of-the-art works on the four public datasets. We select different results from the current publications for different datasets and conduct different experiment settings. In Confusion matrixes, the rows are predicted labels, and the columns are true labels.
MSR-Action-3D dataset. For a comprehensive comparison, we select advanced works based on Depth data and  skeleton data as competitors, the comparison results are shown in Table 3. Our proposed model, using more discriminative features with sustained attention mechanism, achieve an average accuracy of 95.6%. Our recognition rate for the three action groups has been improved, especially the recognition of similar activities has increased by 2% and that of complex action is as high as 98.7%. Even if we only use a single data (skeleton-based), ours achieves the best performance. However, our model has a poor improvement in the recognition rate of the AS2 similar action group. The recognition rate of each action is shown in Fig. 5. The recognition rates of most actions are over 90% and that of half actions are about 100%.
SYSU 3D Human-Object Interaction Set. For a comprehensive comparison, we select advanced works based on mixed data and skeleton data as competitors, the results are shown in Table 4 and the confusion matrixes are shown in Fig. 6. This dataset is challenging for feature fusion because many same objects performed similar action and complex action in the same background that causes recognition bias. Due to the effective global spatiotemporal feature extraction with the sustained attention model, our proposed model achieves a significant improvement than most recent works. It is noted that the view adaptive-based model achieves the best performance, thanks to its high-level skeleton representation adaptive transformation scheme.
Northwestern-UCLA dataset. The results are shown in Table 5     rate by 5% compared to most current studies. Although Enhanced-CNN-based model is relatively robust to the change of collection viewpoints, the average recognition accuracy is 1% lower than ours. These recent works have the following limitations: 1) Local feature extraction without considering the global input sequence is not enough to distinguish subtle differences in similar motions; 2) The method is greatly influenced by environmental factors, such as changes in lighting and occlusion conditions; 3) The recognition of input skeleton sequence from different perspectives for the same action is quite different.
SBU dataset. For a comprehensive comparison, we select advanced works based on mixed data and skeleton data as competitors, the results are shown in Table 6 and the confusion matrixes are shown in Fig. 8. Our proposed model achieves 93.2% in cross-validation test, which outperforms the most recent works and increases the recognition rate by 2%. However, it can be seen that our proposed model not achieve the best performance, the recognition rate of our proposed model is not as high as GBSWC and PCST-LSTM. The main reasons and their limitations are as follow: 1) In  GBSWC, a structured skeleton representation is designed based on the human prior knowledge, which can obtain the high-level performance representations, but more costs of time and labor are needed; 2) In PCST-LSTM, a two-stream with CNNs and RNNs is proposed for simultaneously model mixed data, which improves the ability to model spatiotemporal features, but more data and complex training scheme are needed.
By comparing the misclassification of each action in the confusion matrixes, we get the following conclusions. First, it is difficult for the model to classify ''two steps'' action, like ''taking from wallet'' and ''Drop trash''. The main reason is that these activities lasts for a long time and can be divided into multiple sub-actions, which leads to misclassification. Next, there are differences in the recognition rates of certain actions judged to be similar based on human prior knowledge, like ''draw X'' and ''draw circle'', and ''sweeping'' and ''mopping''. The main reason is that certain activities are often misclassified by the model based on special features, though they seem to have no obvious connection. The model solves the classification problem, each action has its own label during training and testing. The recognition result only includes correct classification and incorrect classification to other categories. Therefore, the ability to distinguish similar features is still a challenge for human action recognition. Finally, due to the fixed perspective of skeleton-based data collection, the model only needs to overcome the challenges that long-term action brings to the temporal features learning, which is also the main contribution of our model. Therefore, the method achieves a high recognition rate for most action in MSR-Action-3D dataset, especially for ''High throw'', ''Forward kick'', and ''Pickup throw''. On the other hand, because of the simple data collection setting, most stateof-the-art methods can achieve satisfactory results on this dataset.

V. CONCLUSION AND FUTURE WORK
In this article, we design a two-stream network consisted of SA-LSTM and SA-CNN that integrates the proposed novel sustained attention mechanism for skeleton-based human action recognition. The proposed sustained attention model is motived by the fact that the contributions of each body joints to the completion of the action is different. For the SA-LSTM subnetwork, we assign the attention weights to all the joint points to locate the key part and then assign the weight to each frame to learn their dynamic relationship. For the SA-CNN subnetwork, we re-design the structure of the convolutional layer and integrate the sustained attention model to predict the weight of each body joints. Next, an end-to-end training scheme is utilized and a data enrichment method is also proposed by the randomly transformed skeleton coordinate system to enhance robustness. Finally, ablation studies and visual analysis are conducted to prove the correctness and accuracy of our proposed model. Extensive experiments on four popular benchmark datasets show that the superiority of our proposed model over the recent state-of-the-art works. In the future study, we will focus on the high-level skeleton representation pre-processing and collect the large-scale skeleton-based interaction database.
ZHIHONG LIANG received the B.S. and M.S. degrees in mechanical engineering from Shenyang Ligong University, China. He is currently working as an Associate Professor at Shenyang Ligong University. His current research interests include automatic detection system design research, mechanical automation equipment design, hydraulic system design, electrohydraulic, and computer control systems. BO LIU was born in 1981. He received the B.S. degree in mechanical engineering and automation from the Ordnance Engineering College, China. He is currently an Engineer with the Army Equipment Department in Shenyang, China. His research interests include mechanical and electrical automation testing equipment planning and product factory inspection.
BO LIU was born in 1990. He received the B.S. degree in automation from Shenyang Ligong University, China, where he is currently pursuing the master's degree. His research interests include visual image processing and pattern recognition detection technology and application, intelligent non-destructive testing technology and application, advanced control theory and application, and robot intelligent control technology and application.