3D Graph Convolutional Feature Selection and Dense Pre-Estimation for Skeleton Action Recognition

Action recognition plays an important role in promoting various applications in healthcare and smart education. However, unclear target actions, similar actions, and occluded characters may be encountered in some special scenarios. To solve the issues, a 3D Graph Convolutional Feature Selection and Dense Pre-estimation for Skeleton Action Recognition (3D-GSD) method is proposed to analyze and recognize the motion trajectory of the human skeleton. First, 3DSKNet is designed to adaptively learn and select important features in the skeleton sequence to identify skeleton parts of different importance more accurately according to the size of the input image resolution. It will help to better focus on key skeletal parts, improving the accuracy and robustness of bone recognition. Then, the DensePose algorithm is used to detect the complex key points of the human body posture and optimize the accuracy and interpretability of action recognition for different key points, key channels, and key-frames of the action. The proposed method achieves the best performance on the NTU RGB+D 60, NTU RGB+D 120 datasets, and Kinetics SKeletion 400 datasets, with an improvement of 0.02% 0.06%, and 0.1% in accuracy compared to the state-of-the-art methods.


I. INTRODUCTION
Skeleton features are widely used in human action recognition and human-computer interaction.It refers to detecting and tracking key points of a human skeleton from a given image or video.This technology requires depth cameras, sensors, and other equipment to capture the movement trajectory of human bones and analyze and identify them through computer vision and machine learning techniques.Skeleton The associate editor coordinating the review of this manuscript and approving it for publication was Wenbing Zhao .
behavior recognition technology can be applied to games, virtual reality, healthcare, and security.Different types of skeleton data in research and application scenarios increase the difficulty of skeleton recognition tasks.The main difficulties are: 1.The videos in the data set contain multiple characters, and the postures of each character may interfere with each other, making the extraction of skeletons difficult.2. The videos in the data set come from different perspectives and different cameras, so the expression of the same action may be different, and actions under different perspectives need to be expressed uniformly.3. The actions in the data set involve interactions between characters, including occlusion, changes in spatial position, etc.These factors may affect skeleton extraction and action recognition.
Early skeleton-based action recognition methods were based on hand-designed feature extraction and spatiotemporal modeling.The former uses specially designed feature extraction algorithms to extract features representing actions from skeletal joint data [1], [2].Common features include joint angles, joint distances, joint speeds, etc.The latter treats skeletal joint data as sequences that vary in time and space [3].Action recognition is achieved by establishing action models, such as dynamic time warping (DTW), hidden Markov models (HMM), or conditional random fields (CRF), to perform spatiotemporal modeling and matching of skeletal sequences.However, these methods ignore the intrinsic relationships between human joints.The iDT algorithm [4] is known as the best performance method without the support of the deep learning technique.There are several methods developed based on iDT [5], [6].In recent years, the skeleton feature action recognition technology based on deep learning has been roughly classified into two categories: one is based on skeleton key points [7], and the other is based on spatio-temporal feature analysis [8], [9], [10], [11].The former mainly refers to using key points to describe the movement of the whole human body, and the output is an action category label, which has the advantages of not being disturbed by the environment, and a small amount of data.However, due to the limitations of the information contained in the skeleton points, it is difficult for the algorithms based on the skeleton points to effectively recognize some actions that are closely related to objects or scenes.Methods based on spatio-temporal feature analysis mainly include Two-Stream [12], C3D [7], and convolutional neural network-long shortterm memory network (CNN-LSTM) methods [13].The Two-Stream algorithm is to input the RGB image and the optical flow image into two CNN networks respectively, and then fuse the results of the two networks to obtain the final classification result.The Two-Stream algorithm can use the optical flow information to better capture the motion information of the action and improve the accuracy of action recognition.However, it requires additional GPU computing time and storage space, which has become the bottleneck of the Two-Stream algorithm.C3D extends the mature network structure in 2D CNN to the time domain and then adopts a decomposition strategy of the 3D convolution kernel, which is decomposed into 2D convolution and 1D convolution and adopts different serial and parallel methods combined to obtain the final classification result.C3D can accept the frame of the whole video, and it does not need to process the video into segments, with fast speed and good effect.However, the algorithm is not sensitive to camera viewpoint, noise, and local occlusion, which will affect the acquisition of interest points.For example, Qiu et al. [14] propose a deep neural network architecture called P3D, which aims to better learn the spatiotemporal features in videos, using pseudo-3D convolution operations and residual connections to capture the spatiotemporal information of videos, in multiple videos Extensive experiments on classification datasets demonstrate its superior performance over state-of-the-art techniques.Yan et al. [15] propose a three-dimensional gesture and action recognition framework called PA3D, which is mainly aimed at video recognition tasks.This method is represented by converting human postures and actions in the video into key point sequences in a three-dimensional coordinate system, and then inputting them into the neural network for recognition.The CNN-LSTM algorithm inputs the video sequence into a convolutional neural network, then inputs the result of the network into an LSTM, and finally classifies the result of the LSTM to obtain the final classification result [16], [17].Liu et al. [49] propose an end-to-end multilevel long short-term memory (LSTM) network with spatial and temporal attention mechanisms.Its network can automatically select key information from each frame to determine actions, and the network uses spatial and temporal attention modules to assign different importance levels to each frame.Ke et al. [18] propose Global Contextual Attention LSTM (GCA-LSTM), which can selectively focus on discriminating joints.Ke et al. [18] divide the action sequence into several short video clips, then use a 2D convolutional neural network to extract features from each clip, and then input these feature sequences into an LSTM network for sequence modeling to ultimately achieve the classification of action prediction categories.The CNN-LSTM algorithm works well for long-term series and can capture long-term dependencies in time series.However, it is slower and requires more computing resources.In addition, it is not sensitive to factors such as camera viewpoint, noise, and partial occlusion.
Traditional skeletal action recognition methods generally require manual design and feature extraction, and often require the participation of multiple steps and domain experts, making it difficult to adapt to different scenarios and tasks.In addition, most deep learning-based methods perform poorly for pose changes and high motion complexity in skeletal sequences.The MS-G3D network [19] does not need to manually design and extract features, but automatically learns the features of the bone sequence through convolution and pooling operations, which improves the generalization ability and adaptability of the model.At the same time, the network adopts 3D convolution and attention mechanism to process the spatio-temporal information in the skeleton sequence, effectively capturing the key features of the action, while reducing the model parameters and calculation amount, and improving the efficiency and accuracy of the model.However, one of its main drawbacks is the influence of motion being occluded, which may cause the model to fail to capture the key information of the motion correctly and lead to a decline in the performance of the model.Skeleton joints are the key information in the skeleton sequence, but in the MS-G3D network, each skeleton joint is only represented as a coordinate point.This representation cannot fully express the morphological and dynamic information of the skeletal joints.Therefore, it is necessary to find better ways to strengthen the expressive ability of skeleton joints to further improve performance.
Action recognition methods based on deep learning play a significant role in promoting various applications in healthcare and smart education.However, some special scenarios may encounter unclear target actions, similar actions, and occluded characters.To solve the problems of pose change, scaling, and sequence loss in skeleton sequences, we propose a 3D graph convolutional feature selection and dense pre-estimation (3D-GSD) method for action recognition of skeletons.Introducing spatial and temporal attention mechanisms and human prediction models can make the model adaptively focus on key poses and skeletal joints and consider their changes in time.Therefore, our method does not only better capture the local and global information of actions but also analyzes human poses more comprehensively.The main contributions are as follows: • We design a 3DSKNet to adaptively adjust the model.
It can more accurately identify key points of different importance according to the size of the input image resolution.Moreover, it greatly improves the estimation accuracy of skeleton missing key points, reduces the difficulty of skeletal action recognition, and increases the accuracy of skeletal action recognition occluded by objects.
• We introduce a DensePose algorithm to detect the complex key points of human poses and integrate them into the 3D-GSD network model.The 3DSKNet attention mechanism focuses on key skeletal parts, while Dense-Pose can provide more detailed pose and shape information.By combining them, more accurate and complete human motion analysis results can be obtained.
• Extensive quantitative and qualitative experiments are implemented to verify the accuracy of the 3D-GSD.The experiments were evaluated on two different datasets of human recognition.
The rest of the paper is structured as follows: Section II provides a brief review of related work, including the skeletal action recognition, attention mechanism, and human pose estimation algorithm based on CNN.Section III presents the details of the proposed method.Section IV shows the experimental results.Section V is the conclusions.

II. RELATED WORKS A. SKELETAL ACTION RECOGNITION
Traditional algorithms for skeleton-based action recognition are implemented using hand-designed feature extractors, which can include joint angles, accelerations, velocities, energies, etc.These features are then fed into machine learning models for classification or regression, such as support vector machines (SVM) and hidden Markov models (HMM).With advanced deep learning techniques, models for skeleton-based action recognition are developed and can be divided into two categories: sequence-based models and graph-based models.
Sequence-based models typically use recurrent neural networks (RNNs) or long short-term memory networks (LSTMs) to model sequence data.These models can handle variable-length bone sequence data and consider the temporal relationship between joints.Liu et al. [20] propose a new gating mechanism to deal with noise in skeleton data by learning the reliability of sequential data and accordingly adjusting the input data's contribution to the long-term contextual representation stored in the cell's memory unit.Wang et al. [21] propose a novel hierarchical attention network with pseudo-meta-paths for skeletal action recognition, which learns discriminative features for action recognition by capturing the long-range dependencies of skeletal joints.Zhang et al. [9] propose a Two-Stream Transform Encoder (TSTE) network utilizing motion spatiotemporal feature embedding and shape transformation.San et al. [22] provide a comprehensive review of deep learning techniques for human activity recognition (HAR) and provide a resource guide for researchers and practitioners.Zhang et al. [23] introduce deep learning-based methods for human action recognition, including methods based on color videos, skeleton sequences, and depth maps.Li et al. [24] propose a new CNN-based action classification and detection framework.Although sequence-based models perform well in skeletal action recognition tasks, there are still some problems and challenges that need to be resolved: when collecting skeletal sequence data, there may be certain noise or missing data, for example, due to sensor failure or human body occlusion, etc.
Graph-based models aim to address the limitations of sequence-based models, mainly utilizing graph convolutional neural networks (GCNs) to model the relationship of skeletal sequences.Yan et al. [25] first use ST-GCN to model the problem of skeleton-based action recognition.The AS-GCN network proposed by Li et al. [26] can effectively share information between different actions, and the graph structure can be adaptively optimized through network learning to obtain more behavior details to improve the recognition effect.Shi et al. [27] propose a two-stream network (2s-AGCN) structure using node information and bone information and then construct a two-stream network structure by simultaneously utilizing key points and bone information to obtain more skeleton features for action recognition, significantly improving recognition performance.Shi et al. [28] further propose directed graph networks (DGNN), which can dynamically construct graph connections end-to-end, surpassing other methods in all indicators, and it can effectively identify complex motions in skeletal motions.Graph-based models can naturally capture complex relationships among skeletal sequences and can better handle multi-person actions and object interactions.However, graph models have higher time complexity than sequence-based models since computations need to be performed on all nodes and edges.This can lead to increased training and inference time, limiting the usefulness of these models in real applications.

B. ATTENTION MECHANISM
In recent years, some new attention mechanisms [29], [30] have been proposed.For example, interactive attention can learn to find key features and salient parts from the input data to achieve better task effects [31], and multi-head attention [32] runs multiple attention mechanisms on the same input data and merges the results.The channel attention [33], [34], [35], [36] that is mainly studied in this paper, can find the specific data in complex data, and improve the accuracy and efficiency of the model by learning how to adjust the weight of each channel in the input data.

C. HUMAN POSE ESTIMATION BASED ON CNN
Human Pose Estimation (HPE) [37] is to obtain human motion information from visual data, including the position of key points, attitude angle, and other information.With the powerful development of CNN, more CNN models are used for human pose estimation, such as the Hourglass [38] model, the Integral Human Pose Regression model [39], the Simple Baseline model [40], etc. Mask R-CNN [41] is a CNN-based target detection and semantic segmentation algorithm, but it does not directly output the position of the key points of the human body but outputs the rectangular frame where the human body is located and the position of each key point in the rectangular frame.Subsequently, the DensePose algorithm [42] appears, the key to which is to train a large number of datasets with pose annotations, so that the model can learn the mapping relationship between the human body surface and pixels and can predict the position of each pixel on the human body surface.However, in practical applications, different scenarios and tasks require different loss functions, which also need to be designed and optimized according to specific problems.[19] is a bone recognition method based on a 3D CNN.It can analyze and predict the input 3D skeleton sequence, but it cannot fully express the shape and dynamic information of skeleton joints due to motion occlusion.The proposed 3D-GSD has been modified on this basis, retaining the ms-g3d module to extract the space-time feature representation of the skeleton sequence, and designing a new feature selection module 3DSKNet and dense pre-estimation module DensePose, as shown in Figure 1.3DSKNet is an attention mechanism for 3D convolutional neural networks, which can adaptively learn important features in skeleton sequences, better focus on key skeleton parts and action sequences, and ignore unimportant parts such as some noise or interference and some irrelevant joints, which helps to improve the accuracy and robustness of skeleton recognition.After skeleton recognition, it is necessary to estimate the pose and shape of the human body in three-dimensional space.Using DensePose to estimate the human body pose on the input image can analyze the human body pose more comprehensively.Specifically, 3DSKNet can provide position and motion information of key points of interest, while DensePose can estimate more detailed pose and shape information.The combination of them can obtain more accurate and complete human motion analysis results, which is of great significance to many application fields, such as motion analysis and medical diagnosis.

A. 3DSKNET MODULE
SKNet [52] is a lightweight attention mechanism that can enhance the representation ability of the network, but the reason why the SKNet mechanism cannot be directly applied to the 3D network structure is that it is carried out in space, and in the 3D structure, in addition to the spatial dimension (x, y, z), and the time dimension (t) , so it is necessary to design the attention mechanism in the time dimension.In addition, the attention mechanism of SKNet needs to operate on feature maps of different scales, and the 3D network structure requires a more complex design to deal with feature maps of different scales due to the larger range of scale changes.Therefore, a 3DSKNet mechanism that can handle joint information in both spatial and temporal dimensions is proposed.The 3DSKNet mechanism adopts a 3D convolution operation and attention mechanism, which can adaptively learn the spatiotemporal features of each joint point and perform a weighted fusion of the features of different time steps to capture the spatiotemporal relationship.In 3DSKNet, the feature learning and feature selection of each joint point are carried out in three-dimensional space, the formula is as follows: where y i represents the feature vector of the i-th joint point, X i is the feature input of the i-th joint point, W 1 and W 2 are Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.learning parameters, and s i is the average feature of the i-th joint point in the time dimension.In 3DSKNet, the attention coefficient can be expressed as: where W 3 is the learning parameter and b 3 is the bias parameter; z i,t represents the attention coefficient of the i-th joint point at time t, and the attention-weighted feature of each joint point at time t can be calculated: where a i,t is the attention-weighted feature of the i-th joint point at time t, and y i,t is the feature vector of the i-th joint point at time t.Finally, the attention weight of each joint point at different moments can be weighted for data fusion to obtain the final feature representation: where V is the number of joint points, T is the number of time steps, and F is the final feature representation, which can be used for subsequent tasks, such as skeleton recognition and human pose estimation.We introduce in detail the mechanism of extending 3DSKNet on 3D skeleton data in Figure 2, which mainly includes three stages, namely the split stage, fuse stage, and select stage.
The split stage mainly performs scale-invariant processing on the input feature map and adds convolution operations of different kernels according to the number of branches.First, the input feature map X is divided to obtain multiple sub-feature maps, and each sub-feature map corresponds to a convolution kernel.According to the number of branches M , the input feature map is divided into M parts as input.Specifically, for the i-th branch, a convolution kernel of (2i + 1) × (2i + 1) × (2i + 1) × 3 size is used for the convolution operation, the step size is 1, and the padding is 1.After the convolution operation, the feature maps of all branches are stitched together to obtain a feature map of size M × T × V × H ×H ×out_channels, which T represents the number of time steps, V represents the number of joint points, H represents the size of the spatial dimension (height and width), and out_channels represents output channel dimensions.
where W 4 represents the fully connected layer, and Squeeze represents compressing the dimension of the feature to 2. After performing dimension reduction, dimension increase, and softmax operations on the feature, the correlation between the features is learned, and the weights of different positions are obtained for selecting the appropriate subset of features.
In the selection stage, the feature U output by the fusion stage is first divided into two features a and b through the split operation.Next, compress the second dimension (the number of channels) to obtain two vectors whose length is half the number of channels, denoted as a ′ and b ′ , respectively.Then, a ′ and b ′ are respectively multiplied elementwise by the weight vector, and the weighted feature V will represent the entire feature U more accurately.This weight vector maps the value to the [0 1] range according to the softmax function, ensuring that each element is in the [0 1] range and the sum is 1.
In general, the design idea of the 3DSKNet mechanism is to fuse the output features of the skeleton recognition module with the global features with spatial relationships, to improve the performance of skeleton recognition.

B. DENSEPOSE MODULE
To better capture the characteristics of human motion, we use the DensePose module at the end of 3D-GSD to improve the accuracy of skeletal behavior recognition.By predicting the position of the key points of the human skeleton, more abundant posture information can be provided, and more detailed posture estimation can also be realized, such as the specific position of the hand, the degree of flexion and extension of the fingers, the posture of the body, etc., thereby improving the accuracy.Specifically, the DensePose human poses prediction module is mainly divided into three stages: feature extraction, feature, and skeleton feature fusion, and pose estimation.As shown in Figure 3.
Firstly, the output feature map of the previous stage is used as input, and a series of convolutional layers (including 3 Conv2d and 3 ConvTranspose2d) are used for feature extraction and dimensionality reduction.In the recognition of skeletal sequences, 384 feature points are extracted from the output of the skeleton network and used as the representation of skeletal sequences.Specifically, perform a Conv2d operation to reduce the number of the feature channels from 384 to 256, and then perform two downsampling operations (Conv2d with stride=2) to reduce the feature size to 1/4 of the original.Subsequently, perform two more Conv2d operations to reduce the number of channels of the feature map to 128 and 64 respectively.Finally, perform a Conv2d operation to reduce the number of channels of the feature map to 32 again, and then use ConvTranspose2d three times to increase the dimension to obtain the final feature map.Which purpose is to reduce the number of feature channels while keeping the size of the feature map constant, thereby improving the abstraction ability of features.By fusing the DensePose feature with the skeleton feature, more comprehensive human pose information can be obtained.The fused formula is as follows: where F DP,t is the DensePose feature of the t-th frame, F ske,t is the skeleton feature of the t-th frame, [•; •] represents the splicing operation in the feature channel dimension, T is the total number of frames in the video, and F fuse is the fused feature vector.Then, F fuse performs global average pooling to obtain the final feature vector f : where F fuse,t is the fused feature vector of the t-th frame, and the fused features are input into two fully connected layers for classifying actions.The final output is the probability value for each category, which is obtained by the softmax: where y k represents the probability of belonging to the k-th category, and h is the output of two fully connected layers.

IV. EXPERIMENTS
Experiments are implemented on a Windows system equipped with an Intel Xeon(R) 4210R CPU and an NVIDIA RTX 3090 GPU.The network framework is also based on the PyTorch platform.The full source code is available at the address https://github.com/wizardbo/3D-GSD.
We quantitatively compare our method with the other competing deep learning-based methods on NTU RGB+D 60 Skeleton, NTU RGB+D 120 Skeleton datasets, and Kinetics SKeletion 400 datasets.Table 1 and Table 2 display the statistical outcomes of X-Sub and X-Set for all the competing methods.It can be seen that the proposed method achieves the best X-Sub and X-Set results on all datasets.Moreover, The X-Sub and X-Set values of our method are 0.02%∼0.06%higher than those of the baseline MS-G3D.Table 3 shows the  statistical results of the Top 1% and Top 5% on the Kinetics SKeletion 400 dataset, and our methods own the best results.
These results indicate that our proposed method achieves better performance for various datasets and improves the action recognition performance of the model by focusing on key parts and action details.
For the complexity, the proposed 3D-GSD contains 5,012,643 parameters and MS-G3D -3,194,595 parameters.For the training time cost, both MS-G3D and 3D-GSD took around 1 week, and the difference is only a few hours.This is understandable because the number of parameters of the proposed model is larger.However, the difference in time for  the training procedure is acceptable.Moreover, for the testing time, the difference is just very minor: the proposed 3D-GSD took 13.668 seconds and MS-G3D took 12.918 seconds for the data of NTU RGB+D 120 Skeleton dataset.

C. ABLATION STUDY
To further validate the proposed 3D-GSD, we analyze the contribution of each module to the 3D-GSD method by removing different network modules, including removing 3DSKNet and removing Densepose, respectively.Here, we also perform tests on joint only and bone only on NTU RGB+D 60 Skeleton dataset, as shown in Table 4.
The joint SD-G3D network that only adds 3DSKNet has a lower recognition rate of 0.07% than the joint SD-G3D network that adds both the 3DSKNet attention mechanism and the DensePose pre-estimation module.For the skeletal SD-G3D network with the 3DSKNet attention mechanism and DensePose pre-estimation module, the recognition rate is 0.03% higher than that with only the 3DSKNet skeletal SD-G3D network.The human body pre-estimation DensePose module can estimate the key points and pose information of the human body in the input image, thereby improving the recognition accuracy of the occluded parts.It can also map the two-dimensional points on the image to the surface of the three-dimensional human body and mark them so that the model can understand the posture and shape of the human body more accurately, and effectively solve the problem of being occluded.The rising curves of X-Sub(%) and X-Set(%) for each part of the NTU RGB+D 60 dataset are shown in Figure 4.
In the experiments of NTU RGB+D 60 (Joint Only) and NTU RGB+D 60 (Bone Only), since the network only considers joint points and bone information, adding the 3DSKNet mechanism can improve the connection between joint points and bones so that the model can better understand the skeleton information and better distinguish different actions.At the same time, the 3DSKNet mechanism can effectively reduce noise interference and improve the robustness of the model, thereby improving the accuracy of the model.

V. CONCLUSION
This paper proposes a 3D graph convolutional feature selection with a dense pre-estimation (3D-GSD) method for action recognition of skeletons.This method is mainly to design the 3DSKNet attention mechanism in the MS-G3D network of bone recognition and introduce the human body pose estimation DensePose.Specifically, the designed 3DSKNet attention mechanism can make the network pay more attention to important features, improving the accuracy while keeping the computational cost small.Secondly, the introduction of the DensePose module can provide pose information on the skeletal sequence, further enhancing the performance of skeletal behavior recognition.The 3D-GSD network has advantages in processing spatiotemporal sequence data, so it can better handle bone sequence data.Finally, the paper conducts extensive experimental validation on several commonly used action recognition datasets.The results show that the proposed method achieves the best performance on the NTU RGB+D 60, NTU RGB+D120 datasets, and Kinetics SKeletion 400 datasets, achieving accuracy gains of 0.02%, 0.06%, and 0.1% compared to the best-performing methods.The SD-G3D network model may be more effective for specific datasets and tasks, while the generalization performance on other datasets or tasks may be degraded.This is because features and fusion strategies for multimodal data are usually designed for specific problems and may not be applicable to other scenarios.
In the fusion stage, the features of different scales obtained by all branches are first added element-by-element to generate a mixed feature U with a dimension of [N , C ′ , T , V , M ]; then, three-dimensional adaptive pooling is performed on U to compress the features to the specified dimension 1 and get S with dimensions [N , C ′ , 1, 1, M ].Next, squeeze the results obtained in the previous step into [N , C ′ ], and then use the fully connected layer to reduce the dimension to L to get a scalar d.The formula is:
JUNXIAN ZHANG is currently a Lecturer with the School of Health, Jiangsu Vocational Institute of Commerce, China.Her research interests include smart elderly care technology and elderly care technology.AIPING YANG is currently a Professor with the School of Health, Jiangsu Vocational Institute of Commerce, China.Her research interests include artificial intelligence, smart elderly care technology, and nutrition allocation technology.

TABLE 1 .
Quantitative comparison (X-Sub and X-Set) of the NTU RGB+D 60 SKeleton dataset.The top-performing result is highlighted in bold, while the second-best result is underlined.

TABLE 2 .
Quantitative comparison (X-Sub and X-Set) of the NTU RGB+D 120 SKeleton dataset.The top-performing result is highlighted in bold, while the second-best result is underlined.

TABLE 4 .
Ablation study of 3D-GSD for different modules on NTU RGB+D 60 SKeleton dataset.