Skeleton-Based Action Recognition With Low-Level Features of Adaptive Graph Convolutional Networks

Skeleton-based action recognition is a typical classification problem which plays a significant role in human-computer interaction and video understanding. Since a human skeleton has natural graphic features, methods based on graph convolutional networks (GCN) are widely applied in skeleton-based action recognition. Previous studies mainly focus on structural links in GCN to generate high-level features of human skeleton. However, low-level features are also important in many applications. For instance, low-level edge gradient and color information are important for image classification. This paper introduces a multi-branches structure to capture different low-level features of human skeleton. We combine both high-level and low-level features to recognize human action. We validate our method in action recognition with two skeleton datasets, NTU-RGB+D and Kinetics. Experiment results indicate that the proposed method achieves considerable improvement over some state-of-the-art methods.


I. INTRODUCTION
Human action recognition is a typical classification problem which plays a key role in computer vision. Action is an important dimension of human beings to express their feelings and intentions just like languages and facial expressions. Lots of applications based on action recognition have received extensive studies, such as video surveillance [1], humancomputer interaction [2], game control [3], virtual reality [4] and video retrieval [5]. These applications have a rapid development in the past twenty years. Data modalities of action recognition have changed from mainly RGB-based data [6] to a variety of modalities, including skeleton [7], point cloud [8], radar [9], WiFi [10], etc. Recently, methods based on skeleton data attract increasing attention due to the development of different kinds of accurate and affordable sensors, such as Leap Motion and Kinect [11]. Nowadays, skeleton data can be more easily captured than ever before. Human skeleton is a topological representation for human body with locations of key joints in 3 dimension space. Compared with other modalities, human skeleton retains most human action The associate editor coordinating the review of this manuscript and approving it for publication was Juntao Fei . information with less data. For example, an image may have a complex background, whereas a skeleton only contains joint information. Therefore, skeleton-based approaches appear less computation consuming and more robust to variations of viewpoints, motion speed, body scales, etc. [12].
In the early days, RGB videos which contain temporal dynamics of human motions are more easily to obtain than other modalities. Most human action recognition works are RGB-based. Just like many other tasks, human action features of RGB videos gradually change from hand-crafted features [13] to deep features [14]. Even with deep learning methods, human action recognition is still challenging. Because RGB videos are sensitive to viewpoint variations, illumination conditions and background. Besides, action data have a much larger file size in RGB video format than they are in skeleton format. This can lead to a higher computational cost. Naturally, researchers turn to skeleton-based methods. In human action recognition, hand-crafted spatial and temporal features [15] of skeleton data are firstly applied. At present, deep learning methods have become the mainstream in this field because of their powerful feature learning ability. Skeleton data can be seen as a sequence of static skeleton frames. RNN is suitable for learning dynamic dependencies of sequential data. Various methods based on RNN [7], [16] are applied to model temporal information of skeleton data. Some researchers try to use CNN to model spatio-temporal information of skeletons. The main idea is treating a 3D skeleton sequence as a sequence of pseudo-images [17]. Both CNN and RNN cannot effectively extract complex spatio-temporal information and correlations between joints in a skeleton due to the characteristics of networks.
As mentioned, human skeleton is a natural topological graph structure including key body joints. It is difficult to use proven models like CNN or RNN on graph structure directly. GCN can be seen as a generalized version of CNN on structure of an arbitrary graph. Therefore, some researchers try to utilize GCN to model skeleton data. The ST-GCN [18], as shown in Fig.1(a), is the first work which applies GCN to model dynamic graphs over large-scale human skeleton sequences successfully. ST-GCN uses natural connections FIGURE 1. Illustration of deep learning frameworks for skeleton-based action recognition. From top to bottom: (a) GCN is used to capture connections in one skeleton in one frame and TCN is used to capture connections on the same joint in different frames. The alternate placement of GCN and TCN blocks has been used by subsequent methods. (b) AS-GCN constructs A-links and S-links to improve the adaptability of adjacency matrix. A-Links is generated by an encoder-decoder structure to get predicted action. (c) 2s-AGCN constructs a parameterized adjacency matrix (C k ) to capture links between indirectly connected joints dynamically and generates a bone stream (the line skeleton) to capture the second-order information of a skeleton. of joints in a human body to construct a graph structure. Meanwhile, it adds temporal edges to connect the same joint across consecutive time steps. Multiple layers of GCN and temporal convolutional networks (TCN) are constructed thereon. Spatial features can be extracted by GCN layers. And temporal features can be extracted by TCN layers. GCN layers and TCN layers are stacked alternatively. The main problem of ST-GCN is that the skeleton graph is predefined and the adjacency matrix represents only the physical structure of a human body. Some indirectly connected joints have semantic relationship in some actions like ''walking'' and ''clapping''. ST-GCN cannot capture action features which need long dependence of joints. Establishing connections between indirectly connected joints is the main improvement direction. Li et al. [19] propose an Actional-Structural GCN (AS-GCN), as shown in Fig.1(b), to connect long distance joints using two types of links named actional links and structural links. In AS-GCN, actional links generated by an encoder-decoder structure are used to capture latent dependencies between arbitrary joints. Structural links generated by a high order adjacency matrix are used to represent high order relationships. Both actional links and structural links are fixed during classification essentially. Based on [19], Li et al. propose Sym-GNN [42] to capture body parts links. The main idea of Sym-GNN and AS-GCN is to determine the adjacency matrix with segmentation. This kind of segmentation is still based on physical structure of a body. Almost at the same time, an adaptive graph convolutional network is proposed in [20] namely 2s-AGCN. 2s-AGCN takes advantage of a feature integration strategy. They generate a bone stream from joints to improve recognition performance as shown in Fig.1(c). Besides, they design an adaptive adjacency matrix to capture links between indirectly connected joints dynamically. Based on [20], Shi et al. [21] add an attention mechanism to further improve recognition performance. After that, Shi et al. [43] propose DSTA-Net which looks into the spatial-temporal in human action sequence closely and uses the idea of transformer [44] to decouple spatial-temporal features. MS-G3D [41] utilizes a high order adjacency matrix to enhance the adaptability and proposes a unified spatial-temporal graph convolution to capture crossspacetime correlations. MS-G3D has high computational complexity because of the use of high-order adjacency matrix. Cheng et al. [45] propose a shift graph convolutional network (Shift-GCN) based on Shift CNN [46] to reduce computational complexity.
As seen in Fig.1, most deep learning frameworks are linear stacks of GCN and TCN blocks. Final features are extracted from the last layer of networks. Based on graph theory, information propagated through a multi-layers graph convolutional network can achieve a convergent state on joints of a human body finally. Features in the last layer are high-level. Low-level features are gradually integrated into high-level features in the process of propagating. We try to utilize low-level features for two reasons. The first reason is that the experience of CNN shows low-level information is also VOLUME 9, 2021 FIGURE 2. The pipeline of the proposed framework. In each branch, we apply a GCN-TCN block to capture different low-level features directly. The input dimension of each branch is different, and the output dimension is the same. Before classification, we concatenate different low-level features and backbone features together. critical for classification [22]. In [18]- [21], all frameworks apply a local residual structure to improve the performance. In some cases, global low-level features are also helpful for classification. For example, in [23], different levels of features represent different areas of a vehicle. Another reason we try to use low-level features is to sovle the degradation problem. As we all know, along with network depth increasing, accuracy degrades rapidly in CNN. In [22], He et al. introduce a residual structure to solve this problem. The core idea of residual structure is utilizing low-level features directly. Based on both considerations, we propose a novel framework taking advantage of global low-level features based on [20]. Fig.2 presents the pipeline of our framework. We directly capture features from different low-level layers and concatenate them together before the last layer. Besides, we apply a preprocess strategy [17] to improve action recognition performance. The core idea of this strategy is to determine whether a frame is valid by calculating the variance of joints, so as to delete valid frames. The details of this strategy will be introduced in the experiment section. To verify the effectiveness of the proposed model, namely low-level adaptive graph convolutional network (LAGCN), we conduct extensive experiments on two large-scale datasets: NTU-RGB+D [24] and Kinetics-Skeleton [25]. The experiments demonstrate that LAGCN has achieved the state-of-the-art performance.
The main contributions of our work lie in three aspects: (1) A human action recognition framework with a multi-branches structure is proposed to learn the low-level features of skeleton data. (2) A better preprocess strategy is used to improve action recognition performance. (3) On two large-scale datasets for skeleton-based action recognition, the proposed LAGCN exceeds the state-of-the-art.

II. RELATED WORK A. NEURAL NETWORKS ON GRAPHS
Classical neural networks have achieved great success in processing structured data. For example, images can be seen as a grid and texts can be embedded into a fixed-length vector. Recently, researchers begin to pay attention on unstructured data. The graph based method is a hot topic in deep learning research. GNNs [26] is the first work to combinate the graph and recurrent neural network for graph representation learning. Scarselli et al. [26] prove mathematically that graph representation learning using the recurrent neural network and Almeida-Pineda algorithm [27] is convergent. After that, GGNN [28] based on GRU is proposed. Although GGNN cannot guarantee convergence in an arbitrary initial state, it is more flexible and practical in applications. Subsequently, spectral and spatial GCNs appear. Spectral GCN [29] based on the spectral graph theory transforms graph signal with the Laplacian on graph spectral domain. Because its simplicity as a mean neighborhood aggregator, many subsequent spatial GCN frameworks have been developed. Spatial GCNs [15], [30] apply a convolution operation on each node and its neighbors to compute a new feature vector directly. References [18]- [21] and this work all adapt the layer-wise update rule in [29].

B. SKELETON-BASED ACTION RECOGNITION
Feature extraction methods for skeleton-based action have gradually changed from early hand-crafted [13], [15] to deep learning like many other applications. Although traditional manual methods have good interpretability, the performance is hardly satisfactory. With the acquisition of large amounts of data getting easier, data driven methods based on deep learning become the mainstream. Since skeleton videos can be seen as a sequence of frames, methods based on RNN [7], [30], [31] are introduced into action recognition. Basically most of these methods try to convert the human action video classification problem to a sequence classification problem. Methods based on CNN [32]- [34] try to treat the human action classification problem as a 2D or 3D pseudo-image classification problem. All methods mentioned above do not take into account the characteristics of human skeleton. Since GCN can better extract neighborhood features between joints, GCN-based methods have become a hot research direction [18]- [21].

A. GRAPH CONVOLUTIONAL NETWORKS
It is natural to use graph convolutional networks on skeleton data from an intuitive point of view. The most essential reason is that GCN can extract irregular neighborhood features. The number of neighborhood vertex is variable as shown in Fig.3. We can use the Laplacian to characterize the degree of difference between vertexes with where N i is the number of neighborhood of vertex v i . If we apply (1) to all vertexes in the graph, then we have where d i = j∈N A ij is the degree of vertex v i and A is the adjacency matrix. Then, we have where H is called the filter function. Kipf and Welling. [29] replace H with In practice, θ will be set to 1 and L is replaced byL = D There are two main contributions of ST-GCN [18]. The first is that they utilize Open Pose [37] to process action videos to get skeleton data. Each skeleton video consists a 4 dimensions data block. The 4 dimensions are numbers of joint features dimension, frames, joints and humans. For example, a (3,150,18,1) block means that there is one subject with 18 3D joints coordinates in 150 frames. By observing this data form, we can see that this is a normalized matrix form of data. Unlike the traditional graph convolution problem, the graph structure of human skeleton is fixed. The process of ST-GCN is very similar to CNN. As shown in Fig.1(a) where f in is a feature map function on the vertex v tj using its 1-distance neighbors. w is a weighting function.

C. ADAPTIVE GCN
The skeleton graph used in ST-GCN is just the physical structure of human body. There are no links between long distance joints. Shi et al. [20] try to solve this problem with an adaptive adjacency matrix of the graph. In ST-GCN, (1) is transformed into where K v is the number of subset. A k is the adjacency matrix of the subset correspondingly. W k is a C out ×C in ×1×1 weight vector. C denotes the number of features of a joint. M k is a N × N weight matrix that indicates the importance of each vertex. is the dot product. In (8), A k is just the skeleton adjacency matrix which represents the physical structure of the human body. To make the adjacency matrix adaptive, they introduce another two types of adjacency matrices as shown in where A k is the same as the one in (8). B k is a parameterized adjacency matrix which indicates the existence of connections between arbitrary two joints. Because values in B k can be arbitrary, they can also indicate the importance of connections like M k in (8). And C k is a data-dependent graph which can learn a unique graph for each sample.

D. LOW-LEVEL FEATURES OF ADAPTIVE GRAPH CONVOLUTIONAL NETWORKS
The pipeline of proposed work is illustrated in Fig.2. The workflow of our framework is mainly based on [20]. The main difference between our work and [20] is that we take advantage of low-level features. In [20] and [18], there are also similar structures in blocks of their networks. But they just utilize local low-level features like ResNet [22]. Some works [23], [35] have shown that low-level features are helpful for many applications. Kong et al. [36] have shown that a well-trained convolutional neural network is capable of producing well-organized features consisting of abundant semantic and fine information. Although skeleton data has removed a lot of irrelevant information, it still has multi-view information, different kinds of adjacency and semantic of an action. According to the classic convolutional neural network theory, shallow layers are more effective to capture subtly fine features to represent delicate structures. And deep layers can extract high-level semantic features. Utilizing both high-level and low-level features are essential for human action recognition. In human action recognition, information will reach a stable state after multi-levels propagation. But in some cases, the original information is more discriminative. For example, ''taking off shoes'' and ''putting on shoes'', final features will encode he entire process. Their main difference is the start position. Therefore, we argue information like ''start position'' exists in low-level features. Based on this idea, we propose a multi-levels feature extraction framework to fully exploit the complementary information in skeleton. First, we put different low-level features into branches before they are put into the backbone network. Then, we aggregate features from different low-level layers by some different size GCN-TCN blocks. More specifically, the feature dimension of skeleton frame gradually increases from 3, 64, 128 to 256. We put low-level features into branches before using the stride structure to increase feature dimension. Thus, we save multi-stages low-level features. We utilize the GCN-TCN block for two purposes. The first purpose is to capture low-level spatial and temporal features in skeleton like the backbone network. And the second purpose is to enlarge low-level features to the output of the backbone network. So kernel sizes of different branches are different. At last, we add all enlarged features directly as shown in where f i is the output feature of the ith branch. n b is the number of branches. α is the weight of different low-level features. The other parts are normal fully connection network. f out is just the one in (9).

IV. EXPERIMENTS A. DATASETS
In this section, we evaluate the performance of our method with two large-scale action recognition datasets: Kinetics [25] and NTU-RGB+D [24]. Both of them are benchmark datasets in human action recognition.

1) NTU-RGB+D.
This is the largest in-house captured action recognition dataset. It contains vastly different properties like color, position, depth, etc. This dataset has both video and skeleton data for human action recognition. In this paper, we use the skeleton data.

2) KINETICS
Deepmind Kinetics human action dataset has 300000 video clips in 400 classes retrieved from YouTube. Yan et al. [18] use OpenPose [37] toolbox to generate locations of 18 joits in each frame. We use the skeleton data generated by [18] directly. The dataset is split into a training set with 240436 clips and a validation set with 19796 clips. Following the evaluation method in [18], we report the top-1 and top-5 accuracies on the validation set.

B. NETWORK ARCHITECTURE AND TRAINING DETAILS
The backbone network is based on [20]. The whole model is composed of 10 layers of GCN-TCN blocks. The first layer have 3 input channels and 64 output channels. The next three layers have the same output channels to the first layer. The stride in layer 5 is set to 2 as a pooling layer for changing the output channels to 128. Another two layers have 128 output channels too. The stride in layer 8 is 2 and the output channels in last two layers is 256. Before the layer 1, 2, 6, we have three branches. Input channels of the three branches are 3, 64, 128 and output are all 256. All experiments are conducted on PyTorch deep learning toolbox with 2 Tesla V100 GPUs. The Nesterov momentum of stochastic gradient descent optimization strategy is set to 0.9 and the learning rate is set to 0.0001. Cross-entropy is selected as the loss function to backpropagate gradients. For the NUT-RGB+D dataset, the batch size is 32 and we decay the learning rate by 0.1 at 30, 40, 50 epochs. The max number of frames in each sample is 300. We pad the video to 300 frames if there are less than 300 frames in one sample. In the Kinetics dataset, the batch size is 128 and we decay the learning rate at 45, 55, 65 epochs. The input tensor is set to the same as [20] with 150 frames and 2 subjects in each frame. Preprocess is also a critical factor for the performance. In the NUT-RGB+D dataset, first 50 action classes clips have only one subject and the last 10 action classes clips have two subjects. The body tracker of Kinect is prone to detecting more than 2 bodies. We need to filter the incorrect bodies. The preprocess strategy in [17] is just used to process clips having more than two subjects. First, if the number of valid frame in raw skeleton sequence is less than a predefined threshold, we delete the subject. Then, if the difference of y-axis is greater than that of x-axis in a frame, that is, body height is greater than body width, the frame is considered invalid. If the percentage of invalid frames is greater than the predefined threshold, we delete the subject too. At last, we sort subjects according to joints variance and select the two with the lowest variance. For data consistency, if the number of subject is 1, then the other subject will be padded with 0. And this preprocess is denoted as VA. Another preprocess is normalization and translation following [18], [20].

C. COMPARISON WITH THE STATE-OF-THE-ART
Because the 3 dimension feature in Kinetics dataset consists of a 2 dimension position vector and a confidence score. We cannot proprocess skeleton data with VA. First we compare our best configuration denoted as VA+LAGCN with the state-of-the-art on NTU-RGB+D dataset. We only compare the performance using joint data, which is the most basic feature. Results are shown in Table 1. In Table 1, reference [15] is a hand-craft method. References [24], [30]- [32] and the RNN method in [17] are RNN-based methods. Reference [38]- [40] and the CNN method in [17] are CNN-based methods. References [18]- [20], [41]- [43], [45] and our method are GCN-based methods. First, we validate the effectiveness of VA preprocess. The result denoted as VA+2sAGCN shows an improvement on the performance of existing algorithms. Then, we add low-level features in 2s-AGCN (denoted as VA+LAGCN), the performance has been further improved. Our method achieves the state-of-the-art performance in X-Sub evaluations of NTU-RGB+D dataset.

D. EFFECTIVENESS OF THE LOW-LEVEL FEATURES
In order to verify the effectiveness of low-level features, we perform a head-to-head comparison with 2s-AGCN on both Kinetics and NUT-RGB+D datasets. Results are shown in Table 2 and Table 3. The multiple feature integration is a common strategy in the field of machine learning. 2s-AGCN calculates bone features from original skeleton data and integrates both joint and bone features together to improve the performance. We also conduct our experiments with joint, bone and both. Using low-level features shows an obviously improvement on X-Sub with just joints data. This result is consistent with that in Table 1. It shows that low-level features are more effective in extracting joint features. In most cases in NTU-RGB+D dataset, our method shows an improvement on accuracies. Especially in Kinetics dataset, the top-5 accuracies show an obviously improvement. This shows that low-level features can widen the gap between similar classes. To verify the effectiveness of proposed algorithm, we generate two confusion matrices for CS and CV evaluation of the NUT-RGB+D dataset as shown in Fig.4 and Fig.5. The accuracy of most categories is high, but there are obvious misclassification in class 10, 11, and 29. The corresponding actions are reading, writing and play with phone/tablet. Intuitively, the hand actions of these classes are similar. In the future work, we should focus on how to further fine portion of hand actions.

V. CONCLUSION
In this work, we review the development of human action recognition based on skeleton data. After analyzing the network structure of previous GCN-based methods, we propose a novel low-level adaptive graph convolutional neural network (LAGCN) for skeleton-based action recognition. It constructs multi-branches from global view to improve the performance of classification. These branches can extract low-level features of the network. The final network is evaluated on two large-scale action recognition datasets, NTU-RGB+D and Kinetics. And it achieves the state-of-the-art performance on both of them.