Rank-GCN for Robust Action Recognition

We present a robust skeleton-based action recognition method with graph convolutional network (GCN) that uses the new adjacency matrix, called Rank-GCN. In Rank-GCN, the biggest change from previous approaches is how the adjacency matrix is generated to accumulate features from neighboring nodes by re-defining “adjacency.” The new adjacency matrix, which we call the rank adjacency matrix, is generated by ranking all the nodes according to metrics including the Euclidean distance from the nodes of interest, whereas the previous GCNs methods used only 1-hop neighboring nodes to construct adjacency. By adopting the rank adjacency matrix, we find not only performance improvements but also robustness against swapping, location shifting and dropping of certain nodes. The fact that the human-made rank adjacency matrix wins against the deep-learning-based matrix, implies that there are still some parts that need touch of humans. We expect our Rank-GCN can make performance improvements especially when the predicted human joints are less accurate and unstable.


I. INTRODUCTION
Action recognition has become a very important task in computer vision and artificial intelligence. This is because action recognition is widely used in various applications, such as human-computer interaction, gaming, video surveillance, and video understanding. As the spread of infectious diseases such as COVID-19 increases the amount of time spent at home, action recognition is more required at-home workout training systems. In addition, the application scope of action recognition is expanding to the behavior recognition for companion animals.
According to the utilized types of input data, action recognition methods are roughly categorized into imagebased, skeleton-based, and hybrid approaches. In imagebased approaches, optical flows, which refer to the point correspondences across pairs of images have been commonly used to represent the apparent motions of subjects The associate editor coordinating the review of this manuscript and approving it for publication was Zahid Akhtar . of interest [1]. However, these methods often require time-consuming and storage-demanding subprocesses. Additionally, the performance of image-based methods can be affected by optical noises such as illuminations. Even if these issues are mitigated, image-based approaches are not free from personally identifiable information (PII) issues. In real situations, such as the hospital service for elderly patients, these approaches are limited.
In this context, the advantages of skeleton-based approaches are clear. As optical flows are extracted in imagebased approaches, the process of extracting skeletons, which refer to sets of connected coordinates to describe poses of the interested subject, is performed from videos. However, this type of method is relatively lightweight, because its representations are compact and privacy-free. The prevalence of cost-effective depth sensors such as Microsoft Kinect [2] and decent pose estimators such as Openpose [3] have made it easier to obtain skeleton data for action recognition methods in this paper. The picture on the left illustrates a person with one arm sharing a candy and the picture on the right shows a person stretching his arms in front of a desk. In both cases, the dotted circles and lines are inevitably shifted, and they show inaccurate skeleton information. Our method achieves increased robustness by focusing on where actions truly occur (yellow circles and lines) in a dynamic and physically meaningful way.
In early-stage skeleton-based action recognition studies, pseudo images were generated from skeleton sequences, or heatmaps were obtained from pose prediction models, e.g., convolutional neural networks (CNNs); these approaches are similar to image-based methods. However, creating an intermediate form of data such as pseudo images conflicts with compactly using skeleton data and hinders the learning of deeper neural networks on low-end computers. Therefore, graph convolutional networks (GCNs), which generalize CNNs to more generic graph structures, have been selected for skeleton-based action recognition.
GCNs learn the feature at a vertex (i.e., skeleton's joint) by aggregating the features over neighboring vertices on top of an irregular graph that is constructed with 2D or 3D joint coordinates as nodes and their connections (i.e., bones) as edges, with respect to both the spatial and temporal dimensions of the input data. According to this aggregation strategy, various methods can be distinguished. For simplicity, physical connectivity between body joints has been used, but the ideal feature aggregation strategy should consider beyond the local neighborhoods and reflect the long-range dependencies between nodes that have strong correlations even if they are structurally apart. Hence, previously developed methods such as [4] predefine the neighboring vertices heuristically and implicit methods such as [5] learn adjacency information from data.
However, even if global neighbors are used, the adjacency information is usually fixed over the temporal dimension of the input video, and skeleton-based methods are sensitive noise in joint coordinates, just as the image-based methods are sensitive to optical noise. Fig. 1 shows two example situations inaccurate skeleton information can affect the action recognition performance. In this paper, we introduce an effective but robust framework, Rank-GCN, which calculates an adjacency matrix dynamically along the temporal dimension. The main contributions of this work are as follows.
• We propose a novel method that uses global information in both the spatial and temporal dimensions. Compared to the previous methods that use learnable parameters for dynamic adjacency matrix generation, our method has fewer parameters, is easier to implement, and produces more interpretable results. Human-made methods have been recognized as weaker than deep learning-based methods, but our approach not only shows better performance than previous methods but also has interpretable prospects.
• We address the issue of calculating adjacency matrices by using the geometrical distance measure and introducing the rank graph convolution algorithm. We use distance rankings instead of using the distance threshold directly as in [6]. By using ranks to determine the adjacent groups of joints, neighboring nodes are better utilized and provide performance and robustness in activity recognition over those of the state-of-the-art methods.
• We verify our method via thorough experiments that test not only accuracy but also robustness. The experiments include test cases with multi-streams that are frequently used to achieve high algorithmic performance.
The paper is organized as follows. Section II presents related work, and we elaborate on the proposed method in Section III. Section IV presents action recognition experiments that are especially focusing on the robustness and we conclude in Section V.

II. RELATED WORK A. CNN-BASED SKELETON ACTION RECOGNITION
In early skeleton-based action recognition study traditional CNN models were mostly adopted. To utilize CNN models, some researchers generated pseudo images [7], [8], [9], [10] by preprocessing a sequence of skeletons to three channel images. In [7], the author built color maps of joint trajectories from three different views, front, top, and side then fused the prediction scores of these three different views. The body pose evolution image (BPI) and body shape evolution image (BSI) approaches were used in [8] by applying rank-pooling [11] along the temporal axis of joints and concatenating the normalized coordinates of 3D joints respectively.
The weakness of pseudoimage-based action recognition with skeletons is that convolutional operations are only applied to neighboring joints, and they are represented as images on a grid. That is, although many plausible combinations of joints should be jointly considered, only three joints are considered with the convolution of a kernel size of three. To resolve this problem, the BSI method sets duplicated joints traversing along the human body. On the other hand, HCN [12] was formed as a modified version of VGG-19 [13] by implementing additional layers that swap the axes of joints of the channels (from T ×V ×C to T ×C ×V ). These swapping layers lead to significant performance improvements without additional costs, showing that non-local operations performed on a wide range of neighboring joints are important for action recognition. shows the whole model structure. As in previous GCN-based models, we used a similar configuration of ten blocks of interleaving spatial graph convolutional layers and temporal 1D convolutional layers with three channel stages. We used a Rank GCN layer for spatial convolution; its modules are illustrated in the figures at the bottom. The bottom left figure shows how we generated rank adjacency matrices used by the Rank-GCN layer. And the bottom right figure illustrates how we use a given rank adjacency matrices to aggregate vertices, and apply an attention mask that is shared across frames.
A heatmap based 3D CNN action recognition model, PoseC3D [14] was also introduced. PoseC3D is a variant of 3D CNN [15], [16], [17] model that uses a 3D or (2+1)D convolutional layer to extract spatiotemporal features. A heatmap of each joint generated from 2D skeleton inputs are also used on PoseC3D. PoseC3D resolves the issue of locality by stacking deep blocks of 3D layers to extract spatiotemporal features. However, as PoseC3D is deep, it incurs a higher computational cost than GCN-based models. The authors of PoseC3D also showed that their model was robust against joint detection failures than GCN based models.

B. GCNs FOR SKELETON ACTION RECOGNITION
From the above observations regarding the variants of CNN-based methods, we can infer that action recognition with GCNs performs better than traditional CNN-based methods by using the concept of ''adjacency of neighboring joints''.
The original GCN [18], which was initially developed for the semisupervised classification of author-citation networks was modified as a spatiotemporal GCN (ST-GCN) [4] and used for action recognition for the first time. After the ST-GCN was proposed, many other similar methods such as [5], [19], [20], [21] have been explored. The ST-GCN, an extension of GCN [18], was developed with a subset partitioning method to divide neighboring joints into groups in the spatial domain, and a 1D convolutional layer was used to capture the dynamics of each joint in the temporal domain. In the adaptive GCN (AGCN) [19] and the adjacency-aware GCN (AAGCN) [5], learnable adjacency matrices are made by applying the outer product of the intermediate features and combining them.
In MS-G3D [20], the authors extended an adjacency matrix to the additional dimension in temporal directions, to capture more comprehensive ranges of spatiotemporal nodes compared to spatial relations.
In Efficient-GCN [25], to use less parameters for computation, the authors embedded separable convolutional layers [26], [27] and adopted early fusion method for the input data streams. Especially by adopting the early fusion, they decreased the number of model parameters for multistream ensemble dramatically.
Unlike other GCN models, Distance-GCN [6] makes a new adjacency matrix based on the Euclidean distances between the joints. The authors showed that utilizing the pairwise distances between joints yields improved action recognition performance instead of using simple physically connected adjacency.
From previous research on GCNs for action recognition, we can say that designing an adjacency matrix has a critical effect on the performance. Our work can be understood as an improved version of Distance-GCN, which adopts actual metrics to partition neighborhood joints. We sort neighboring joints in order according to their distances from a joint of interest. VOLUME 10, 2022

III. PROPOSED METHOD
Similarly to many other GCN-based action recognition models [5], [19], [21], which follow the basic configuration containing ten blocks of spatial and temporal layers (starting from the ST-GCN) [4], Rank-GCN has a similar structure, as shown in the top figure of Fig. 2. Given a graph represented with adjacency matrix A, which could be a predefined fixed graph that is possibly modified with an attention mechanism or constructed experimentally, multiple blocks of a spatial graph convolutional layer and a 1D temporal convolutional layer are applied on input data to extract high-dimensional spatiotemporal features. The input data size is P×T ×V ×C, where P is the number of people in sequence, T is the number of frames, V is the number of joints, and C is the dimension of the 2D or 3D coordinates.
Then, global averaging pooling (GAP) is applied over the penultimate features to merge all the people and its vertices in the input frames. In the last stage, the softmax is applied for the classification task. The spatial graph convolution operation is formulated as the following equation, where v, A, N and FC() represent vertex, adjacency matrix, neighboring node-set, and fully-connected layer respectively: Our Rank-GCN model also comprises ten interleaving rank graph convolutional layer for spatial features and 1D convolutional layer for temporal features. In addition to spatial features, depending on the input streams and adjacency matrix, our rank graph convolutional layer also extracts spatiotemporal features to obtain more complex representations of body gestures. In the following subsections, we introduce the rank graph convolutional layer modules, which are the main modules in Rank-GCN for graph-based action recognition as shown in bottom figures of Fig. 2. Fig.3 illustrates our major contribution, the concept ''Rank'', by comparing how previous models defined an adjacent neighbors. Comparing (a) versus (b) and (c), the coverage of local neighbors is very limited in (a) since this method only cares about 1-hop neighbors which are physically connected. On the other hand, wide ranges of neighbors are covered in (b) and (c). Comparing (b) with (c), the green solid and the green dotted circle in (b) show two possible ranges for the green subset. Two out of three joints will be excluded if the green subset has a ''slightly'' smaller range learned in the training process, such as the dotted circle. And this could affect performance since the number of elements (joints) is changed. Our method shown in (c) adopts a ranking strategy, and is not affected by slight distance changes of the neighboring nodes and keeps a stable number of elements for each subset.

A. RANK ADJACENCY MATRIX
To capture the action dynamics more effectively, we propose a new adjacency matrix generation method, the rank adjacency matrix. Given a frame at time t and a center node v i t of interest, we calculate the distance between nodes with a metric function M that outputs scalar values. The distance matrix D i t is obtained by iterating over all nodes in the input frame as where v i t can be the coordinate, speed, and acceleration of a vertex.
Based on the distance matrix, the rank matrix A ti ∈ Z R×V ×1 is built by ranking and filtering them with rank ranges = {γ r = (s r , e r )|r = 1, . . . , R} where s r and e r are the start and end of the range, respectively, and R is the number of rank ranges. Hence, the rank metric at a frame t, for a vertex i, and a rank r is as Given skeleton inputs, we denote the frames of the skeletons as S = {S t ∈ R V ×C |t = 1, . . . , T }. The rank graph convolutional layer works as follows in Algorithm 1.

B. NODE INDEXING
A conventional GCN for action recognition aggregates joint features with fixed rules so that we can expect which nodes will be aggregated by a given point. However, we do not have a fixed aggregated set of joints due to the dynamically perform aggregation. To address this problem, we append an embedding of one-hot vertex indices along the feature channel axis.

C. RANK-WISE ATTENTION MASK
Due to the many performance improvements yielded by attention mechanism [5], [28], [29], [30], [31], [32], we also devise an attention module. While previous studies applied attention mechanisms on the adjacency matrices, or aggregated features, we apply attention to the pre-aggregation stage, which makes it possible to learn the optimal mask for each rank subset. To make the attention module as light as possible, we adopt a simple static multiplication mask, which is denoted as M in Alg. 1.

Algorithm 1 Algorithm for Rank-GCN layer
Input: Joints of a skeleton sequence S ∈ R T 1 ×V ×C 1 , Features from previous layer Our mask module is a rank-, vertex-, and channel-wise multiplication mask, which results in consistent performance improvements with only small increases in the computational complexity and number of weight parameters.

D. RANK-GCN LAYER
By joining the aforementioned modules explained together, we propose a Rank-GCN layer that performs aggregation differently for every frame depending on the input skeleton data. The complete process of Rank-GCN layer is shown in Alg. 1.
Here, we can utilize various kinds of rank matrices A by changing the metric function M. In the following experiments section, we verify that the resulting models are mutually complementary by combining different metric functions together with different input streams. Further details of the implementation regarding the metric functions are explained in the following sections.

IV. EXPERIMENTS AND ANALYSIS A. DATASETS
We used the following three datasets in order to verify the superiority of our algorithm. Those three datasets and experimental protocols are summarized in Table 1.

1) NTU RGB+D 60 AND NTU RGB+D 120
NTU RGB+D 60 [33] is a dataset with four kinds of different modalities: RGB, depth-map, infrared and 3D skeleton data. It contains 56,880 samples with 40 subjects, three camera views, and 60 actions. Two official benchmark training-testing splits are utilized: (1) cross-subject (CS) (2) cross-view (CV). The dataset also includes 2D skeletons that are projected to the RGB, depth-map, and infrared modalities from 3D. The 3D skeleton data are represented in meters. All the modalities are captured by the Kinects-V2 sensor, and the 3D skeletons are inferred from the depth-maps [2]. Due to the limitations of depth maps and ToF sensors, some samples have much noise on their skeleton coordinates. The lengths of the samples are at most 300 frames, and the number of people in a view is at most four. We choose the two most active people and 300 frames. Samples with fewer than 300 frames or fewer than two people are preprocessed with the method used in the AAGCN [5].
NTU RGB+D 120 [34] is an incremental extension of NTU RGB+D 60 [33] with 60 new action classes added. It utilizes two official benchmark training-testing splits (1) cross-set (XSet) and (2) cross-subject (XSub) splits. The same preprocessing method is applied as that used for NTU RGB+D 60.

2) SKELETICS-152
Skeletics-152 [1] is a skeleton action dataset extracted from the Kinetics700 [35] dataset with the VIBE [36] pose estimator. Because Kinetics-700 has some activities without people and some that are to classify within the context of what humans interact with, 152 classes out of the 700 total classes are chosen to build Skeletics-152. Due to the to accurate pose estimation performance of VIBE pose estimator, Skeletics-152 have much less noise than NTU-60 skeletons. The numbers of people appearing in the samples ranges from 1 to 10 with an average of 2.97 and a standard deviation of 2.8. The lengths of samples ranges from 25 frames to 300 frames with an average of 237.8 and a standard deviation of 74.72. We choose at most two people in the samples for all experiments performed in this paper. While NTU-60 contains joints' coordinates in meters, Skeletics-152 has a normalized values in the range [0, 1]. Samples with fewer than 300 frames or fewer than 3 peoples are padded with zeros and no further preprocessing is performed for training and testing.

B. IMPLEMENTATION DETAILS
All the experiments are performed using PyTorch libraries with four RTX 2080TI GPUs. Training is conducted for a maximum of 300 epochs. The learning-rate starts from 10 −1 and is divided by 10 at epochs 60 and 120. L 2 -regularization is applied with a weight 10 −4 . Following combinations of input streams and rank adjacency matrices share the same model architecture as shown in Table 2.
For NTU-60 and NTU-120, there is no class imbalance problem. Authors of NTU-60 and NTU-120 controlled the number of samples in each class while capturing the dataset.   In Skeletics-152, classes are approximately balanced between 800 to 900 instances. We did not address the class imbalance problem in our experiments because such problem is not dominant.

1) INPUT STREAMS
We follow the same method used in [19] for augmenting the skeleton coordinates into four different input streams, as shown in Fig. 4. The joint stream contains raw coordinates. The bone stream is a vector of two adjacent bone coordinates. The temporal streams of each stream are the temporal differences in the corresponding values.

2) RANK ADJACENCY MATRICES
We devised four different metric functions to make four types of rank adjacency matrices as shown in Fig. 5. From left to right in Fig. 5, the raw skeleton coordinates, the four streams extracted from the coordinates, and the rank adjacency matrices derived from the streams are shown. Regarding the rank adjacency matrices, we used positions, speeds, and accelerations: positions are the distances between the intraframe and interframe joints. speeds are the 1st-order derivatives of the distances along the temporal axis, and accelerations are the 2nd-order derivatives of the distances along the temporal axis. Four types of rank adjacency matrices are generated from the input coordinates. From top to bottom, the spatial coordinates, 1st-order temporal differences (speeds), 2nd-order temporal differences, and spatiotemporal coordinates are illustrated. The shape of the input is (T , V , C ), and the shape of each matrix is (R, T , V ) and (R, T , WV ), where W is the size of the temporal window.  These values are calculated, sorted, and divided into three subsets according to their sorted indices. We use a temporal window of size 3, which includes 75 spatiotemporal joints in total.
One training case out of sixteen combinations of rank adjacency matrices and input streams is illustrated in Fig. 6 and Fig. 7 showing changes of accuracy and loss respectively. From these, we can see that NTU-120 which is incremented version of NTU-60 with 60 more classes are added, the training process is more stabilized with less fluctuation.

C. ABLATION STUDY 1) SUBMODULES
We show the effectiveness of two different factors and their influences on two different adjacency matrix generation methods: the physical connectivity method and rank connectivity method as shown in Table 3, By comparing the two rows, we can verify that adopting rank connectivity is always better than or equal to physical connectivity in accuracy. Additionally, node indexing and pre-aggregation attention boost the performance of each adjacency method. Finally putting node-indexing and pre-aggregation attention together yields the best result. It is worth noting that while the rank method obtains the best result by applying node indexing and attention mask, the connectivity method does not obtain the best result in each row.

2) STREAMS AND ADJACENCY MATRICES
To check whether ensemble of streams and adjacency matrices improve a performance, we experimented thoroughly as shown in Tables 4 and 5. For the experiments on input streams and adjacency matrix, three out of five cases and four out of five cases show the best results when all the experiments and input streams are assembled respectively.

D. PERFORMANCE COMPARISONS
To further improve the performance, we ensemble sixteen combinations of four input streams and four adjacency matrices for all the experiments involving Rank-GCN. Table 6 shows the results of a comparison with the stateof-the-art methods. Overall, under five different settings with three different datasets, Rank-GCN produces competitive results. While Efficient-GCN obtains the best results on NTU60-CS and NTU120-Csub, our Rank-GCN shows the best results on the other datasets. For the Skeletics-152 dataset, we partly re-experimented with our preprocessing method and obtain better results.
From the results, we believe that the appropriateness of the constructed adjacency matrix is important for human action recognition. Especially based on the Skeletics-152 results relative to those obtained on other datasets, we assume that although our model is robust against noise, the precise prediction of skeletal joints is more a power of Rank-GCN than using error-prone joints predicted from Kinects V2.

E. ROBUSTNESS COMPARISONS
To demonstrate the robustness of our method, we organize three different experimental settings as shown in Fig. 8. We experiment on the CS split of the NTU RGB+D 60 dataset. Although pose estimation has been improved, misalignment among the inferred joints may still occur. Kinects V2, which is the capturing device for the following experiments, has frequent jittering issues. For this experiment, we use only the joint stream, not the ensemble of streams. For the comparison, we choose MS-G3D [20] as the superior version of our model, and the AAGCN [5] as the inferior version. Since we worry that superior models may tend to be weak to various types of errors, we set AAGCN as the compared model.

1) RANDOM TRANSLATION
For this experiment, we translate all the joints with vectors of that have the same length but different directions. For every frame and every joint in each frame, uniformly generated translation vectors in the range [0, l] applied where l is the maximum length among the translation vectors. Fig.9 and Fig.10 show the results of the experiment.

2) RANDOM DROPPING OF VERTICES
This test assumes that the inference of joint subsets fails due to occlusion or pose prediction system failure. We choose d joints among the 25 total joints and set the chosen joints VOLUME 10, 2022   to be (0, 0, 0) with a probability of 0.5. As shown in Figs. 11 and 12, we experiment with d = 0, 1, 2, 3, 4, 5.
We verify that our model surpasses other models when random joints are dropped, and the results suggest that our model is more robust than other models in harsh environments, where the joint action recognition model has no access to a partial joint subset.

3) RANDOM SWAP OF VERTICES
Here, we permute all joints in random order. We change the length of the permuted sequence l from 0 to 300 with a random start point for each test instance. The results in Fig.13 show that unlike in the other two robustness tests, the performance of the AAGCN drops rapidly. This implies that AAGCN's instance-wise adjacency matrix generation  produces results in harmful consequences, while our model's dynamic rank adjacency matrix approach handles permuted joints very well. It is interesting that the performance drops exhibited by the two other models are aligned. On Fig.14, shows that Efficient-GCN is better than Rank-GCN in the multistream cases. We assume that the architecture of Efficient-GCN is more appropriate for this experiment than our Rank-GCN method due to Efficient-GCN's preprocessing strategies.

F. DISCUSSION
From the aforementioned set of experiments, we showed superior performance of our method against to those of statesof-the-arts, detailed impacts of each module e.g., node indexing and attention mask, and the robustness of our method. With a strong performance gain compared to existing models, our model has pros and cons as shown in Table 7. Both Distance-GCN and our model are based on distance-metric, but our model has a ranking operation of the distance between vertices which costs O(n 2 ). And about graph interpretability, while authors of MS-G3D revealed the relationship between vertices by showing learned weights of adjacency matrix, Distance-GCN used the distance metric directly for learning,   whereas our model interpreted distance to a more understandable way of rankings. Compared to MS-G3D, our model utilized a dynamic graph along the time axis. In addition, by using four different types of the more interpretable rank adjacency matrix type, our model showed better performance than Distance-GCN.

V. CONCLUSION AND FUTURE WORK
In this paper, we introduced a new way to apply the GCN architecture to action recognition: Rank-GCN. In Rank-GCN, we define a rank adjacency graph based on the pairwise distance between vertices and accumulate vertex features according to the closest and farthest ranks. By adopting this method, we are able to attain not only state-of-the-art performance but also robustness in more practical scenarios.
In future work, we will extend the rank graph convolutional layer of our Rank-GCN from fixed rank ranges to learnable ranges and try to optimize the computationally expensive operations in the network. In addition to the technical perspectives, we expect this rank adjacency based method to be adopted for new application domains such as object recognition or hand gesture recognition.