Fusing Higher-order Features in Graph Neural Networks for Skeleton-based Action Recognition

Skeleton sequences are lightweight and compact, and thus are ideal candidates for action recognition on edge devices. Recent skeleton-based action recognition methods extract features from 3D joint coordinates as spatial-temporal cues, using these representations in a graph neural network for feature fusion to boost recognition performance. The use of first- and second-order features, i.e., joint and bone representations, has led to high accuracy. Nonetheless, many models are still confused by actions that have similar motion trajectories. To address these issues, we propose fusing higher-order features in the form of angular encoding into modern architectures to robustly capture the relationships between joints and body parts. This simple fusion with popular spatial-temporal graph neural networks achieves new state-of-the-art accuracy in two large benchmarks, including NTU60 and NTU120, while employing fewer parameters and reduced run time. Our source code is publicly available at: https://github.com/ZhenyueQin/Angular-Skeleton-Encoding.


I. INTRODUCTION
Skeleton-based action recognition is more robust to background information and easier to process, attracting increasing attention [26] in the community.Recently, deep graph neural networks fuel the recent surge of accuracy for skeleton-based action recognition [41].By leveraging graph neural networks, action recognizers more thoroughly extract the topological information within the skeleton sequences.
To make graph neural networks applicable for skeletonbased action recognition, skeletons are treated as graphs, with each vertex representing a body joint and each edge a bone.Initially, only first-order features were employed, representing the coordinates of the joints [41].Subsequently, [27] introduced a second-order feature: each bone is expressed as the vector difference between one joint's coordinate and that of its nearest neighbor in the direction of the body center.Their experiments show that these second-order features improve the recognition accuracy of skeleton-based action recognizers.
However, existing methods suffer from the poor performance of discriminating actions with similar motion trajectories (see Figure 1).Since the joint coordinates in each frame are similar in these actions, it is challenging to identify the cause of nuances between coordinates.It can be due to various body sizes, motion speeds, or actually performing different actions.To robustly capture the relative movements between body parts while maintaining invariance for different The angles formed by red dashed lines (i.e., the foreand upper arms) are distinctive, which are informative in distinguishing these two similar motions.
body sizes of human subjects, in this paper, we propose the use of higher-order representations in the form of angles.We refer to the new proposed feature as angular encoding, which can be applied to both static and velocity domains of human body joints.Thus, the proposed encoding allows the model to recognize actions more precisely.Experimental results reveal that by fusing angular information into the existing modern action recognition architectures, such as Spatio-Temporal Graph Convolutional Network (STGCN) [41] and Decoupling GCN [4], confusing action sequences can be classified more accurately, especially when the actions have very similar motion trajectories.It is worth considering whether it is possible to design a neural network to implicitly learn angular features.However, such a design would be challenging for current graph convolutional networks (GCNs) [37], [31], mainly due to two reasons.(a) Conflicts between more layers and higher performance of GCNs: GCNs are currently the best-performing models in classifying skeleton-based actions.To model the relationships among all the joints, a graph network requires many layers.However, recent work implies the performance of a GCN can be compromised when it goes deeper due to over-smoothing problems [22].(b) Limitation of adjacency matrices: recent graph networks for action recognition learn the relationships among nodes via an adjacency matrix, which only captures pairwise relevance, whereas angles are thirdorder relationships involving three related joints.
We summarize our contributions as follows: 1) We propose a rich collection of higher-order representations in the form of the angular encoding defined in both static and velocity domains.The encoding captures relative motion between body parts while maintaining invariance against different human body sizes.
2) The angular features can be easily fused into existing action recognition architectures to further boost performance.Our experiments show that angular features are complementary information relative to existing features, i.e., the joint and bone representations.3) We are the first to incorporate multiple categories of angular features into modern spatial-temporal GCNs and achieve state-of-the-art results on several benchmarks, including NTU60 and NTU120.Meanwhile, if a simple model (employing fewer training parameters and requiring less inference time) has equipped with the proposed angular encoding, it becomes powerful.Thus, the proposed angular encoding supports real-time action recognition on edge devices.

II. RELATED WORK
Many of the earliest attempts at skeleton-based action recognition encoded all human body joint coordinates in each frame into a feature vector for pattern learning [33], [34].These models rarely explored the internal dependencies between body joints, resulting in missing rich information about actions.Kernel-based methods have also been proposed for action recognition [9], [10].
Later, as deep learning became a standard choice in video processing [18], [1] and understanding [13], [12], RGB-based videos started to tackle action recognition.However, they suffer from problems in domain adaptation [44], [7], [46] since they have varying backgrounds with different textures of subjects.On the other hand, skeleton data has relatively fewer issues with domain adaptation.Convolutional neural networks (CNNs) were introduced to tackle skeleton-based action recognition and achieved an improvement [35].However, CNNs are designed for grid-based data and are not suitable for graph data since they cannot leverage the topology of a graph.
Recently, deep graph neural networks are accumulating attention [21], [42], [16], [36].Graph neural networks also started to attract attention in skeleton recognition.In GCNbased models, a skeleton is treated as a graph, with joints as nodes and bones as edges.An early application was ST-GCN [41], using graph convolution to aggregate joint features spatially and convolving consecutive frames along the temporal axis.Subsequently, AS-GCN [15] was proposed to further improve the spatial feature aggregation via the learnable adjacency matrix instead of using the skeleton as a fixed graph.AGC-LSTM [30] learned long-range temporal dependencies, using LSTM as a backbone, and changed every gate operation from the original fully connected layer to a graph convolution layer, making better use of the skeleton topological information.2s-AGCN [27] made two major contributions: (a) applying a learnable residual mask to the adjacency matrix of the graph convolution, making the skeleton's topology more flexible; (b) proposing a second-order feature, the difference between the coordinates of two adjacent joints, to act as the bone information.An ensemble of two models, trained with the joint and bone features, substantially improved the classification accuracy.More graph convolution techniques have been proposed in skeleton-based action recognition, such as SGN [43] and Shift-GCN [5], employing self-attention and shift convolution respectively.Recently, MS-G3D [19] achieved high results by proposing graph 3D convolutions to aggregate features within a window of consecutive frames.However, 3D convolutions demand a long running time.
In more recent times, Qin et al. proposed some self-attention models that dynamically optimize the graph structure [23].Xu et al. designed a pure CNN architecture that more effectively captures the topological information [39].Memmesheimer et al. study the one-shot problem of skeleton-based action recognition [20].They apply the metric learning setting and map the problem to a nearest-neighbor search in a set of activity reference samples.Wang et al. studied the adversarial attack problem in skeleton-based action recognition [32].They investigated a perceptual loss that ensures the imperceptibility of the attack.Diao et al. investigated the black-box attack on skeleton-based action recognition [6].They proposed an attack mechanism called BASKR and showed that the adversarial attack is a threat and on-manifold adversarial samples are common for skeletal motions.
All the existing methods suffer from low accuracy in discriminating actions sharing similar motion trajectories.This motivates us to seek a new encoding to facilitate the model differentiating two confusing actions.Some works show angle features similar to the local feature presented in this paper [8], [40].On the other hand, we propose a collection of angular encoding forms.Each category consists of further subcategories.Different categories of angular encoding are designed to capture motion features of distinct kinematic body parts.

III. ANGULAR FEATURE REPRESENTATION A. Angular Encoding
We propose using third-order features, which measure the angle between three body joints to depict the relative movements between body parts in skeleton-based action recognition.Given three joints ,  1 and  2 , where  is the target joint to calculate the angular features and  1 and  2 are endpoints in the skeleton, ì    denotes the vector from joint  to   ( = 1, 2), we have ì , where (  ,   ,   ) represent the coordinates of joint  ( = ,  1 ,  2 ).We define two kinds of angular features.
Static Angular Encoding: suppose  is the angle between ì   1 and ì   2 ; we define the static angular encoding   () for joint  as Note that  1 and  2 do not need to be adjacent nodes of .The feature value increases monotonically as  goes from 0 to  radians.In contrast to the first-order features, representing the coordinate of a joint, and the second-order features, representing the lengths and directions of bones, these third-order features focus more on motions and are invariant to the scale of human subjects.Velocity Angular Encoding: the temporal differences of the angular features between consecutive frames, i.e., where  (+1)  () is the angular velocity of joint  at frame ( + 1), describing the dynamic changes of angles.The angular encoding is a third-order feature.Taking the velocity of these third-order features further increases the order.Hence, these velocity angular features enable an action recognizer to capture fourth-order information of motion sequences.
However, we face a computational challenge when we attempt to exploit these angular features: if we use all possible angles, i.e., all possible combinations of ,  1 and  2 , the computational complexity is  ( 3 ), where  and  respectively represent the number of joints and frames.Instead, we manually define sets of angles that seem likely to facilitate distinguishing actions without drastically increasing computational cost.In the rest of this section, we present the four categories of angles considered in this work.
(a) Locally-Defined Angles.As illustrated in Figure 2(a), a locally-defined angle is measured between a joint and its two TABLE I: Comparison of recognition performance on four settings of two benchmark datasets.We compare not only the recognition accuracy but also the total number of parameters (#params) in the networks.#Ens is the number of models used in an ensemble.BSL means to use the original feature without employing angular encoding.AGE-S and AGE-V stand for concatenating the original representation with angular encoding in the static and velocity domains respectively.Joint/J and Bone/B denote the use of joint and bone features respectively.The top accuracy is highlighted in red bold, and the second best performance is highlighted in blue.Symbol & indicates ensembling models trained with different input features given in the parenthesis.GFlops stands for the floating-point operations performed by a model, which is the number of multiply-add operations that a model performs.adjacent neighbors.If the target joint has only one adjacent joint, we set its angular feature to zero.When a joint has more than two adjacent joints, we choose the most active two.For example, we use the two shoulders instead of the head and belly for the neck joint since the latter rarely move.These angles can capture relative motions between two bones.
(b) Center-Oriented Angles.A center-oriented angle measures the angular distance between a target joint and two body center joints representing the neck and pelvis.As in Figure 2(b), given a target joint, we use two center-oriented angles: 1) neck-target-pelvis, dubbed as unfixed-axis, and 2) neck-pelvis-target, dubbed as fixed-axis.For the joints representing the neck and pelvis, we set their angular features to zero.Center-oriented angles measure the relative position between a target joint and the body center joints.For example, given an elbow as a target joint moving away horizontally from the body center, the unfixed-axis angle decreases while the fixed-axis angle increases.
(c) Pair-Based Angles.Pair-based angles measure the angle between a target joint and four pairs of endpoints: hands, elbows, knees, and feet, as illustrated in Figure 2(c).If the target joint is one of the endpoints, we set the feature value to zero.We select these four pairs due to their importance in performing actions.The pair-based angles are beneficial for recognizing object-related actions.For example, when a person is holding a box, the angle between a target joint and  2(d), the two joints corresponding to fingers are selected as the anchor endpoints of an angle.The finger-based angles can indirectly depict gestures.For instance, an angle with a wrist as the root and a hand tip as well as a thumb as two endpoints can reflect the degree of hand opening.

B. Our Backbone Architecture
The overall network architecture is illustrated in Figure 3. Three different features are extracted from the skeleton and input into the stack of three spatial-temporal blocks (STBs).Then, the output passes sequentially to a global average pooling, a fully connected layer, and then a softmax layer for action classification.We use a simplified version of MS-G3D [19] as the backbone of our model.For simplification, we remove their heavy graph 3D convolution (G3D) modules, weighing the performance gain against the computational cost.We call the resulting system MSGCN.Note that our proposed angular features are independent of the choice of the backbone.
We extract the joint, bone, and angular features from every action video.For the bone feature, if a joint has more than one adjacent node, we choose the joint closer to the body's center.So, given an elbow joint, we use the vector from the elbow to the shoulder rather than the vector from the elbow to the wrist.For the angle, we extract seven or nine angular features (without/with finger-based angles) for every joint, constituting seven or nine channels of features.Eventually, for each action, we construct a feature tensor  ∈ R × × × , where , ,  and  respectively correspond to the numbers of channels, frames, joints, and participants (the persons conducting actions).We test various combinations of the joint, bone, and angular features in the experiments.
Each STB, as exhibited in Figure 3(b), comprises a spatial multiscale graph convolution (SMGC) unit and three temporal multiscale convolution (TMC) units.The details of these components are illustrated as follows.
The SMGC unit, as shown in Figure 3(c), consists of a parallel combination of graph convolutional layers.The adjacency matrix of graph convolutions results from the summation of a powered adjacency matrix   and a learnable mask    .Powered adjacency matrices: To prevent over-smoothing, we avoid sequentially stacking multiple graph convolutional layers to make the network deep.Following [19], to create graph convolutional layers with different sizes of receptive fields, we directly use the powers of the adjacency matrix   instead of  itself to aggregate the multi-hop neighbor information.Thus,   ,  = 1 indicates the existence of a path between joint  and  within -hops.We feed the input into  graph convolution branches with different receptive fields. is no more than the longest path within the skeleton graph.Learnable masks: Using the skeleton as a fixed graph cannot capture the nonphysical dependencies among joints.For example, two hands may always perform actions in conjunction, whereas they are not physically connected in a skeleton.To infer the latent dependencies among joints, following [27], we apply learnable masks to the adjacency matrices.
The TMC unit, shown in Figure 3(d), consists of seven parallel temporal convolutional branches.Each branch starts with a 1×1 convolution to aggregate features between different channels.The functions of different branches diverge as the input passes forward, which can be divided into four groups.In detail: (a) Extracting multiscale temporal features: the group contains four 3 × 1 temporal convolutions, applying four different dilations to obtain multiscale temporal receptive fields.(b) Processing features within the current frame: This group only has one 1 × 1 to concentrate features within a single frame.(c) Emphasizing the most salient information within the consecutive frames: The group ends with a 3 × 1 max-pooling layer to draw the most important features.(d) Preserving Gradient: The final group incorporates a residual path to preserve gradients during back-propagation [2].

IV. EXPERIMENTS A. Datasets
NTU60 [25].NTU60 is a widely-used benchmark dataset for skeleton-based action recognition, incorporating 56,000 videos.The action videos were collected in a laboratory environment, resulting in accurately extracted skeletons.Nonetheless, recognizing actions from these skeletons is still challenging due to five aspects: (1) the skeletons are captured from different viewpoints; (2) the skeleton sizes of subjects vary; (3) so do their speeds of action; (4) different actions can have similar motion trajectories; (5) there are limited joints to portray hand actions in detail.NTU120 [17].NTU120 is an extension of NTU60.It uses more camera positions and angles, as well as a larger number of performing subjects, leading to 113,945 videos.

B. Experimental Setups
We train deep learning models on four NVIDIA 2080-Ti GPUs and use PyTorch as our deep learning framework to compute the angular encoding.Furthermore, we apply stochastic gradient descent (SGD) with momentum 0.9 as the optimizer.The training epochs for NTU60 and NTU120 are set to 55 and 60, respectively, with learning rates decaying to TABLE III: A comparison of with/without angular features on the most confusing actions that may share similar motion trajectories.The 'Action' column shows the ground truth labels, and the 'Similar Action' column shows the predictions from the model (with/without angular features).The similar actions highlighted in orange demonstrate the change of predictions after employing angular features.The accuracy improvements highlighted in red are the substantially increased ones (Acc↑ ≥ 10%) due to using our angular features.We follow [26] in normalizing, translating each skeleton, and padding all clips to 300 frames via repeating the action sequences.The training loss function is cross-entropy [24].

C. Ablation Studies
There are two possible approaches for using angular features: (a) simply concatenate our proposed angular features with the existing joint, bone, or both features, and then train the model; (b) feed the angular features into our model and ensemble it with other models that are trained using joint, bone or both features to predict the action label.We study the differences between these approaches.We report the results in Table I, including using different settings of both NTU and NTU120.To reduce clutter, we use the results of the crosssubject setting of NTU120 for ablation studies.We denote the accuracy without angular encoding with baseline (BSL).AGE means to concatenate the original feature with angular encoding.The suffix -S (in BSL-S and AGE-S) and -V (in BSL-V and AGE-V) represent feeding the static and velocity feature, respectively.
Concatenating with Angular Features.Here, we study the effects of concatenating angular features with others.We first obtain the accuracy of three models trained with three feature types, i.e., the joint, bone, and a concatenation of both, respectively, as our baselines.Then, we concatenate angular features to each of these three to compare the performance.We evaluate the accuracy with two data streams, i.e., angular static and velocity.We observe that all the feature types in both data streams receive accuracy boosting in response to incorporating angular features.For the static stream, concatenating angular features with the concatenation of joint and bone features leads to the most significant enhancement.As to the velocity stream, although the accuracy is lower than that of the static one, the improvement resulting from angular features is more substantial.In sum, concatenating all three features using the static data stream results in the highest accuracy.
Training Solely with Angular Encoding.We are interested in the performance of the network when only feeding the angular encoding, i.e., no joint and bone features are used.The outcome is shown as the first row of Table II, denoted as Ang.We see training merely with angular encoding even outperforms that of utilizing the joint feature, indicating the completeness of angular encoding for depicting human skeleton motion trajectories.
Ensembling with Angular Encoding.We also study the change in accuracy when ensembling a network trained solely with angular features Ang with networks trained with joint and bone features, respectively, as well as their ensemble.The results are reported in Table II.We obtain the accuracy of the above three models as the baseline results for each stream and compare them against the precision of ensembling the baseline models with Ang.We note that ensembling Ang consistently leads to an increase in accuracy.As with the concatenation studies, angular features are more beneficial for the velocity stream.However, unlike the case with concatenation, the accuracy of the two streams is similar.We also observe that ensembling with Bon achieves considerable accuracy gain.An ensemble of Jnt, Bon and Ang results in the highest accuracy in the static stream.
Evaluating Angular Encoding of Each Category.We independently evaluate the boost of the angular encoding of the four categories, i.e.,local, center-oriented, pair-based, and finger-based.The utilized model is the BSL architecture.We discover that all these four categories can individually boost the recognition accuracy, as shown in Table VI.Furthermore, the proposed angular encoding has been leveraged in an open TABLE IV: A comparison of the effect for improving action recognition by concatenating certain angular features to the joint representation.Each subtable is sorted by the increase in accuracy.The 'Action' column shows the ground truth labels, and the 'Similar Action' column shows the predictions from the model (with/without angular encoding).challenge and revealed to be effective1 .

D. Comparison with State of the Art Models
The ablation studies indicate fusing angular features in both concatenating and ensembling forms can boost accuracy.Hence, we include the results of both approaches as well as their combination in Table I.In practice, the storage and the run time may become bottlenecks.Thus, we consider not only the recognition accuracy but also the number of parameters (in millions) and the inference time (in gigaFLOPs).The unavailable results are marked with a dash.
We achieve new state-of-the-art accuracies for recognizing skeleton actions on both datasets, i.e., NTU60 and NTU120.For NTU120, MSGCN outperforms the existing state-of-theart model by a wide margin.Apart from the higher accuracy, MSGCN requires fewer parameters and a shorter inference time.We evaluate the inference time of processing a single NTU120 action video for all the methods.Compared with the existing most accurate model, MSGCN requires fewer than 70% of the parameters and less than 70% of the run time while achieving higher skeleton-based recognition results.
Of note, the proposed angular features are compatible with the listed competing models.If one seeks even higher accuracy, the employed simple GCN can be replaced with a more sophisticated model, such as MS-G3D [19], although this change can lead to more parameters and longer inference time.For example, if we employ more complicated MSG3D [19] instead of our MSGCN, the accuracy can be further improved as Table V shows.Nonetheless, both the number of parameters and the GFlops will also correspondingly increase.

V. ANALYSIS OF ANGULAR ENCODING
We want to provide an intuitive understanding of how angular features help in differentiating actions.To this end, we compare the results from two models trained with the joint features and the concatenation of joint and angular features.

A. Utilizing of All Types of Angular Encoding
First, we concatenate all kinds of angular encoding with joint features and train the baseline network.The results are illustrated in Table III.We observe two phenomena: (a) the majority of the action categories receiving a substantial accuracy boost from angular features are hand-related, such as making a victory sign vs thumbs up.We hypothesize that the enhancement may result from our explicit design of angles for hands and fingers, so that the gestures can be portrayed more comprehensively.(b) for some actions, after the angular features have been introduced, the most similar actions change.This suggests that the angles are providing complementary information to the coordinate-based representations.For the new actions that still confuse the network after using the angular encoding, they are also challenging for humans to differentiate them from their corresponding ground-truth actions by just observing skeletons.For better understanding, We provide some visual examples displaying the confusing actions whose mostly confused counterparts get altered after using angular encoding in Figure 4.Among them, folding paper and counting money are easily confused, and reading and writing are also likely to be mixed up.We see these confusing pairs of skeletons are visually similar to those of humans.

B. Contributions from Different Angle Types
Next, we conduct ablation studies on different types of the proposed angular encoding for improving the accuracy of recognizing skeleton-based actions.The baseline accuracy is obtained merely using the joint feature.Then, we concatenate different types of angular encoding with the joint feature to evaluate the effectiveness of each encoding type.We study the effects of different types of angular features on improving the accuracy of recognizing actions.
The results are depicted in Figure 5.We observe: i) the center-oriented angular encoding boosts the accuracy with the largest margin for both static and velocity input features; the increases are 1.01%and 2.02% respectively.Since the centeroriented encoding reflects the distance from the joint to the body center, the results imply knowing such a distance is greatly beneficial to recognizing skeleton-based actions.This is consistent with our daily experience.To illustrate, people normally pose the hand farther away from the body center for the victory sign than for the ok sign.ii) Angular encoding improves more accuracy for the velocity input features than the static joint coordinates.The average improvements are 0.58% and 1.42% respectively.This difference indicates angular encoding provides more additional information in capturing the dynamic motion trajectories of actions than depicting the spatial structural information.iii) The part-based angular encoding only marginally heightens the accuracy of using the static features, only 0.22%, whereas the increase improves substantially enlarges to 1.47% for the velocity input.We conjecture this is because the actions performed by arms and legs involve a lot of dynamics.Thus, when using the velocity input, angular encoding provides complementary dynamic information to these actions.
We investigate how each kind of angular encoding improves accuracy.To this end, we collect the top seven actions whose accuracy is improved by the angular encoding the most.The results are exhibited in Table IV.We see: i) Equipping the velocity features with angular encoding boosts substantial accuracy for the long-lasting actions, such as 'staple book'.In contrast, for the static input, most actions whose accuracy is significantly improved are those that last for a short time, such as 'thumb up'.ii) The majority of actions whose accuracy is improved by a type of angular encoding are those performed by the anchor joints corresponding to the angular encoding.To illustrate, the finger-based encoding increases accuracy for the hand-related actions, while the part-based encoding benefits the actions heavily using arms and legs.

VI. GENERALISABILITY OF ANGULAR ENCODING
A possible concern is the generalisability of the proposed angular encoding.That is, will fusing angular encoding improve the accuracy of other backbone architectures?To answer this, we conduct experiments fusing angular encoding with the joint feature and feed the concatenated input to three recently-proposed backbone networks: ShiftGCN [5], Decou-pleGCN [4] and MSG3D [19].The utilized dataset is the crosssubject setting of NTU120.
We display the results in Figure 6.We not only demonstrate the accuracy of fusing all kinds of proposed angular encoding, but we also separately concatenate every type of encoding with the joint feature and report the corresponding accuracy.We see fusing angular encoding with the original features consistently improves the accuracy of all three backbones.On the other hand, the effectiveness of different angular encoding varies in boosting accuracy.We observe the centeroriented angular encoding increases accuracy with the largest magnitude.Furthermore, angular encoding improves accuracy more when deployed in the velocity domain than in the static domain.These two observations are consistent with those on our simple backbone network.For DecoupleGCN, the partand finger-based angular encoding more substantially improve accuracy than they do for our simple backbone.Specifically, although feeding the velocity input to DecoupleGCN initially leads to lower accuracy than using the static feature, the situation is reversed after fusing with these two types of angular encoding.These scenarios imply that using features in the velocity domain surpasses using the static joints.
VII. DISCUSSION As we have described in the introduction, current GCNs are designed to extract features between two adjacent nodes.On the other hand, the angular features are higher-order ones beyond two adjacent vertices.We can theoretically view every angle as a hyperedge ( 1 ,  2 ,  3 ), where  1 ,  2 and  3 are the constitutional joints of an angle.The angular encoding is their associated feature.The angular encoding extends the capability of existing GNNs to capture features of hyperedges.
From the perspective of treating a skeleton as a hypergraph, we have proposed four categories of hyperedges.In contrast, existing work that also makes use of angle features only contains one type of hyperedges.

VIII. CONCLUSION
To extend the capacity of GCNs in extracting body structural information, we propose higher-order representations in the form of angular features, The proposed angular features comprehensively capture the relative motion between different body parts while maintaining robustness against variations of subjects.Hence, they are able to discriminate between challenging actions having similar motion trajectories, which causes problems for existing models.Our experimental results show that the angular features are complementary to existing features, i.e., the joint and bone representations.By incorporating our angular features into a simple action recognition GCN, we achieve new state-of-the-art accuracy on several benchmarks while maintaining lower computational cost, thus supporting real-time action recognition on edge devices.

Fig. 1 :
Fig. 1: Sample skeletons with similar motion trajectories: (left) taking off glasses vs (right) taking off headphones.The angles formed by red dashed lines (i.e., the foreand upper arms) are distinctive, which are informative in distinguishing these two similar motions.

Fig. 3 :
Fig.3: Our backbone architecture is composed of three spatial-temporal blocks, each consisting of a spatial multiscale graph convolution and a temporal multiscale convolution unit.The spatial multiscale unit extracts structural skeleton information with parallel graph convolutional layers.The temporal multiscale unit draws correlations with four functional groups.See Section III-B for more details.

Fig. 4 :
Fig. 4: Visualization examples of confusing actions.The action that the network gets most confused about has changed after employing angular encoding as a part of input features.

Fig. 5 :
Fig. 5: Accuracy of recognizing skeleton-based actions using the multi-scale GCN with different types of angular encoding.Both static and velocity domains are considered.The best accuracy of each domain is highlighted in red.

Fig. 6 :
Fig. 6: Accuracy of recognizing skeleton-based actions using DecoupleGCN (left) and ShiftGCN (right) with different types of angular encoding.Both static and velocity domains are considered.The column All represents concatenating all types of angular encoding.

a) Local (b) Center-Oriented (c) Pair-Based
x (Input Temporal

TABLE II :
Evaluation results on ensembling with angular features.Ens is the ensembling.Jnt and Bon represent the joint and bone features respectively.The red bold number highlights the highest prediction accuracy.Acc↑ is the improvement in accuracy.Fingers are actively involved in human actions.When the skeleton of each hand has finger joints, we include more detailed finger-based angles to incorporate them.As demonstrated in Figure

TABLE V :
Comparison of recognition performance between MSGCN and MSG3D.MSG3D has higher accuracy, more parameters, and a longer running time.GFlops stands for the floating-point operations performed by a model, which is the number of multiply-add operations that a model performs.

TABLE VI :
Independently evaluation of angular encoding for each category.XSub and XView represent cross-subject and cross-view.XSet means cross-setup.