Improving Skeleton-Based Action Recognition Using Part-Aware Graphs in a Multi-Stream Fusion Context

Skeleton-based human action recognition with Graph Convolutional Networks is an active research field that has gained increased popularity over the last few years. A challenge in skeleton-based action recognition is the design of a model in a way that captures fine-grained motions and the relations between the movements of different parts of the skeleton towards the recognition of specific actions. In this paper, the use of a set of part-aware graphs for the skeleton representation is proposed aiming to enhance discrimination between actions in the recognition task since each action put emphasis on specific parts of the skeleton. Extensive experimental work has been carried out in a consistent evaluation framework taking into account different combinations of part-aware graphs and feature representations leading to a configuration that achieves the optimal balance. Based upon two well-established datasets, namely NTU RGB+D and NTU RGB+D 120, we demonstrate that the proposed methodology compares favourably with the state-of-the-art. Code is publicly available at: https://github.com/joyios1/Improving-skeleton-based-action-recognition-using-part-aware-graphs-in-a-multi-stream-fusion-context.


I. INTRODUCTION
The importance of recognizing human actions led to the development of the research field of Human Action Recognition (HAR) which has applications in a lot of real-world problems such as: recognizing suspicious and dangerous human actions with video surveillance systems [1], real-time human-robot interaction [2] and entertainment [3].
The first approaches that used deep neural networks [4], [5] involved RGB images as input which poses many challenges due to background variability, illumination conditions, viewpoints and scales of humans.Over the recent years, a new field of HAR has emerged called skeletonbased action recognition, due to easy access to skeleton data with the use of depth cameras such as Microsoft Kinect [6].
The associate editor coordinating the review of this manuscript and approving it for publication was Berdakh Abibullaev .
Early methods that used the skeleton modality did not take into account the physical topology of the human body because the human joints were processed with Long Short Term Memory (LSTM) networks [7] or Convolutional Neural Networks (CNNs) [8] like sequences and images, respectively.However, the human body has a specific topology that needs to be considered when exchanging information between human joints which regular CNNs fail to achieve.Over the recent years, Graph Convolutional Networks (GCNs) have gained increased popularity in multiple and diverse contexts [9], [10], [11].In the context of action recognition, most recent methodologies utilize GCNs, where the topology of the human body is represented as a graph of interconnected nodes [12].
A common practice in skeleton-based action recognition with GCNs is the extraction of different modalities from the spatiotemporal data and the fusion of independently  trained models, operating on different modalities.The most common case is the bone modality, which was originally proposed in [13], calculated as the Euclidean distance between neighboring joints.Other common modalities are the motion data of joints and bones [14], [15], [16], [17], [18], [19], [20], calculated as the distance between the positions of a node in consecutive frames.In [21], a new modality was introduced, concerning the angles between specific bones.
Another common practice among state-of-the-art methodologies is the segmentation of the human skeleton into parts or part groups [22], [23], [24], [25], [26].The different graphs produced by this approach can be processed either by a single network trained in an end-to-end fashion or by individually trained models in a late fusion context.It can be argued that the isolation of specific parts can be beneficial for actions where the corresponding part is dominant.However, the  The contributions of this work are summarized as follows: • Investigation of novel part-aware graphs for skeletonbased action recognition which emphasize on parts of the skeleton that are dominant during specific actions.
• Extensive experimental work on well-established datasets that point out the optimal balance between part-aware graphs and feature representations in a multi-stream fusion context.

II. RELATED WORK
The use of graph convolutional networks in skeleton-based action recognition demonstrates high effectiveness because the human body can be represented by a graph, where the nodes and the edges denote the joints and the bones of the body, respectively.The graph can represent either the whole body or specific parts of it, which is the choice of many approaches in skeleton-based action recognition (part-based approaches).Furthermore, the majority of stateof-the-art methodologies employ the fusion of multiple networks trained independently with different feature modalities that are derived from the joint coordinates (stream fusion approaches).In the stream fusion approaches the same network is trained for different feature modalities whereas in the part-based approaches, the networks are modified to focus on sub-graphs of the human body.Last but not least, the combination of the aforementioned approaches has also been addressed.

A. PART BASED APPROACHES
Early approaches in which the human body is split into body parts utilized Recurrent Neural Networks (RNNs).In [22] the human skeleton was divided into five parts according to the human physical structure, and each part was fed into a Bidirectional Recurrent Neural Network (BRNN).Then in the next layers, the different learned representations from the parts were fused to obtain the final representation.In [27] a hierarchical spatial reasoning network was proposed with the use of LSTM networks, for effective capturing of the bodylevel structural information between body parts.However, LSTM networks achieve inferior performance compared to GCNs because the topology of the human body is not taken into consideration.In [23] a part-based graph convolutional network (PB-GCN) was proposed in which the skeleton graph is divided into four subgraphs with joints shared across them.Convolutions for each part of the part-based graph are performed separately and the results are combined using an aggregation function.In [24], structure-induced part-graphs were proposed based on the natural division of the human body for the input human skeleton, and then intra-part graph convolution was performed for better analysis of specific human actions.In [25] part-aware graph convolutional networks were introduced in which the joints are divided into several parts so that convolutions can focus on these specific parts of the human body.In [28] a Graphbased reasoning approach was introduced by projecting each part of the human body into a node to form a fully-connected similarity graph, in order to capture relative information among the disjoint and distant joints of the human body.Due to the inherent complexity of combining multiple partbased representations in an end-to-end training scheme, the requirement for large amounts of training data becomes an obstacle towards effective classification.

B. STREAM FUSION APPROACHES
Two milestone methodologies called ST-GCN [29] and AS-GCN [30] utilized only the joint coordinates.In 2s-AGCN [13] a two-stream network was proposed for the first time.
The model was trained two times, for the joint coordinates and for the differences between adjacent joint coordinates (bones).Another approach in which the joint-bone fusion framework was utilized is the [31] in which cross-spacetime connections were introduced for direct information exchange 117310 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.across the spatial-temporal graph.In [32] the joint and joint motion features are fused in a simple input.Then, more recent methodologies such as [14], [15], [16], [17], [20], [33], and [34], utilized a four (4) stream fusion adding the motion data of joints and bones, achieving even higher recognition accuracy than the two (2) stream fusion.In [35] higher-order features were proposed by introducing relative distances between joints and accelerations of joints and bones leading into a five-and six-stream fusion, respectively.In AngNet [21] angular features were proposed between specific pairs of bones that were concatenated with the coordinate and motion data in a four-stream fusion.In Info-GCN [36] a six (6) stream fusion was proposed where the first four streams were the same as in [15] and the two extra streams are introduced with a multi-modal representation of the skeleton where new spatial features for each joint were produced.Then, the motion data of these new features is calculated for the final six-stream ensemble.In our method, we further explore the concatenation of joints and bones (cat) modality which our experiments show that it greatly increases the performance of the fusion streams.

C. COMBINED STREAM FUSION AND PART-BASED APPROACHES
In [26] a Part-Level Graph Convolutional Network (PL-GCN) was proposed to capture part-level information of skeletons in which the partition of body parts is learned in a data-driven way.Firstly, a graph pooling operation is employed for learning high-level characteristics between body parts and then a graph unpooling operation is employed to give emphasis on important body parts in each action.
The input for the model was the concatenation of the joint coordinates and the temporal displacement.In [37], a partwise attention mechanism was proposed, in which five body parts of the human body are manually selected and processed both individually and as a whole, in order to obtain the final attention map for each body part.The input modalities are the joint, bone, bone motion, and joint motion, with which the network is trained in an end-to-end manner and fusion of all the modalities takes place in an intermediate layer of the network via concatenation.In [38] the full body, the arms and the legs of the human body are trained independently for four concatenated modalities, i.e joint, bone, joint velocity and bone velocity.Then the part streams are fused by performing a weighted average of the prediction scores.
Contrary to the aforementioned approaches, in our work we build part-aware single networks where we use three distinct modalities, namely, bones, joints and concatenated bones with joints.Each modality is used as an input in the proposed network for each selected part-aware graph.The final outcome is the result of a multi-stream fusion of all considered networks, resulting in a methodology that compares favorably with the state-of-the-art.

III. METHODOLOGY
In this section, we present an in-depth analysis of the proposed part-aware graphs along with the network architecture of the employed model.

A. PREREQUISITES
Let the human skeleton be considered as a graph G = (E, V ) where V ∈ {u 1 , ..u N } is a set of N human joints in 3D space, and E is a collection of edges between the human joints.The edges are described by an adjacency matrix A ∈ R N ×N where each element a ij represents the weight of the connection between joint u i and u j .Each action is represented by a vector X ∈ R C×T ×V where C, T and V denote the number of channels, frames and joints, respectively.The graph convolution operation is defined as: where N (u i ) = {u j |a ij ̸ = 0} denotes the neighboring nodes of u i , x j ∈ R C denotes the feature vector of u j and W denotes a learnable weight matrix.

B. MODEL ARCHITECTURE
The core of our methodology is the CTR-GCN [15].Firstly, spatial modeling is addressed in which joints exchange feature information, considering each frame independently, taking into account the graph topology in the form of an adjacency matrix, as defined in (1).Specifically, the spatial modeling of the model comprises three (3) CTR-GC blocks (FIGURE 1), each one concerning a different topology, which is shared for all the actions.The topologies of the CTR-GC blocks are initialized with different types of connections along the human skeleton graph, namely the 'self', the 'outwards' and the 'inwards' connections, respectively, as shown in FIGURE 2. However, these topologies are subject to change during training, allowing the model to determine the optimal learned connections.For a better understanding of the learned topologies the reader is referred to section IV-C3.Additionally, apart from the learnable topologies, the network calculates a specific topology for every feature channel in  the input skeleton sequence.Firstly, two 1 1 convolutional layers are used in parallel to extract intermediate feature maps from the input to a CTR-GC block.For each feature map, temporal pooling is applied to aggregate the features from all the frames in the sequence.Then, the resulting features are subtracted in a pairwise manner, determining a connection weight between all the joint pairs.A final 1 × 1 convolutional layer is used to restore the channel dimension to match the input channels, resulting in a different adjacency matrix for each channel.Finally, the shared topology is added to the channel-wise topologies as a refinement.The next step in the pipeline of the model is temporal modeling, where adjacent frames of the same joint exchange information.Firstly, 1 × 1 convolution layers are used as a bottleneck design to reduce the channel dimension and avoid a very large number of parameters when the 5×1 convolution layers are used to perform the temporal modeling.The dimensionality of a skeleton sequence is denoted as R C×T ×V where C is the number of channels, T is the number of frames and V is the number of joints of the human body, therefore when a 5×1 kernel is used information is exchanged between the frames employing the temporal context.Additionally, by using a dilation factor several frames can be skipped, resulting in the consideration of larger time spans.
An overview of the model architecture is shown in FIG-URE 2. The CTR-GCN model contains ten (10) consecutive blocks in which the first four blocks have 64 output channels, the next three have 128 and the last three have 256 and the temporal dimension is halved at the 5-th and 8-th blocks.

C. PART-AWARE GRAPHS
In this work, we have encountered skeletons with 25 joints as shown in FIGURE 3 which have been produced by the Microsoft Kinect v2.Motivated by the emphasis put on different parts of the skeleton during an action, we have considered not only global information of the skeleton but also local counterparts that can be expressed by a set of partaware graphs.In our approach, the proposed set of graphs comprise the following: a) ''Full graph'' consists of all the twenty-five (25) joints.b) ''Arms'' consists of thirteen (13)  Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.'type on keyboard'), the ''Right Hand & Left Leg'' and ''Left Hand & Right Leg'' can capture actions in which opposite arm-leg moment takes place ('walking apart', 'run on the spot') and the ''Legs'' can capture leg-dependent actions ('kicking', 'side kick', 'kicking something').
Furthermore, for each node of a graph, we denote three distinct connections that build the corresponding topology of the graph.As shown in FIGURE 4, The blue arrows denote the ''self'' connections which connect the node with itself.The green arrows denote the ''inwards'' connections toward the center of mass.The red arrows denote the ''outwards'' connections which are directed to the opposite way from the center of mass.The topologies for the parts (a)-(e) are shown in FIGURE 5.

D. STREAM FUSION POLICY
In the state-of-art approaches [14], [15], [16], [17], a fourstream fusion is used, in which the scores produced by different networks are combined.Each network is trained with a different modality, namely the joints, the bones, the joint motions and the bone motions.However in [18], an attempt for an early fusion of the aforementioned modalities is presented, where the network is trained in an end-to-end manner for all the modalities and fusion takes place in an intermediate layer of the network via concatenation.In our methodology, we investigate multiple-stream fusion schemes taking into account different combinations of part-aware graphs and modalities, as well.In particular, the employed modalities comprise (i) joints (ii) bones and (iii) concatenated joints and bones (cat) and the employed part-aware graphs comprise (i) ''full graph'' and (ii) ''arms''.The full pipeline of the proposed model is shown in FIGURE 6.
The fusion of the different modalities and part-aware graphs is computed as a weighted sum of the final scores as shown in Algorithm 1.However, the mechanisms to determine the actual values of the weights are not addressed in previous methodologies.Towards this end, we a grid search where all the different combinations of weights x a-j ← from x, keep the joints of arms 5: x a-b ← from x, keep the joints of arms, calculate bones 6: in a range from 0.1 to 1 with a step of 0.1 are considered and evaluated on a hold-out set as shown in Algorithm 2. The hold-out set is chosen from the training set by randomly selecting 15% of the actions.Then the weights with which the highest accuracy on the hold-out set is achieved are used to determine the final fusion accuracy on the testing set.For a detailed comparison between the different feature and graph representations, the reader is referred to section IV-C1.

IV. EXPERIMENTAL WORK A. DATASETS 1) NTU RGB+D
NTU RGB+D [40] is the first large-scale dataset in skeleton-based action recognition that was produced with the Microsoft Kinect v2.It contains a total of 56,880 actions that are categorized into 60 classes and were performed by 40 different subjects.Out of the 60 classes, 49 of them contain a single subject (i.e.'drink water', 'take off glasses') and the rest (11) contain two subjects (i.e.hugging, shaking hands).The authors recommend two benchmarks for training and evaluation: (1) Cross-Subject (CS) where 20 of the 40 subjects are used for training and the rest for evaluation.
(2) Cross-View (CV) where the training set consists of the actions from camera views 2 and 3 and the testing set from camera view 1.
2) NTU RGB+D 120 NTU RGB+D 120 [39] is an extension of NTU RGB+D [40] containing 60 more action classes resulting in 120 in total.It consists of 114,800 skeleton sequences that were performed by 106 distinct subjects.The dataset contains thirty-two ( 32) for w a-j ∈ 0.1, 0.
TotalAcc ← 0 10: = ŷ(i) else 0 Both datasets were labeled according to the action performed, the subject used, the camera id and the corresponding camera configuration.The CS evaluation category shows the model's ability to recognize actions performed by different subjects, which can be an indicator of performance in scenarios where real-time action recognition is performed in 117316 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

B. EXPERIMENTAL SETTINGS
For backpropagation, the SGD optimizer is utilized with a momentum of 0.9 and weight decay of 0.0004.The model is trained for a maximum of ninety (90) epochs with a warm-up strategy of five (5) epochs.The starting learning rate is set to 0.01.After the warm-up phase, it becomes 0.1 and then it drops to 0.00001 with the cosine annealing strategy [41].As a loss function, the cross-entropy loss function is used.the batch size is set to 64 for NTU RGB+D and the NTU RGB+D 120 datasets, respectively.The data preprocessing for both datasets is the same as in [15].The model is implemented with Pytorch version 1.12.0 [42] and was trained with two NVIDIA GeForce RTX 3090.

C. ABLATION STUDY
In this section, the results of the model are reported in the categories CS and CV for NTU RGB+D dataset and in the categories CS and CSet for NTU RGB+D 120 dataset.The model was trained for different configurations for the partaware graphs and the feature modalities under consideration.

1) CONFIGURATIONS IN THE FUSION SCHEME
In Table 1 we present the results of our comparative study for both NTU RGB+D and NTU RGB+D 120 datasets.
It is shown that the cat modality achieves better accuracy than the bone and joint modality in almost all the experiments which shows that the network can effectively combine the two main modalities when provided together.Furthermore, when the cat modality is used in the context of a three-stream fusion, the recognition accuracy further increases in all the cases.
In TABLE 1, 'rhll' and 'lhrl' denote the 'right hand & left leg' and the 'left hand & right leg' pairs, respectively, as defined in Section III-C.In the rows a-l at TABLE 1, different types relate to different combinations.In particular, type a-e relate to the use of just a single part-aware graph.First of all, the highest accuracy regarding all the different modalities is achieved for the full graph (type a) representation.When combined with the arms graph (type f) it leads to comparatively higher results.The reason for that is that NTU RGB+D and especially NTU RGB+D 120 contain a lot of actions that capture fine-grained hand and finger motions and object-related actions, in which the rest of the human body stays still.Furthermore, in the 'rhll' graph (type c) the accuracy in all modalities is significantly higher than the 'lhrl' graph (type d).Although the hand preference of the subjects is not provided, we consider that most subjects are right-handed due to this accuracy gap.With the legs representation, the model achieves the lowest accuracy compared to the other graphs because the datasets contain very few actions in which the legs are the dominant moving body part.In the case of type f, g, h, i, j, k, l the fusion accuracy of multiple combinations of different part-aware graphs and modalities are presented.The proposed stream is the combination of the full graph and the arms graph (type f) for all three (3) modalities as it achieves the best accuracy for all the categories of both datasets.

2) CLASS-SPECIFIC PERFORMANCE STUDY
In Tables 2 and 3, the three-stream fusion accuracy of the ''full graph'' is compared with the three-stream fusion accuracy of the ''arms'' and ''legs'' graphs, respectively.In the actions that were chosen, the ''arms'' and the ''legs'' of the human body are mostly involved to highlight the importance of the networks with part-aware graphs.Additionally, the six-stream fusion accuracy is reported which combines both graphs.As shown in TABLE 2, the accuracy of the model in some actions such as 'staple book', 'play magic cube', 'make victory sign', 'point finger' and 'play with phone' is higher on the three-stream of the arms which shows that the rest of the joints are redundant and the network can focus on the ''arms'' in a more effective way when the rest of the joints are excluded.However, in actions like 'take off glasses' and 'put on glasses' the model does not perform well on the arms stream because the position of the head of the human body may become crucial for the recognition.The sixstream fusion is the best-performing stream achieving higher accuracy than the three streams in the majority of the classes.
For a better analysis of the arm-dependent actions, in FIGURES 7, 8, 9 the confusion matrices for the armdependent actions of TABLE 2 for the 'full graph & arms 6s', 'full graph 3s' and 'arms 3s' are presented, respectively.It is worth noting that the action 'staple book' is mostly confused for the 'cutting paper' action due to their similarity, 117318 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
the action 'counting money' is often confused for the actions 'play magic cube', 'cutting nails' and 'play with phone' and the action 'writing' is often confused for the actions 'reading' and 'type on keyboard'.The fact that arm-dependent actions can not be easily distinguished shows the need for an analysis focused on the arms of the human body.
In TABLE 3, the performance of part-aware graphs for leg-dominant actions is presented.Specifically, the ''legs'' graph performs worse than the ''full graph'' three-stream.However, considering that the ''legs'' graph contains only the nine joints of the legs, the accuracy gap is very small.The six-stream fusion accuracy surpasses both individual configurations, which highlights the importance of incorporating part-aware representations, however, it remains inferior to the cases of focusing on the arms.

3) LEARNED TOPOLOGIES VISUALIZATION
As mentioned in section III-B, the shared topologies are initialized with the different part-aware graphs (section III-C) and are subject to change during training, allowing the model to obtain the optimal topologies.Several examples of the learned shared topologies of three CTR-GC blocks, initialized with different graphs and connection types are shown in FIGURE 10.Although the initial topologies adhere to the physical topology of the human body, the topologies that are learned by the model contain connections between all joints highlighting the need for non-adjacent connections between joints for a more accurate analysis of the human body.This is a sound example of the need for having the capability to learn new connections beyond the standard connections of the human body skeleton.Although the learned shared topologies cannot be tied to specific actions, since they are used for the complete set of actions considered, the stroke width of each connection emphasizes its importance towards optimizing performance.

4) COMPARISON WITH STATE-OF-THE-ART
In this section, we compare the proposed approach with the state-of-the-art methods.In TABLE 4 the accuracy of several state-of-the-art methodologies is reported for the 'CV' and 'CS' categories of the NTU RGB+D dataset.The performance of the proposed six-stream approach surpasses the four-stream accuracy of CTR-GCN on both categories but does not perform better than Info-GCN.In TABLE 5 the accuracy of state-of-the-art methods is reported for the 'CSet' and 'CS' categories of the NTU RGB+D 120 dataset.Because NTU RGB+D 120 is a superset of NTU RGB+D with a variety of extra arm dominant actions, the proposed configuration outperforms both CTR-GCN and Info-GCN in both categories achieving state-of-the-art performance.

V. CONCLUSION
In this paper, a novel skeleton-based action recognition approach has been proposed.The first contribution of the proposed approach relies on the use of a set of part-aware graphs that emphasizes different parts of the human skeleton aiming to improve the performance for specific classes.The superiority of our method lies in the fact that a high number of actions depending on the arms of the human body is present in the datasets.Thus, the arms graph in combination with the full graph proves to be the bestperforming configuration.The second contribution is a novel feature representation based on the concatenation of the joints and bones further enhancing the recognition accuracy.The proposed method effectively improves the capability of the network in capturing fine-grained hand and finger motions and fine-grained object-related individual actions while it achieves competitive results on the NTU RGB+D dataset and compares favorably with the state-of-the-art on the NTU RGB+D 120 dataset.

FIGURE 1 .
FIGURE 1.The architecture of CTR-GC block.

FIGURE 2 .
FIGURE 2. The architecture of the basic block of CTR-GCN model.

FIGURE 4 .
FIGURE 4. The three different neighbors of each node.The yellow node is the target node.The blue, green and red arrows denote the ''self'' connections, ''inwards'' connections and ''outwards'' connections, respectively.

FIGURE 6 .
FIGURE 6. Block diagram of the proposed weighted fusion pipeline using the best performing combination of part-aware graphs.

FIGURE 7 .
FIGURE 7. Confusion matrix for arm-dependent actions of TABLE 2 for the 'full graph & arms 6s' fusion configuration.

FIGURE 8 .
FIGURE 8. Confusion matrix for arm-dependent actions of TABLE2for the 'full graph 3s' fusion configuration.

Algorithm 1
Multi-Stream Fusion of the Different Models.The Indices 'a-j', 'a-b', 'a-c', 'f-j', 'f-b', 'f-c' Denote the Arms-Joint, Arms-Bone, Arms-Cat, Full Graph-Joint, Full Graph-Bone and Full Graph-Cat Representations, Respectively.x Denotes an Action.M Is the Set of the Trained Models and M (x) Denotes the Predicted Logits of Model M for Each Class in the Dataset.W Is the Set of Weights Used for Each Model.y Denotes the Final Predicted Label for the Action x

Algorithm 2
The Grid Search Procedure to Determine Best Weights for the Weighted Fusion.X and Y Denote the Set of the Actions Used for Grid Search and the Corresponding True Labels, Respectively.The Total Number of Actions Is Denoted as N .M Is the Set of Trained Models for the Different Graph Representations and Features.M a-j , M a-b , M a-c , M f-j , M f-b , M f-c Denote the Trained Model for the Arms-Joint, Arms-Bone, Arms-Cat, Full Graph-Joint, Full Graph-Bone and Full Graph-Cat Representation, Respectively.W Denotes the Set of the Determined Best Weight for Each Model Input:

2 . . . 1 .0 do 3 :do 4 :
for w a-b ∈ 0.1, 0.2 . . .1.0 for w a-c ∈ 0.1, 0.2 . . .1.0 do 5: for w f-j ∈ 0.1, 0.2 . . .1.0 do 6: for w f-b ∈ 0.1, 0.2 . . .1.0 do 7: end for setups according to the cameras' height and distance from the subjects.The authors propose two evaluation techniques: 1) Cross-Subject (CS) Evaluation where the actions from fifty-three (53) subjects are used for training and the remaining 53 for testing and 2) Cross-Setup (Cset) Evaluation where the actions from sixteen (16) camera setups are used for training and the remaining are used for testing.

FIGURE 9 .
FIGURE 9. Confusion matrix for arm-dependent actions of TABLE 2 for the 'arms 3s' fusion configuration.

FIGURE 10 .
FIGURE 10.Learned shared topologies of (a) full graph, (b) arms, (c) right hand & left leg, (d) left hand & right leg, (e) legs.Note that the stroke width of the red lines denotes the strength of the connection.

TABLE 1 .
Comparative results of the performance of different fusion scenarios for the categories 'CS' and 'CV' for NTU RGB+D and the categories 'CS' and 'Cset' for NTU RGB+D 120.

TABLE 2 .
Accuracy for arm-dependent actions using full-graph, arms and the fusion of both for the 'CS' category of the NTU RGB+D 120 dataset.

TABLE 3 .
Accuracy for leg-dependent actions using full-graph, legs and the fusion of both for the 'CS' category of the NTU RGB+D 120 dataset.

TABLE 5 .
Performance comparison between the proposed method and several state-of-the-art methods on the NTU RGB+D 120 dataset.