Spatial Temporal Graph Deconvolutional Network for Skeleton-Based Human Action Recognition

Benefited from the powerful ability of spatial temporal Graph Convolutional Networks (ST-GCNs), skeleton-based human action recognition has gained promising success. However, the node interaction through message propagation does not always provide complementary information. Instead, it May even produce destructive noise and thus make learned representations indistinguishable. Inevitably, the graph representation would also become over-smoothing especially when multiple GCN layers are stacked. This paper proposes spatial-temporal graph deconvolutional networks (ST-GDNs), a novel and flexible graph deconvolution technique, to alleviate this issue. At its core, this method provides a better message aggregation by removing the embedding redundancy of the input graphs from either node-wise, frame-wise or element-wise at different network layers. Extensive experiments on three current most challenging benchmarks verify that ST-GDN consistently improves the performance and largely reduce the model size on these datasets.


I. INTRODUCTION
R ECENT years, skeleton-based action recognition has attracted great attention since the data is more compact and more robust to complex background, when compared to RGB inputs [1]- [6]. Formerly, deep neural models, including both convolutional neural networks (CNNs) [1], [7]- [10] and Recurrent Neural Networks (RNNs) [4], [11], [12], achieve promising results and become mainstream methods since they are able to automatically learn more distinguishable features from data. Nevertheless, just like for another irregular data, conventional neural networks like CNNs and RNNs are designed in Euclidean space thus the skeleton-based action recognition does not significantly benefit from the neural networks. Fortunately, by introducing GCNs into this task, remarkable improvements have been witnessed [13]- [19]. Yan et al. first proposed to use spatial-temporal GCN for this task [14] and it becomes one of the most common framework to skeleton-based action Manuscript  recognition. Derived from ST-GCN, Shi et al. proposed to add virtual typology to involve more semantic information [18]. Likewise, Peng et al. [19] turn to neural architecture search, NAS [20], and automatically construct ST-GCN module [19] for this task. Nevertheless, as mentioned before, ST-GCNs capture and extract graph embeddings via a message passing paradigm, which makes the representation from different nodes indistinguishable to each other. Message propagation has the ability to enhance the interactions between nodes with correlation especially from topology structure. However, interaction with unrelated nodes May not get complementary information but noise which May even harm the original node representation. Especially from the high semantic level, node interactions based on topology connections May lead to very similar embeddings, which is unreasonable.
In this letter, we propose a novel graph neural architecture, referred as spatial temporal graph deconvolutional network (ST-GDN), to deal with the aforementioned issue. As shown in Fig. 1, the deconvolution operation provides a filter which can reshape and transform the graph features before the filtering. By changing the coordinates in a new feature space, the feature embeddings are standardized and unrelated to each other. With the deconvolutional operations, we remove the correlations and redundancy of the graph representations from either node-wise level, frame-wise level, or element-wise level. Finally, we evaluate our model on three current most challenging skeleton-based human action recognition tasks. Our contribution can be summarized as follows: r We present a novel and flexible Graph Deconvolution Network (GDN), which is designed to address the graph over-smoothing problem and also can be easily plugged into variant graph neural networks. r We utilize this model to deal with skeleton-based human action recognition tasks. The results on three current most challenging datasets show that we can get the best performance on any given evaluation metrics with an efficient fashion.

II. PROPOSED APPROACH
In this section, we will detail the theory of our approach and four various blocks, referring as ST-GDNs, including Node- 1) ST-GCNs: The basic GCN model here is designed based on the chebyshev polynomial approximation [21], [22], in which the matrix L has a complete set of orthonormal eigenvectors U = [u 1 , u 2 , .., u n ], which are the Laplacian eigenvectors associated with non-negative eigenvalue . Then, taking the eigenvectors of the normalized Laplacian matrix as a set of bases, the graph convolution operator is defined by the Fourier transformation. Current GCNs approximate this convolution operations with (K − 1)-th order polynomial expansion. Since our inputs are a sequence of skeletons, we introduce temporal filters to capture the dynamic information from the inputs. Inspired from (2+1)D convolution networks [23], we make the above mentioned GCN be followed by a temporal filter, thus the output representation should be where Θ t τ is an 1D temporal convolutional filter with kernel size of τ , θ k is the polynomial coefficient for the k-th order.

1) ST-GDNs:
The basic architecture in the network is a spatial graph model(s) followed by a temporal model, as illustrated in Fig. 1. The spatial model can be either a GCN, a node-wise GDN, or their combinations. The temporal model can be a temporal convolutional filter (TCN) or a temporal deconvolutional network (TDN). Given a sequence of human skeletons or skeleton feature embeddings, the ST-GDNs output higher level representations (Output Graph Features in Fig. 1) and a dynamic graph embedding matrix (Matrix A in Fig. 1). Based on the representations, a dynamic graph embedding matrix is also provided for each GCN layer. We stack multiple ST-GDNs to learn the graph embedding.
Over-smoothing leads to very similar representations for each node or even each feature element. Our ST-GDNs address this problem via filtering graph with representation standardization. By changing the coordinates in a new feature space, the feature embeddings are standardized and unrelated to each other, thus the issue is relieved. Here, we begin to introduce the proposed node-wise deconvolution firstly. It is very easy to generalize to frame-wise and element-wise models. Assume X ∈ R N ×T ×F is the N node graph representations for a T frames skeleton sequence and the feature dimension for each node is F . From Eq. (1) we know that, a convolutional filter will be introduced to capture the node embeddings for the tensor X. Assume that the kernel size of the convolutional filter is k 1 × k 2 . In practice, the filter multiples with an unfold tensor of X, in which there are much redundancy and unavoidable will lead to the oversmoothing issue. Suppose we unfold the embedding X to m feature blocks while each block contains N × k 1 × k 2 elements. The number of feature block m can be represented by The index i belongs to the set {1, 2} since the GCN is a 2D operation. S i = {T, F } is the length of feature along each dimension. The parameters p i , d i , k i and t i are the padding value, dilation value, kernel size and stride value for the corresponding dimension, respectively. Thus, we can obtain the unfolded feature embedding X ∈ R (N ×k 1 ×k 2 )×m . Instead of directly filter this tensor, we preform the deconvolution on it. Therefore, once getting this unfolded embedding, we first calculate the mean node embedding μ ∈ R N ×k 1 ×k 2 for the m feature blocks. Then, a covariance matrix can be written as To make it more stable, a very small value is added to the diagonal of the matrix. The graph deconvolution operation can be a GCN on the transformed input feature with the covariance matrix, that is: Here, the transformed feature representation (X − μ)C − 1 2 has an identity matrix since its covariance is By changing the coordinates in a new feature space, the feature embeddings are standardized and unrelated to each other. This operation can be considered as a deconvolution since it can negates the process of convolution. It means that for a delta kernel δ, the transformed feature XC − 1 2 will not be changed by using kernel C 1 2 δ. In this case, the deconvolution kernel is C − 1 2 · vec(δ), where vec(δ) is equal to slicing the middle row/column of C − 1 2 and reshaping it to the kernel size. However, calculating the inverse square root of a matrix is still computationally expensive and unstable. Instead of directly computing it, like [24], we further use coupled Newton-Schulz iterations to reduce the cost. The iteration starts with Y 0 = C, Z 0 = I , and could be executed by Once the iteration is done, the final result of Z k will converge to C − 1 2 . Then, the value of Z k could be utilized to approximate the C − 1 2 as a trade-off between efficiency and accuracy. Here, the deconvolutional operation in Eq. (4) can be further simplified by only considering the first-order polynomial approximation and setting the polynomial coefficients θ = θ 0 = −θ 1 . Then, the learnable θ is expected to make the approximation more robust and higher-order node connections can be captured by stacking multiple layers. Thus, the output from single GDN layer can be represented as To break the limitation for the higher level graph embedding, like [19], instead of providing a predefined correlation embedding matrix, we introduce self attention mechanism to automatically compute a dynamic embedding matrix based on the representation similarity: Here, A i,j is the correlation between node i and node j. The two projection functions φ(·) and ψ(·) are used to map features to another feature space, where the Gaussian similarity is used to measure the node correlation strength. Based on the above structure, we construct four kinds of deconvolution models. First, the node-wise graph deconvolutional networks ST-GDN 2 , which combines the mentioned graph model and a TCN together. To further benefit from the original GCN, we design a ST-GDCN, which replaces the graph model of ST-GDN 2 by a union of GCN and GDN. Both convolutional and deconvolutional feature embeddings are involved to promote the graph representation learning. Next, we extend deconvolution to frame-wise, in which we first rearrange the feature shape along the temporal dimension. So the feature representation X ∈ R N ×T ×F could be reshaped to X ∈ R T ×N ×F . Then, the ST-GDN-T will be built by a normal GCN followed by a TDN, where the TDN is a temporal filter based on Eq. (7) and the matrix (I − L) switches to an identity matrix. Finally, with above two models (ST-GDN 2 and ST-GDN-T), we combine GDN and TDN to build an element-wise ST-GDN, which is referred as ST-GDN-E.

A. Experiment Settings
Our model has seven graph neural layers. Like previous works [14], [18], [19], a residual skip connection is applied on graph convolutional block. The projection functions, as described in Eq. 8, are implemented by two channel-wise convolutional filters. The number of channels at each level are consistent with the current state-of-the-art methods [14], [18], [19] for fair comparison. The last output feature maps are averaged to a vector and then a fully connected layer is used for final class prediction. All the experiments are performed on PyTorch [26] and we train 50 epochs for our models with cross-entropy loss. A SGD with Nesterov momentum (0.9) is applied in the optimization algorithm. The weight decay is set to 0.0005. The learning rate is set as 0.1 and is decreased based on a cosine function. We execute five iterations for coupled Newton-Schulz.

B. Visualization
We first visualize the features of each node at the last graph layer to observe whether this block could distinguish the embeddings. To this end, we compare ST-GCN network to the ST-GDN 2 , which is designed from the node-wise, on NTU RGB+D dataset under the Cross-Subject (CS) evaluation.
After training these two networks for 50 epochs, we choose all the samples from the same class for evaluation. Since there are 25 nodes in each graph, we assign one different color to each node. We average the feature along the temporal dimension such that we get a 256-D representation for each node. Then we visualize it by using t-SNE [27]. Features from ST-GCN are shown in Fig. 2(a), while Fig. 2(b) presents for features obtained by ST-GDN 2 . In Fig. 2(a), nodes from the same graph are nearly located at the same location since different nodes are with very similar feature representations. This is obviously an over-smoothing problem caused by GCN. On the contrary, as shown in Fig. 2(b), we can find very distinguishable representations for the nodes from ST-GDN 2 , which proves that our method could alleviate this problem.

C. Ablation Experiments
Here, we evaluate the effectiveness of our method on the NTU RGB+D dataset under the CS evaluation. Here, current state-of-the-art 2S-AGCN [18] is utilized as the baseline. We also implement a seven-layer network, 2S-AGCN-7l, based on the block from 2s-AGCN. 2S-AGCN-7l is with the same architecture settings of ST-GDNs and is much smaller than 2S-AGCN. In this way, we want to know how well our model could perform when compared with GCN using the same setting. Finally, we combine all our ST-GDNs and expect to get a better graph network. Here, we empirically design our ST-GDN like this: for the first four layers, we put four ST-GDCN to capture richer representation of the input. For the fifth layer, we insert a node-wise deconvolution, ST-GDN 2 . Inspired by [19], we set our ST-GDN with temporal-wise deconvolution and elementwise deconvolution at higher layers (layers six and seven) to enhance the importance of temporal information. In this way, we build a more powerful network (i.e., ST-GDN) for this task.
It can be seen from Table I that all our networks can achieve better results when compared to the baseline method 2S-AGCN. If we reduce 2S-AGCN to seven layers, which is the depth of our networks, our ST-GDN could even outperform it by 6.4% and 9.1% on joint and bone data, respectively. This shows the effectiveness of our method. Besides, results show that network benefits more from frame-level deconvolution (ST-GDN-T), which is reasonable since over-smoothing caused by 300 frames could be more serious than that caused by 25 nodes. We could also find that directly adding deconvolution to all elements can not ensure the improvement when compared ST-GDN-E to ST-GDN-T. This also proves the architecture of ST-GDN is reasonable.

NTU RGB+D dataset
Here, we compare with 14 state-ofthe-art skeleton-based action recognition approaches under two evaluation metrics, i.e., CS and Cross-View (CV) metrics. All the comparison results are listed in Table II. In this task, like [18], [19], we build two stream networks and report the best result after performing the score-level fusion on joint and bone data. It can be seen from Table II that our model achieves the best performance in terms of either evaluation metrics. Specifically, our model gets the current best result 89.7% and 95.9% on CS and CV evaluations, respectively. Note that, our model even outperforms the NAS-based GCN method [19]. Besides, the model size of the proposed method also decreases by three times when compared with [19].
NTU RGB+D 120 dataset We compare with 14 skeletonbased action recognition approaches under CS and Cross-Setup (CST) evaluation metrics. Here, we report the best result on joint data. All the comparison results are listed in Table III. We can see from Table III that our model outperforms other compared approaches under both CS and CST metrics. When compare to the current best CNN-based method [31], GCN-based methods could get more than 10% improvements on average. That proves the graph convolutional networks are much suitable for this task. Comparison in the GCN-based methods could also show our superiority. For instance, when compared to the AS-GCN [17], which is the previous best model for this task, we can still get  Kinetics-skeleton dataset We compare our method to seven different approaches. Like [18], [19], we report both top1 and top5 accuracy since this task is much challenging. All the comparison results are listed in Table IV. It can be seen from Table IV that our model achieves the best performance on both of the metrics. Specifically, we get the best Top-1(37.3%) and Top-5(60.5%) performance on Kinetics-Skeleton dataset, which presents the score-level fusion results. For either using joint or bone data, we can always get the best results when compared with methods using the same data. Also, the size of the proposed model is smaller than the current best one [19] by three times.

IV. CONCLUSION
In this letter, we provide a novel and flexible spatial temporal graph deconvolutional network, ST-GDN, to address the graph over-smoothing issues in skeleton-based action recognition. The ST-GDN provides a new graph deconvolutional operation which not only performs a feature extraction but also provides a transformation of the graph representation such that it could be standardized. Based on this model, we build four different kinds of ST-GDNs and empirically insert them at the different levels of the networks. In this way, we construct our ST-GDN which could capture more powerful graph embeddings for the graph sequences. Compared to many state-of-the-art methods, the proposed model presents its efficiency and accuracy with corresponding metrics.