An Expert-Knowledge-Based Graph Convolutional Network for Skeleton- Based Physical Rehabilitation Exercises Assessment

Physical therapists play a crucial role in guiding patients through effective and safe rehabilitation processes according to medical guidelines. However, due to the therapist-patient imbalance, it is neither economical nor feasible for therapists to provide guidance to every patient during recovery sessions. Automated assessment of physical rehabilitation can help with this problem, but accurately quantifying patients’ training movements and providing meaningful feedback poses a challenge. In this paper, an Expert-knowledge-based Graph Convolutional approach is proposed to automate the assessment of the quality of physical rehabilitation exercises. This approach utilizes experts’ knowledge to improve the spatial feature extraction ability of the Graph Convolutional module and a Gated pooling module for feature aggregation. Additionally, a Transformer module is employed to capture long-range temporal dependencies in the movements. The attention scores and weight matrix obtained through this approach can serve as interpretability tools to help therapists understand the assessment model and assist patients in improving their exercises. The effectiveness of the proposed method is verified on the KIMORE dataset, achieving state-of-the-art performance compared to existing models. Experimental results also illustrate the interpretability of the method in both spatial and temporal dimensions.


An Expert-Knowledge-Based Graph Convolutional Network for Skeleton-Based Physical Rehabilitation Exercises Assessment
Tian He , Yang Chen, Ling Wang , Member, IEEE, and Hong Cheng , Senior Member, IEEE Abstract-Physical therapists play a crucial role in guiding patients through effective and safe rehabilitation processes according to medical guidelines.However, due to the therapist-patient imbalance, it is neither economical nor feasible for therapists to provide guidance to every patient during recovery sessions.Automated assessment of physical rehabilitation can help with this problem, but accurately quantifying patients' training movements and providing meaningful feedback poses a challenge.In this paper, an Expert-knowledge-based Graph Convolutional approach is proposed to automate the assessment of the quality of physical rehabilitation exercises.This approach utilizes experts' knowledge to improve the spatial feature extraction ability of the Graph Convolutional module and a Gated pooling module for feature aggregation.Additionally, a Transformer module is employed to capture long-range temporal dependencies in the movements.The attention scores and weight matrix obtained through this approach can serve as interpretability tools to help therapists understand the assessment model and assist patients in improving their exercises.The effectiveness of the proposed method is verified on the KIMORE dataset, achieving state-of-the-art performance compared to existing models.Experimental results also illustrate the interpretability of the method in both spatial and temporal dimensions.

I. INTRODUCTION
T HE prevalence of chronic disease among aging popula- tions profoundly impacts individuals' quality of life [1].Proper rehabilitation support can alleviate resulting limitations in motor activity and social participation [2].Current rehabilitation programs involve an initial inpatient phase followed by an outpatient phase where patients perform prescribed exercises independently [3].However, studies have shown low patient compliance and incorrect execution of exercises without therapists [4], [5], [6], [7], resulting in longer treatment times and increased healthcare costs.Since it is unlikely for therapists to observe all exercise trials of a patient [8], the development of a quantitative model for automatically evaluating and guiding rehabilitation exercises is necessary.Such a model could improve patient compliance and exercise accuracy, and reduce healthcare costs, ultimately enhancing the effectiveness of rehabilitation programs.
In recent years, there has been a growing interest in the automated assessment of rehabilitation exercises [9], [10], [11], driven by the increasing demand for social rehabilitation.With the advancements in hardware technology, researchers have focused on developing computational approaches that can automatically evaluate human motion.These approaches can be broadly categorized into rule-based, template-based, and end-to-end machine learning methods.
Rule-based methods involve manually defining explicit rules [9], and the actions to be evaluated are compared with the rule sets.For example, in [12], [13], the authors used trunk flexion angles and the distance traveled by a set of joints to formulate rules for postural control, and trunk tilt angles for gait retraining.In [14], the authors proposed three types of rules including dynamic rules, static rules, and invariance rules to define different exercises.Manual rule formation is a time-consuming and labor-intensive task, as it necessitates the involvement of human experts to identify key action features and create corresponding rules.Moreover, these rules may Overview of the EGCN framework, architecture, and interpretability.
not adequately account for individual differences or variations in rehabilitation exercises, resulting in limited accuracy and generalizability [15].
In the template-based approach, standard action templates are recorded, and similarities between actions and templates are calculated [10].In [16], [17], the authors used Dynamic Time Warping (DTW) to calculate similarities between the joints' movement trajectories of the standard and observed actions.In [18], the authors defined a set of time-stamped characteristic points for each bone in each rotation axis and calculated the difference between standard and observed performances.In [19], the authors utilized a graph-based approach to align two dynamic skeleton sequences for a recognition task.These studies have demonstrated the feasibility of template-based rehabilitation assessment methods.However, accurately representing action performance through similarities between templates and observed actions is challenging due to patients' varying physical functional abilities, which can result in dynamic and temporal differences [20].The accuracy and reliability of template-based assessment methods can fluctuate, as they rely on standard templates that may not fully capture the intricacies of individual patient movements.
Recently, end-to-end machine learning methods have gained more attention for their effectiveness in action assessment.These methods can be divided into manual feature extraction models and automatic feature extraction models.Manual feature extraction models require experts to identify motion descriptors from videos or skeleton data sequences as input [11], [15], [21], [22].The performance of these methods heavily relies on manually extracted features, which imposes limitations on their predictive and generalization capabilities.Automatic feature extraction models use original data directly.In [23], the authors first proposed a deep-learning-based framework using Convolutional and Long Short-Term Memory networks to extract motion features from human skeleton data.However, the human skeleton essentially represents a graph in a non-Euclidean space where joint coordinates serve as vertices and the bones between joints represent the edges of the graph, using Convolutional Neural Networks (CNN) may result in the loss of important spatial information.In the domain of human action classification, which is closely related to human action assessment, Graph Convolutional Networks (GCN) [24] have demonstrated greater potential compared to CNN [24], [25], [26], [27], [28].GCN can effectively leverage skeletal data from the human body, capturing movement dynamics across different parts of the body through the graph formed by the human skeleton structure.However, GCN-based approaches are not widely used for rehabilitation exercise assessment.
End-to-end assessment methods usually exhibit higher accuracy and stability but lack the ability to generate corresponding feedback or explanations, which is crucial for a model in the clinical field [29].If the model cannot explain its result to support the decision-making, therapists may lose trust in the model [30].Moreover, due to the lack of proper feedback or instruction, patients cannot achieve the best recovery efficiency [31].Therefore, it is important to develop an assessment method that not only provides accurate results but also generates feedback and explanations to guide therapists' decision-making and improve patients' recovery efficiency.
In this paper, we propose a novel end-to-end framework, as shown in Fig. 1, for accurately assessing motion quality and providing helpful feedback.We utilize skeleton sequences as model inputs due to their robustness to illumination and scene changes.The skeleton sequences are fed into two streams.The first stream consists of basic GCN for capturing common action information.The second stream incorporates Expertknowledge-based GCN for rehabilitation-related spatial feature extraction.The extracted feature sequences are then aggregated by a Gated Attention Pooling module and further processed by a Transformer module [32].The Expert-knowledge-based GCN can model the relationships between joints regardless of their topological information.The gated pooling module aggregates spatial features effectively and identifies important joint pairs for spatial explainability.The transformer encoder module enhances long-term spatial-temporal feature learning and indicates the importance of specific time steps for temporal interpretability.The global feature generated by the Transformer and the feature extracted by the basic GCN are added for final score prediction.The main contributions of this work are summarized as follows: 1) Expert's knowledge of rehabilitation exercises is introduced to construct the GCN module.The adjacency matrices of the GCN are built based on the comprehensive understanding of the rehabilitation actions, which results in improved performance on spatial feature extraction.2) A novel gated-attention pooling mechanism is proposed, and a gated attention score regularization term is introduced in the loss function.This mechanism improves model accuracy and provides users with expert-like II.METHODS In our framework, skeleton sequences are fed into two streams.The first stream is used to extract general action spatio-temporal information.The second stream aims to effectively extract rehabilitation-related features.The outputs from both streams are amalgamated to make the final prediction of assessment scores.The overall structure of our framework is shown in Fig. 2 A. Preliminaries 1) Problem Formulation: A human skeleton graph can be denoted as G = (V, E), where V = {v 1 , v 2 , . . ., v N } is the set of N joints, E is the edge set representing bones captured by an adjacency matrix A ∈ R N ×N .If an edge directs from v i to v j , A i j = 1 and 0 otherwise.The specific exercise performed by a patient is captured in the form of a sequence of graphs x t,i is a C dimensional feature vector for node v i at time t.For each exercise to be recorded, a groundtruth score y ∈ [0, 1] is given.The presented score is employed to evaluate the efficacy of the rehabilitation training.A higher y indicates superior patient performance, while a lower y denotes suboptimal outcomes.The whole data set includes a couple of tuples { x j , y j | j ∈ [1, D]}, where D is the total number of recorded sequences.Our goal is to train a model F to predict exercise score F (x) that minimizes the root mean square error to ground-truth score y, which can be formulated as: 2) Graph Convolutional Networks (GCN): Given the graph sequence defined above, with skeleton input features X and graph structure A, the vanilla graph convolution for a single layer can be expressed as: where σ is a nonlinear activation function.Ã = A + I is the skeleton adjacency matrix adding an identity matrix that indicates all nodes are self-connected.To deal with the diminishing or exploding problem in the neural network, Ã is normalized to maintain the scale of output feature vectors by multiplying ˜ − 1 2 on both the left and right sides.˜ is the diagonal node degree matrix of Ã in measuring the degree of each node.W ∈ R C in ×C out denotes a trainable weight matrix of the network.It is worth noting that this formulation can resemble the standard 2D convolution with a specific adjacency matrix A. The vanilla GCN approach aggregates the 1-hop neighbors' information in the skeleton graph.Each layer in GCN only combines information in a small range, which may lose information over long-range linkages.In [33], the authors employed higher-order polynomials of the adjacency matrix to aggregate multi-scale structural information.With the L-order polynomial, the graph convolution network can reach further neighbors to increase the receptive field.The L-order polynomial can be expressed as: where , l represents the power of adjacency matrix.

B. Expert-Knowledge-Based GCN
While higher-order polynomial GCN have the advantage of increasing the receptive field to a larger scale, this approach tends to prioritize closer nodes, potentially limiting its effectiveness in capturing critical long-range joint dependencies during rehabilitation exercises.For example, in the case of the rehabilitation exercise "Bring a Cup to the Mouth," the connections between the hands and the head are particularly important.As shown in Fig. 3, the higher-order polynomial scheme employed in this method may result in less emphasis on the informative connection between the hands and head.This issue becomes more pronounced as the complexity of the exercise increases.
To address the aforementioned issue, we used expert knowledge to improve the efficiency of gathering spatial information in GCN for rehabilitation exercises.Unlike traditional action classification, rehabilitation exercises have stricter standards.For instance, the physical exercise "Lifting of the arms" described in [34], from the experts' perspective, demands raising the arms above the head, maintaining elbow extension to stretch the trunk muscles, and avoiding anterior or posterior pelvic tilt.According to the above expert requirements, to accurately evaluate this rehabilitation action requires considering relationships between wrist joints and the head, elbow joints and the spine, and pelvis and spine.We created specific adjacency matrices for this exercise based on the expert requirements, as illustrated in Fig. 4. Similarly, adjacency matrices for the other exercises can be constructed.Based on professional knowledge of the exercise, we define the adjacency matrix Ã(k) for the specific exercise k as: To improve the flexibility of graph convolutions, we add a learnable, unconstrained adaptive matrix Ãadpt to every is initialized as an all-zero matrix.Let Âk = , where ˜ (k) is the diagonal node degree matrix of Ã(k) in measuring the degree of each node, we get the expression for Expert-knowledge-based GCN: X out,k ∈ R T ×N ×C out , k represents the types of different rehabilitation exercises.
In contrast to multi-order GCN, the expert-knowledgebased method accurately depicts joint relationships, leading to sparser adjacency matrices and improved efficiency in capturing motion features.On this basis, a relatively simple graph convolutional network is enough to capture spatial information.Consequently, our approach can be particularly suitable for small-scale rehabilitation data sets.Fig. 5 shows the structure of our expert-knowledge-based method.

C. Gated Attention Pooling
1) Gated Attention Pooling Method: In the Vanilla Graph Convolutional Network (GCN) framework, the feature representations obtained from different adjacency matrices of various orders are treated equally.Their summation is then used to derive the final output, as shown in Eqn.(3).However, the importance of the representations should vary for different actions.For example, single-order representations might be effective in capturing the dynamics of actions like sitting down or standing up.On the other hand, multi-order representations are more suitable for actions such as drinking water or eating food.This discrepancy becomes more apparent in our expertknowledge-based GCN framework.In this framework, the model should prioritize the features extracted through the corresponding expert-knowledge-based adjacency matrix, Âk .Inspired by this idea, we propose a more rational pooling approach that allows for the weighted summation of different features, as shown in Fig. 6.Let the extracted features of the adjacency matrix Âk as X out,k ∈ R T ×C out ×N , the attention score of each X out,k is defined as: where W ∈ R 1×D and U 1 , U 2 ∈ R D×C out are trainable parameters, ⊙ is Hadamard product.The sigm(•) function is introduced to provide non-linearity to learn complex relationships and regulate the final expression of learned relationships [35].a k is the attention score of the k th exercise adjacency matrix, the final feature Z ∈ R T ×C out ×N after aggregation for the rehabilitation exercise assessment is calculated as: 2) Spatial Explainable Ability: Intuitively, it is recommended to assign a higher attention weight a k to the type k exercise.The gated attention mechanism does not directly yield the classification probability for exercise k but a k can serve as an approximation.Consequently, the attention mechanism facilitates the interpretation of the assessment results in the context of joint connection significance.We define the importance matrix M as: where [M] i j represents the importance of the connections between joints in model assessment.By incorporating the inherent topological structure of the human skeleton with the critical joint connections during physical exercise, we can pinpoint the body regions that require increased attention and focus.This offers valuable insights to both therapists and users.
To mimic the therapist's assessment as much as possible by assigning high attention scores to the adjacency matrix corresponding to the training exercise, we add an attention score regulation term to the loss function during the training process.The regulation term is defined as: where y (k) is an indicator variable: The full loss function to train our model is defined as: n is the training batch size, y i is the ground truth score for the sample i, ŷi is the predicted score for the sample i, and α is a hyperparameter to control attention score regulation.We Set α to be a monotonically decreasing value as the training epoch increases to ensure the final prediction effect.α is defined as: where e 1 is the number of epochs during the training process.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

D. Transformer for Temporal Information Fusion
1) Transformer for Temporal Modeling: In temporal information modeling, previous works [26] used temporal convolutions with fixed kernel sizes.However, this approach has limitations when applied to data involving individuals with varying action speeds.To address this, researchers have used a multi-scale temporal convolution framework (MSTCN) with different kernel sizes [36], [37].To enhance the receptive field of MSTCN, larger kernels or dilation techniques have been employed [38].While MSTCNs are flexible and robust, they can only integrate temporal features from adjacent time steps, which may not be suitable for rehabilitation exercises characterized by repetitive actions.To address this problem, we utilize transformer [32] for temporal information fusion, enabling the capture of long-range dependencies and repetitive patterns in rehabilitation exercises.
The main idea behind the transformer is self-attention.Self-attention is a sequence-to-sequence operation: a sequence of vectors goes in, and a sequence of vectors comes out.To produce output vectors, the self-attention operation takes a weighted average over all the input vectors.Every input vector is used in three different ways in the self-attention mechanism: Query (Q), Key (K), and Value (V).Three trainable weight matrices W q , W k , W v are used to compute three linear transformations: The attention weight matrix Âattn which represents connections among input vectors is defined as: To normalize the value of K T Q, the square root of d k which is the dimension of Query (Q), Key (K) is divided.Then, the Value (V) is multiplied by Âattn to get H out : 2) Temporal Guidance Ability: As previously stated, the attention weight matrix Âattn captures the connections between input vectors [39], [40].In the context of rehabilitation, our regression model focuses solely on a global feature which is the weighted summation of all input features.As illustrated in Fig. 7, the row highlighted by the red rectangle represents the overall influence of input vectors from different timestamps on the final global features.This enables therapists to gain insights into how the model evaluates over time, while users can identify specific periods of their past workouts that require more attention and focus.

III. EXPERIMENT A. Experiment Setup 1) Dataset:
The KIMORE dataset [34] is collected by the Microsoft Kinect v2 sensor and consists of three synchronized data typologies, which include RGB, Depth videos, skeleton positions, and orientations in the format produced by the skeleton tracking system.Two physicians, specialized in Physical and Rehabilitation Medicine, selected five exercises usually adopted in rehabilitation programs for low back pain.These exercises, namely Exercise 1 (EX1) to Exercise 5 (EX5), are "Lifting of the Arms," "Lateral Tilt of the Trunk with Arms in Extension," "Trunk Rotation," "Pelvis Rotations on the Transverse Plane" and "Squatting."Each exercise has detailed action requirements from experts, and the dataset contains recordings of 44 healthy and 34 unhealthy subjects performing each exercise.For the healthy subjects, the average age was 36 years (mean (SD) = 36.7 (16.8) years), with 15 females and 29 males.For the unhealthy subjects, the average age was 60 years (mean (SD) = 60.44 (14.2) years), with 19 women and 15 men.Among the unhealthy subjects, 10 were suffering from strokes, 16 were suffering from Parkinson's disease and 8 were suffering from back pain due to spondylosis.Ground truth assessments of each person's performance, determined by experts through a questionnaire designed to measure accuracy, are also provided.Our model uses the skeleton joint positions captured in each video as input data.
2) Assessment Metrics: In line with established methods [41], [42], [43], we evaluate the regression performance using Pearson's correlation (ρ), mean absolute error (MAE), and root mean squared error (RMSE).Since different therapists may assign slightly different scores to individual rehabilitation movements, the evaluation process should focus on measuring the consistency between the model's predictions and the ground truth.The correlation coefficient ranges between −1 and +1, with values closer to 1 indicating greater consistency between the model and the experts' assessment results.MAE quantifies the average absolute deviation between predicted scores and true scores, with lower values indicating better performance.RMSE is another commonly used assessment metric, which computes the square root of the average squared difference between predicted and actual values.RMSE is more sensitive to larger errors due to the squared difference.The assessment metrics can be formulated as: , Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

M AE
3) Implementation Details: The Adam optimizer was used to train all models with a batch size of 16, and an initial learning rate of 0.05.A step learning rate decay with a factor of 0.1 at [50, 300, 700] epochs was employed to adjust the learning rate.In order to remove systematic small jitters, convolution was applied to all skeleton sequences to smooth joint trajectories.The skeleton sequences were downsampled, selecting 1 frame every 5 frames, and resized to a length of 200 frames.Normalization was performed by subtracting the position of the pelvis at frame 0 for all skeleton sequences.To ensure fair performance comparison, we conducted filtering to remove abnormal data, resulting in a dataset comprising 389 instances.We also abstained from utilizing any data augmentation techniques.We first randomly partitioned the dataset into training (80%) and validation (20%) sets.An exhaustive grid search was then conducted based on the training and validation sets to determine the optimal hyperparameter configuration.To ensure result reliability, we employed a 5-fold training and testing approach based on the optimal hyperparameter configuration.The performance metrics from all folds were averaged to obtain the final results.The implementation of our model was carried out using PyTorch 1.13.

B. Comparison Against the State-of-the-Art
In this study, we compared our model's assessment performance on each exercise in the KIMORE dataset with stateof-the-art methods.Our framework outperformed all these models in the rehabilitation exercise assessment task in terms of ρ, MAE, and RMSE, as shown in Table .I. HSMM [15] employed a Hidden Semi-Markov model approach based on handcrafted features for assessing exercise scores.In comparison to other GCN-based methods, this model performs suboptimally.STGCN [26] was a well-established GCN model designed to capture the spatio-temporal dynamics of human motions from skeleton data.2S-AAGCN [37] was an enhanced version of STGCN, introducing an adaptive graph convolution method that optimizes the graph's topology and other network parameters in a learning manner.GCN+LSTM [43] is a GCN-based model specifically designed for rehabilitation exercise assessment.It combined GCN with a trainable self-attention map to identify the importance of different joints and utilized LSTM to handle temporal information.MS-G3D [38] utilized disentangled multi-scale graph convolutions to better capture long-range spatial-temporal dependencies in human motions.In MS-SGN [44] a semantics-guide GCN with multi-scale TCN was designed to enhance joint feature representation.Particularly noteworthy is that our model consistently delivered accurate results across all exercises, while the performance of other models exhibited significant variations, see Fig. 8.These findings highlight the exceptional feature extraction capability of our proposed method.

C. Component Studies
We conducted component studies to evaluate the effectiveness of the expert-knowledge-based GCN (EGCN), gated attention pooling, and the transformer module in our overall architecture.The performance evaluation was performed exclusively on the joint position data of the five different exercises from the KIMORE dataset.Component models were trained using the root mean square error loss function.For a clearer component analysis, we only utilized root mean square error (RMSE) as the performance metric.
1) Expert-Knowledge-Based GCN: We compared the proposed EGCN method with different L-order Graph Convolutional Networks (GCN-L) with varying adjacency powers from 1 to 5. To demonstrate the feature extraction capabilities, we used a single stack of EGCN and GCN-L, followed Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.  .III.The baseline model consisted of the EGCN with a multi-scale Temporal Convolutional network.We then added the Gated Attention Mechanism to EGCN (EGCN-att) and applied attention score regulation (EGCN-att-α) with varying values of hyperparameter α.The results demonstrated that the integration of Gated Attention Pooling with the EGCN method led to a decrease in the average RMSE from 0.187 to 0.178.However, the RMSE slightly increased when employing a small α value and deteriorated significantly when using larger values of α.As mentioned in II-C.2, the attention weight a k can approximate class k classification probability.Intuitively, higher classification accuracy derived from a k represents higher interpretability.To make the model more explainable, we proposed a varying α that achieved comparable performance to the EGCN-att model.Fig. 9 illustrates that a small α value results in lower interpretability but higher prediction accuracy, while a large α value yields higher interpretability but lower prediction accuracy.Our proposed α improves interpretability while maintaining prediction accuracy.Therefore, incorporating the proposed α for the regulation term effectively enhances the model's explainability.
3) Transformer: To compare the Transformer model and multi-scale Temporal Convolutional convolutions(MSTCN), we analyzed them while keeping the EGCN module fixed.We used a single-layer MSTCN and single-head Transformer to show the Transformer's superiority.The assessment results are presented in Table .V. Clearly, the Transformer exhibits a greater ability to aggregate temporal features compared to the MSTCN.This emphasizes the significance of incorporating the Transformer, especially when dealing with rehabilitation exercises that encompass repetitive movements and intricate patterns.

D. Ablation Study
In our model, skeleton sequences are fed into two streams.The first stream contains a stack of L-order GCNs followed by multi-scale TCNs.The second stream contains an EGCN module, a Gated Attention Pooling module(ATTN), and a Transformer module(TRANS).As the first stream is a commonly used framework, we focus on demonstrating the predictive capability of the second stream.We performed an ablation study by selectively removing the EGCN, ATTN, and TRANS modules from our architecture.The results are presented in Table .IV, where we report ρ, MAE, and RMSE for each model.Our experiment indicates that the full model, incorporating all modules, achieves the highest performance across all evaluation metrics.This demonstrates that each module contributes to the overall predictive power of the model, and the removal of any one of them leads to a decrease in performance.

E. Model Interpretability
Our model offers both spatial and temporal interpretability through the importance matrix M defined in Eqn.(8) and the attention weight matrix Âattn defined in Eqn.(14).Each exercise record yields a specific importance matrix M, which aids in identifying the most critical joint connections.For each exercise, we have an expert-knowledge-based adjacency matrix and the joint connections in the adjacency matrix, where the joint connections in the matrix correspond to key points specified by professional rehabilitation movement requirements.By selecting the top 5 connections with the highest weights in the importance matrix M (excluding self-connected pairs), we can gain insights into the joint connections that the model prioritizes when assessing rehabilitation movements.Comparing the prioritized joint connections of the model with the key connections of the expert-knowledge-based adjacency matrix reveals differences that represent the specific movement requirements patients need to pay attention to during exercises.This analysis also provides clinicians with insights into why the model made a particular assessment.As shown in Fig. 10 (a), for EX3, the high-scoring user's importance matrix closely resembles the expert-knowledge-based adjacency matrix.Conversely, for a low-scoring user, the importance matrix highlights the connections of Handleft-HipLeft, and SpineShoulder-HipRight, indicating a need for greater attention to the trunk plane and avoidance of bending.Additionally, the high importance of the arms suggests an incorrect pose of the arms during training.Furthermore, the attention weights assigned to the global feature from each timestamp in the attention weight matrix Âattn enable the evaluation of different time periods, providing insights into temporal interpretability.As shown in Fig. 10 (b), higher attention weights correspond to more significant timestamps.Analyzing the actions at these highlighted timestamps reveals that the user was performing trunk rotations, indicating the need for increased attention to such movements in EX3.

IV. CONCLUSION
Physical rehabilitation programs often require the continuous presence of a therapist, which can be both costly and unfeasible.Automated assessment of physical rehabilitation can address this challenge by providing an objective assessment and offering improvement suggestions without the need for a therapist.In this paper, we propose an end-to-end machine learning approach based on the graph convolutional network.The proposed approach effectively extracts action features by leveraging expert knowledge of rehabilitative movements, thus achieving good performance even when the dataset is small.Our framework incorporates attention pooling and transformer mechanisms, providing attention scores and weight matrices.These mechanisms can aid therapists in Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
understanding the model's judgments and assist patients in enhancing their training movements.Experimental results on the KIMORE dataset demonstrate that our framework achieved state-of-the-art performance compared to existing models, showcasing its strong potential for automated assessment in physical rehabilitation programs.For future work, we aim to further enhance the interpretability of our model, transforming implicit reminders into explicit feedback.Additionally, we plan to leverage large-scale action recognition data sets as pre-training resources to improve the accuracy of rehabilitation assessment.

Fig. 2 .
Fig. 2. The overall structure of our framework.Skeleton sequences are input into two streams.The stream above employs a stack of L-order GCN followed by Multi-Scale Temporal Convolution Networks (MSTCN) to extract general spatio-temporal information for actions.The stream below incorporates expert-knowledge-based GCN for rehabilitation-related spatial feature extraction.The feature sequences are then aggregated by the Gated Attention Pooling module and processed by the MSTCN and the Transformer module.The outputs of the two streams are combined for the final assessment score prediction.

Fig. 3 .
Fig. 3. Illustration of the L-order adjacency matrix.A darker color indicates stronger joint connections.The red squares indicate the association of the left hand and right hand with the head.As L increases from 1 to 5, more joint connections are explored.However, despite increased exploration, the hand-to-head connections receive low weights, reducing the model's effectiveness.For visualization, we reduce the number of joints considered in our analysis, where joint 2 represents the head, joint 4 represents the left hand, and joint 7 represents the right hand.

Fig. 4 .Fig. 5 .
Fig. 4. Illustration of Expert-knowledge-based adjacency matrix for "Lifting of the arms."Left: adjacency matrix.Right: human skeleton framework, red dotted lines represent important connections from experts' knowledge described in [34].The joint numbers in the figure follow the skeleton numbers of the Kinect V2.

Fig. 10 .
Fig. 10.Illustration of interpretability on both spatial and temporal dimensions.(a) Right: expert-knowledge-based adjacency matrix for EX3 with standard action graph.Middle: importance matrix and skeleton graph for a high-score user.Left: importance matrix and skeleton graph for a low-score user.(b) Bottom: attention weights form each timestamp for the global vector.Top: Skeleton graphs at the timestamps when the weights are high.

TABLE I PERFORMANCE
COMPARISONS ON KIMORE DATASET IN TERMS OF ρ, RMSE, AND MAE 8.The evaluation results of different assessment methods for each exercise.It can be observed that our approach not only exhibits the lowest RMSE but also demonstrates stable assessment performance across different exercises.

TABLE II RESULTS
OF EGCN AND GCN-L METHODS FOR FIVE EXERCISES (EX) ON KIMORE DATASET.AVG DENOTES THE AVERAGE RMSE ACROSS ALL EXERCISES by an identical multi-scale Temporal Convolutional Network.Experimental results are presented in Table.II.EGCN model outperforms the best GCN-L method with an average result of 0.187 vs 0.200, corresponding to a 6.5% improvement.The prediction accuracy for EX1 and EX5 with GCN-L was lower than the other exercises, but this was alleviated as L increased, likely due to the enlarged receptive field.Our proposed EGCN method consistently demonstrated favorable results across all exercises, highlighting its robust feature extraction capabilities.2) Gated Attention Pooling: To test the effectiveness of gated attention pooling in capturing spatial features, we built a model by adding individual components, as presented in Table

TABLE III RESULTS
OF EGCN WITH GATED ATTENTION POOLING.AVG DENOTES THE AVERAGE RMSE ACROSS ALL EXERCISES TABLE THE ABLATION STUDY RESULTS ON THE KIMORE DATASET

TABLE V COMPARISON
BETWEEN TRANSFORMER AND TEMPORAL CONVOLUTIONAL NETWORK