A Contrastive Learning Network for Performance Metric and Assessment of Physical Rehabilitation Exercises

Human activity analysis in the legal monitoring environment plays an important role in the physical rehabilitation field, as it helps patients with physical injuries improve their postoperative conditions and reduce their medical costs. Recently, several deep learning-based action quality assessment (AQA) frameworks have been proposed to evaluate physical rehabilitation exercises. However, most of them treat this problem as a simple regression task, which requires both the action instance and its score label as input. This approach is limited by the fact that the annotations in this field usually consist of healthy or unhealthy labels rather than quality scores provided by professional physicians. Additionally, most of these methods cannot provide informative feedback on a patient’s motion defects, which weakens their practical application. To address these problems, we propose a multi-task contrastive learning framework to learn subtle and critical differences from skeleton sequences to deal with the performance metric and AQA problems of physical rehabilitation exercises. Specifically, we propose a performance metric network that takes triplets of training samples as input for score generation. For the AQA task, the same contrast learning strategy is used, but pairwise training samples are fed into the action quality assessment network for score prediction. Notably, we propose quantifying the deviation of the joint attention matrix between different skeleton sequences and introducing it into the loss function of our learning network. It is proven that considering both score prediction loss and joint attention deviation loss improves physical exercises AQA performance. Furthermore, it helps to obtain informative feedback for patients to improve their motion defects by visualizing the joint attention matrix’s difference. The proposed method is verified on the UI-PRMD and KIMORE datasets. Experimental results show that the proposed method achieves state-of-the-art performance.

In rehabilitation medicine, physical rehabilitation training is often compulsory and critical for patients' recovery from various musculoskeletal diseases [24], [25], [26].Initial rehabilitation training is usually organized and supervised by clinicians to guide patients performing rehabilitation programs.However, it is expensive to equip each patient with a professional clinician.Therefore, the most important follow-up training is usually carried out by patients at home.However, it has been reported that due to the absence of professional guidance and continuous feedback, only a few patients can persist in completing home-based rehabilitation training.It leads to prolonged treatment time and increased healthcare costs [27], [28], [29], [30].
With ongoing research, new methods for evaluating physical exercises are emerging.Preliminary studies have approached the task as a classification problem, where the goal is to determine whether an exercise is being performed correctly or not rather than to evaluate the continuous quality of the exercise [21], [31].Furthermore, they cannot provide incremental feedback to patients.Recently, several deep learning methods have been presented to target the evaluation task.Liao et al. [26] combined a convolutional neural network (CNN) and a recurrent neural network (RNN) to extract features from skeleton sequences.However, CNNs ignore the topological structure and relationships between human joints, which leads to the loss of important information and reduces the assessment accuracy.Although graph convolution networks have been introduced to extract the holistic body configuration features between joints, such as [18] and [32], the problem of unavailable score labels is unsolved.In addition, most existing methods cannot provide informative feedback about motion defects, so it is difficult to obtain effective guidance for patients to improve their exercise performance [26], [27], [28], [29], [30], [31], [32], [33], [34].
Based on the aforementioned research motivation, this paper proposes a multi-task contrastive learning framework to extract subtle and critical differences by comparing healthy and unhealthy skeleton sequences for solving the problems.In the proposed contrastive learning strategy, triplets of training samples are generated as input.The classification annotation that denotes the sample is healthy or unhealthy is used as the label, and a performance metric network is designed to learn and map critical differences between various samples to a reasonable quality score.The obtained score can be used as the label for the action quality assessment task, in which the same contrast learning strategy is employed but pairwise training samples are constructed and fed into the proposed AQA network for score prediction.To solve the second problem that most existing methods cannot provide feedback about motion defects, we design the joint attention matrix to represent and capture different joints' importance for diverse exercises.Different from the existing approaches, we quantize the deviation of the joint attention matrix and introduce it as one of the optimization objectives into the loss function of our learning framework.The intuitive motivation is that we not only expect the learning network to obtain accurate score prediction, but also to concentrate on certain joints that may play an important role in completing each physical exercise.When we compare a healthy skeleton with an unhealthy skeleton and visualize them by attaching the learned joint attention weight, effective guidance on improving motion defects can be easily obtained.
The contributions of this paper can be summarized as follows: (1) We propose a new multi-task contrastive learning framework to solve the problem of performance metric and action quality assessment of physical rehabilitation exercises.It bridges the difficulties that score labels are usually absent in real application and informative feedback can not be provided by most existing approaches.
(2) A joint attention matrix is designed to distinguish the significance and performance of different joints for physical exercise.In addition, we define the deviation loss of the joint attention matrix between healthy and unhealthy skeletons, and combine it with the score prediction loss to enhance the feature learning process.
(3) Ablation experiments on the UI-PRMD dataset [35] and the KIMORE dataset [36] are conducted to verify the effectiveness of the proposed method.The experimental results show that better performance can be achieved in comparison with existing approaches.

II. RELATED WORK A. Physical Exercise Assessment
In physical exercises assessment, the goal is to evaluate the quality of a patient's movement automatically and provide feedback about motion defects more efficiently and intelligently.Early related studies extracted handcrafted features from motion data to represent human motion, and employed traditional machine learning algorithms to develop evaluation models [37], [38], [39], [40], [41].Recently, Rahman et al. [42] presented a comprehensive and systematic review on automatic stroke rehabilitation systems.These methods cannot solve the problem that motion evaluation not only requires to predict the category labels for classification, but also needs to estimate the quality score.
In this paper, we focus on a skeleton-based graph convolution network method to solve the problem of accurately assessing physical exercises.The most relevant works are [18], [26] and [32].Liao et al. [26] made an early attempt to introduce deep learning methods into assessing physical rehabilitation exercises.They proposed a CNN and RNN based deep learning framework to extract deep pose features and regress the quality score.By comparison, the significant difference is that a graph convolution network (GCN) is employed to model the topology structures of human joints and the interactions between body joints in our work.In [18], a selfattention GCN was proposed and combined with LSTM for physical exercise evaluation.In [32], self-supervised regularization was designed and combined with a GCN network to improve the performance of feature extraction.Compared with these methods, the novelty of this approach is that we propose a contrastive learning strategy and a more effective feature learning method which extracts subtle and critical differences between different skeleton sequences for obtaining accurate score prediction and providing informative feedback for patients.
Yu et al. [13] proposed a group-aware contrastive regression model to estimate diving score from RGB video.Jain et al. [50] proposed a new two-stage scoring system, including a depth measurement learning module and a score evaluation module, to obtain score prediction by comparing a query video and the reference video.Xu et al. [43] proposed a procedure-aware AQA method for diving score estimation.They extracted comparative features from pairwise action instances including a query instance and an exemplar instance.Li et al. [44] proposed a comparative learning network structure to learn the relative score between two different videos, and defined the consistency constraint loss to guide the model to learn subtle differences between videos.Bai et al. [45] proposed to generalize and capture fine-grained intra-class changes.They designed two new cross-attention losses to make the model learn better part representation.Doughty et al. [46] attempted to extract shared features from a pair of videos and proposed a supervised depth ranking method to accurately rank the skill grade of different videos.Furthermore, they trained a level-specific time attention module [47] to address high-skill and low-skill part representations.Wang et al. [48] designed a novel framework for fine-grained motion recognition with few recognition samples.Contrastive training was used to fully learn the potential gap between different samples.Song et al. [49] proposed a mask-guided contrastive attention model and defined a region-level triple loss to limit the features learned from the body region farther from the features of the background region.
To our best knowledge, we are the first to apply contrastive learning strategy to the task of assessing physical rehabilitation exercises.We summarize two main differences by comparing the proposed method with existing contrastive learning based AQA methods as follows.The first difference is that most of existing contrastive learning based AQA methods are presented for sports activity scoring or skill training rather than physical rehabilitation, and they design the learning framework to extract features directly from RGB data of video.The proposed method is based on skeleton data and tries to model human action from the perspective of joints' layout and relation, then extracts effective deep pose features from skeleton sequence.Another difference is that for obtaining informative feedback about patients' motion defects that has not been addressed so far by most of current methods, we propose to quantize the deviation of the joint attention matrix and introduce it into the loss function of our learning network.It is proven that the proposed method not only achieves score prediction improvement, but also provides better feedback about motion defects by visualizing the joint attention difference.

III. METHOD
In this section, the proposed multi-task contrastive learning framework that consists of the performance metric network and the AQA network is introduced in detail.The performance metric network is designed to address the challenge of score label absence, where its input is formalized as a skeleton sequence and its corresponding class label, and its output is the score value.Furthermore, the AQA network is proposed for accurately assessing the quality of physical exercises and providing informative feedback for patients to improve their motion defects.The input of the AQA network is a skeleton sequence and its corresponding score annotation, while the output is the predicted score and feedback obtained by joint attention matrix comparison.

A. Performance Metric Network
In the real-world application of physical exercise assessments, skeleton data rather than RGB video are usually provided when considering patient privacy.Given a set of skeleton sequences with category labels indicating healthy or unhealthy, the target of the proposed performance metric network is to extract critical motion patterns to represent the action category and quantize the difference of motion patterns between healthy and unhealthy skeleton sequences into score values.The skeleton-based performance metric of physical exercise can be formalized as follows.
1) Problem Definition: Given a set of skeleton sequences S = {X, Y } with category labels that X = {x i } i=1∼m and y j j=1∼n represent the healthy and unhealthy set of skeleton sequences where x i denotes the joints' position coordinates of the ith healthy skeleton sequence and y i is the joints' position coordinates of the jth unhealthy skeleton sequence, the skeleton-based performance metric problem can be solved by two mapping functions.
The first mapping function models the feature extraction process that transforms the original skeleton data into an effective feature space in which the feature distance between samples with the same category is as close as possible, whereas the feature distance between samples with different categories is as far as possible.The optimal parameters of the feature extraction process are determined by solving the following formula: ŵF = argmin(loss pm (F(S); w F )) where F(•) encodes the mapping of S into feature space with the parameter setting of w F .loss pm (•) represents the loss function of performance metric network that measures the feature distance between healthy and unhealthy samples.
The second mapping function is to model the score generation process that maps the feature vector into a quality score subject to maximize the separation degree that differentiates healthy samples from unhealthy samples.The solution of finding the optimal parameter of score generation can be formulated as follows: where Z (•) denotes to encode a feature vector into a quality score with the parameter setting of w S .F X and F Y respectively represent the feature vectors of healthy and unhealthy samples obtained by F(•) in Eq.( 1).S D(•) denotes the measurement of separation degree between healthy and unhealthy samples.
After training, we can obtain the quality score of a given skeleton data through these two mapping functions with the learned optimal parameters of ŵF and ŵS .The architecture of the proposed performance metric network is shown in Fig. 1.The pipeline consists of three main parts: triplet training construction, feature extraction, and score generation.The specific details of each part are described as follows.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.2) Triplet Training Construction: Motion pattern extraction is crucial for physical exercise assessment, since different action types involve varying dynamic changes in human body movement.Moreover, to accurately quantify physical exercises performances, discriminative feature representation must be extracted from both healthy and unhealthy skeleton sequences, even if there are subtle differences exist between their movements.In view of these two aspects, a contrastive training strategy is employed, and triplet of training samples that consists of an anchor sample, a positive sample, and a negative sample are constructed as input and fed into the feature extraction network for comparable learning.
Since the UI-PRMD dataset contains skeleton data of ten subjects (S 1 ∼S 10 ) performing ten different rehabilitation exercises (E 1 ∼E 10 ) captured by Vicon and Kinect sensors, and each subject performs each rehabilitation exercise correctly for 10 repetitions and incorrectly for another 10 repetitions, 200 samples are collected for each rehabilitation exercise.Therefore, there are totally 2000 samples for all the 10 exercises in this dataset.In experiment on the UI-PRMD dataset, for a fair comparison with the existing method of Liao et al. [26], we keep dataset consistent with them that a reduced version of UI-PRMD dataset is used.In the reduced dataset, inconsistent data (associated with measurement errors or subjects performing the exercise with their left-arm/leg in a set of mostly right arm/leg exercises) were manually removed from the original dataset, resulting in less than 100 repetitions per subject.E.g., there are 90 correct and incorrect movements for E 1 , and 55 correct and incorrect movements for E 2 as described in [26].Then it totally contains 1326 skeleton sequences and each skeleton sequence is composed of frame by frame three-dimensional angular displacements of 39 human body joints collected from Vicon sensor.
For the triplet construction strategy, we take the first correct repetition of every subject for each exercise as the anchor sample, all correct repetitions of the same subject as positive samples, and all incorrect repetitions of the same subject as negative samples to construct triplet.It can be denoted as (P ) where E i means the ith exercise, S j represents the jth subject, P k denotes the kth correct sample, and N k means the kth incorrect sample.Then we explore three partition ways of training/test set to train our feature learning network.In the first training construction method, we take all triplets from the first eight subjects (S 1 ∼S 8 ) as training set, and all triplets of the last two subjects (S 9 ∼S 10 ) as test set to train our contrastive feature learning network.In the second training construction method, we choose the first eighty percent of triplets from all subjects (S 1 ∼S 10 ) as training set, and the rest twenty percent of triplets from each subject as test set.In the third method, we randomly select eighty percent of all the constructed triples as training set, and the rest twenty percent as test set.
3) Feature Extraction: Effective feature extraction from skeleton data is important to obtain an advanced performance metric network.The original skeleton data is often composed of the joints' position coordinates in 2-D or 3-D Cartesian coordinate system.The representation of the dynamic changes in human body movement and the capture of critical motion features from holistic body configuration and temporal relations among joints directly impact the effectiveness of performance metric network.In this work, we use the MS-AAGCN proposed by Shi et al. [51] as the backbone network to extract deep semantic features from skeleton sequences.We stack 10 basic blocks to form our feature extractor and encode the motion pattern of joints and the relationship between joints from the skeleton sequence.The basic structure of backbone network is illustrated in Fig. 2 (a).Each block consists of a space convolution layer (Convs), a spatial-temporal-channel attention layer (STC), a time convolution layer (Convt) and a dropout layer.The output of each spatial convolution layer and each temporal convolution layer is processed by batch normalization (BN) and ReLU activation (ReLu) operations.The structure of the STC attention layer that consists of a spatial attention module (SAM), a temporal attention module (TAM) and a channel attention module (CAM) is shown in Fig. 2 (b).
In our triplet contrastive training strategy, three backbone networks with the same structure are constructed to implement the feature extraction process of the anchor sample, the positive sample, and the negative sample.We employ a weight sharing strategy to train the feature extraction network.Inspired by [52], we define the constraint that minimizes the feature distance between samples of the same action category while maximizing the feature distance between samples with different action categories for contrastive learning.Accordingly, the loss function of feature extraction is defined as follows: where x a , x p and x n represent an anchor sample, a positive sample of x a and a negative sample of x a , and they compose a training triplet.F(•) denotes the mapping of the feature extraction network.α c is set as the boundary between positive and negative samples.4) Score Generation: Score generation methods can be divided into the model-free method and the model-based method [26].The model-free method uses distance-based functions (such as Euclidean distance, Mahalanobis distance and dynamic time warping) to measure the difference between skeleton sequences.Then the score value can be generated by quantizing the deviation of a given skeleton sequence from the reference skeleton sequence.The model-based method constructs probabilistic models to learn the feature distribution of all skeleton sequences and generates the score value by computing the feature log-likelihood of a given skeleton sequence under the learned distribution.
To verify the effectiveness of the proposed contrastive training and feature extraction method, we employ the score generation approach used in [26].Specifically, a Gaussian mixture model (GMM) is employed to model the feature distribution of all trained skeleton sequences.The negative log-likelihood is computed and used to generate score values.In contrast to [26] that additional trivial processing steps such as dimension reduction by principal component analysis or auto-encoder are required, we implement the score generation module directly based on the output of our feature extraction network.

B. Action Quality Assessment Network
For the task of automatically assessing physical exercises, the AQA network based on a pairwise contrastive training strategy is proposed.The objective of the proposed AQA network is not merely to obtain accurate score estimation, but also to provide informative feedback for patients to improve their physical exercise performance.The architecture is shown in Fig. 3.The pipeline consists of three main parts: pairwise training construction, feature extraction, and score prediction.
1) Problem Definition: Given a set of skeleton sequences S = {X, G} with score annotations that X = {x i } i=1∼N and G = {g i } i=1∼N where x i denotes the joint position coordinates of the ith skeleton sequence and g i is its ground-truth score annotation, the problem of skeleton-based action quality assessment for physical exercises can be defined as a mapping function.It can be formulated as follows: where F(•) represents the feature extraction stage with the parameter setting of w F , and R(•) denotes the score prediction stage with the parameter setting of w S .θ is the joint attention matrix and its element value represents the learned importance weight assigned to each human body joint.loss AQ A (•) is the loss function of the AQA network defined by three consistency constraints based on the pairwise contrastive training.The specific details of each part are described as follows.
2) Pairwise Training Construction: In the proposed AQA network, a pairwise learning strategy is employed to train the learning model.Specifically, a pair of skeleton sequence < x p , x q > is generated from the original skeleton sequence set X .It is used as input and fed into the feature extraction network to acquire important comparative differences between skeleton sequences.Therefore, the total number of pairs can reach C(N , 2) = N * (N −1)

2
, where N is the total number of skeleton sequence in the training set.
3) Feature Extraction: We construct two feature extraction networks with the same architecture as performance metric network to deal with the input pairwise skeleton sequences x p and x q , respectively.Ten spatio-temporal convolution blocks are stacked and the weight sharing strategy is employed to train the comparative feature extraction model.It extracts hierarchical feature representations at different levels, from the local context to high-level action patterns, so that deep Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.pose features can be learned from both spatial and temporal dimensions.
The spatio-temporal convolution block of the lth layer can be expressed by: where A s k is the adjacent matrix that represents the topological structure of joint connections from a global perspective.It is initialized based on the natural human body structure and iteratively updated during training batches.A a k represents the adaptive adjacent matrix generated for each skeleton sequence from a local perspective.x l and x l+1 represent the input and output features of the lth layer, respectively.W k represents the weights that can be learned and their values are different for each layer.
We combine the adjacent matrix from two views and define the joint attention matrix as: where λ is used to balance the global adjacent matrix and the local matrix.K is set to 3, whitch represents three different graph topological connections.4) Score Prediction: In this module, a regression function is established to map the extracted feature vectors into a predicted quality score.Inspired by better results that can be achieved by spindle-style channel assignment in Pose-GCN [53], a score regression module implemented by a spindle structured fully connected (FC) network is constructed for score estimation.
As a solution to the problem that informative feedback about motion defects need to be provided by the learning model, we formulate three constraints and accordingly design three loss functions to train the proposed AQA network.The intuitive motivation is that it is not sufficient to only consider the deviation between the ground-truth score and the predicted score.It is also crucial to capture the differences in feature representation and joint attention weights which indicate the significance and performance of different joints in various physical exercise samples.It ensures the model to maintain consistency during its learning process.
The first constraint requires the predicted score of the input skeleton sequence to be consistent with its ground-truth score annotation.To achieve this, the Huber loss function which measures the deviation of predicted score from the true score is employed.The score value loss is defined as: where loss sv represents the score error loss, which measures the difference between the predicted score of a given skeleton sequence and its corresponding ground-truth score.g i is the ground-truth score value of the ith sample, and ĝi is the predicted score value of the ith sample.δ is set as a learned threshold value that represents the evaluation accuracy.Second, we propose a constraint based on our pairwise training strategy that aims to reduce the feature distance between skeleton sequences with similar score values, while increase the feature distance between skeleton sequences with large score differences.The loss function used to measure feature similarity is formulated as follows: where loss f s represents the loss of feature similarity, g p and g q represent the ground-truth score values of the input pair of skeleton sequences < x p , x q >.Similarly, f p and f q are the feature vectors of < x p , x q > output by the feature extraction module.sim(•) denotes the Cosine similarity.σ (•) denotes the fractional mapping of sim(•) computed by |( sim( f p , f q )−1 2 )| * s max , where s max is the highest score value in the training set.
The third consistency constraint aims to minimize the deviation of joint attention weights between skeleton sequences with similar score values.This is achieved by quantizing the attention weight deviation of human body joints and introducing it into the optimization object.It can also provide helpful feedback about motion defects for performance improvement.The loss function for joint attention weights is defined as: where loss at represents the loss of the joints' attention weight.g p and g q are the ground-truth score values of the input pair of skeleton sequences < x p , x q >.Similarly, M p and M q denote the joint attention matrix of < x p , x q > learned by the feature extraction module and calculated according to Eq.( 6).Finally, the overall loss function of the proposed AQA network is summarized as follows: loss AQ A = loss sv + loss f s + loss at (10) Once the training stage is completed, the optimal parameters of the feature extraction module and the score prediction module can be determined.During the testing stage, only one branch of the AQA network needs to be used for inference.

IV. EXPERIMENT A. Dataset and Evaluation Method
We evaluate the proposed method on two datasets, UI-PRMD [35] and KIMORE [36].An ablation study is conducted to verify the effectiveness of each component of our method.
1) UI-PRMD Dataset: This dataset is created for physical exercise assessment and contains skeleton sequences from 10 subjects performing 10 different rehabilitation exercises.It is collected by using a Vicon optical tracking system and a Kinect sensor, with each subject performing each exercise correctly for 10 repetitions and incorrectly for another 10 repetitions.Each skeleton sequence includes multi-frame skeleton data composed of the position coordinates and the angular displacements of body joints.A binary class label indicating healthy or unhealthy rather than a quality score annotation is given for each skeleton sequence.
To fairly verify the effectiveness of our approach, we remain consistent with the compared methods that a reduced version of this dataset that collected by the Vicon motion capture system is used in experiments.In the reduced version of UI-PRMD dataset, 1326 skeleton sequences are contained and each skeleton sequence is composed of frame by frame three-dimensional angular displacements of 39 human body joints collected by Vicon.
2) KIMORE Dataset: This dataset includes RGB videos, depth videos and skeleton sequences extracted by a RGB-D sensor of KinectV2 for recording physical exercises performed by 78 subjects across 5 action categories.The subjects are divided into two groups: the control group that consists of 12 experts and 32 non-experts, and the pain and postural disorder group that contains 34 subjects suffering from chronic motor disabilities.Each subject performs each exercise once, then 390 skeleton sequences are totally captured.Skeleton data per frame includes three-dimensional position coordinates of 25 joints.The ground-truth score annotation of each skeleton sequence is provided.
3) Evaluation Metric: To validate the effectiveness of our performance metric network, we utilize the separation degree (SD) employed by [26] as the evaluation metric.It is computed by: where X = {x i } i=1∼m and Y = {y i } j=1∼n represent healthy and unhealthy skeleton sequences respectively.∇(x i , y j ) = s x i −s y j s x i +s y j measures the quality score deviation between x i and y j .
To evaluate the performance of the proposed AQA network, we employ three evaluation metrics including the mean absolute deviation (MAD), the mean absolute percentage error (MAPE), and the root mean square error (RMSE) [18].The computations of MAD, MAPE, and RMSE are defined as follows: where n is the total number of skeleton sequences, y i is the ground-truth quality score value of the ith skeleton sequence, and ŷi denotes the predicted score value of the ith skeleton sequence.

B. Implementation Details 1) Contrastive Training Sample Construction:
We propose a contrastive learning strategy for the performance metric task.Our approach generates triplet that consists of an anchor sample, a positive sample and a negative sample as input.
In our experiments on the UI-PRMD dataset, we explore three partition ways of training/test set to train our feature learning network as described in Section III-A.On the other hand, by considering the issue of inconsistent length of skeleton sequences, we employ two supplement methods including supplementing zero values and linear interpolation to align them.In summary, we design and conduct four different experiments on the reduced UI-PRMD dataset to verify the effectiveness of the proposed performance metric network.
Our(A): It adopts the first training construction method that all triplets from the first eight subjects (S 1 ∼S 8 ) are chosen as training set and the zero value supplementary method to align the skeleton sequences.Our(B): It employs the second training construction method that the first eighty percent of triplets from all subjects (S 1 ∼S 10 ) are taken as training set and the zero value supplementary alignment method.Our(C): It uses the second training construction method and the linear interpolation alignment method.Our(D): It employs the third training construction method that randomly selecting eighty Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE I THE MEAN VALUE AND STANDARD DEVIATION OF SEPARATION DEGREE(SD) ON THE UI-PRMD DATESET (THE HIGHER THE VALUE, THE BETTER THE PERFORMANCE)
percent of triples from all constructed triplets as training set and the linear interpolation alignment method.We repeat five random experiments over 10 exercises of the UI-PRMD dataset, and take the average Important to note that although different training set construction methods are tried to train the feature learning network, the evaluation criteria of separation degree (SD) defined in Eq. ( 11) is computed based on all the samples in UI-PRMD dataset.
The AQA network uses pairwise skeleton sequences as input.However, neither the UI-PRMD dataset nor the KIMORE dataset is partitioned into training and test sets by their creators.To compare the performance of our method with existing methods, we follow the partition rule used by most existing methods.Specifically, for the UI-PRMD dataset we randomly select seventy percent of the dataset as the training set, and the remaining thirty percent as the test set.For the KIMORE dataset we use the predefined training set and testing set from [18] for our experiment.Pairwise training samples are constructed based on the training set and fed into the AQA network to train the learning model.During the testing stage, only one branch of the AQA network is used to infer the scores for the testing set.
2) Parameter Setting: We implement the proposed framework by PyTorch, and train the models on one NVIDIA RTX 3090 GPU card.The parameters are optimized with the Adam [54] algorithm.For the performance metric task on the UI-PRMD dataset, we set the parameter α c in Eq.(3) to 6, which acts as a boundary between positive and negative samples.The learning rate is initially set to 0.005 and decayed at the 30th and 40th epochs with a decay rate of 0.0001.The model is trained for 200 epochs with a batch-size of 5.For the action quality assessment task, the AQA network is trained for 1500 epochs on the UI-PRMD dataset.The learning rate is initialized to 0.001 and reduced at the 30th and 40th epochs with a rate of 0.0001.The batch size is set to 4. On the KIMORE dataset, the AQA network is also trained for score prediction.The learning rate is initially set to 0.01 and decayed at the 100th epoch and the 200th epoch with a decay rate of 0.0001.The model is trained for 3000 epochs with a batch-size of 16.The parameter of δ which acts as a learned threshold in Eq.( 7), ( 8) and ( 9) is set to 0.1.

C. Result
1) Results of Performance Metric on UI-PRMD: Table I presents the results of our performance metric network on the UI-PRMD dataset.The separation degree (SD) defined in Eq. ( 11) is used to evaluate the effectiveness.The higher the SD value, the better the performance.Important to note that although different triplet training construction methods are tried to train the feature learning network, the evaluation criteria of SD is computed based on all the samples in the UI-PRMD dataset.
To analyze the impact of different subjects, we use two SD calculation schemes: between-subject and within-subject.The former does not distinguish between skeleton sequences of different subjects and computes the results based on all skeleton sequences.The latter computes SD based on each subject's skeleton sequences and takes the average SD over all subjects.Compared to the results reported in [26], our approach achieves significant improvement in both between-subject and within-subject scenarios.In [26], the learning network uses a single skeleton sequence as input to train the feature extraction network.Our approach uses contrastive learning strategy and the results prove its superiority in capturing subtle and critical features and obtaining a better score generation model.
To examine the impact of GMM modeling parameters, we conduct an extensive experiment with different numbers of GMM components (G_C): 1, 4, 6, 8, 10, and 12.The experimental results indicate that increasing the number of GMM components positively affects the performance.From Table I, it can be observed that in the between-subject scenario, when G_C is set to 6, Our (A), (C) and (D) that use the first triplet training construction, the second and the third strategy separately achieve the best SD of 0.543, 0.561 and 0.589, respectively.Our (B) that employs the second triplet construction strategy achieves the best SD of 0.541 when G_C is set to 8.Both Our (A) and (B) use supplementing  zero values and aligning to the maximum length to address the length inconsistency of skeleton sequences.Our (D) that employs the third random selection triplet training construction strategy and linear interpolation for length alignment achieves the best SD of 0.589 on the UI-PRMD dataset when G_C is set to 6.In the within-subject scenario, the third random selection strategy performs better than the other three training strategies.Furthermore, the proposed method achieves the best SD of 0.659 on the UI-PRMD dataset when G_C is set to 6 that is 0.056 higher than the result of [26].
To further examine our results, we plot the GMM log-likelihood values and quality scores of the skeleton sequences for the E1 exercise in the UI-PRMD dataset in Fig. 4. To compare our approach with existing methods, we adopt the settings of [26] and visualize the quality scores obtained by Our (A) for the between-subject scenario with G_C set to 6.It can be observed that the quality scores of healthy skeleton sequences are close to the maximum value of 1, while most of the quality scores of unhealthy skeleton sequences range from 0.4 to 0.8.It shows that a clear separation between healthy and unhealthy samples has been achieved.In addition, an increased score difference among unhealthy samples has also been obtained by the proposed performance metric network.
2) Results of AQA Task on UI-PRMD and KIMORE: The results of our AQA network on the UI-PRMD dataset and the KIMORE dataset are presented in Tables II and III, respectively.In this task, MAD, MAPE and RMSE defined in Eqs. ( 12) ∼ ( 14) are used to evaluate the effectiveness.The lower the values, the better the performance.Liao et al. [26] are the first to attempt to solve the problem of assessing physical rehabilitation exercises using a deep neural network.They use a single skeleton sequence as input to train the feature extraction network.Deb et al. [18] propose a variant model based on the spatio-temporal graph convolution network for assessing rehabilitation medical movement.They replace the average pooling with LSTM units to learn global information of the sequence, and introduce a self-attention mechanism to learn the contribution of different joints on different movements.Du et al. [32] propose a self-supervised regularized graph convolution network that predicts the future human body skeleton from extracted graph features.Song et al. [33] propose a GCN-based multi-stream model to explore sufficient discriminative and complementary features over all skeleton joints.Yan et al. [34] build a spatio-temporal graph neural network and use global pooling to fuse the feature information from different frames.
According to Table II, the proposed AQA model achieves the lowest MAD for exercise 7. The best results for most exercises, except exercises 3 and 10, are achieved by [18].Our method that only uses one motion modality of angular displacement achieves the average MAD on all exercises of 0.015, which is an improvement over the results of 0.025 and 0.021 reported in [26] and [32], respectively.The proposed method shows better feature learning ability to extract more discriminative motion patterns when compared with them.
The result of Ours(ang+pos) is obtained on all the ten exercises of the reduced Vicon version dataset by incorporate the normalized 117-dimensional position coordinates into the normalized 117-dimensional angular displacements to form the 234-dimensional skeleton data per frame for each exercise sample.From Table II, it can be observed that utilizing both angular displacements and position coordinates to train the AQA network can bring performance improvement since the average MAD of 0.014 is achieved by the method of Ours(ang+pos) when compared with the average MAD of 0.015 that only one motion modality of angular displacements is used.
On the KIMORE dataset, the results of three evaluation metrics (MAD, RMSE and MAPE) on five exercises are presented in Table III.We compare the results of our proposed method with the skeleton data based methods including [18], [26], [33], and [34], and the RGB data based methods including [13], [44], [45], and [47].Li et al. [44] extend the score regression to relative score prediction and propose an end-to-end PCLN model to improve the performance of sports activity scoring.We apply their approach on the KIMORE dataset that the ResNet-50 network pre-trained on ImageNet is used as feature extractor to perform frame-by-frame feature extraction on videos, then 2048-dimensioned feature vectors of a video pair are fed into the learning network to train the proposed PCLN for score estimation.In [45], Bai et al. adopt a contrastive regression framework and propose temporal parsing transformer (TPT) to decode each video into a fixed number of part representations.In their decoder, the ranking loss and sparsity loss are defined to supervise effective part feature learning.Following their work, we segment the frames of a KIMORE video into overlapping clips and use the I3D network pre-trained on Kinetics to extract clip-level features.
Then the clip-level features of an input video and an video are fed into their temporal parsing transformer to learn difference and estimate relative score.Doughty et al. [47] propose a rank-aware attention network (RAAN) to assess the relative overall level of skills.They first extract I3D features from a ranked pair of videos, and then input these I3D features into the attention module that consists of multiple attention filters to learn weighted video-level features.We apply their method on the KIMORE dataset that each video is uniformly split into segments and I3D is used to extract segment-level features.Then segment-level features are passed into a pair of attention module to produce video-level representation.Since their downstream task is to determine the relative skill level of two compared videos is higher or lower, a classification head is designed and implemented by a FC network and a Tanh activation function (the output ranges from −1 to 1).
For applicable to the score prediction task of AQA problem, we change the activation function to Sigmoid (the output ranges from 0 to 1) and then map it into the final score.
In [13], Yu et al. early put forward to learn the relative scores with reference to an exemplar video instead of regressing an unreferenced score.They design a group-aware regression tree to predict coarse-to-fine classification and regress the score difference in small intervals.We follow their experimental settings on the AQA-7 dataset and conduct experiments on the KIMORE dataset.103 frames are firstly extracted from each video clip and segmented into 10 overlapping snippets that each snippet contains 16 continuous frames.I3D is used to extract features from an input video and an exemplar video, and then they are fed into the group-aware regression tree for predicting the score difference.It can be seen from Table III that the proposed AQA method achieves the best performance on all three metrics for this dataset.We obtain a significant improvement with the lowest average MAD of 0.260, RMSE of 0.333 and MAPE of 0.711 over five exercises.These results present a significant decline compared to the existing skeleton-based methods.On the other hand, it can be observed that the skeleton-based methods show superior performance than the RGB-based methods on the task of physical exercise assessment according to their results on the KIMORE dataset.The best result of RGB methods is obtained by CoRe [13] that the lowest average MAD of 4.612, RMSE of 6.135 and MAPE of 15.841 are achieved over five exercises.The compared RGB-based methods are presented for solving the AQA problem of simple action from short video, such as diving and vault etc.They extract rich spatial and temporal features from RGB data sequences.Affected by variations on human appearance, scale, illumination and complex scene factors including camera motion and background changes, feature extraction based on RGB data often contains not only human movement information but also much scene information that is unrelated to action quality evaluation.The performance is significantly decreased because the detailed information of human movement can not be accurately captured by their feature representations.It demonstrates that the skeleton-based methods show obvious advantages on the physical exercise AQA issue.
3) Ablation Study of Each Component: To verify the effectiveness of each component of the proposed AQA network, ablation experiments are designed on the KIMORE dataset.Unlike existing methods, the AQA network employs a contrastive training strategy (CS) and introduces two new consistency constraints with corresponding loss calculations to optimize the parameters of the graph convolution network-based feature extractor and the assessment module.Accordingly, the feature distance loss (FL) and the joint attention weights difference loss (JL) are defined to impose these constraints.The ablation experimental results are shown in Table IV.We compare the results with the baseline method (BN) which employs a single skeleton sequence as input rather  To investigate the performance of the proposed contrastive learning framework on RGB videos provided in the KIMORE dataset, we have conducted experiments by using two pose estimation algorithms to extract skeleton sequences from RGB videos as discussed in [18].Pre-trained BlazePose [56] and Video-Pose3D [57] are employed to extract 3D joints' coordinates of human body, then pairwise skeleton sequences are fed into and train the proposed contrastive learning framework.We have compared our results on Exercise 5 by using three skeleton sources including two types of skeleton sequences extracted from RGB videos using BlazePose and VideoPose3D, and one type of skeleton sequences obtained by Microsoft KinectV2 sensor [58] provided by the KIMORE dataset.Also we have compared with the results reported in [18].The experimental results have been shown in Table V.
It is noted that different skeleton models are defined in these three skeleton sources.Specifically, 33-keypoint model, 17-keypoint model and 25-keypoint model are respectively employed by BlazePose, VideoPose3D and KinectV2.In our experiment with one NVIDIA RTX 3090 GPU card, the batchsize of 16 that is set on the skeleton sequences of KinectV2 to train the learning network cannot be implemented on the skeleton sequences of Blazepose, since more human body joints have been detected and the amount of skeleton data has been increased.To fairly compare the performance of three skeleton sources, the unified batch-size of 12 is set in experiments for all these three skeleton sources.It can be seen that the best performance that MAD of 0.382, RMSE of 0.492 and MAPE of 1.089 are achieved on the KinectV2 skeleton sequences by the proposed method.On two types of extracted skeleton sequences from RGB videos by using pose estimation, better performance has been achieved on the VideoPose3D skeleton sequences that MAD of 0.863, RMSE of 1.113 and MAPE of 2.421 than on the BlazePose skeleton sequences that MAD of 0.950, RMSE of 1.190 and MAPE of 2.524.Similar conclusion can be drawn from the results of Liao et al. [26], Yan et al. [34] and Du et al. [55].It is worth mentioning that with the batch-size of 16, the best performance that MAD of 0.292, RMSE of 0.378 and MAPE of 0.808 are achieved by the proposed method using KinectV2 skeleton sequences as shown in Table III.

D. Visualization of the Learned Joints' Attention Weights
To further qualitatively analyze the performance of the proposed AQA method, we visualize the results of the learned joint attention weights through one example for each exercise on the KIMORE dataset as referred to in [18].In Fig. 5 between the expert sample of x e and two patients' samples of x p and x q .These two skeleton sequences of patients have different quality scores.The first row to the fifth row show five exercise of this dataset.By introducing the loss of joint attention weights into the final loss function, it is observed that the joints' performance difference between the expert and the patients can be captured and quantized into informative feedback for the patient.This feedback can help the patient concentrate on improving the performance of joints with larger red circles, which indicates prominent defects during their motions.

V. CONCLUSION
In this paper, we propose a multi-task contrastive learning framework to solve the problem of skeleton-based performance metric and action quality assessment of physical exercises.The proposed performance metric network is designed to tackle the problem of score label absence, and the AQA network is proposed to accurately assess physical exercise quality and provide informative feedback for patients to improve motion defects.Experimental results show that the contrastive training strategy brings benefits in learning subtle and critical features for accurately assessing physical rehabilitation exercises.By quantizing and introducing the joint attention weights into the optimization target of our learning network, accurate score prediction can be and effective feedback for improving motion defects can be provided.In future studies, we plan to address preprocessing skeleton data to filter out noise interference and optimize the structure of the feature extraction network.

Fig. 1 .
Fig. 1.The architecture of the proposed performance metric network.

Fig. 2 .
Fig. 2. The components of the feature extractor.

Fig. 3 .
Fig. 3.The architecture of the proposed AQA network.

Fig. 4 .
Fig. 4. Visualization of the performance quantization results of the skeleton sequences of E1 exercise in UI-PRMD.

Fig. 5 .
Fig. 5. Visualization of the joint attention weights (red circle) of an expert sample and two patient samples for five exercises on the KIMORE dataset.M e is the learned joint attention weights of an expert sample of x e .The larger red circle represents the higher importance of the joint.d p s and d q s indicate the joint attention differences |M e − M p | and |M e − M q | between the expert sample of x e and two samples of patient x p and x q with different quality scores.
, the first column represents the learned joints attention matrix M e of an expert sample x e .The second column illustrates x e highlighted by red circles according to the values of M e , which represent different joint importance.The larger circle denotes higher importance of the joint.The third and fourth columns show the joint attention differences |M e − M p | and |M e − M q |Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE II THE
RESULTS OF MAD OF TEN EXERCISES ON UI-PRMD DATASET (THE LOWER THE VALUE, THE BETTER THE PERFORMANCE)

TABLE III THE
RESULTS OF MAD, RMSE, AND MAPE ON KIMORE DATASET (THE LOWER THE VALUE, THE BETTER THE PERFORMANCE.THE SUPERSCRIPT OF S AND V DENOTE THE SKELETON-BASED METHOD AND THE RGB-BASED METHOD RESPECTIVELY)

TABLE IV ABLATION
STUDY ON THE KIMORE DATASET (BN DENOTES THE BASELINE METHOD.CS REPRESENTS THE CONTRASTIVE PAIRWISE TRAINING STRATEGY.JL AND FL DENOTE THE JOINT ATTENTION WEIGHT LOSS AND THE FEATURE SIMILARITY LOSS)

TABLE V THE
RESULTS OF MAD, RMS AND MAPE ON EX5 OF THE KIMORE DATASET WITH DIFFERENT SKELETON SOURCES than contrastive training and implements a score regressor based only on the score error loss.It can be observed that both the contrastive learning strategy and the proposed constraints contribute to the performance improvement.4)Results of Different Skeleton Sources: