BiLSTM-Based Joint Torque Prediction From Mechanomyogram During Isometric Contractions: A Proof of Concept Study

Quantifying muscle strength is an important measure in clinical settings; however, there is a lack of practical tools that can be deployed for routine assessment. The purpose of this study is to propose a deep learning model for ankle plantar flexion torque prediction from time-series mechanomyogram (MMG) signals recorded during isometric contractions (i.e., a similar form to manual muscle testing procedure in clinical practice) and to evaluate its performance. Four different deep learning models in terms of model architecture (based on a stacked bidirectional long short-term memory and dense layers) were designed with different combinations of the number of units (from 32 to 512) and dropout ratio (from 0.0 to 0.8), and then evaluated for prediction performance by conducting the leave-one-subject-out cross-validation method from the 10-subject dataset. As a result, the models explained more variance in the untrained test dataset as the error metrics (e.g., root-mean-square error) decreased and as the slope of the relationship between the measured and predicted joint torques became closer to 1.0. Although the slope estimates appear to be sensitive to an individual dataset, >70% of the variance in nine out of 10 datasets was explained by the optimal model. These results demonstrated the feasibility of the proposed model as a potential tool to quantify average joint torque during a sustained isometric contraction.


BiLSTM-Based Joint Torque Prediction From Mechanomyogram During Isometric Contractions: A Proof of Concept Study
Jongsang Son , Member, IEEE, Fandi Shi , and William Zev Rymer , Life Member, IEEE Abstract-Quantifying muscle strength is an important measure in clinical settings; however, there is a lack of practical tools that can be deployed for routine assessment.The purpose of this study is to propose a deep learning model for ankle plantar flexion torque prediction from time-series mechanomyogram (MMG) signals recorded during isometric contractions (i.e., a similar form to manual muscle testing procedure in clinical practice) and to evaluate its performance.Four different deep learning models in terms of model architecture (based on a stacked bidirectional long short-term memory and dense layers) were designed with different combinations of the number of units (from 32 to 512) and dropout ratio (from 0.0 to 0.8), and then evaluated for prediction performance by conducting the leave-one-subject-out cross-validation method from the 10-subject dataset.As a result, the models explained more variance in the untrained test dataset as the error metrics (e.g., root-mean-square error) decreased and as the slope of the relationship between the measured and predicted joint torques became closer to 1.0.Although the slope estimates appear to be sensitive to an individual dataset, > 70% of the variance in nine out of 10 datasets was explained by the optimal model.These results demonstrated the feasibility of the proposed model as a potential tool to quantify average joint torque during a sustained isometric contraction.
Index Terms-Manual muscle testing, muscle/joint strength, machine/deep learning model, clinical assessment.

I. INTRODUCTION
M USCLE strength is considered an important measure in our clinics, as significantly reduced muscle strength is closely associated with severe motor impairments across a wide range of clinical populations [1], [2], [3], [4].In fact, previous studies have demonstrated a strong correlation between muscle strength and reduced motor function in a variety of activities such as balance, walking, reaching, and grasping [5].This underscores the significance of assessing and quantifying muscle strength in routine clinical practice so that clinicians can not only monitor the progress of rehabilitation over time but also tailor rehabilitation strategies based on the individual's function.
Manual muscle testing (MMT), using a 6-point grading scale, is widely considered to be the gold standard to assess muscle strength in real clinical settings.However, the MMT scale is quite subjective and less reliable probably due to differences in experience and strength of examiners [6], [7].
To address these limitations, some investigators have attempted to use a dynamometer (e.g., an isokinetic device) or a handheld device, which is available to better quantify muscle strength [8], [9].Unfortunately, such approaches appear to be impractical for a routine assessment in real-life clinical settings.Thus, there is a need of an alternative, more practical approach for quantifying muscle strength-related measures.
It has been widely accepted that the electromyogram (EMG) is closely related to muscle force [10].Indeed, several approaches have leveraged the EMG signals to estimate muscle force [11], [12].However, the acquisition of EMG signals can be challenging due to technical constraints; for example, proper placement of the EMG electrodes, careful skin preparation and management of electrode-skin contact, and environmental noise (for review, see [13]).In addition to these technical challenges, other barriers can limit its routine application in real clinical settings [14], [15].
For the past several decades, the mechanomyogram (MMG) has been used to record small vibration signals from skeletal muscle.Skeletal muscle is known to generate low-frequency mechanical vibrations (< 100 Hz) during muscle contractions, which are presumably linked to changes in the shape of muscle fibers (e.g., cross-sectional area and length) in response to neural excitation of active motor units (for review, see [16]).Given that such mechanical signals propagate through soft tissues, MMG can provide some advantages over EMG, in terms of the sensor location, sensor-skin contact, and environmental noise [17], [18], [19].
Along with these advantages, a large number of studies have demonstrated significant relationships between several features (e.g., root-mean-square (RMS) MMG, median/mean frequency, or zero-crossing of the MMG signals) and mean muscle force/joint torque during a sustained isometric contraction [20], [21], [22], [23], [24].Moreover, there have been attempts to estimate joint torque from some selected features of the MMG signals based on the machine learning models, demonstrating a reasonably good performance [25], [26], [27], [28].However, in most of the previous studies, data collected from the same participant were used for both training and test datasets, and it is thus uncertain whether the models would perform well with untrained test data from a new individual.Given that there is no consistent relationship established, it is plausible that such features may not always contain key necessary kinetic information required for predicting muscle force/joint torque accurately.In addition, since a slope of the relationships between observed and predicted values has not been often reported, there is a lack of information on whether the previous models underpredict or overpredict the outcomes.
Recently, deep learning models have been extensively deployed in many applications, likely due to their ability to automatically extract features from raw sensor signals.
For example, some studies demonstrated a deep learning model application with features from a few seconds of raw (or processed) MMG signals.However, most studies sought to address human movement recognition or muscle action recognition [29], [30].
Thus, the purpose of this proof-of-concept study is to propose a deep learning model for ankle plantar flexion torque prediction from MMG signals and to evaluate its performance.Considering that an isometric contraction protocol in research is a similar form to the MMT procedure in clinical practice, our deep learning model was designed to predict a single numeric value corresponding to an average ankle plantar flexion torque for a given time-series MMG signals during isometric plantar flexion contraction.

A. Dataset
This study used the dataset collected from our previous study with 10 healthy young individuals (age: 25.8 ± 2.2 years; height: 170.0 ± 10.3 cm; body mass: 66.9 ± 12.5 kg; F/M: 4/6) without any relevant medical history of neuromuscular diseases in their lower limb [24].Briefly, the participants performed three isometric plantar flexion contractions at each of different contraction intensities, i.e., 10%, 20%, 30%, 50%, 70%, and 100% of maximum voluntary isometric contraction (MVIC), and at each of different ankle joint angles, i.e., plantar flexion of 26 • , plantar flexion of 10 • , and dorsiflexion of 3 • based on the neutral position (0 • defined as perpendicular between the shank and sole).Given that the average maximum ankle plantar flexion torque was observed at plantar flexion of 10 • , only the data collected at plantar flexion of 10 • were used for this study.During the contractions, ankle torque signals (in N m) were recorded, using a 6-axis force measuring device (Omega160, ATI Industrial Automation, Apex, NC, USA), and MMG signals (in mm/s 2 ) Fig. 1.
Architecture of the proposed bidirectional long short-term memory (BiLSTM)-based models for a joint torque prediction from timeseries 3-axis MMG signals during a sustained isometric contraction.Four different models are based on a stacked BiLSTM and dense layers.The hyperparameter values for the number of units of the BiLSTM and dense layers and the dropout ratio of the dropout layer are summarized in Table I.
were collected from the medial gastrocnemius (MG) muscle, using a 3-axis accelerometer with z-direction perpendicular to the skin surface over MG muscle (ADXL335, Analog Devices, Wilmington, MA, USA; sensitivity: 300 mV/g).These signals were collected at a sampling frequency of 2 kHz (NI USB-6259 BNC, National Instrument, Austin, TX, USA).All procedures described above were approved by the Northwestern University's Institutional Review Board, and the subjects gave informed consent forms before participating in this study.

B. Deep Learning Model Architecture
Since our application is to predict an average ankle plantar flexion torque value from time-series MMG signals during isometric plantar flexion contraction for typically 3-10 seconds (i.e., many-to-one regression problem), our deep learning model architecture was based on a stacked bidirectional long short-term memory (BiLSTM) and dense layers (Figure 1).Typically, input sequential data for BiLSTM are supposed to have the same data length (i.e., the same time step) across trials, and were thus zero-padded to the maximum length among all of the trials.A masking layer was added to use original sequential data for training a model.Afterward, as summarized in Table I, different combinations of BiLSTM, dense, and dropout layers were implemented to explore the effects of the model architecture and hyperparameters on the performance of the ankle plantar flexion torque prediction.A rectified linear unit activation layer and a dense layer with one unit were then added to output a single, positive plantar flexion torque value.

C. Data Preprocessing
The torque signals were lowpass filtered using a zerophase 4 th order Butterworth filter with a cutoff frequency of 6 Hz.For each trial, a 4-s segment in the middle of the sustained contraction was automatically selected with the minimum standard deviation of the segment, and the mean ankle plantar flexion torque value of the segment was then calculated.These mean values were normalized to body mass, and the normalized mean torque data (i.e., N m/kg) were used as output data of the deep learning models.
The 3-axis MMG signals were bandpass filtered using a zero-phase 4 th order Butterworth filter with a passband of 4-40 Hz.In order to improve the performance of the trained models, data augmentation was applied to the filtered signals.Nine random durations ranging between 0 and 3 s with an interval of 0.1 s were chosen, and the data points corresponding to each of the durations from the beginning and end of the signals were removed (e.g., for a given 1-s duration, the first 1-s data and the last 1-s data were removed), leading to 10 different filtered signals (i.e., original plus nine random durations).A moving RMS filter with a 100-ms window size was applied to each axis of the filtered MMG signals.Principal component analysis was conducted to determine the scores of the moving RMS-filtered signals, and the three time-series scores were then downsampled 10 Hz to ensure that the maximum data length is less than 300 (i.e., up to 30-s long data).Finally, these processed signals were used as input data for the deep learning models.This preprocessing process was performed in MATLAB (R2022b, The MathWorks, Inc., Natick, MA, USA).

D. Model Training and Evaluation
The proposed models were implemented, trained, and tested in Python 3 using Keras (v2.10.0)[31] and TensorFlow (v2.10.0)[32].Most of the layer arguments were not defined, so default arguments were used.Only the Dense1 and Dense2 layers in Figure 1 had an L2 kernel regularizer with a regularization factor of 0.01.The adaptive momentum (Adam) optimization algorithm was used to minimize the loss function (i.e., mean squared error) during the training of the proposed models.The batch size and maximum number of epochs were set to 32 and 500, respectively.During the training process, the validation loss was calculated at the end of each epoch, and the model with the best validation loss out of 500 epochs was saved as the final model.
Given that the purpose of this study is to develop a deep learning model that predicts an average ankle plantar flexion torque value for given time-series 3-axis MMG signals from a new individual, the prediction performance of each model was evaluated by conducting a leave-one-subject-out crossvalidation (LOSOCV) method.The data from all subjects except one (i.e., nine subjects in this study) were randomly separated into training set (80%) and validation set (20%) and were then used to train a model.The remaining data from the left-out subject were used only to test the model prediction performance on untrained test data.This procedure was applied to each subject, resulting in 10 different subject models for each combination of the hyperparameters.
The mean absolute error (MAE), RMS error (RMSE), and RMSE normalized to interquartile range (RMSE IQR ) between the measured and predicted normalized mean torque data were calculated as performance metrics to assess the prediction performance of each trained model.Moreover, linear regression analyses were conducted to determine the slope and coefficient of determination (R 2 ) of relationships between the measured and predicted normalized mean torque data.Briefly, as the predicted value is closer to the measured value, both slope and R 2 values are closer to 1.0.Considering that these performance metrics may vary according to different datasets randomly separated into training and validation, the LOSOCV procedure for each subject model was repeated 10 times (i.e., a repeated random sub-sampling method) and the average value of the performance metrics from the 10 models was then used as a representative prediction performance for each subject model.Finally, the prediction performance for each combination of the hyperparameters was assessed by the overall mean and standard deviation of the representative performance metrics across the 10 different subject models for each combination of the hyperparameters.These model evaluations were done in MATLAB (R2022b, The MathWorks, Inc., Natick, MA, USA).
Three-way analysis of variance (ANOVA) was conducted to evaluate the main effects of the model architecture, the number of units, and the dropout ratio as well as the three two-way interaction effects on each of the aforementioned performance metrics.Multiple comparisons were done with Tukey's HSD test.These statistical analyses were performed using JMP Pro (16, JMP Statistical Discovery LLC., Cary, NC, USA) with a significance level (α) of 0.05.
In general, the ANOVA results demonstrated a significant interaction between the model architecture and the number of units and between the number of units and the dropout ratio for all the performance metrics ( p < 0.05).The interaction between the model architecture and the dropout ratio was significant for MAE ( p = 0.013) and slope ( p = 0.016) but Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.3) yielded the best slope of 0.849 ± 0.037.Model 3 and the models with 2 BiLSTM and 2 Dense layers (Model 4) yielded a significantly better slope value than the models with 1 BiLSTM and 1 Dense layer (Model 1) and Model 2 ( p < 0.05) as well as a significantly higher R 2 value than Model 1 ( p < 0.05).

B. Effects of Number of Units on Prediction Performance
Table III summarizes performance metrics for the models with different numbers of units.There was a significant difference in the average values of MAE, RMSE, RMSE IQR , slope, and R 2 among the models with five different numbers Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.  of units ( p < 0.0001).There was a general trend of improved performance metrics with an increase in number of units.On average, the models with the number of units of 512 yielded the smallest values of MAE, RMSE, and RMSE IQR ( p < 0.05), demonstrating the best slope ± 0.030) and R 2 (0.707 ± 0.018) values.

C. Effects of Dropout Ratio on Performance
IV summarizes performance metrics for the models with different dropout ratios.The models with each of the four different dropout ratios revealed a significant difference in the average values of MAE, RMSE, RMSE IQR , slope, and R 2 ( p < 0.0001).The models with the dropout ratio of 0.8 (DR 0.8) yielded the smallest average values of MAE, RMSE, and RMSE IQR compared to the others ( p < 0.05), but the average values of the slope (0.771 ± 0.083) and R 2 (0.668 ± 0.028) were also significantly smaller ( p < 0.05).On average, the models with the dropout ratio of 0.0 predicted the best slope (0.857 ± 0.043) and best explained the variance in the normalized mean torque data by ∼70% ( p < 0.05), yielding comparable average values of MAE, RMSE, and RMSE IQR to DR 0.8.

D. Summary of Prediction Performance Across Models
The correlation matrix revealed that across the models, there are strong correlations (r > 0.9; p < 0.001) between error metrics (i.e., MAE, RMSE, and RMSE IQR ) and moderate negative correlation (r < -0.3; p < 0.01) of R 2 with the error metrics (Figure 2).The slope-R 2 relationship was significant ( p < 0.001), but the slope did not significantly correlate with the error metrics ( p > 0.05).These correlations indicate that R 2 increases likely as the error metrics decrease and as the slope becomes closer to 1.0.Interestingly, the R 2 -slope plot demonstrates that most models predicted the mean torque value for the untrained test dataset with a reasonable slope (i.e., close to 1) and a moderate to strong effect size (i.e., R 2 > 0.65).
Based on the performance metrics, the model with 2 BiLSTM and 1 Dense layers with the number of units of 512 and the dropout ratio of 0.0 was selected as an optimal model for this study.Although the overall mean slope was close to 1.0, the slope for each subject dataset was not always close to 1.0, ranging between 0.35 and 1.99 (Figure 3A).However, the prediction from the selected optimal model had a strong effect size for the untrained test dataset (Figure 3B), explaining >70% of the variance in the normalized plantar flexion torque except one subject dataset (ID 9, R 2 = 0.293).

IV. DISCUSSION
The purpose of this proof-of-concept study was to propose a BiLSTM-based model that predicts an average ankle plantar flexion torque value for a given time-series 3-axis MMG signal recorded during a sustained isometric contraction.We designed the different deep learning models in terms of model architecture, the number of units, and dropout ratio, and then evaluated its prediction performance by conducting the LOSOCV method.Across the models, there was a general trend showing that the models tend to better account for the variance in untrained test datasets as the error metrics decrease and as the slope becomes closer to 1.0.From the optimal model prediction for the untrained test dataset, the slope estimates appear to be sensitive to an individual dataset, but >70% of the variance in nine out of 10 datasets was explained by the model.These results demonstrate the feasibility of the proposed model as a potential tool to quantify average joint torque during a sustained isometric contraction.
There have been attempts to estimate joint torque from some features of the MMG signals (e.g., RMS MMG, mean/median power frequency, zero-crossing, etc.) based on machine learning models, demonstrating a good performance assessed by the coefficient of determination (R 2 ).For example, Youn and Kim [26] introduced a multiple linear regression model that predicted the time-series elbow joint torque with an average R 2 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply. of 0.84 for the same-subject validation and of 0.82 for crosssubject validation.Ibitoye, et al. [27] developed a support vector regression model that predicted an average knee extensor torque evoked by neuromuscular electrical stimulation with R 2 of 0.89 for stratified random sampling validation.Other studies also demonstrated promising results (R 2 range: 0.77-0.87)based on artificial neural network models [25], [26], [28].However, in most of these studies, data collected from the same participant were used for both training and test datasets, likely yielding a better prediction performance with a higher R 2 value compared to the LOSOCV method that was used in this study.Thus, it is uncertain whether the previous models would perform well with untrained test data from a new individual.In addition, as these studies did not report a slope of the relationships between observed and predicted values, there is a lack of information on whether the previous models underpredict or overpredict the outcomes.
Given that our primary goal of the proposed model was to predict an average joint torque value for given time-series MMG signals during a sustained isometric contraction from a new individual (i.e., a similar form to the MMT procedure in clinical practice), the model prediction performance was evaluated by conducting the LOSOCV method.In general, the overall mean slope was reasonably close to 1.0 regardless of the tested models, whereas the slope estimates for each untrained test subject ranged between 0.35 (i.e., underpredict) and 1.99 (i.e., overpredict).We confirmed that this wide range of the slope estimates was unlikely due to the proposed model issues, by conducting a repeated random sub-sampling method (i.e., split the whole dataset into training (80%), validation (10%), and test (10%) dataset randomly 10 times) with the optimal model (for the test dataset, slope: 0.975 ± 0.033 ranging between 0.889 and 1.008; R 2 : 0.978 ± 0.020 ranging between 0.936 and 0.997).Given this superior performance of the proposed model over the previous models, it is plausible that the MMG signals contain subject-specific features that make muscle force/joint torque prediction from MMG signals more challenging.
Indeed, MMG signals are known to be affected by many factors.For example, the MMG amplitude tends to increase as the muscle contraction increases [21], [24], [33], [34].Regarding the mean/median frequency of the MMG signals during contraction at different intensities, there was mixed evidence demonstrating an increase (e.g., first dorsal interosseus [22], biceps brachii [20], [21]) or decrease (e.g., tibialis anterior [34]).These changes in the mechanical response appear to be associated with neuromuscular factors such as global motor unit behavior (e.g., recruitment and rate coding) [20], muscle architecture [34], muscle fiber type composition [35], muscle length [24], muscle stiffness [36], intramuscular pressure [37], subcutaneous thickness [38], skinfold thickness [39], and arm circumference [40].Given these uncertainties, widely accepted variables that feature the amplitude and frequency components of MMG signals may not likely contain all the necessary information to predict kinetic variables such as joint torque.In this regard, we proposed a BiLSTM-based model that predicts an average joint torque from time-series 3-axis MMG signals during a sustained isometric contraction, rather than features of the signals.The proposed model outperformed the previous machine learning models when evaluated with a comparable cross-validation method (an average R 2 of 0.97), promising that the proposed deep learning model may be more capable of learning subjectspecific features, especially from a larger training dataset.
It is also worth considering the possible effects of other factors on the relationship between the MMG signals and ankle plantar flexion torques across participants.Since this study only used the MMG signals collected from MG (i.e., one of the plantar flexors), it is plausible that the contribution of the other plantar flexors to the net joint torque (i.e., loadsharing) may be different across trials and/or participants.Moreover, co-contraction of other muscles (e.g., dorsiflexors) could also contribute to the net joint torque slightly differently across trials and/or participants.In addition, these neighboring muscle contractions may likely influence the MMG signals collected from MG (i.e., crosstalk), as observed in the forearm muscles [41], [42], elbow flexors [43], and quadriceps femoris muscles [44].Considering that multiple sensors may improve the prediction performance [45], [46], it would be interesting to test whether the proposed model prediction better performs when using MMG signals from more muscles (e.g., agonists and antagonists) and/or from multiple measurement sites.
There are several other important considerations in this study.First, the MMG signals were collected from the distal part of the MG muscle rather than the belly part.Given the previous findings that the MMG signals may be different in amplitude and frequency according to the measurement site [17], [18], [34], [66], it will be interesting to test the sensitivity of the proposed model using data collected from different measurement sites.Second, the current study was tested only with a dataset collected from MG at the ankle joint.Future studies are required to test other muscles in terms of location, muscle architecture, muscle fiber type composition, etc. Lastly, the RMS-filtered MMG signals were used as an input to the proposed model, which may lose the frequency information of the MMG signals to some extent.It would be interesting to determine whether adding frequency information can improve model prediction performance.

Fig. 2 .
Fig. 2. Correlation matrix of performance metrics including mean average error (MAE), root-mean-square error (RMSE), RMSE normalized to interquartile range (RMSE IQR ), and slope (Slope) and coefficient of determination (R 2 ) of the regression between predicted and measured ankle plantar flexion torque.A significance level of the correlation is presented as a different marker color: red (p < 0.05) or green (p > 0.05).The diagonal elements present the histogram of the performance metrics.

Fig. 3 .
Fig. 3. Slope and coefficient of determination (R 2 ) of the regression between predicted and measured ankle plantar flexion torque by conducting a leave-one-subject-out cross-validation method.

TABLE II
Table II summarizes performance metrics for the model architectures.There was a significant difference in the average values of MAE, RMSE, RMSE IQR , slope, and R 2 among four different model architectures ( p < 0.0001).The models with 1 BiLSTM and 2 Dense layers (Model 2) yielded the smallest average values of MAE, RMSE, and RMSE IQR ( p < 0.05).The models with 2 BiLSTM and 1 Dense layer (Model

TABLE III PERFORMANCE
METRICS BY NUMBER OF UNITS

TABLE IV PERFORMANCE
METRICS BY DROPOUT RATIO