Deep Reinforcement Learning for Control of Time-Varying Musculoskeletal Systems With High Fatigability: A Feasibility Study

Functional electrical stimulation (FES) can be used to restore motor function to people with paralysis caused by spinal cord injuries (SCIs). However, chronically-paralyzed FES-stimulated muscles can fatigue quickly, which may decrease FES controller performance. In this work, we explored the feasibility of using deep neural network (DNN) controllers trained with reinforcement learning (RL) to control FES of upper-limb muscles after SCI. We developed upper-limb biomechanical models that exhibited increased muscle fatigability, decreased muscle recovery, and decreased muscle strength, as observed in people with chronic SCIs. Simulations confirmed that controller training time and controller performance are impaired to varying degrees by muscle fatigability. Also, the simulations showed that large muscle strength asymmetries between opposing muscles can substantially impair controller performance. However, the results of this study suggest that controller performance for highly-fatigable musculoskeletal systems can be preserved by allowing for rest between movements. Overall, the results suggest that RL can be used to successfully train FES controllers, even for highly-fatigable musculoskeletal systems.

Many studies have demonstrated that FES can restore some 37 degree of upper-limb motor function to people with paral-38 ysis [6], [7], [8], [9]. People with SCIs have been able to 39 complete functional tasks by controlling FES systems with few 40 degrees of freedom and pre-programmed hand grasps [10]. 41 Yet, the FES systems that have been demonstrated to date 42 provide low-dimensional control, which falls short of approx- 43 imating natural upper-limb function. Also, the implementa-44 tion of FES systems requires continuous intervention from 45 highly-skilled clinicians and engineers. Therefore, FES sys-46 tems to restore upper-limb motor function to individuals with 47 SCIs have not been widely translated into clinical practice. 48 If FES neuroprostheses are to be more widely used, they 49 will require controllers that can address the aforementioned 50 limitations. FES controllers should be easy to train, and 51 they should maintain performance with minimal intervention 52 from experts. Ideally, FES controllers should be effective in 53 coordinating multiple actuators, providing multidimensional 54 control that approximates natural movement. Since chronic 55 paralysis causes muscles to become substantially weaker and 56 more fatigable [11], [12], [13], [14], FES controllers should 57 be effective even for highly fatigable and atrophied muscles. 58 While many of these needs have been addressed individually 59 by previous studies [15], [16], [17], there are no FES controller 60 architectures, to the best of our knowledge, that can meet these 61 requirements simultaneously. 62 Recently, deep reinforcement learning (DRL) has been 63 used to train FES controllers that can meet many of these 64 needs. By emulating natural learning, RL automatically 65 adjusts controller parameters to maximize performance. There-66 fore, RL may prevent labor-intensive manual adjustments 67 of controller parameters. RL has been used to control a 68 multi-actuator biomechanical model [18], [19], an upper-limb 69 FES system [20], and robotic arms performing complex motor 70 tasks [21]. More recently, an RL technique called Hindsight 71 Experience Replay (HER) [22] has been used to train FES 72 controllers in as little as 15 minutes [23]. DRL controllers 73 were able to effectively control a multi-input, multi-output 74 biomechanical model of the arm in a large workspace, and 75 trained controllers required minimal retraining for consid-76 erable changes in biomechanical properties [24]. However, 77 previous studies did not consider musculoskeletal systems with 78 highly fatigable and atrophied muscles, such as those observed 79 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/   [23]. Figure. 1A shows a diagram of the 100 model. The model contained two segments representing the 101 forearm and the upper arm. The segments were connected by 102 2 pin joints representing the shoulder and the elbow. The arm 103 model included two degrees of freedom: horizontal flexion 104 and extension of the shoulder, and flexion and extension of 105 the elbow. The movement of the arm was constrained to a 106 horizontal plane and the weight of the arm was supported, as if 107 moving on a tabletop with no friction. The model included a 108 total of 6 actuators represented by Hill muscle models [28]. 109 Hill muscle parameters were extracted from [29], [30], and 110 limb segment dimensions were calculated from [31] for a male 111 subject with a height of 177 cm and weighting 80 kg, as in 112 similar RL studies that demonstrated suitable motor behav-113 ior [19], [23] and robustness changes in biomechanical prop-114 erties [24]. See Supplementary Tables I and II for an overview 115 of key musculoskeletal parameters. As shown in Figure. 1A, 116 4 actuators acted on only one joint, roughly approximating the 117 functions of the anterior deltoid (a) and the posterior deltoid 118 (b) on the shoulder, and the functions of the brachialis (c) and 119 the short head of the triceps (d) on the elbow. Two actuators 120 acted on both joints, approximating the functions of the biceps 121 (e) and the long head of the triceps (f). Simulations were 122 performed using forward Euler approximation with model 123 states updated every 20 ms, which has been found to provide 124 accurate control in previous studies [15], [18], [19], [23], [24], 125 and in preliminary simulations in the current study. To investigate the impact of fatigability on controller 128 performance, we implemented a previously-validated fatigue 129 model [27], [32], [33]. The fatigue model was incorporated 130 into the musculoskeletal model of the arm, enabling the 131 continuous estimation of fatigue levels during arm control 132 for each muscle. Since the added computational burden was 133 proportional to the number of controlled muscles, we pri-134 oritized fatigue models that were reasonably accurate and 135 computationally efficient. In spite of its simplicity, the model 136 proposed by [27] could accurately predict fatigue levels for 137 isometric [32] and intermittent motor tasks [33], and it did not 138 affect simulation times in any noticeable manner in this study. 139 Figure. 1B shows a visual representation of the fatigue model, 140 adapted from [27]. The fatigue model included three com-141 partments representing three possible states for motor units:  Eqs. 1 describe the mathematical representation of 145 the compartment model. C(t) is a bidirectional muscle 146 activation-deactivation drive function, as described previ-147 ously [27]. R and F are the recovery and fatigue coefficients, 148 respectively. R determines the rate at which fatigued motor 149 units become available to perform contractions, and F deter-150 mines the rate at which activated motor units become fatigued. 151   Figure. 9 [34], and fit to the fatigue compartment 159 model using a least squares regression to extract R and F. 160 To estimate the R and F coefficients for people with chronic  Exercise leads to increased blood flow to the muscles, which 171 is likely to result in faster recovery from fatigue [33]. Also, 172 previous studies demonstrated that FES-induced exercise can 173 substantially reduce the fatigability of paralyzed muscles [35].  with SCI after long-term FES exercise [35], and the soleus 191 torques produced by average males [36]. Atrophy levels that 192 were estimated using quantitative plantarflexion torque data 193 agree with our own qualitative observations of upper-limb 194 atrophy after FES exercise in people with tetraplegia [10], [37]. We used reinforcement learning (RL) [38] to train a DNN 204 to control the muscle activations in a musculoskeletal arm 205 model. Figure. 2 shows the implemented controller training 206 paradigm. At each time step, the controller received the 207 kinematic state of the system, described by the joint angular 208 positions and joint angular velocities of the arm, as well as 209 the target posture in angular coordinates. The action space was 210 a 6-dimensional vector containing commanded muscle activa-211 tions over a range of [0, 1] for each of the 6 actuators in the 212 musculoskeletal model. The reinforcement learning agent was 213 given a reward at each step according to Equation 2, where 214 I at was a boolean that was 1 if the endpoint of the arm was 215 inside the target region T (see Figure. 1A) and 0 otherwise, 216 and a was a 6 dimensional vector containing the muscle 217 activations of the arm. The first term rewarded the controller 218 for moving the arm into the target region, the second term 219 penalized movement duration, and the third term penalized 220 higher muscle activations to encourage lower levels of muscle 221 fatigue. The second and the third terms were included to 222 promote controller training convergence, as described in [23] 223 and [24].
To train the arm controller, we used an actor-critic RL algo-227 rithm incorporating Twin-Delayed Deep Deterministic Pol-228 icy Gradients (TD3) [39] and Hindsight Experience Replay 229 (HER) [22], as described previously [23], [24]. Briefly, the 230 actor observed kinematic state variables and target kinematic 231 state variables and chose muscle activation values, as shown 232 in Figure. 2. The critic mapped action-state pairs to expected 233 rewards. The expected rewards were used to update the actor 234 in order to maximize rewards. Both the actor and the critic 235 were feedforward DNNs containing 2 layers and 64 nodes per 236 layer. [19], [23], [24]. Here, the goal was to assess the feasibility of  were not sufficient to move the arm to the target location 290 and/or (2) the RL algorithm had difficulty with directionally 291 asymmetrical actuator strengths. To test these hypotheses, 292 we implemented models with non-fatigable (time-invariant) 293 muscles, but with varying levels of maximum muscle forces. 294 We tested 3 conditions:

295
• Symmetrical muscle strength: all maximum muscle forces 296 were scaled by the same factor.

297
• Flexor force decrease: flexor muscles were weaker than 298 extensor muscles, and all maximum flexor muscle forces 299 were scaled by the same factor.

300
• Extensor force decrease: extensor muscles were weaker 301 than flexor muscles, and all maximum extensor muscle 302 forces were scaled by the same factor. Muscle Strengths: To demonstrate that muscle strength asym-305 metries can impair controller performance, we controlled for 306 muscle strength asymmetries by artificially linking muscle 307 fatigue levels between opposing muscles, as illustrated in 308 Figure. 1A. Since flexor muscles displayed higher levels of 309 fatigue than extensor muscles in preliminary simulations, 310 we chose to have extensor fatigue track flexor fatigue. We used 311 the same musculoskeletal models described in Simulation 1.

4) Rest-Reach Controller Training Protocol:
In this simu-313 lation, we implemented a controller training protocol that 314 included rests of 3 to 6 seconds between reaches to reduce 315 overall levels of fatigue. We assessed the impact of rest dura-316 tion on controller training times and controller performance. 317 We evaluated the rest-reach protocol using the SCI Exercised 318 model described above.

G. Controller Evaluation 320
For each condition, the controller training process shown in 321 Figure. 2 was repeated 32 times, as was shown to provide 322 an accurate assessment of controller performance in previ-323 ous studies implementing the same motor task [23], [24]. 324 Controller performance during training was measured every 325 5 minutes of simulated time for all simulations, except for 326 the simulation where opposing muscles were artificially kept 327 at the same levels of fatigue. For that simulation, we per-328 formed evaluations every minute to allow for a more thorough 329 assessment of transient changes in controller performance 330 due to time-varying muscle forces. During evaluations, DNN 331 parameters were held constant while the controller performed 332 100 reaches. The success rate for an evaluation of a single 333 controller was calculated as the number of successful reaches 334 divided by the total number of reaches. The success rate for 335 a condition was estimated as the median success rate of all 336 controllers trained in that condition. Controller performance 337 plots display the median success rate and the interquartile 338 range during controller training. Training time was defined as 339 the time taken for success rates to reach the maximum median 340 performance for each condition. In workspace performance 341 plots, evaluations included 2,000 reaches. The workspace was 342 discretized into pixels of 0.3 × 0.3 cm, and performance was 343 measured as the number of successful reaches divided by the 344 total number of targets spawned within each pixel.  to a non-fatigable model that was included as a control.     In Figure. 4A, all muscle forces were scaled by the same 372 factor, whereas in Figure. 4B, only the extensor muscles 373 were weakened, and in Figure. 4C, only the flexor muscles 374 were weakened. In Figures. 4A and 4B, the DRL controllers 375 achieve success rates above 95% for all conditions represent-376 ing atrophy levels of up to 75%. In Figure. 4C, the DRL 377 controllers achieve success rates above 90% only when flexor 378 atrophy was below 15%. An imbalance where the flexors 379 were 30% weaker than the extensors (purple curve in Fig. 4C) 380 caused success rates in this task to drop to approximately 67%. 381 No targets were acquired when flexor forces were decreased 382 by 90% or more.  Figure. 6A shows success rates during controller training 397 for fatigable musculoskeletal models where muscle strengths 398 from opposing muscles were artificially forced to be equal. 399 When muscle strengths for opposing muscles were forced 400 to be equal, controller performance was improved compared 401 to models where opposing muscle strengths were allowed to 402 be asymmetrical (compare Figure.  Muscle strength was decreased only for the flexor muscles. The controller was impaired by asymmetries where the flexor muscles were weaker than the extensors. Success rates for flexor atrophy levels above 75% were nearly zero -notice that the learning curves representing 90% and 95% atrophy coincide with the x-axis. Taken together, these results demonstrate that muscle strength asymmetries can considerably impair controller performance, even in the absence of muscle fatigue during training.  included rest periods between reaches. The musculoskeletal 415 model used in this simulation represented a subject with SCI 416 after FES exercise. Success rates were approximately 95% 417 after 120 minutes of training for rest periods of 5 seconds after 418 every reach. After introducing a pause of at least 3 seconds 419 between reaches, muscle strengths remained above 70% for 420 all muscles included in this model.   Since the arm model was supported against gravity, 462 we encourage caution when predicting results for other tasks. 463 Nonetheless, these results are relevant to people with SCIs 464 who use mobile arm supports [10], [37]. The results from Figure. 6A suggest that muscle strength 468 time-variance due to muscle fatigue is unlikely to impair 469 DRL control if the remaining muscle strength is sufficient 470 to perform the task. Note that for the Chronic SCI model, 471 combined levels of atrophy and fatigue cause muscle strength 472 to drop below 4% of the muscle strength available to non-473 disabled people. Considering the results in Figure. 4A, 4% 474 of non-disabled muscle strength is likely to be insufficient 475 to perform the task. This would explain why the Chronic 476 SCI condition, compared to other conditions, did not benefit 477 as much from forcing fatigue levels in opposing muscles to 478 be equal. These results highlight the importance of adequate 479 FES-exercise to increase muscle strength in chronically para-480 lyzed muscles prior to controller deployment.

481
As discussed above, fatigue-induced performance decre-482 ases were likely caused in large part by muscle strength 483  In this study, as in previous implementations of DRL 507 controllers [23], [24], we observed that commanded muscle 508 activations were frequently either below 5% or above 95%. Higher muscle activation penalties caused minor shifts in the 518 muscle activation histogram, and they resulted in lower success 519 rates, as can be seen in Supplementary Figures. 2 and 3. 520 Qualitatively, the motor behavior observed for successfully 521 trained controllers could be described as smooth and fast with 522 deceleration as the endpoint of the arm approached the target 523 region. The arm was generally stable at the target region for 524 all conditions (see Supplementary Figure. 7). The duration 525 of the reaches for the fatigable models was approximately 526 0.4 seconds, which is slightly longer than the duration reported 527 in a similar study [23]. Yet, this is not surprising, since the 528 workspace used in this study was considerably larger, and 529 muscle strength was smaller due to the inclusion of atrophy 530 and fatigue. Chronic paralysis causes muscle atrophy [13], and more 533 rapid muscle fatigue [11]. Additionally, SCI is likely to cause 534 denervation in certain muscles [13], and FES exercise cannot 535 reverse atrophy caused by denervation. Furthermore, muscle 536 strength asymmetries between opposing muscles are common 537 even for non-disabled individuals [41]. Therefore, patients 538 with SCIs are likely to exhibit muscle strength asymme-539 tries, and it is possible that these asymmetries are charac-540 terized by large imbalances where flexors are weaker than 541 extensors. Since these results indicate that such imbalances 542 impair DRL controller performance, it may be necessary to 543 scale electrode stimulation ranges to compensate for flexor 544 weakness. Therefore, muscle strength asymmetries may decrease the 554 reachable workspace, but they do not seem to substantially 555 impair controller performance within the reachable workspace.

556
Taken together, these findings suggest that the task workspace 557 could be optimized by considering the anatomy of the 558 patient.

559
The results in this study (see Figures. 6A and 7A) indi-560 cate that time-variance due to muscle fatigue is unlikely 561 to affect controller performance in reaching tasks with the 562 arm supported against gravity, as long as rest periods are 563 included between reaches. However, we evaluated DRL con-564 troller performance in a motor task that did not require practice in neuroprosthesis implementation [10], [37]. 576 We highlight the importance of exercising chronically para-577 lyzed muscles using FES to partially reverse disuse atrophy 578 prior to controller deployment. Also, controller use should 579 include frequent periods of inactivity. When high levels of 580 muscle atrophy were combined with muscle fatigue, the 581 resulting muscle strengths were not sufficient to perform the 582 relatively simple task studied in this work. However, since 583 upper-limb use naturally includes periods of low activity, 584 we do not anticipate that this will affect the use of the 585 controllers, or cause inconvenience for the users. [23], [24], [42], [43], [44], to limit the impact of confounding 602 variables. Here, the goal was to characterize the impact 603 of fatigue and time-varying muscle strength on DRL con-604 trollers. Future studies should implement DRL control of more 605 complex musculoskeletal models that can more accurately 606 represent the biomechanics of people with chronic paraly-607 sis, as well as more meaningful functional tasks. A recent 608 study demonstrated that similar RL controllers were robust 609 to changes in many biomechanical parameters [24], [45], 610 suggesting that these controllers can generalize across different 611 users with minimal retraining. However, we anticipate that 612 motor tasks involving object interaction will require major 613 adjustments to the reward function. 614 We used data from the literature to estimate levels of atro-615 phy, as well as the R and F coefficients for the musculoskeletal 616 models in this work. However, these data, particularly the 617 fatigue curves used to estimate the R and F coefficients, were 618 recorded during motor tasks that differed in important ways. 619 Specifically, suitable data to extract R and F coefficients was 620 available either during MVC [34], or intermittent tasks involv-621 ing full activation and complete rest [12]. Yet, previous studies 622 argue that R and F coefficients are likely to change depending 623 on the level of muscle contraction, due to physiological factors 624 such as increased blood flow in muscles that are relaxed [33]. 625 Therefore, we anticipate that more accurate estimations of R 626 and F coefficients could lead to shifts in the expected success 627 rates (see Supplementary Figures. 4 and 5) and fatigue levels 628 for the conditions studied in this work, although we expect 629 that the main conclusions would remain unchanged. 630 Finally, the R and F coefficients representing people with 631 SCIs were estimated using soleus data. The non-paralyzed 632 soleus is substantially less fatigable than upper-limb muscles 633 due to its higher proportion of type I muscle fibers [14]. 634 However, chronically-paralyzed muscles become primarily 635 composed of type II fibers [14], which fatigue more quickly. 636 Therefore, we assumed that different muscle groups displayed 637 similar fatigability after SCI. Yet, it is possible that small dif-638 ferences in muscle composition remain after paralysis-driven 639 muscle adaptation. In this study, we have implemented a musculoskeletal model 642 of the arm that allows for continuous estimation of muscle 643 fatigue during motor tasks. We have adapted this model to 644 represent patients with SCI before and after FES-exercise, and 645 we have used this model to investigate the feasibility of DRL 646 control of FES for people with chronic paralysis. The results 647 suggest that DRL controllers should provide good performance 648 if muscles are exercised to reverse disuse atrophy, and if there 649 are realistic rest periods between movements. However, the 650 simulations indicate that large muscle strength asymmetries 651 may impair DRL controllers. Muscle strength asymmetries that 652 occur due to fatigue may be alleviated by introducing frequent 653 rests during controller use. We recommend that electrodes 654 are carefully profiled and that stimulation ranges are scaled 655 to compensate for large asymmetries due to disuse atrophy 656 or denervation, and to ensure that movements are safe and 657 comfortable for the users. These results support the feasibility 658 of using DRL to control FES systems to reverse paralysis due 659 to SCI.