Learning Skill Training Schedules From Domain Experts for a Multi-Patient Multi-Robot Rehabilitation Gym

A robotic gym with multiple rehabilitation robots allows multiple patients to exercise simultaneously under the supervision of a single therapist. The multi-patient training outcome can potentially be improved by dynamically assigning patients to robots based on monitored patient data. In this paper, we present an approach to learn dynamic patient-robot assignment from a domain expert via supervised learning. The dynamic assignment algorithm uses a neural network model to predict assignment priorities between patients. This neural network was trained using a synthetic dataset created in a simulated rehabilitation gym to imitate a domain expert’s assignment behavior. The approach is evaluated in three simulated scenarios with different complexities and different expert behaviors meant to achieve different training objectives. Evaluation results show that our assignment algorithm imitates the expert’s behavior with mean accuracies ranging from 75.4% to 84.5% across scenarios and significantly outperforms three baseline assignment methods with respect to mean skill gain. Our approach solves simplified patient training scheduling problems without complete knowledge about the patient skill acquisition dynamics and leverages human knowledge to learn automated assignment policies.

originally experimental and expensive, there has been a push for more affordable rehabilitation robots [6], and it is now not uncommon to see robots that cost less than $5000.Though such robots are relatively simple and often train only a single motor function (e.g., only the elbow or hand [6]), a rehabilitation clinic could potentially purchase multiple simple robots for less than the price of one larger robot.
Due to this shift toward cheaper devices, some rehabilitation centers now have "robot gyms": rehabilitation facilities where multiple patients can exercise with multiple robots (or passive sensorized devices) simultaneously under the supervision of a single therapist [3], [7], [8], [9].Though clinical results from such robot gyms are still limited, initial studies indicate that such gyms do result in functional improvements [7] and that one therapist can effectively supervise up to 4 patients [9].
In such robot gyms, each robot (or passive rehabilitation device) currently acts independently of the others, with no knowledge of what alternative devices might be available.As the next step, it may be useful to connect all robots to a single central system that could monitor patients as a group, suggest changes to individual patients' regimens (e.g., increasing difficulty), point out patients who may benefit from manual therapist intervention, or even suggest that patients switch to a different robot if they are not benefitting from their current robot.As a first example of such software, Hocoma AG, a major robot manufacturer, recently created the HocoNet software portal, which allows therapists to create a patient database and gather data from multiple robots in a centralized fashion [10].Our research team, however, aims to go further: monitor multiple patients and robots as a group and dynamically suggest what robot a patient should be assigned to throughout the session.In existing robot gym studies, patients either train with the same robot throughout the session [3] or switch robots halfway through it [6].We believe that session outcome can potentially be improved if the patient group is monitored throughout the session and dynamic patient-robot assignments are made during the session based on monitored data.
In our recent study [11], we presented a fully automated patient-robot assignment algorithm for a very simplified simulated robot gym.Though the simulated environment was completely deterministic and transparent, we showed that dynamically assigning patients to robots throughout the session resulted in higher simulated skill improvement than training with the same robot throughout the session or switching halfway.However, such deterministic and transparent environments are not realistic, and a realworld patient-robot assignment algorithm would need to make decisions based on imperfect measurements from robot sensors.We have begun work on a stochastic expansion of that approach [12], but such a fully automated approach ignores a rich source of information about how to assign patients to robots: human therapists, who generally supervise rehabilitation sessions and should be able to demonstrate appropriate patient-robot assignments.
In this paper, we thus present a patient-robot assignment system that has access to imperfect measurements from robot sensors and learns assignment decisions from a human expert by demonstration.The human may make these decisions directly based on the same measurements or may make decisions based on qualitative patient observations.Notably, our approach does not require therapists to consciously encode their decision-making procedure into a step-by-step heuristic.As data from actual rehabilitation gyms are not yet available, we evaluate the approach in a simulated rehabilitation gym.

B. Dynamic Task Allocation and Scheduling
Scheduling patients' skill training tasks in a rehabilitation gym can be framed as a dynamic multi-robot task allocation problem where a set of patients' skill training tasks are allocated to multiple robot trainers dynamically in a way that optimizes overall training outcomes.Dynamic task allocation is an essential problem in multi-agent systems and has been studied in many applications, including coordinated assembly by a team of human and industrial robots [13], health and social care services via multiple assistive robots [14], and collective package delivery by a team of drones [15].
Classical methods for multi-agent task assignment and scheduling typically involve formulating the problem as a mathematical program (e.g., combinatorial optimization) and leveraging optimization algorithms to find exact or approximate solutions [16].These algorithms rely on complete knowledge of the problem or on problem-specific heuristics supplied by a human expert.Thus, they are not suitable for nondeterministic problems such as the one in this paper.Moreover, developing problem-specific heuristics requires substantial domain knowledge and trial-and-error [17].
As an alternative, prior studies involving multi-agent dynamic task assignment for rehabilitation have used genetic algorithms [18] for task allocation.These algorithms iteratively develop optimal solutions by creating "generations" of solutions, gauging their quality, and using highest-quality solutions to generate a new generation.Though such algorithms can produce good results, they need a way to determine the quality of possible solutions.This can be difficult in patient-robot assignment scenarios involving uncertainties that influence outcomes: as patient skill improvement and assessment are not knowable before training, it is difficult to determine solution quality in advance.
To overcome limitations of classical methods, recent works have begun automating the discovery of task allocation, scheduling heuristics, and policies using machine learning [19], [20], [21].Task allocation and scheduling problems were framed as sequential decision-making problems and solved by policy learning, a subfield of machine learning that aims to learn computational models for sequential decisionmaking.Gombolay et al. [19] proposed an apprenticeship scheduling method to learn domain expert heuristics for a class of scheduling problems.Ingimundardottir and Runarsson [21] used imitation learning to learn dispatching rules based on hand-crafted features describing scheduling states.
Inspired by recent work on learning from demonstration [19], [21], the current study proposes to learn a domain expert's patient-robot scheduling behavior from demonstration and build a computational model that can be used for later automated scheduling.Such learning from demonstration is suitable for the considered problem since complete knowledge of the environment is unavailable (precluding the use of classical methods) and the quality of solutions is not known beforehand (precluding the use of genetic algorithms), but a human domain expert should be able to demonstrate their scheduling behavior.Learning from demonstration can be used to encode this behavior into computational models without the expert needing to explicitly codify the decision-making.Such codifying would be needed for approaches involving hand-crafted heuristics or features, but is often impossible since, e.g., a human expert may base their heuristics on qualitative patient observation rather than quantitative sensor data.

C. Contribution of Current Paper
This paper presents a patient-robot assignment approach for a robotic rehabilitation gym that learns scheduling strategies from human experts.Patient skill acquisition is formulated as a stochastic process and the model describing the skill improvement is unknown to the scheduler.The problem considered in this paper differs from scheduling problems in previous studies [19], [21] in that neither the precedence nor duration of tasks are known a priori; instead, both must be determined by the scheduler.Unlike our related work [11], the objective function to be optimized is not known explicitly and can only be evaluated numerically.An early version of the current work was presented at a conference, but involved a much simpler gym environment where a patient's success rate was the only performance indicator and all patients had the same skill acquisition patterns [22].In this paper, we: • Developed a stochastic sequential decision-making formulation for the patients' skill training scheduling problem in a multi-patient multi-robot rehabilitation gym.
• Proposed a two-stage patient-robot assignment algorithm and trained neural network models to encapsulate the assignment behavior of a domain expert.
• Verified the proposed method in three different simulated gym environments with different complexities.

A. Scenario Description
A robotic gym in our study comprises m patients and n rehabilitation robots.Each patient has k skills that can be trained, each of which represents a different functional ability such as hand function, shoulder function, etc.The duration of a training session spans T discrete time steps.Before each time step, a domain expert (e.g., a therapist) assigns each patient to a robot and the patient then trains on that robot for a time step.We make the following assumptions and constraints for the gym scenario (introduced in our previous work [11]): • All patients start and finish the training session simultaneously at the first and last time step, respectively.
• Each robot can only train one skill, and each skill is only trained on a single robot.That is, n = k.
• At any given time, step, a robot can only be used by one patient and a patient can only train on a single robot.
• Assignments are made before every time step, and each patient uses their assigned robot for the entire time step.Training improves the patient's skill levels, which are not measurable directly but can be inferred from the patient's measurable performance.Common performance metrics in the real world include task success rate, exercise intensity, and motion quality [23], [24].In our specific scenario, it is assumed that the patient's task success rate and (optionally) workload are measured at the end of each time step and are correlated to the corresponding skill level in a stochastic way.Additionally, the robot's difficulty level and the patient's diagnosis (hereafter referred to as robot and patient characteristics) are considered in our scenario.Rehabilitation robots commonly feature adjustable difficulty levels [3], [25], which must be set by the assignment agent when the patient is assigned to the robot.Different diagnoses (e.g., stroke in different locations, spinal cord injury, cerebral palsy) may result in different relationships between the patient's skill level and performance [26].

B. Dynamic Patient-Robot Assignment Problem
Given the scenario described in section II-A, we present the mathematical formulation of the dynamic patient-robot assignment problem.Similarly to our previous study [11], let R = {r 1 , r 2 , . . ., r n } be a set of robots, P = { p 1 , p 2 , . . ., p m } be a group of patients, and S = {s 1 , s 2 , . . ., s k } be a set of motor skills.Each patient p i ∈ P has a set of real-valued features, p i (t) = {γ p i ,r 1 (t) , γ p i ,r 2 (t) , . . ., γ p i ,r n (t)}, that are measured at every time step t ∈ [1, T ].Each element in the feature set, γ p i ,r j (t), is a vector that describes the training performance and (optionally) robot/patient characteristics with respect to robot r j .We consider three scenarios with increasing complexities.
1) Scenario 1: Feature vector γ p i ,r j (t) only includes the success rate of patient p i on robot r j , denoted by g p i ,r j (t), ∀r j ∈ R. Success rate exists for each robot and is obtained during the most recent time step when the patient trained with that robot.This is the simplest scenario: there is only one performance metric, all patients behave similarly, and no further choices need to be made after patient-robot assignment.
2) Scenario 2: The feature vector γ p i ,r j (t) includes the success rate of patient p i on robot r j , and the difficulty level of training on robot r j , denoted by d p i ,r j (t), ∀r j ∈ R. Both variables exist for each robot and are recorded during the most recent time step when the patient trained with that robot.This situation represents a more complex scenario where different robot settings must be chosen and consequently influence the patient's performance [3].The most recent difficulty level is included in the feature vector to account for its influence on patient performance, and a new difficulty level is set for each patient-robot assignment in the current time step.
3) Scenario 3: The feature vector γ p i ,r j (t) includes the success rate of patient p i on robot r j , the workload of patient p i on robot r j , denoted by w p i ,r j (t), and the patient's diagnosis, denoted by c p i ,r j (t).This represents an alternative expansion of the first scenario: there are now two performance metrics (success and workload, which could represent, e.g., the amount of movement, mean force applied, or mean muscle activation [24]), and different patients have different diagnoses.In this scenario, we consider two different diagnoses, and a patient's diagnosis applies to all skills of that patient.
A hidden variable, h p i ,r j (t) , ∀ p i ∈ P, r j ∈ R, is defined as the skill level of patient p i on robot r j at time step t.As we assume that each robot only trains one skill and each skill is only trained on one robot, the hidden variable h p i ,r j (t) is associated with the skill, s j ∈ S, of patient p i trained on robot r j .Note that h p i ,r j (t) is unobservable but could be estimated from the patient's measurable features γ p i ,r j (t) associated with robot r j .
The dynamic patient-robot assignment process represents a stochastic sequential decision-making problem where patients are assigned to robots at every time step based on performance features and robot/patient characteristics in a way that optimizes the overall training outcome.In this paper, we design an automated system that dynamically assigns patients to robots by learning assignment behaviors from a domain expert's demonstrations.The learned behavior is encapsulated as a neural network policy and is intended to achieve training outcomes similar to those obtained by the domain expert.

C. Robotic Gym Simulator and Synthetic Data
1) Robotic Gym Simulator: As real-world data from robotic rehabilitation gyms are not available, we developed a simulator that produces the dynamic process of the robotic gym scenarios described in section II-A.This simulator is based on the one used in our previous work [11]; unlike the previous one, it allows human input.It works as follows.
a) Constraints: Skill level, success rate and workload are all constrained to the range 0-100.If any value ever falls outside the range due to random factors, it is changed to 0 or 100.
b) Impairment and diagnosis: The simulation starts by initializing each patient's impairment level, denoted as L p i for patient p i .This value is analogous to, e.g., a real-world Fugl-Meyer score [27] and ranges from 0 (no impairment) to 100 (no ability -e.g., complete paralysis).In this study, a patient's impairment level applies to all their skills.
In scenario 3, each patient is assigned a diagnosis of 1 or 2 at the start of the simulation.In scenarios 1 and 2, all patients have diagnosis 1.In this study, diagnoses affect the relationship between skill level, success rate and workload.c) Difficulty level: In scenario 2, all robots have difficulty level settings that can be manually changed by the domain expert (options: 1, 2, 3) and are initially 1 for all robots.In scenarios 1 and 3, all robots are always set to difficulty 2.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
d) Initial skill level, success rate, workload: Each patient's initial skill levels at time t = 0 are determined by adding additive random noise to the impairment value.For patient p i on robot r j , this is given as: Thus, a patient's initial skill levels are inversely proportional to their impairment.ω h 0 ∼ N (0, 3 2 ) is a random number sampled from a Gaussian distribution with a zero-mean and a standard deviation of 3; it ensures that not all skills start at the same value.Realistically, some diagnoses may result in huge differences between patients' individual skills [26], but such intraindividual variability was not modeled in this study.
Initial success rate and workload for each patient-robot pair are calculated at time t = 0 using equations that relate skill level to success and workload (see below).It is assumed that these initial values are available from, e.g., when a patient trained in a previous session.
e) Summary of step-by-step process: At the start of each time step, the domain expert manually assigns patients to robots.The expert cannot see impairments or skill levels but can see success rate and (in relevant scenario) diagnosis, difficulty level and workload from when each patient last trained with each robot.In scenario 2, after assigning patients to robots, the expert also sets difficulty levels for all robots.After each time step, the success rate and workload are updated, and skill level is updated based on updated success rate and workload.Visible variables are then returned for assignment at the next time step.The process is repeated until the simulation ends after T time steps.
f) Skill improvement: Whenever patient p i trains on robot r j , the corresponding skill level is updated with: Eq. 2 models an additive increase: the new skill level is old skill level h p i ,r j (t) plus deterministic element I p i ,r j (t) plus stochastic element ω h t .The sum of both added elements is constrained to be nonnegative so that training cannot worsen skills.ω h t is sampled from uniform distribution U (−1, 1) and represents a contribution of random factors such as patient mood.Realistically, the two elements might not be additive and improvement might not be positive (as patients may, e.g., practice unwanted movements that must be unlearned [28]).
Element I p i ,r j (t) in Eq. 2 is defined as a piecewise function of success rate and is shown in Fig. 1a for both diagnoses.It is designed to produce diminishing returns as skill levels increase, a common pattern in motor learning that was also used in our previous work [11].Furthermore, it is based on the motor learning concept that patients achieve optimal improvement when exercising at a moderate intensity that is neither too easy nor too hard [28].For diagnosis 1, the function is given by The optimal success rate in Eq. 3 is 60, and improvement decreases relatively rapidly at higher success rates to emphasize diminishing returns.Our preliminary conference paper used an optimum of 70 based on real-world literature recommendations [22], but that value did not adequately produce diminishing returns.For patients with diagnosis 2, the function is given by The optimal success rate in Eq. 4, is 31, and the function was designed to produce a similar shape to Eq. 3 but with a lower optimal success rate and a slower decrease above the optimal success rate.The lower optimal success rate is because success rate in diagnosis 2 has a different relationship with skill level.g) Success rate and workload, overall: Success rate and workload are functions of skill level since skill levels affect exercise performance [29].Success rate is 0 at skill level 0 and 100 at skill level 100, but the detailed relationship between the two depends on diagnosis and difficulty level.Conversely, workload is never zero and peaks at a moderate difficulty level, modeling the real-world phenomenon where severely impaired patients can barely move their limb [26] and mildly impaired patients need less effort for the same task than moderately impaired patients (due to, e.g., higher motion efficiency).
h) Success rate: At difficulty level 2, success rate for diagnosis 1 is equal to skill level plus a random variable, representing a linear relationship between skill and success.At difficulty 1, success rate is higher than at difficulty 2 since training is easier, and the relationship is no longer linear.Conversely, at difficulty 3, success rate is lower since training is harder.Since skill improvement is highest at success rate 60, patients with low skill levels thus achieve highest improvement at difficulty 1 while patients with high skill levels achieve highest improvement at difficulty 3; as a patient's skill improves, they may need to increase difficulty level.The function for success rate for diagnosis 1 is given by where ω g is sampled from uniform distribution U (−4, 4).Fig. 1b shows plots of the deterministic component of Eq. 5.
As diagnosis 2 is not used in the scenario with 3 difficulty levels, success rate is a function of only skill level.It increases linearly with skill level at low skill levels, but the rate of increase is very low compared to diagnosis 1 until the patient reaches skill level 60; then, the rate of increase is much higher.This mimics patients who can move their limb but have poor motor control (e.g., due to involuntary movements); success is thus a poor indicator of skill level until motor control improves.The function is given by where ω g is sampled from a uniform distribution U (−4, 4).i) Workload: Workload is a function of only skill level since it is not used in the scenario with 3 difficulty levels.Since success rate in diagnosis 1 is already a good indicator of skill level, workload in diagnosis 1 does provide information but can practically be ignored by the domain expert.It is determined by w p i ,r j (t) = −0.02h 2 p i ,r j (t) + 2.3h p i ,r j (t) + 28.5 where ω w is sampled from a uniform distribution U (−4, 4).Conversely, diagnosis 2 was intentionally designed so that workload is a better indicator of skill level than success rate at low skill levels: it increases rapidly with skill level from skill level 0 to 20, then rapidly decreases from skill level 20 to 40, then decreases in an asymptotic manner above skill level 40.This is determined by the piecewise function: where ω w ∼ U (−4, 4).Deterministic components of workload vs. skill level curves for both diagnoses are shown in Fig. 2.
2) Synthetic Dataset: The synthetic dataset is constructed with data points of the domain expert's assignment decisions and associated patients' feature sets, p i (t), from all time steps.
a) Domain expert: In learning from demonstration, a domain expert is defined as a person with extensive experience in the application area.In our case, this would be a clinician who works in robotic gyms.However, as real-world robotic gyms are still uncommon and a clinician would likely not have equal insights into a simulated gym, co-author Novak was the stand-in domain expert for this study.Novak is an electrical engineer who has worked with rehabilitation robots since 2008 and specifically studied two-patient two-robot systems.She was involved in our previous deterministic optimization work [11] and helped develop the current simulation model.
Once the simulator was completed, Novak was given the simulator and all underlying mathematical equations.She repeatedly ran simulations and input training assignments until she identified consistent decision-making procedures for each scenario.The goal of the decision-making procedure was not prescribed in advance as long as it was consistent.She was originally asked to manually perform patient-robot assignments to generate the full dataset, but we found this to be prohibitively time-consuming given the target size of the dataset.Thus, after practicing for approximately 10 runs per scenario, Novak created four heuristics.The first three were considered the primary heuristics (one per scenario) and aimed to maximize mean skill gain across all patients and skills.The fourth was a secondary heuristic for scenario 2.
b) Heuristics for scenario 1: These heuristics use success rate as the sole basis for assignment.For each patient, the robots are sorted from most preferred to least preferred, with the most preferred being the robot for which the patient has the lowest success rate.The patient with the biggest difference in success rate between their most preferred and second preferred robot is assigned to their most preferred robot first.This rule is repeated with remaining patients and robots until all patients are assigned.Finally, the heuristic checks whether switching any two patients' assignments would result in "improvement": whether one patient's success rate on the new robot would decrease more than the other patient's would increase.In this case, patients' assignments are switched and the process is repeated until no possible switch results in "improvement".
c) Heuristics for scenario 2: These heuristics account for difficulty level.First, heuristics for scenario 1 are applied to all patients whose most preferred robot was last set to difficulty level 1.Then, the same heuristics are applied to all remaining robots and patients whose remaining most preferred robot was last set to difficulty 2. Finally, the same heuristics are applied to all remaining patients.After each patient-robot assignment, difficulty is kept unchanged if the patient's previous success rate is below 60; difficulty is increased by 1 if success rate is greater than or equal to 60 unless difficulty is already at level 3.
d) Heuristics for scenario 3: These heuristics consider patients' diagnoses.First, in each time step, a skill estimate for each patient and each robot is obtained according to the following rules: for patients with diagnosis 1, estimate their skill level based on their success rate and the deterministic part of skill improvement curve of diagnosis 1; for patients with diagnosis 2 and success rate below 20, estimate their skill level based on their workload; for patients with diagnosis 2 and success rate above 20, estimate their skill level based on their success rate and the skill improvement curve of diagnosis 2. It is assumed that the domain expert has prior knowledge of the deterministic component of the skill improvement equation.Once skill estimates are obtained, heuristics for scenario 1 are applied to skill estimates rather than to success rates.
e) Secondary heuristic for scenario 2: The above heuristics all aim to maximize mean skill improvement across all patients by prioritizing each patient's lowest skill.Realistically, a domain expert could have other objectives.While this is beyond the scope of the main paper, section I of the Supplementary Materials presents a secondary heuristic for scenario 2 with a different objective -maximizing only one skill per patient.

D. Learning Scheduling Policy From Demonstrations
We propose a two-stage algorithm that automatically makes patient-robot assignments at each time step.The first stage predicts the highest-priority patient for each robot.The second stage resolves conflicts when multiple robots have the same highest-priority patient.It should be noted that this two-stage algorithm does not rely on the heuristic rules used by the domain expert in any way, as such rules would realistically not be known a priori and may differ between domain experts.
1) Priority Prediction: The key to learning a patient-robot assignment policy from demonstration data is to identify the relationship between demonstrated assignments and observed patients' performance features.For example, if patient p i was assigned to robot r j , we would like to find the patient features that had the greatest effect on the decision.These features should be evaluated with respect to those of other patients since the decision may have been based on feature differences between patients.Therefore, we use the pairwise comparison method [19] for the classifiers to determine the priority of a robot between any two patients.Such pairwise comparison learns the reasoning about the relative importance between scheduled and unscheduled patients from the difference in their features.For each time step, a scheduled patient with regard to a robot is the one assigned to this robot by the expert, and all other patients are unscheduled with regard to this robot.The classifier, denoted as f γ p i ,r j , γ p x ,r j ∈ {0, 1}, takes as input the robot-dependent features, γ p i ,r j and γ p x ,r j , of two patients, p i and p x , and returns a continuous value between 0 and 1.This value represents the probability of the classifier's final binary output.The classifier then outputs 1 if this value is over 0.5 (patient p i has priority) and 0 otherwise ( p x has priority).
a) Data preparation: Classifier training data consist of positive and negative pairwise comparison samples.Each positive sample, φ p i , p x , y p i , p x , includes an input element φ p i , p x and output label y p i , p x .The input element is defined as the difference in robot-dependent features between a scheduled patient p i and an unscheduled patient p x , at time step t, i.e., φ p i , p x = γ p i ,r j (t) − γ p x ,r j (t) , ∀ p i ∈ P scheduled for robot r j , ∀ p x ∈ P\ p i .The associated output label y p i , p x = 1.Each negative sample, φ p x , p i , y p x , p i , has an input element defined as the difference in robot-dependent features between an unscheduled patient p x and a scheduled patient p i at time step t, i.e., φ p x , p i = γ p x ,r j (t) − γ p i ,r j (t) , ∀ p i ∈ P scheduled for robot r j , ∀ p x ∈ P\ p i .The output label y p x , p i = 0.
b) Classifier structure: The classifier was a multilayer perceptron neural network with an input layer, several hidden layers, and an output layer.The size of the input layer was the dimension of the feature vector, γ p i ,r j , associated with a patient-robot pairing.The size of the output layer was 1. Input and hidden layers were activated by a hyperbolic tangent activation function.The output layer was activated by a sigmoid function whose output represents the probability of the binary label predicting a patient's priority.Neural network architectures were different in each scenario (scenario 1: 1 hidden layer, 32 neurons; scenario 2: 2 hidden layers, 64 neurons/layer; scenario 3: 3 hidden layers, 64 neurons/layer) to improve performance.
c) Classifier training: A separate classifier was trained for each scenario using pairwise comparison samples φ p i , p x , y p i , p x and φ p x , p i , y p x , p i created for that scenario.Each classifier was used for all robots since all skills have the same skill improvement patterns.Classifiers were trained with the binary cross-entropy loss function [30] and the Adam optimizer [31].
d) Prediction procedure: Once classifiers are trained, the prediction stage at each time step repeats the following procedure for every robot r j ∈ R: (i) Choose a patient p i ∈ P, perform pairwise comparison with every other patient p j ∈ P\ p i (i.e., m − 1 comparisons) and obtain priority predictions using the classifier for each comparison.(ii) Count how many times patient p i is given priority by the classifier from the m − 1 labels and store the number to variable C p i ,r j so that C p i ,r j = p x ∈P\ p i f (γ p i ,r j , γ p x ,r j ).(iii) Obtain the probability (i.e., continuous output of last layer) returned by the classifier of each label where patient p i is given priority, and calculate the mean probability, denoted by Pr p i ,r j .(iv) Repeat the above steps for every other patient.(v) After completing priority prediction for all patients, rank the patients based on C p i ,r j .The patient with the largest C p i ,r j , denoted by p * i , gets the highest priority for robot r j , i.e., p * i ← arg max For patients with the same C p i ,r j , use the mean probability, Pr p i ,r j , to rank the patients and choose the patient p * i with the highest mean probability, i.e., p * i ← arg max Pr p i ,r j , where P r j ⊆ P is a set of patients who have the same C p i ,r j for robot r j .At the end of this procedure in each time step, each robot r j ∈ R has a ranking of patients from highest to lowest priority.
2) Conflict Resolution: Any conflict is resolved by choosing the robot for which a patient p i has the highest mean probability Pr p i ,r j , i.e., r * j ← argmax Pr p i ,r j , where R ′ p i ⊆ R is the set of robots for which patient p i gets the highest priority.For every remaining robot in r j ∈ R ′ p i , choose the patient with the second highest mean probability Pr p i ,r j if they have not been selected for any other robot.Otherwise, check the patient with the next highest mean probability until all patients are assigned a robot.

E. Evaluation Methodology
Our proposed method was tested in the simulator using the three primary heuristics described in section II-C for the three scenarios.Multiple patient groups with varied initial skill distributions were simulated for each scenario.All simulation runs were set to have 5 patients, 5 robots, and 12 time steps.
1) Simulated Patient Groups: To test whether our method is effective for patient groups with varied initial skill distributions, we created 10 uniform distributions from which patients' impairment L p i , as in Eq. ( 1), were randomly selected to determine their initial skills.Table I summarizes these distributions.Distributions 6-9 used a mixture of two uniform distributions to account for situations where patients with both high and low impairment train together.
Two patient groups were created for each of the 10 impairment distributions.Thus, 20 patient groups were created and used for all three evaluation scenarios.In scenario 3, each patient was also assigned a diagnosis.For the two patient groups of each impairment distribution, one group had 2 patients with diagnosis 1 and 3 patients with diagnosis 2 whereas the other group had 3 patients with diagnosis 1. Considering the stochasticity in patient skill improvement, each group of patients was simulated 5 times with each simulation run reinitialized (i.e., the final result of a simulation run is not carried over to another run as its initial condition).There was therefore a total of 100 evaluation runs for each scenario.
2) Outcome Metrics: Since the proposed method aims to learn the domain expert's assignment, our first outcome metric for each scenario was the accuracy of our algorithm relative to the domain expert.In a timestep, a patient is considered correctly assigned if the robot chosen by our algorithm matches the domain expert's choice for that patient.The accuracy of a simulation run is defined as the percentage of correctly assigned patients across all time steps in that run.For each patient group, mean accuracy was calculated over 5 runs.
Additionally, we wished to know whether our method achieves the same outcome as the expert even if it does not fully match the expert's behavior.We thus ran all simulations again with the same initial conditions, but with patient-robot assignments made by our trained method instead of the expert.For each simulation run, we calculated mean skill gain as: which represents the mean of all patients' total skill gains over the entire simulation run, [1, T ].Here, h p i ,r j (t) = h p i ,r j (t) − h p i ,r j (t − 1) represents the skill gain of patient p i on robot r j , obtained as the difference in skill levels between time step t and the previous time step t − 1.The domain expert's primary heuristics were intended to maximize this outcome metric.As our proposed method aims to learn the expert's behavior, it does not directly aim to maximize mean skill gain, but should indirectly achieve similar skill gain to the expert if properly trained.To verify this, mean skill gain was calculated both for our method and the domain expert.
For each patient group, mean skill gain for each method was averaged over 5 runs.
3) Baselines: To contextualize our method's performance, we also created and evaluated three "baseline" methods: • Random assignment: At each time step, each patient was randomly assigned to a robot, with equal probability of each robot.In scenario 2, this method used the same strategy from section II-C.2 to set difficulty level at each time step (i.e., increase if success rate over 60 and difficulty not maximum).Though such random behavior would not be used in actual rehabilitation, it was included as a baseline "least intelligent" dynamic strategy.
• Static assignment: For the first time step, patient-robot assignments were made using the domain expert's heuristics.These assignments were kept unchanged for all 12 time steps.In scenario 2, this method used the same strategy from section II-C.2 to set difficulty levels.This strategy mimics real-world robot gym studies where patients train with the same robot for the entire session [3].
• Switch halfway: For time steps 1-6, assignments were the same as in static assignment.After step 6, assignments were made again using the same expert heuristics on the partially trained patients, and those assignments were used for steps 7-12.This strategy mimics real-world robot gym studies where patients switch robots once midsession [6].We considered using the optimization method from our previous work as an additional baseline [11], but decided not to do so since that method is only suitable for deterministic environments.While we have begun working on a stochastic expansion of the method [12], it is not yet ready for use.
4) Statistical Analysis: Mean skill gain was analyzed using a two-way mixed analysis of variance (ANOVA) with one between-subjects factor (scenario: 1-3), one within-subjects factor (scheduler: our method, domain expert, three baselines) and 20 samples (patient groups) per bin.Post-hoc Sidak tests were used to compare scenarios 1-3 and to compare schedulers.Effect size was reported as partial eta-squared.Significant differences between schedulers were expected a priori since our algorithm and domain expert should achieve better skill gain than baseline methods.Differences between scenarios were also expected due to different scenario complexities.
5) Secondary Evaluation: Section I of the Supplementary Materials includes a similar evaluation as above, but for the secondary heuristic introduced in section II-C.2.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Fig. 3 shows mean skill gain in the three scenarios for five assignment methods (our algorithm, domain expert, 3 baselines) using violin plots.Each violin plot shows the results distribution of 20 patient groups: each point represents one group's mean over 5 simulation runs.Points are randomly jittered horizontally to avoid overlap, and contour curves represent data distribution density.Inside each density curve is a box plot where the rectangle shows 25 th and 75 th percentiles and the central dot shows the median.Whiskers extend 1.5 times the interquartile range from 25 th and 75 th percentiles.
Fig. 4 shows mean skill level across 5 patients and 5 skills over 12 time steps (rather than only the difference between skill levels at the start and end of the simulation run) obtained with five assignment methods in all three scenarios for two patient groups: group 3 and group 10 from Table I.
The Supplementary Materials include results of the secondary evaluation as well as accuracies and mean skill gains for individual patient groups with all methods and scenarios.

IV. DISCUSSION
The accuracies of our algorithm indicate that our method could learn, albeit imperfectly, patient-robot assignment behavior from a domain expert.Mean accuracies of scenarios 2 and 3 are slightly lower than that of scenario 1, likely due to higher complexity of scenarios 2 and 3.With respect to mean skill gain, our algorithm performed worse than the domain expert but outperformed all baselines in all scenarios, as shown by the ANOVA and Fig. 3.This indicates the potential of intelligent dynamic scheduling compared to static assignment, nearly static assignment, or dynamic but unplanned scheduling.
Different patient groups behave differently due to systematic differences in initial skill levels: a patient with better initial skills has less potential improvement.In Fig. 4, for example, group 3 (left) has higher initial impairment and thus more potential for improvement than group 10 (right).Additionally, mean skill gains were higher in scenario 1 than in scenarios 2 and 3.In scenario 2, this is likely because skill improvement becomes slower as difficulty increases.In scenario 3, it is likely because skill improvement in diagnosis 2 (only seen in this scenario) starts to diminish at a lower skill level (above skill 20.9) compared to diagnosis 1 (above skill 60), resulting in less improvement within the same simulation duration.
In the secondary evaluation (see Supplementary Materials), static assignment achieves higher skill gain than our method.This is because the expert has a different objective that we consider less relevant for rehabilitation; it is discussed further in Supplementary Materials.It is not a weakness of our method, which still learns from the expert -it simply indicates that the benefit of dynamic assignment depends on the objective.
Overall, our method can learn different assignment behaviors from human experts with accuracies of 75-85%, and these learned behaviors result in similar skill gain to actual expert behaviors.However, the method was only tested in simplified simulations, and we next address limitations and future steps.

A. Scalability to Different Gym Sizes
As our method relies on pairwise comparison, it is scalable with regard to the number of patients: if only this number changes, the neural network does not need to be retrained.However, it does need to be retrained if the number of robots changes since this number dictates the number of neurons in the network's output layer.This could be addressed by modifying the neural network to instead accept the robot type as part of the input and only have one output neuron.This architecture would require the scheduler to run multiple input vectors through the neural network separately for each patient (one input vector per robot), but could also ensure scalability to different numbers of robots.However, in practice, the number of robots in a rehabilitation gym is unlikely to change often, and occasionally retraining the model would likely be feasible and affordable.

B. More Complex Robots and Related Skills
The current study assumes that each robot trains only one skill, that all skills are independent of each other, and that patients can start or stop training with a robot instantly.Realistically, a robot may train multiple skills (e.g., both proximal and distal arm function), the skills trained by robots may overlap (e.g., one robot trains only skill A while another mainly trains skill B with smaller increases in skill A), and skills may be correlated (training one skill with any robot also increases another skill).Each robot also has some setup time (e.g., attaching cuffs to a patient), and some robots allow two patients to use them simultaneously (e.g., a competitive scenario).This would likely not be difficult to implement in simulation and could be immediately tested with a new heuristic to evaluate whether a more complex environment still allows effective learning.

C. Multimodal Uncertainties
In this study, skill improvement consists of an additive deterministic component (function of success rate) and a stochastic component with a uniform distribution.Our method would be generalizable to other relationships between skill improvement, success rate, and workload as long as they also involve a single input and additive components.Realistically, however, multiple factors introduce multimodal uncertainties into the scenario and could initially be modeled in simulation.
As one example, unmotivated or fatigued patients exhibit lower performance and skill improvement than motivated, unfatigued patients [29].Fatigue could be modeled as a variable that increases if workload is high; high fatigue would reduce success rate and workload.Motivation could be modeled as a variable that decreases if success is low (poor performance) or if patients have spent several time steps on a robot (monotony); low motivation would again reduce performance.If motivation is too low, patients could even leave the gym (though, in the current simulator, patients cannot leave/enter separately).The scheduler would not have direct access to motivation/fatigue values but could potentially estimate them via additional measurements.However, such modeling would likely be much more complex than the expansions in sections IV-A-B.

D. Additional Measurements, Diagnoses and Heuristics
Our study involved 3 scenarios, 2 performance metrics, one robot setting (difficulty) and 4 heuristics.Future simulations could add additional performance metrics, robot settings and diagnoses.More complex heuristics with different training goals could then be created, and our method could be evaluated with regard to its ability to learn these heuristics, allowing robust study of the method's advantages and limitations.
We particularly note that, in this study, expert behavior was encoded as if-then heuristics.Realistically, such heuristics may not be encodable, as experts such as therapists may not be able to describe their decision-making using if-then rules.Even if they can, these rules may not match their actual behavior or may be based on information not available to scheduler algorithms (e.g., motivation, which therapists can ask patients about).Still, some complex or unreliable heuristics could be studied in simulation.For example, hidden variables (motivation, fatigue) could be made visible to the expert but not the assignment scheduler, allowing the expert to make decisions based on these variables.Unpredictable behavior could be simulated by having the expert randomly deviate from the predefined heuristic.

E. Human-Machine Collaborative Scheduling
After training, the current scheduler made patient-robot assignments with no further learning and no option for humans to interfere with it.This would likely not be the case in reality, where a human therapist could override the scheduling system, if it attempted to make unpleasant or harmful decisions.Additionally, patients themselves may not agree with a robot assignment.Thus, we foresee two expansions of the system to enable human-machine collaborative scheduling.
First, the scheduling system could be expanded with the ability to quantify its confidence in assignment decisions.An example of such confidence-aware models is Bayesian neural networks [32].If the system has low confidence in its assignment, it could query the therapist, who could then accept or reject the assignment.Second, the scheduling system could be equipped with a learning algorithm that would leverage the therapist's input and use it to modify future decision-making.The system would thus gradually learn and improve its accuracy while reducing the frequency of therapist queries.

F. Real-World Robot Gym Evaluations
We have largely discussed how our method could be improved in simulation.However, not all above steps need to be completed before testing our method in real-world environments.If data from real robot gyms become available, our method could be tested with them as long as data included patient-robot assignments together with corresponding measurements and as long as the assignment strategy was consistent.Patient-robot assignments would not even need to be dynamic; assignment behavior could also be learned from initial assignments performed only at the start of the session.
Alternatively, an experimental platform could be set up with a simplified robot gym and unimpaired participants: for Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
example, five exercise devices that all train different motions.Participants could be assigned to different devices by an experimenter based on their exercise performance, and our method could be used to try and learn the assignment behavior.This would require a large sample size, but would allow a first test of our method without recruiting patients.

V. CONCLUSION
Our study presented a learning-based approach that trains neural network models to imitate a domain expert's patientrobot assignment behavior in a multi-patient multi-robot robotic gym.Neural network models were trained via supervised learning with a synthetic dataset created with our robotic gym simulator.A two-stage assignment algorithm then automatically made dynamic patient-robot assignments throughout a training session.We evaluated our approach in three scenarios with different complexities and found that our algorithm imitated the domain expert with accuracies of 84.5% ± 4.6%, 77.4% ± 9.5%, and 78.0% ± 12.8% for the three scenarios.Our algorithm resulted in worse skill gains than the domain expert but was superior to three baseline methods.
Our approach provides a way to learn computational models that enable automated patient-robot assignment and scheduling for a rehabilitation gym.Though only tested in relatively simple scenarios, the approach leverages human knowledge to solve patient training scheduling problems without complete knowledge of patient skill acquisition dynamics [11].Our future work will explore more complex environments as well as human-machine collaborative scheduling systems that continuously learn from human experts.

Fig. 1 .
Fig. 1.Deterministic components of (a) skill improvement as a function of success rate for different diagnoses and (b) success rate with diagnosis 1 as a function of skill level for different difficulty levels.

Fig. 2 .
Fig. 2. Deterministic components of success rate and workload as functions of skill level for diagnoses 1 and 2. Since success rate increases linearly with skill in diagnosis 1, it makes sense to rely primarily on success rate when estimating skill of patients with diagnosis 1; conversely, for diagnosis 2, workload is more useful than success rate at low skill levels.

Fig. 3 .
Fig. 3. Violin plots of mean skill gain for all 20 patient groups with different assignment methods in the three scenarios.