A Machine Learning-Based Initial Difficulty Level Adjustment Method for Balance Exercise on a Trunk Rehabilitation Robot

Trunk rehabilitation exercises such as those for remediating core stability can help improve the seated balance of patients with weakness or loss of proprioception caused by diseases such as stroke, and aid the recovery of other functions such as gait. However, there has not yet been any reported method for automatically determining the parameters that define exercise difficulty on a trunk rehabilitation robot (TRR) based on data such as the patient’s demographic information, balancing ability, and training sequence, etc. We have proposed a machine learning (ML)-based difficulty adjustment method to determine an appropriate virtual damping gain ${(}{D}_{\textit {virtual}}{)}$ of the controller for the TRR’s unstable training mode. Training data for the proposed system is obtained from 37 healthy young adults, and the trained ML model thus obtained is tested through experiments with a separate population of 25 healthy young adults. The leave-one-out cross validation results (37 subjects) from the training group for validation of the designed ML model showed 80.90% average accuracy (R2 score) for using the given information to predict the desired difficulty levels, which are represented by the level of balance performance quantified as Mean Velocity Displacement (MVD) of the center of pressure. Statistical analysis (Repeated measures analysis of variance) of subject performance also showed that ground truth difficulty levels from the training data and predicted difficulty levels did not differ significantly under any of the three exercise modes used in this study (Hard, Medium, and Easy), and the standard deviations were reduced by 16.39, 41.39, and 25.68%, respectively. Moreover, the Planar Deviation (PD) of the center of pressure, which was not the target parameter here, also showed results similar to the MVD, which indicates that the predicted ${D}_{\textit {virtual}}$ affected the difficulty level of balance performance. Therefore, the proposed ML model-based difficulty adjustment method has potential for use with people who have varied balancing abilities.


I. INTRODUCTION
T RUNK rehabilitation exercises, such as those for remediating core stability, can help improve the seated balance of patients with weakness or loss of proprioception caused by diseases such as stroke and aid the recovery of other functions such as gait [1], [2]. Therefore, trunk rehabilitation exercises are extensively prescribed during stroke recovery [3].
However, since such rehabilitation exercises require extensive therapist input, they can benefit greatly from the use of robotic devices that can reduce the therapists' workload [4]. Furthermore, rehabilitation robots have the advantages of providing quantitative data acquisition and training [4], [5]. Thus, a number of trunk rehabilitation robots (TRR) have been developed for the evaluation and rehabilitation of seated balance [6], [7], [8], [9]. It has also been reported that, as compared to conventional rehabilitation protocols, the use of robotically generated unstable seating conditions for the training of chronic stroke survivors resulted in greater improvements in their proprioceptive and postural control, and reactive balance [8], [9]. Inspired by these benefits, we have developed a TRR that can be used to provide core stability and strength training to patients with seated balance deficiencies caused by factors such as stroke [10].
In previous TRR studies with healthy young people, its unstable seat mode was used with the fixed control parameter under various biofeedback conditions (virtual This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ damping gain) [10], [11]. The results of these studies showed large standard deviations (SD) of the balance outcomes. This indicates that although the control parameter was the same, different participants experienced different amounts of balance difficulty due to their different balancing abilities. Such difficulty differences may become amplified in the elderly or the patient populations due to the higher variability in their balancing abilities [12].
Selecting an appropriate difficulty level is essential for maximizing therapy engagement and preventing frustration [13], [14]. Furthermore, an inappropriately high level of difficulty may reduce balance so much that it can increase the fear of falling, fall risk, and mobility limitations that can result in reduced independence in performing the activities of daily living [15]. Conversely, increasing the level of robotic assistance in case of poor performance/participation can lead to slacking, where patients gradually become passive and start relying on the robotic assistance [16]. Therefore, it is imperative to provide training on an appropriate level of difficulty for each patient undergoing trunk rehabilitation training. The trial-and-error method to determine the difficulty level is time-consuming and can cause tiredness and reduction in the patient's concentration levels, which may affect the outcomes of the difficulty adjustment process. Additionally, in clinical rehabilitation practice, selecting the exercise difficulty level and adapting it over the therapeutic course is a challenging task that is often left to the therapists' subjective perception of a patient's abilities [16].
Research on adaptive control of rehabilitation robots has focused on minimizing the tracking error or the supporting force provided by assist-as-needed (AAN) systems [5], [17], [18], [19], [20]. Variable control gains of adaptive or fuzzy controllers are tuned in real-time to determine the exact amount of supporting force or torque, in order to increase the training efficiency or overcome uncertainties in the human-robot interaction model. However, due to complex control logic, it takes time for the adaptive value to converge progressively for each user [16].
Andrade et al. suggested an evolutionary algorithm (AE) based dynamic difficulty adjustment (DDA) for games that adjusted the moving distance or speed of the game character using the AE integrated with the user model, in order to obtain the desired score in the game [21]. Sekhavat suggested a Multiple-Periodic Reinforcement Learning (MPRL) method that makes it possible to evaluate different objectives of difficulty adjustment during separate periods of an arm movement tracking game [22]. These works considered the individual variability in the deficits and behavior of patients in order to optimize the impact of rehabilitation. However, since their focus was on the immersion aspect of the game interface rather than the rehabilitation robot or movement performance, they seem more suited to simple rehabilitation movements and long-term rehabilitation training.
Shirzad explored the usefulness of using participants' motor performance (visual distortion) and physiological signals (skin conductance rate and temperature, etc.) during a typical reaching task using the upper-arm with 24 healthy people for prediction of their desirable difficulties, however their system evaluation was limited to the validation of training data (hold-out cross validation) [23]. Yan et al. suggested an assistive force training control strategy and corresponding participation model based on the support vector machine for seated and reclining training with a lower limb rehabilitation robot [24]. They divided the difficulty into three stages (overchallenge, challenge, less challenge) and carried out system evaluation with a group of participant (ten for training, two for verification). Metzger et al. clinically applied long-term rehabilitation training through difficulty adjustment to six stroke patients using an upper-limb rehabilitation robot [16]. They were able to maintain the participants' performance within 70% of the target level, demonstrating the effectiveness of their proposed training method. This work confirmed the importance of setting the initial difficulty level. However, until now, there has been no research done to find a method that automatically determines the initial difficulty setting of a trunk rehabilitation robot from data such as the patient's demographic information, balance performance (Reference information), training sequence, etc. which is similar to the clinical training environment.
Therefore, in this work, we have proposed a machine learning (ML) based difficulty adjustment method that determines the appropriate virtual damping gain (D vir tual ) of the controller for the TRR's unstable training mode. This work was done to evaluate our hypotheses that, firstly, there will be no significant difference in the balance performance measure, Mean Velocity Displacement (MVD) of the user's Center of Pressure (COP), results obtained from the training and evaluation experiments with the developed model, however, the standard deviation will be reduced. Secondly, when the error is defined based on the average MVD value of 37 participants, the mean error value for the evaluation experiment results will be significantly less than the training experiment results. This would mean that our proposed ML model, that takes into account the participant's demographic information, balance ability, etc., can accurately predict the D vir tual parameter in order to obtain test values close to the desired MVD. Achieving the desired MVD value is of interest to us because mean COP velocity is a reliable measure for assessing postural steadiness [25], [26], which is a goal of balance rehabilitation.

II. METHOD
A. Trunk Rehabilitation Robot (TRR) Fig. 1 (a) shows the experimental setup with the TRR used in this study. The TRR can move with 4 degrees of freedom; Pitch, Roll, Yaw and Heave using servomotors. The roll and pitch movements correspond to the Mediolateral (M) and Anteroposterior (A) movements of the body, and are used to challenge the users balance. The seat has load-cells to measure the user's COP position, which is shown by a point on the visual display provided through a 27-inch LED monitor placed in front of the user at eye level.
Seat movements are controlled by a software running in LabVIEW (National Instrument, USA) that, in the unstable seat mode, takes in the user's COP position and moves the  seat accordingly based on commands from an admittance controller [27]. The COP data is also stored for later balance performance evaluations. In the controller used, inertia is calculated based on the distance and direction of the COP position from the origin, in order to determine the movement angle and direction. The seat movement speed is calculated based on the virtual damping gain (D vir tual ), which determines the seat's motion sensitivity. In addition, to avoid overshoot or fluctuating movement of the seat during balance training, we have excluded the virtual stiffness term [27]. As shown in Fig 1 (b), the user must maintain their COP within the 50 × 50 (mm) target square centered at (0, 0) position. This target size was set based on the range of projection of the center of mass of an average person for a maximum trunk tilt of ±5 • [28].

B. Participants (Training Data Set Experiment)
We carried out an experiment with 37 healthy people (21 male and 16 female) to obtain the training data set for the ML model (Age: 22.8 ± 3.5 years, Height: 168.7 ± 7.6 cm, Weight: 63.3 ± 10.9 kg). This study was approved by the Institutional Review Board at Gwangju Institute of Science and Technology, Gwangju, South Korea (20220216-HR-65-04-04) and was performed in accordance with the Declaration of Helsinki. None of the participants suffered from any neurological, musculoskeletal, or vestibular disorders. All subjects gave written informed consent before participation.

C. Protocol (Training Data Set Experiment)
All participants of the training data set collection experiment performed seated balancing tasks under four D vir tual (5000,  Table I), presented in pseudo-random order. In order to determine the appropriate damping values for this study, we carried out preliminary experiments with subjects not included in the main study. These experiments revealed that for damping gain values of less than 5,000 Ns/mm, the system becomes too sensitive for the subject to use confidently, so we kept this as the lowest gain value. Similarly, we found that for damping gain values greater than 20,000 Ns/mm, the system becomes too insensitive, so we kept this as the highest gain value. To allow ease of testing, we divided the range defined by these two gain values into equal parts to obtain the 4 test values used in our experiment. In order to obtain reliable test data, we carried out two trials under each of the four gain conditions (total 8 trials) with a break of 1 min between trials. The order in which conditions were presented to each subject was randomized using an online randomization tool [29]. During all trials, participants were asked to sit with their arms crossed across their chest and try to keep their COP inside the target region mentioned earlier. Before each trial, the system was calibrated to make the subject's balanced COP position coincide with the origin.
Each trial lasted 70 sec and data from the middle 60 sec was used for analysis. The seat movement was limited to a maximum speed of 15 deg/sec and a maximum tilt of 15 deg. The COP data during all trials were recorded at 100 Hz and used for analysis. Furthermore, as shown in Fig. 1 (a), all subjects wore an IMU (Inertial Measurement Unit) (Noraxon, USA) at the lower thoracic level to record their trunk accelerations at 100 Hz, which were also analyzed after the experiments. The total size of one participant's row data for the trainning data set was 48,000 rows (6,000 rows per trial) × 34 columns. Fig. 2 shows, with input and output, the training and testing process of the ML model that was designed considering the testing environment. To adjust the initial difficulty level, a new participant performs only two trials, which serve as the reference for evaluating their balance ability (see Fig. 2). Then, the virtual damping gain suitable for the participant is predicted based on their demographic information, reference results, and desired balance performance value. The value of D vir tual = 10,000 is used to obtain the reference results because in the training data set collected from 37 participants, out of the 4 conditions tested, the standard deviation of the outcome measure was the highest at this value, meaning that this value had the widest data distribution.

D. Machine Learning Models
All the parameters obtained from the subject, TRR and IMU are shown in Table II. The data set was divided into participant's demographic information, COP from the TRR, Angle of seat from the TRR, and trunk acceleration from the IMU. In the participants' demographic information, it was thought that the height and weight would affect the balance performance due to the height of the center of mass, and the trial order predicted that the learning effect of adapting to the unstable mode would occur while performing the experiment. In addition, the COP and angle of seat data are related to balance performance. The Mean Velocity Displacement (MVD) (1) and Planar Deviation (PD) (2) of COP movements, and the RMS (Root Mean Square) of M and A directed trunk accelerations are all commonly used parameters for evaluating postural stability, and their higher values mean higher levels of postural instability [10], [11]. Linear accelerations of the lower thorax to check participants' activity, measured by the IMU with respect to the earth frame of reference, were acquired using the MyoMotion software (MR 3.16, Noraxon, USA). Trunk accelerations can be correlated with trunk muscle activity as higher trunk accelerations have been reported to be accompanied by greater trunk muscle activations [30].
Using the correlation analysis between data variables and the trial-and-error method, we found the most important variables for ML training using the following strategies. All demographic data, except age, was used as part of the training input data to calculate the desired D vir tual . The age was excluded to avoid overfitting to a specific age group as the subject group that we were able to recruit had a very small age range (SD was only ± 3.5 years). Since the sensor data contains a large number of variables, we selected only the representative variables to avoid unnecessary input use. 4 distinct variable clusters were found: IMU sensor values, RMS_A (COP), seat sensing, and COP sensing. However, while validating the training dataset, IMU sensor data and RMS of A directional movement (COP) decreased the ML model's accuracy. So, we excluded the first two clusters (IMU, RMS_A (COP)) from the input data and picked the representative variables from only the 2 remaining clusters. Thus, MVD (COP) and MVD (Seat) were set as input data from each cluster. RMS M velocity was chosen as additional input data to include at least one velocity data in the model. Finally, the desired MVD (COP) is used as the last input data for the ML model.
Additionally, the reference for this was the best performance obtained when the MVD values calculated for the 1st and 2nd trials (done with D vir tual = 10,000) were input separately as measures of balance performance. The final input data set thus obtained is as follows: -Demographic information: Gender, Height, Weight, Mean of Trial orders -Reference result (D vir tual :10,000): 10,000 (D vir tual ), Trial order at 1st trial, MVD at 1st trial (COP), RMS M velocity at 1st trial (COP), MVD at 1st trial (angle of seat), Trial order at 2nd trial, MVD at 2nd trial (COP), RMS M velocity at 2nd trial (COP ), MVD at 2nd trial (angle of seat).
-Balance performance: MVD (COP) Thus, the total size of the trainning data set is 148 rows × 14 columns.

E. ML Model Selection and Validation
The performance of the various ML regression models was evaluated using the data of four participants from the end of the training data set whose trial order and first trial condition did not match. The hyper parameters of each model were found using the greed search method [31]. In this study, we tested a total of seven ML models for training difficulty adjustment: simple linear models and the linear models applying lasso and ridge to prevent overfitting [32], decision tree regressor model (which is a classification model that divides the independent variable space while sequentially applying various rules) [33], k-neighbors regressor model (which is a method of predicting a value through the nearest k samples in the vicinity) [34], and RandomForest [35] and XGBregressor [36] models (which are ensemble techniques that combine multiple Decision Trees). Among these, k-neighbors regressor, RandomForest and XGBregressor showed the best performance (R2 score accuracy), which was compared by dividing the data of one participant by the verification data 37 times and taking the average (Leave-one-out cross validation (LOOVC); K-fold cross validation, K = 37 (# of samples)) [37] (see Table III for accuracy values and hyper parameters). XGBregressor showed the highest mean accuracy of 80.90%. The LOOVC is one of the most used approaches for reducing the differences between training and test accuracy and creating a more generalized model [38]. Shirzad showed 78% predictive accuracy of performance features with hold-out cross validation within the same 24 subjects' data for a study of healthy subjects with upper extremity training [23]. On the other hand, since the validation result through LOOVC is 80.90% when we divide it by subject, this validation result was considered to show that the developed method is sufficiently accurate. This is further supported by the results of the training data set experiment, where the errors in MVD were 18 (3)).

III. EVALUATION EXPERIMENT A. Participants (Evaluation Data Set Experiment)
To evaluate the trained ML model, we carried out an experiment with 25 healthy people (13 male and 12 female) who had no prior experience with the TRR (Age: 25.3 ± 5.6 years, Height: 166.8 ± 9.2 cm, Weight: 62.7 ± 12.9 kg). This study was approved by the same Institutional Review Board as the training data set experiment, and was performed in accordance with the Declaration of Helsinki. None of the participants suffered from any neurological, musculoskeletal, or vestibular disorders. All subjects gave written informed consent before participation.

B. Protocol (Evaluation Data Set Experiment)
The purpose of the evaluation experiment was to compare the balance results obtained using the D vir tual predicted by the trained ML model with the balance results obtained from the training data set experiment, in order to confirm that there is no difference in the desired MVD and that the SD is reduced.
Based on the training data set experiment with 37 people, for this experiment, we defined 233.15, 118.54 and 60.09 cm/s, as the desired MVD values for the Hard, Medium, and Easy modes, since they were the resultant values obtained with D vir tual of 5000, 10000, and 15000 Ns/cm, respectively. Then, as shown in Fig 2, for each new participant, two trials with D vir tual = 10000 (Try mode) are first performed. Then, using the gathered data, the ML model outputs D vir tual values corresponding to the Hard, Medium and Easy modes of the desired MVD. Finally, these conditions are presented to the subject in random order and two trials are done under each condition.
Thus, all participants performed trials under four conditions (Try (10,000), Hard, Medium, and Easy mode), as shown in Table I. Other experimental details were the same as the training data set experiment. The participants were not aware of the experimental condition during the experiment. After the experiment, they were asked to rate the conditions presented to them as Hard, Medium, and Easy in order to determine their perception of the three difficulty levels and see how it compared with the average MVD value based predictions made by the ML model.

C. Data Processing and Analysis
The synchronized COP and trunk acceleration data recorded during all the trials are used to determine the participants' level of balance performance [10]. Data recorded after the first and before the last 5 seconds of each trial were used for analysis. Mean values of the data collected during the two trials under each condition were used for further analysis. The Mean Velocity Displacement (MVD), the Planar Deviation, RMS of the COP in M and A directions, and RMS of trunk accelerations in M and A directions were calculated using MATLAB (Mathworks, USA).
We first performed a paired t-test to observe the difference between the two groups (training and evaluation) with respect to the participants' demographic information. Then, to evaluate the effect of ML prediction (D vir tual ) on balance performance (MVD), results obtained from the two groups under different test conditions were statistically analyzed using a one-way repeated measures analysis of variance (one-way RMANOVA) carried out using SPSS 20 (IBM Corp., USA). Since the average values of 37 participants' MVD results for D vir tual values of 5,000, 10,000, and 15,000 were defined as Hard, Medium and Easy modes of the desired MVD, the average outcome values of 37 and 25 participants were compared for each of these three conditions. Additionally, the absolute value of the difference between each participant's MVD result and the desired MVD value was defined as the error, and compared under the Hard, Medium, and Easy conditions, respectively. The coefficient of variation (CV), which is the normalized standard deviation divided by mean was calculated to evaluate how much the standard deviation of the evaluation results is reduced compared to the standard deviation of the training results.
Q-Q plot evaluation tool was utilized to observe the distribution of all data, which was found to be within the acceptable range of normal distribution. Bonferroni correction method was used for conducting post hoc tests.

IV. RESULT
The result outcomes of subjects under all conditions are presented in Table III. Fig 4 shows the results of one subject's test. Fig 5 and 6 show the results of one-way RMANOVA for the MVD and the error under all conditions for the Fixed Gain (FG) (training) and Predicted Gain (PG) (evaluation) groups.
In addition, Fig 7 shows the results of one-way RMANOVA of the PD and the error under all condition for the FG and PG groups.
The t-tests revealed that there was no significant difference between the FG and PG groups with respect to the participants'    under each condition of the PG group was lesser than that of the FG group, with the difference being 16.39%, 41.39% and 25.68%, under the Hard, Medium and Easy conditions, respectively. This indicates that the D vir tual value predicted by the trained machine learning model (for the desired MVD) resulted in lesser distribution of the resultant MVD than that with the fixed gain approach. In addition, as shown in Fig 6, RMANOVA results of the calculated error in MVD showed that significantly lower values were obtained with the ML model under the Medium and Easy modes (FG vs PG under Medium mode: F(1, 24) = 11.812, p < .01, η 2 p = .330, FG vs PG under Easy mode: F(1, 24) = 6.281, p < .05, η 2 p = .207). Therefore, according to the hypotheses of this study, it can be said that the ML model accurately predicted the D vir tual values in order to obtain the desired balance performance under the Medium and Easy modes.
The PD, which was not the value of interest here, also showed results similar to the MVD results (See Fig 7). The PD value indicates how far away from the origin the COP movement had spread. It is also an indicator of balance performance, like MVD. There was no significant difference in PD between the FG and PG groups under each condition, and CV of SD under each condition was reduced in the PG group as compared to the FG group (H: 14  In order to investigate whether or not the generated difficulty levels were able to provide perceivable differences in difficulty, we conducted a post-experiment survey from each participant. In this survey, each participant was asked to provide a difficulty ranking for the three randomized trial conditions (easy, medium, hard; blinded). As shown in Fig 8, the overall distinction accuracy was 78%, which is similar to our ML model's R2 accuracy score. The distinction accuracy for each condition was 88%, 69% and 77% for easy, medium, and hard, respectively. Additionally, analysis of the trained ML model according to feature importance revealed that the desired difficulty level and reference test results (Try mode) were the most important information needed for machine learning prediction (see Fig. 9).

V. DISCUSSION
This study proposed a ML based method for initial difficulty adjustment of the unstable training mode of TRR. The ML model was trained using data from 37 participants, and a test to evalute the D vir tual values predicted by the ML model was performed with 25 new participants.
Greater improvements in proprioceptive control, reactive balance and postural control of chronic stroke patients have been reported with rehabilitation training using robotically generated unstable conditions, as compared to conventional rehabilitation protocols [8], [9]. Difficulty adjustment is an important part of rehabilitation training. Maintaining a challenging level of difficulty is a very important factor in motivating training and increasing training effectiveness [13], [14]. In clinical studies on difficulty adjustment, it has been reported that the initial difficulty adjustment plays an important part in increasing the training effect [16]. Moreover, if the difficulty level during balance training is set too high, it can cause the balance to reduce so much that it increases the fear of falling, fall risk and mobility limitations [15]. However, studies on automatic adjustment of the initial difficulty in robotic seated balance rehabilitation training had not been reported prior to this work.
In this study, the trained ML model predicted the appropriate system control parameter so that the participants could achieve the desired MVD performance. It was observed that participant performance outcomes with the predicted virtual damping gain had more similarity in difficulty levels, and had lower CVs of SD than those with the fixed gain value. Moreover, the average error rate of MVD under Hard, Medium, and Easy mode decreased from 37.28 ± 26.33% to 24.11 ± 19.25%. Shirzad showed 78% predictive accuracy of performance features with hold-out cross validation within the same 24 subjects' data in their study on prediction of desired difficulties of healthy subjects performing an upper-arm reaching task [30]. Bao et al. trained an ML model to learn the mapping between the trunk sway data from a single IMU and a physical therapist's assessment of balance performance [39]. They showed that the model achieved an accuracy of 82% during evaluation with a leave-one-participant-out scheme (not with various physical therapists). In the current study, the average of validation result through LOOVC (37 subjects) is shown to be 80.90% when divided by subject. Furthermore, Yan et al. suggested an assistive force training control strategy and corresponding task difficulty based on the support vector machine for seated and reclining training on a lower limb rehabilitation robot [24]. They reported 80% accuracy of task difficulty with two participants' evaluation (through survey). Our evaluation accuracy result with 25 participants was 78% and the accuracy based on the error rate of MVD was 76%.
However, as shown in Fig 6, the error in MVD was significantly reduced only under the Medium and Easy conditions and not the Hard condition. It is believed that this may be because the 37 participants felt similar difficulty under the hard condition (5,000, error = 18.64 ± 15.72%). It may also be because the D vir tual value corresponding to the Hard condition (5,000) was at the lower end of the range tested for training data set collection. Since our ML algorithm has been used to map the participants' conditions, including balancing ability, to difficulty levels, it may have limitations in predicting values in specific areas where there is a lack of data [40]. Thus, it is expecxted that the accuracy can be improved by increasing the range of D vir tual used for data collection to include values lower than 5,000. Along the same lines, if we collect training data from more participants using a greater variety of difficulties, then the machine learning algorithm is expected to adjust the difficulty levels more accurately. An interesting observation is that the PD values, which were not the target values, showed results similar to the MVD. Since both MVD and PD are balance parameters representing the degree of difficulty [10], [41], these results show that the proposed method actually adjusts the overall balancing difficulty instead of just predicting the MVD.
Another important feature of this work is that the difficulty level adjustment is done based on only the participant's demographic information and 140 seconds (2 trials) of reference training results. Choi et al. showed that on average 30 trials with the ADAPT system (an end-effector presenting different real-life objects to manipulate against various resistance levels with fast adapting difficulty modulation algorithms) were needed for chronic stroke patients to reach a challenging difficulty level [13], [16]. Metzger et al. determined the initial assessment-based difficulty selection and the cognitive processing of perceived sensory information in as low as 20 trials per exercise and therapy session [42]. Compared to these works, the proposed ML model based method is much quicker in finding the desired level of difficulty.
The evaluation results show that our ML model has been successful in reflecting the desired difficulty levels through the predicted virtual damping gains while using only a relatively limited set of input data. As shown in Fig 9, the desired MVD value was the most important variable for machine learning. After that, the reference test result was significant for predicting the proper virtual damping gain. However, the first reference test result (1 st trial of try mode) had greater influence on the model's predictions. Interestingly, the experiment order had greater influence than the second reference test result (2 nd trial of try mode). This shows that for the ML model, the participants' learning effect due to the order of trials used in the experiment is more important than the second reference test result. However, since the 4 trial conditions used while acquiring the training data can only be presented in 24 distinct orders, the 37 person subject group had an order repetition of only 1.5 times. We believe that a greater number of participants, resulting in a greater amount of trial order repetition, are needed to fully learn the time-dependent learning effect of the randomized experimental sequence. Therefore, we expect that if the training data is gathered from a larger subject population, the ML model can achieve more accurate prediction results. Each person's response to a rehabilitation protocol is different since the learning behavior of each person is different. Therefore, in future studies, it is necessary to find the minimum number of persons required to be included in the training dataset in order to fully represent the learning effect. In addition, since the experiments in this study were performed with healthy young participants, the effect of age on the ML model was small. However, in the case of stroke patients, age is expected to have a greater effect on model training because the stroke patient population may have a wide age variation. Therefore, it seems that it should be included in the training set parameters.
Clinical studies with stroke patients require more demographic information [43]. This includes parameters commonly used in clinical rehabilitation training, such as date of onset, Modified Barthel Index Score (MBI), Mini-Mental State Examination Score (MMSE), cause of stroke, Side of Hemiplegia, etc., which can be grouped and quantified as an integer. Therefore, we believe that there is a possibility that the proposed method can be applied to stroke patient studies, and we intend to conduct such studies in future studies.
The purpose of this study was not to find desireable values of MVD for each participant, rather it was to provide the same three initial difficulty conditions to each participant. In order to confirm that the desired MVD based difficulty settings were appropriate percieved by the participants, they were blinded to the three experimental conditions and asked to rate them after the experiment. Their responses show that our model successfully provided individually percievable difficulty levels. The participants successfully guessed each test's difficulty level with 78% accuracy. This shows that the proposed method is able to provide a real feeling of discrete difficulty levels to the subject. However, when looking at each level separately, the participants showed lower classification accuracy for the medium and hard difficulty levels. This may be because the desired MVD value under the medium condition (118.54 mm/s) was defined as an intermediate value between Hard (233.15 mm/s) and Easy modes (60.09 mm/s), which may be confusing for the participants as the differences between the actual difficulty levels may be not linear. Futhermore, the reduced perception of both the Hard and Medium levels suggests that there may be other factors involved that influence difficulty perception and subject performance. Özkul showed that combined feedback adjustment (CFA), which combined the performance score and mean skin conductance to determine the difficulty level, was able to keep the subjects more active, focused and excited, when compared to only score or physiological feedback adjustments [44]. Therefore, we believe that in order to reliably use the proposed method to determine the appropriate actual and perceived difficulty levels in future clinical studies, it is necessary to include physiological factors in the ML training.
Additionally, the overall effect of difficulty level adjustment may be improved by using the proposed method to adjust the difficulty level periodically over the course of a trial instead of determining a single difficulty value that is used throughout the entire trial. Furthermore, the input-output relationship defined in this study can be applied to other methodologies, such as dynamic difficulty adjustment (DDA) [21], [45] or Recurrent Neural Networks (RNN). Thus, in future research, we plan to compare the performance of the single prediction application of the proposed ML model presented in this work with a recurring prediction application of this model. We also plan to compare the proposed method with other methods, such as Recurrent Neural Network (RNN), to find the method best suited to difficulty level adjustment in the given scenario.

VI. CONCLUSION
In this study, we proposed a ML based method for adjusting the initial TRR based balance training difficulty. The ML model was trained using data collected from 37 participants, and the method's performance was evaluated with 25 new participants. The evaluation showed that the proposed method was able to generate clearly distinguishable difficulty levels that posed similar levels of difficulty for each participant, thus reducing inter-subject variability in performance outcomes. This was achieved by utilizing only 2 reference training trials (140 sec), the participants' demographic information, and the sequence of training trials. This study also revealed the necessity for using a wider range and variety of difficulty levels for collecting the training data and using a larger subject population for training in order to sufficiently include the learning effect.
Since the proposed method, including its input and output data sets and the structure of the ML model, has been optimized for our current subject population, i.e. healthy young subjects, it may not be optimized for use with a patient population. However, this was a necessary step because, due to issues such as quantity and quality of sensor data, the application of ML based systems to patient populations requires rigorous computational modelling to achieve proper estimation of the required parameters and the desired results [46]. Therefore, before applying our proposed method to a patient population, we have defined the ML model's input and output, found its optimal hyperparameters, and tested it with a new healthy subject population. Now, in order to apply it to an elderly or patient population, we believe that it is necessary to include physiological information and the users' perception of the difficulty levels to appropriately define the difficulty levels. Furthermore, to increase the accuracy of the ML model for these populations, it is necessary to include demographic information representing the disease characteristics of the population in the training data set, which will be done in future works.