Offline Evaluation Matters: Investigation of the Influence of Offline Performance on Real-Time Operation of Electromyography-Based Neural-Machine Interfaces

There has been a debate on the most appropriate way to evaluate electromyography (EMG)-based neural-machine interfaces (NMIs). Accordingly, this study examined whether a relationship between offline kinematic predictive accuracy (R2) and user real-time task performance while using the interface could be identified. A virtual posture-matching task was developed to evaluate motion capture-based control and myoelectric control with artificial neural networks (ANNs) trained to low (R2 ≈ 0.4), moderate (R2 ≈ 0.6), and high ( $\text {R}^{\vphantom {\text {D}^{\text {a}}}{2}} \approx 0.8$ ) offline performance levels. Twelve non-disabled subjects trained with each offline performance level decoder before evaluating final real-time posture matching performance. Moderate to strong relationships were detected between offline performance and all real-time task performance metrics: task completion percentage (r = 0.66, p < 0.001), normalized task completion time (r = −0.51, p = 0.001), path efficiency (r = 0.74, p < 0.001), and target overshoots (r = −0.79, p < 0.001). Significant improvements in each real-time task evaluation metric were also observed between the different offline performance levels. Additionally, subjects rated myoelectric controllers with higher offline performance more favorably. The results of this study support the use and validity of offline analyses for optimization of NMIs in myoelectric control research and development.

The key to EMG-based NMIs is the decoding algorithm that can recognize the user's movement intent. Various decoding algorithms have been developed and implemented. One commonly applied decoding concept is EMG pattern recognition [24], which involves the extraction of features from sEMG signals to train a classifier, in order to identify a discrete set of desired motions such as hand open/close, forearm pronation/supination, wrist flexion/extension, and various hand grasping patterns. The classification algorithms have evolved from simple linear discriminate analysis [25] to current deep learning approaches [26]. The drawback of pattern recognition is that the user can only produce one motion at a time, resulting in unnatural movements in applications such as virtual arm or prosthesis control. Another relatively new approach is to decipher EMG signals to predict the continuous motion of multiple joints simultaneously. A diverse array of methods and designs for these controllers can be found in the literature, such as musculoskeletal models [4], [6], [14], [17], [27], state-space models [28], linear regression [9], [12], non-negative matrix factorization [11], and artificial neural networks (ANNs) [10], [19], [28], to name a few. These algorithms have been used in virtual and prosthetic arm control to produce natural, multijoint coordinated motions.
Development of novel EMG decoding algorithms most often involves offline design/evaluation followed by real-time implementation/evaluation. During offline design/evaluation, synchronized recordings of EMG data and kinematics are acquired first. The recorded kinematics are used to train the EMG decoder or to evaluate the decoder as the ground truth. Iterative design and evaluation processes are used to improve EMG decoder performance, quantified by classification accuracy rate This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ and confusion matrix for EMG pattern recognition [25], [26], [29], [30], [31] and by coefficient of determination (R 2 ) or root mean square error (RMSE) for continuous EMG decoders [32]. Once the offline evaluation reaches optimal performance, the researcher moves to implement the algorithm for realtime external machine control and evaluates the interface with human-in-the-loop. The real-time evaluation is usually performed in a task context, quantified by task performance metrics such as task completion time [32].
Recently, however, the value of offline training and evaluation of the EMG decoding algorithm to enhance real-time performance has been questioned. Studies examining offline performance metrics have concluded that offline performance accuracy (i.e., of predicted kinematics) does not correlate with users' real-time performance capabilities [3], [33], [34], [35]. For example, one study examined user performance during a virtual target reaching task using decoders with offline R 2 values ranging from near 0 to above 0.9 and found no relationship between the R 2 value of the decoder used and the task performance [35]. The authors concluded that optimizing offline evaluation metrics does not necessarily benefit the online use of the EMG-based NMI. One possible explanation is the capacity for human users to adapt to and compensate for the shortcomings of the decoding algorithm while using the NMI in real-time. For this reason, researchers have been encouraged to place less emphasis on offline performance metrics and their optimization, instead focusing on new realtime EMG decoding methods emphasizing the adaptability of the human in the control loop [5], [36].
While these results are thought provoking, the number of supporting studies are somewhat limited and may not have systematically controlled for human variations. Limited practice with the controller prior to real-time use and limited realtime task performance feedback may have limited performance and led to more disparate responses [33], [35]. Evaluation of online performance was focused on task performance, but user perception and experiences while using the EMG-based NMI for machine control (e.g., perceived difficulty level in using the NMI) may be equally important in determining actual usage of the system.
A paradigm shift away from offline development does represent significant challenges for creation of EMG-based NMIs. Real-time testing can be more arduous and consume additional time and resources compared to offline analyses. Additionally, real-time performance of the human-machine system becomes more difficult to interpret as many factors, beyond the performance of EMG decoding algorithms, can influence the overall online system performance, such as human variability, adaptation, and hardware limitations [33]. These confounding factors make the development process of EMG-based NMI more difficult because engineers cannot easily identify the system problems and therefore their potential solutions. Altogether, iteratively optimizing real-time performance of an EMG-based NMI during its development phase is costly and difficult.
In light of the potential costs associated with solely online development, further study of the merits of offline development of NMIs seems warranted. Thus, we manipulated the training process of an EMG-decoding algorithm based on artificial neural networks (ANNs) and produced three distinct decoders, each with a different offline performance accuracy level, as quantified by the coefficient of determination (R 2 ). For each ANN, subjects were trained to use the EMG-based NMI to perform a virtual hand postural matching task before a final real-time evaluation was made. We hypothesized that real-time task performance and user subjective perception would improve in association with improvement in offline performance of the myoelectric controllers. The study results provide new evidence to the research community in understanding the value of offline evaluation and optimization for EMG-based NMI design and application.

A. Subjects
The experimental protocol was approved by the University of North Carolina at Chapel Hill Institutional Review Board (Protocol #16-0798; approved March 11, 2022). Twelve subjects who did not have a disability (AB) (6 male, 6 female, ages 22-31, right hand dominant) were recruited to participate in the study. Informed consent was obtained from all subjects to participate.

B. Data Acquisition
Nine reflective markers were placed on anatomical landmarks of each subject's dominant hand and forearm to track metacarpophalangeal (MCP) and wrist flexion/extension ( Fig. 1) [4], [10]. Marker trajectories were recorded at 100 Hz with a motion capture system (Vicon Motion Systems Ltd., UK).
Following marker placement, the extensor carpi radialis longus (ECRL), extensor digitorum communis (EDC), flexor carpi radialis (FCR), and flexor digitorum superficialis (FDS) muscles were identified via anatomical reference and palpation. The skin over each identified muscle location was prepped with an alcohol wipe to reduce impedance and a bipolar surface EMG electrode (Biometrics, Newport, UK) was placed over each identified muscle (Fig. 1). The EMG data were recorded at 1000 Hz, synchronously with the marker data.
Following setup, subjects were comfortably seated at a table with a clear view of the computer screen on which the virtual posture matching task would be displayed. Subjects rested the elbow of their dominant arm on the table in front of them, with the forearm held approximately 45 • to the table surface during trials. Subjects then performed a maximum voluntary contraction (MVC) of each of the four muscles while visual feedback of EMG magnitude was provided. Five 1-minute trials of data for training ANN decoders were collected. The first trial consisted of isolated, cyclic MCP motion, alternating between full flexion, relaxation, and full extension at a rate of 0.25 Hz, using a metronome to maintain the appropriate speed. The second trial consisted of isolated, random MCP motion, with subjects instructed to cover their full range of motion, but speed and motion direction were self-selected. The third and fourth trials were identical to the first and second trials, respectively, except isolated wrist motion was performed instead of MCP motion. The final trial consisted of random, self-selected motion of both the wrist and MCP joints simultaneously.

C. Data Processing
Finger, hand, and forearm segment coordinate systems were defined [37] and wrist and MCP flexion/extension angles were calculated from the marker data via inverse kinematics. The EMG data were processed by rectifying the signals and then computing the mean value across a sliding 260-ms window centered at each time point. The resulting magnitude envelope was then normalized by the maximum envelope value obtained from the MVC trials. The envelopes were then down-sampled to 100 Hz to match the joint angle data.

D. ANN Training
The ANNs used in this experiment consisted of Non-linear Autoregressive Neural Networks with External Inputs (NARX) network from the Deep Learning Toolbox in MATLAB (2019a, MathWorks, Natick, MA) consisting of a single hidden layer and an output layer. NARX networks (NNs) are a type of recurrent neural network used in time series prediction and have previously been applied to the prediction of kinematics from EMG data [28]. The NNs were configured to receive the EMG envelope values from the current timestep and the joint angles from the previous timestep as inputs. They were first trained in an open-loop configuration, receiving the measured previous timestep, and then trained in a closed-loop configuration, receiving the estimated angle from the previous timestep as an input. The hidden layer for each NN trained in this study consisted of 7 neurons. This number of neurons was selected through pilot testing by comparing outcomes obtained with 3-10 neurons in the hidden layer.
An arbitrary, contiguous 20-s window of data from each trial of training data was used to train each NN to the desired coefficient of determination (R 2 ), defined in Eqn. 1: whereθ i , θ i , andθ represent the estimated joint angle at timestep i , the measured joint angle at timestep i , and the average joint angle of the data, respectively. The coefficient of determination was controlled by rearranging Equation 1 to find the desired mean-square error (MSE), the stopping criterion of the NN training, as shown in Eqn. 2: where N is the number of training data points. The NN's R 2 value was determined using a 5-fold cross-validation with 20% of each trial reserved for testing and the remaining 80% of each trial used for training. Training was repeated until performance was within ±0.05 of the desired coefficient of determination. The coefficient of determination levels assessed were R 2 ≈ 0.4, R 2 ≈ 0.6, and R 2 ≈ 0.8. For each experimental session, two NNs were trained: one to predict wrist flexion/extension angles and a second to predict MCP flexion/extension angles [10]. Representative examples of the predictions at each level of offline performance for the 5 training trials are shown in Figure 2.

E. Virtual Posture Matching Task
The virtual posture matching task involved controlling a planar stick figure hand (Fig. 3) from a neutral, starting posture to match the displayed target posture. A target posture was  successfully matched when users held the virtual hand within ±5 • of the desired angle at both joints for 0.5 consecutive seconds. The displayed target posture turned green while subjects were within the target range for both joints. Subjects were given 20 seconds to match each target before the system advanced to the next target posture. The display of the virtual hand was updated at 20 Hz.
Each subject completed testing over 4 sessions. The first session consisted of establishing a performance baseline for the posture matching task. This was accomplished by allowing users to control the virtual hand using joint angles calculated in real-time from the marker data, thereby allowing the virtual hand to perfectly mirror the desired motion of the user. A 130-ms delay between actual movement and movement of the virtual hand was added in order to mirror the delay associated with the EMG control due to the processing time for the EMG signals. Subjects were instructed to try to match as many target postures as they could, and to do so as quickly and accurately as possible. Ten "practice" trials were completed, with 9 targets (Fig. 3) per trial. Following completion of the practice trials, 5 "evaluation" trials were completed with 36 targets (Fig. 3) per trial. The order in which target postures were presented to subjects was randomized each trial. Performing the posture matching task with the IK controller provided a valuable performance baseline, allowing for the evaluation of performance in the ideal case that the controller replicates the desired motion exactly.
Subsequent experimental sessions began with 3 practice trials using the inverse kinematic control scheme to reacquaint users with the interface and task. Following this reintroduction, training data for the NN controllers were collected and were trained to the desired coefficient of determination values of R 2 ≈ 0.4 (low offline performance), R 2 ≈ 0.6 (moderate offline performance), and R 2 ≈ 0.8 (high offline performance), as described above. Each session, new training data were collected and used to train a new set of NARX network controllers, but only one NARX network trained to one of the R 2 values was evaluated each session. We trained networks for each R 2 value in each session to avoid potentially indirectly revealing which R 2 value decoder was being used, as higher R 2 values often took longer to train. Subjects then used the selected controller to perform the 10 practice trials followed by the 5 evaluation trials. The order in which subjects received the 3 EMG controllers was counterbalanced to account for potential task performance improvement across sessions and subjects were blinded to the R 2 value of each decoder they received. The sequence of experimental sessions is summarized in Figure 4.

F. Subjective Rating of Control Quality of EMG-Based NMI
Following each experimental session involving EMG control, subjects were asked to subjectively rate the perceived quality of the EMG-based NMI for virtual arm postural control. Subjects were asked to rate the interface on a scale of 1 to 10, with a 1 being described as "the virtual hand does not follow my intended motion, and it produces seemingly random motions" and a 10 being described as "the virtual hand follows my intended joint motion and feels equivalent to the marker-based or inverse kinematic control."

G. Performance Metrics
Task performance metrics were calculated from the evaluation trials of each controller. The performance metrics studied were task completion percentage, normalized task completion time, path efficiency, and number of target overshoots. These metrics and their definitions are provided in Table I [32], [35]. Target posture repeatability with each level of controller accuracy was assessed by observing users' ability to repeatedly match each individual target. For each user, it was calculated how many of the 5 evaluation trials they were able to match each of the 36 target postures. The percentage of targets successfully matched a minimum number of trials was calculated for all thresholds from 1 trial (i.e., the percentage of targets successfully matched in at least 1 of the 5 evaluation trials) to 5 trials (i.e., the percentage of targets successfully matched in all 5 evaluation trials).

H. Statistical Analysis
The relationship between offline controller accuracy (as determined by R 2 values) and posture matching task performance metrics were analyzed using linear regression and one-way ANOVA with subjects included as random effects. Controller reliability metrics were also analyzed using oneway ANOVA with subjects included as random effects. For the regression analysis, the R 2 value assigned to each subject was the average of the wrist and MCP decoders for that trial and strength of correlations was classified according to a previously establish system [38]. The subjective rating data were analyzed using the Kruskal-Wallis non-parametric test. Tukey's honestly significant difference was applied between the levels of coefficient of determination in the ANOVA and Kruskal-Wallis analyses. Results were considered significant at the 0.05 level. All results are presented as the mean ± standard deviation unless specified otherwise.

A. Virtual Posture Matching Task
Improvement in offline decoder performance was associated with improved online control of the virtual hand. All performance metrics of the virtual postural matching task demonstrated moderate to high correlation with the coefficient of determination (i.e., decoder R 2 value) (Fig. 5). Task completion percentage showed moderate positive correlated with decoder R 2 value (r=0.66, p<0.001) while normalized task completion time showed moderate negative correlation (r=−0.51, p=0.001). Additionally, path efficiency and number of overshoots were strongly correlated with decoder R 2 values, with path efficiency showing strong positive correlation (r=0.74, p<0.001) and overshoots showing strong negative correlation (r=−0.79, p<0.001). Only the number of overshoots showed performance similar to the motion capture/inverse kinematics controller (p=0.06).
Comparisons of the online performance metrics across three ANN decoders with different offline performance (low, moderate, and high) are summarized in Fig. 6. Offline performance level of the decoder had a significant effect on task completion percentage (F=16.7, p<0.001). Post-hoc Tukey tests revealed that the task completion percentage for the high performance level decoder (74.7 ± 13.5%) was significantly higher than those for the decoders with the moderate (59.1 ± 16.6%, p=0.034) and low (41.2 ± 17.7%, p<0.001) offline performance levels. Completion percentage for moderate performance level decoder was significantly higher (p=0.015) than that for the low performance decoder. Decoder type also significantly impacted normalized task completion time (F=6.27, p=0.007). The normalized task completion time for the high performance level decoder (0.21 ± 0.06 s/rad) was significantly lower than that for the low performance level (0.35 ± 0.15 s/rad, p=0.006) but not significantly lower than the middle offline performance level (0.25 ± 0.07 s/rad, p=0.584). The normalized completion time for the moderate performance level decoder was not significantly different    than the moderate (5.6 ± 1.6, p=0.03) and low (11.7 ± 4.5, p<0.001) performance controllers. Moderate performance level decoders also showed significantly fewer overshoots than the low performance level (p<0.001).

B. User's Perceived Control Quality of EMG-Based NMI
Average subjective ratings on quality of NMI are summarized in Fig. 7; each subject's ratings are shown in Table II. Offline performance significantly impacted subject ratings (p=0.002). Subjects rated the EMG-based NMI with higher offline performance better. The controllers with the lowest offline performance received an average rating of 3.3±1.5, while the controllers with the middle and highest offline performance received average ratings of 4.8±1.6 and 6.3±1.8, respectively. The average rating of the high performance level decoders was significantly higher than the low performance level decoders (p=0.001) but was not significantly higher than the moderate level (p=0.21). Additionally, the moderate offline performance level average rating was not significantly different from the low performance decoder average rating (p=0.14).

C. Reliability of EMG-Based NMI
As the EMG decoder's R 2 values increased, the area of the 2-DoF task space (see Fig. 3, right) that could successfully match increased. This can be seen qualitatively in the representative task-space heat maps shown in Fig. 8A-C. The repeatability (reliability) of individual targets also increased as R 2 increased (Fig. 8D), as seen by the increase in the percentage of target postures that could be repeatedly matched. With the low performance decoder users successfully matched 54.6±18.9% of target postures at least 1 out of the 5 evaluation trials. The middle offline performance level allowed for significantly more target postures to be matched (73.1±16.4%) compared to the lowest offline performance level (p=0.019). The high performance decoder showed a significant increase in percentage of target postures matched at least once (86.6±10.3%) compared to the low (p<0.001) but not the moderate performance level decoders (p=0.104). This was also observed with respect to the percentage of target postures successfully matched in all 5 evaluation trials. On average, the low performing decoder allowed for 27.1±14.9% of target postures to be matched in all 5 trials. The moderate offline performance level allowed users to match a significantly higher percentage of target postures in all 5 trials (42.6±18.0%) compared to the low performing decoder (p=0.028). Finally, the high performing decoder allowed users to match 58.1±18.5% of target postures in all 5 trials, which is a significant increase compared to the low (p<0.001) and moderate (p=0.028) performing decoders.

IV. DISCUSSION
In this study we systematically evaluated the influence of offline training of artificial neural network (ANN) EMG-based NMIs, as evaluated by coefficient of determination (R 2 ), on users' ability, perception, and repeatability to perform a realtime 2-DoF virtual posture control. In this study, each subject used EMG decoders trained to offline performance levels of R 2 ≈ 0.4 (low), R 2 ≈ 0.6 (moderate), and R 2 ≈ 0.8 (high offline performance). The R 2 values selected for this study were consistent with values observed and previously studied in the literature [35].
The most significant result from this study was the clear relationship between EMG decoding accuracy obtained offline and real-time task performance with the human-in-the-loop. As offline decoding accuracy improved (as evidenced by higher R 2 values), task completion percentage increased, normalized task completion time decreased, path efficiency increased, and the number of overshoots decreased when human subjects used the EMG-based NMI to perform the virtual arm postural matching task. Performing linear regression on the relationship between R 2 and these metrics also revealed moderate to strong correlation. These results reaffirm the importance of using offline kinematic prediction accuracy to inform real-time control capabilities. Continuous kinematic predictions from EMG are used in various NMI applications, and offline performance metrics such as R 2 can help streamline and guide research and development of promising algorithms and methods. It is worth noting that real-time testing and functional tasks remain integral in evaluating novel myoelectric controllers due to the ability of individuals to adapt to the controller and develop compensatory strategies. However, offline analyses offer a convenient and informative role as well, especially when testing large numbers of a clinical population is impractical or the clinical population is very heterogeneous.
Furthermore, as real-time performance improved with higher offline performance, the percentage of the task space/ targets available to users increased as well. As shown in Fig. 8, increasing R 2 values led to an increase in the percentage of targets users could hit, indicating a higher percentage of the task space was available to them. Users were also able to match target postures more reliably (match the same target posture in multiple trials) with increasing R 2 values. This is consistent with the conceptual meaning of R 2 , which is the percent of variance a predictor explains in the observed data. As R 2 increases, the controller is explaining a higher percentage of the variance in the training data. Since the type of training data used to train the NARX networks were consistent across all subjects (i.e., users were instructed to cover the full range of motion of both joints in all training data collections) the increased R 2 value manifests as explaining more of the possible task space, allowing for more targets to be successfully matched.
In addition to objectively demonstrating that real-time task performance improves as R 2 increases, subjects also subjectively rated EMG-based NMIs with higher R 2 values more highly in terms of control quality. While the average ratings of the highest offline performance level controllers and the lowest offline performance level decoders were significantly different, neither were significantly different from the middle offline performance level average rating. This is consistent with the individual subject breakdown shown in Table II: subjects often provided relative ratings reflective of the R 2 with a higher rating as R 2 increased, while some subjects rated the middle offline performance level the lowest of the 3 and others rated it the highest, no subjects gave the highest offline performance decoder the lowest rating or the lowest offline performance decoder the highest rating. Thus, on average subjects' subjective perception of each level of offline performance tended to match their objective performance metrics.
Additionally, inclusion of trials using the IK-based controller provided a baseline for all performance metrics by demonstrating what task performance would look like if a controller were able to perfectly predict users' motor intent from EMG. While performance metrics with EMG-based control approached the performance levels of the IK-based controller as R 2 increased, only number of overshoots achieved comparable performance. However, even the number of overshoots seen using the highest offline performance level controllers barely reached the 5 th percentile level of performance seen by the IK-based controller. This indicates EMG-based controllers with higher levels of offline performance than were examined in this study are needed to achieve comparable results to those seen using the IK controller.
The findings of this study are contrary to previous studies which concluded that offline continuous kinematic prediction accuracy of myoelectric controllers either had a nonexistent or weak relationship with users' real-time performance. The potential reason for this lack of relationship observed in previous work is primarily due to experimental differences and limitations. For example, Krasoulis et al. utilized a posture matching task, in which users controlled a physical robotic hand to match postures shown in images displayed on a screen and the mean absolute error was measured [33]. In this task, users were given 3.5 s to achieve the desired posture and 1.5 s to maintain the posture [33]. It is possible that the short task duration and lack of real-time feedback limited performance, regardless of the offline performance accuracy of the controller used. Thus, for this study we employed a longer time (20 s) to complete each task and provided simple, intuitive real-time feedback of the 2-DoF hand with the color change when users were within the target tolerance. Similarly, the task used by Jiang et al. did not have any physiological analogue or meaning and thus required manual scaling of prediction outputs from the controllers [35]. This manual scaling via output gains effectively altered the R 2 value as it would change the variance accounted for by the decoder. This study avoided this confounding factor by designing a task with direct physiological meaning (i.e., the predicted joint angles map directly to the joint angles of the virtual hand without any gains) and allowed users to meaningfully perform the task with unaltered outputs from the controllers. By addressing these limitations in the previous work, we were able to observe a significant relationship between offline kinematic prediction accuracy and real-time task performance.
This study is not without its limitations. This study only explored a limited range of R 2 values. The highest level of offline performance studied was set to R 2 ≈ 0.8 as pilot testing demonstrated achieving higher offline accuracy consistently for multiple subjects was not consistent. While this level of offline performance is consistent with EMG-based NMIs observed in the literature, higher performance levels have been demonstrated and additional work exploring these higher R 2 values would be enlightening. Exploring higher values using this systematic approach may elucidate at what level real-time performance reliably matches that of the IK-based controller and could provide a specific goal for research and development of novel EMG decoding algorithms to reach. This study made use of one algorithm (NARXNET) for decoding motor intent. Exploration with other algorithms such as linear regression or musculoskeletal modeling in the future could prove enlightening, however we hypothesize these results would be consistent across algorithms if clean, robust, and representative training data is used for algorithm optimization. This study was limited to exploring the relationship between offline decoding accuracy and real-time performance in a specific task. Real time evaluation on other task context and evaluation of other user's experience and perception such as sense of effort, embodiment, cognitive burden, and ease of adaptation and learning, are also worthy of exploration in the future.
In conclusion, we established a clear relationship between the offline kinematic prediction accuracy of continuous myoelectric controllers and the real-time performance of a virtual posture matching task using ANN EMG decoders trained to varying degrees of offline predictive accuracy. As offline performance improved, users consistently improved task completion percentage, normalized task completion time, path efficiency, and number of target overshoots. Furthermore, as offline performance improved, subjects displayed the ability to successfully match target postures in a larger portion of the task space, as well as more reliably match target postures. In addition to the objective improvements seen with improved offline performance, subjects subjectively rated the quality of control of decoders with higher offline accuracy more highly.
These results suggest offline analyses are informative in guiding research and development of novel NMI algorithms that continuously predict motion, allowing researchers to develop algorithms more intelligently and efficiently by providing researchers insight without resource intensive real-time testing for each iteration of algorithm development.