Prediction of Voluntary Motion Using Decomposition-and-Ensemble Framework With Deep Neural Networks

It is essential for seamlessly delivering intended hand motion to surgical robots while actively suppressing undesired hand tremor during microsurgery. To achieve this goal, we propose a novel method for predicting voluntary motion based on deep learning with the signal decomposition and ensemble approach. This approach can thus deal with various forms of voluntary signals, such as either highly stationary or rather highly cyclic at a low range of frequencies. The proposed method comprises a series of signal blocks to decompose complex hand motion into multiple sub-signals using deep neural networks upon their signal characteristics. The signal block yields parameterized sub-signal and predicted voluntary motion. In addition, an ensemble layer allows for accurately predicting future voluntary motion by combining predicted motion from each signal block with the optimal weight. These signal blocks are connected by a decomposition flow in series and also by a forecast flow in parallel to ensemble the prediction output of each block. Given real data sets, we evaluated the prediction performance of the proposed algorithm compared to other data-driven deep learning models. The generalizability of the proposed algorithm was also investigated by applying the trained models to new data sets from new tasks and a different subject, which had not been involved in training procedures. As a result, the proposed algorithm outperforms the other baseline models in terms of prediction error and accuracy. Furthermore, we explored whether the proposed method could suppress tremor via spectral analysis, which shows substantial tremor attenuation more than −10 dB in a frequency of interest, 6—14 Hz. It is also found that the proposed method has the predictive power for dealing with inevitable control delays, where prediction error increased only by 2% per one-time sample, approximately 4.2 ms.


I. INTRODUCTION
Analyzing and interpreting human hand motion have attracted great interests from a variety of research areas, such as prosthesis [1]- [3], rehabilitation [4], and surgical robotics [5]- [10]. For example, the estimation of unintended hand motion, such as pathological tremor, is required to analyze the degree of patients' disorders or to suppress the tremor through prosthetic devices. The pathological tremor affecting everyday life, such as Parkinson's disease, is involuntary and The associate editor coordinating the review of this manuscript and approving it for publication was Shagufta Henna. pseudo-rhythmic movement from several age-related neurological disorders, which occurs in a broad frequency range of 3-14 Hz [11]. On the other hand, identifying voluntary motion and physiological tremor inherent from normal hand motion is a focus of research in surgical robotics. The voluntary motion in microsurgical operation is described as signals in frequencies below 2 Hz. In contrast, physiological tremor is originated from the combination of mechanical-reflex components and oscillation in a central nervous system [12], which is known to have an RMS (root-mean-square) amplitude on the order of 50-200 µm at a frequency commonly in the 6-14 Hz band [13].
Specifically, teleoperated surgical robots, well-known as Da Vinci Surgical System (Intuitive Surgical, USA) [6], collect hand motion data from a master device, then deliver filtered and/or scaled-down motion to slave manipulators. Alternatively, handheld robots, known as Micron [9], [10], iTrem [14], first sense their own motion of the robots and then selectively filter out erroneous motion such as hand tremor. For such active tremor cancellation, the handheld robots use control signals as the counter-motion of tremor signal or the estimated voluntary motion at the end of the tool tip. Since the cancelation of hand tremor needs to be immediately done as soon as sensing its own motion, no latency in a control loop is preferred. However, inevitable delays in controlling the robots would degrade the performance of active tremor cancellation because the estimated motion at present becomes out of date at the instance of canceling. Therefore, the prediction of hand motion is required to seamlessly accomplish active tremor cancellation with a high level of accuracy, as compensating for the time delay in control.
Despite the need for hand motion prediction in active tremor compensation, no algorithm generalizable to various users and operation in microsurgery has been introduced yet. A common approach introduced in the literature is to estimate hand tremor via online learning. This approach regards spectro-temporal hand tremor as the combination of sinusoidal motions. The learning model either adaptively updates the time-varying frequencies [15] and amplitudes or update the amplitudes of a fixed set of multiple frequencies [16]. However, the proposed algorithm is prone to failure in reaching convergence because it is sensitive to initial parameters for searching the frequencies and amplitudes. A data-driven approach has recently been proposed to address the issues raised by the stringent modeling of hand tremor [17], [18]. Although such deep-learning-based methods offer a certain level of generalized performance, enormous training procedures and high computational power hinder the application of the algorithms to real-time active tremor cancellation.
To overcome those problems in estimation and prediction of hand motion, we propose a novel method for voluntary motion prediction with deep neural networks based on decomposition and ensemble learning. In the proposed model, we take advantage of signal decomposition that separates a raw signal into voluntary and tremor signals while mitigating the time delay, which is inherently yielded in a real-time lowpass filter. Hence, time-series data of hand motion is decomposed by a set of deep neural networks representing either voluntary motion or hand tremor. The prediction of future voluntary motion is then accomplished by the ensemble of prediction outputs from the decomposed signal blocks, as represented in Fig. 1. Given this end-to-end learning model, we evaluate the accuracy of the voluntary motion prediction on real data sets while comparing it with other data-driven machine learning algorithms. The generalized performance of the proposed algorithm is also investigated by applying the model to new tasks and different subjects.
Finally, we explore how efficiently the proposed algorithm could suppress physiological hand tremor while maintaining the voluntary motion for active tremor cancellation.

II. RELATED WORK
The proposed model for voluntary motion prediction is closely related to two research topics: (II.A) hand tremor estimation and (II.B) decomposition and ensemble learning.

A. HAND TREMOR ESTIMATION
To accurately and effectively estimate hand tremor, various machine learning techniques have been introduced to this problem. The weighted-frequency Fourier linear combiner (WFLC) utilizes a multilayer perceptron (MLP) to learn time-varying frequencies and amplitudes of hand tremor [15]. However, the WFLC algorithm may suffer from distinguishing adjacent frequencies and assigning appropriate parameters for convergence [16]. The band-limited multiple Fourier linear combiner (BMFLC) was also proposed to overcome these issues by adapting linear combinations of harmonic signals of a fixed set of multiple frequencies [16]. The coefficients of the linear combiner can be found by a stochastic gradient descents algorithm or the Kalman filter, which offers an optimal solution to continuous Markov chain models assuming linear Gaussian noise models. However, computational load rises as more frequencies are combined to improve the accuracy of estimation. A support vector machine (SVM) was also adopted to estimate hand tremor using nonlinear kernels [19]. Although it outperforms the MLP algorithms, the heavy load in computation limits its application to real-time operation. Furthermore, these algorithms that estimate hand tremor primarily at the present time step are prone to failure in predicting future tremor because of the non-stationary nature of hand tremor. To overcome such the limitation, tremor prediction models have also been proposed, including autoregressive (AR) and autoregressive-movingaverage (ARMA) models [20]. Tremor prediction methods based on least square-support vector machine (LS-SVM) [19], extreme learning machine [21] were also introduced. Recently, Shahtalebi et al. proposed a deep learning-based methodology to both estimate and predict pathological hand tremor [17]. However, the application of the proposed model was still less generalizable to various scenarios of motion. To address the limitation, a generalizable model adopting a deep recurrent model trained with a sizeable dataset was also proposed [18]. However, the proposed model entails a high computational load led by a complex learning architecture and enormous training.

B. DECOMPOSITION AND ENSEMBLE LEARNING
A hybrid approach embedding appropriate preprocessing models in deep learning architectures can improve learning performance because it eases to find learnable features [22]. Decomposition-and-ensemble is one of such hybrid models, of which main ideas are as follows. Firstly, it decomposes a raw signal into sub-signals with a decomposition method. Each sub-signal block then outputs a target signal to be predicted, respectively. Finally, it ensembles the multiple predictions from each processing block in order to obtain a final output.
A Fourier transform is one of the common decomposition algorithms used to analyze hand motion in timeseries [23], [24]. It decomposes a raw signal into sinusoidal sub-signals corresponding to frequencies of interest. Empirical mode decomposition (EMD) is another decomposition method suitable for forecasting of timeseries data [25], which is a part of Hilbert-Huang transform (HHT). EMD also used in tremor decomposition to overcome limitations yielded by the prior assumption of Fourier transform-based methods: linearity and stationarity [26]. Although we can decompose signals using existing methods, the decomposed signals are not optimized for signal prediction. Recently, Wang et al. introduced a neural network-based decomposition adopting a wavelet decomposition network (WDN) [27]. It can thus decompose timeseries data into a group of sub-signals in the form of frequencies, which is crucial for taking into account frequency factors in prediction. Consequently, neural network-based decomposition can take advantage of both timeseries decomposition and the learning ability of neural networks. VOLUME 8, 2020

III. PROPOSED METHOD A. OVERALL ARCHITECTURE
We propose a deep learning model incorporating neural network structures that decompose complex signals into sub-signals and ensemble them to accurately and efficiently predict future voluntary motion. The overall structure of the model comprises sequential signal blocks corresponding to decomposed sub-signals. Each signal block estimates sub-signal and predicts future voluntary motion. The signal blocks used in our model can be classified as a voluntary signal block and a tremor signal block, depending on the characteristics of signals to be estimated. Signal blocks are connected by a decomposition flow in series and also by a forecast flow in parallel to ensemble the prediction output of each block. The overall architecture is represented in Fig. 2. The algorithm details are as follows.

B. SIGNAL BLOCK
Each signal block is responsible for estimating a specific model of sub-signal and predicting voluntary motion that would occur the next time step. For a specific size of historical time window from the present time, a block input is described by X b , and the dual outputs of the block are byŷ b andÔ b from the bth block. We define the input X b as in (1).
where s b,t i is the input signal of the bth block at the ith timestamp t i , and the size of historical window is set by 0.5 s in our model. The outputŷ b represents the prediction of voluntary motion at the next time step t i+1 . In addition, another outputÔ b is sub-signal subject to decomposition from the bth signal block. The input to the next block X b+1 is given by subtractingÔ b from X b and it is repeated until the end of the signal block. The details of the decomposition flow are described in III.C.
Depending on the type of sub-signal subject to decomposition, the signal block is modeled differently while taking its signal characteristics into account. Herein, we introduce two types of the signal blocks to decompose raw signal into voluntary motion and hand tremor. For the voluntary signal block, we approximate the signal as a low-order polynomial with a small degree of P since voluntary motion appears as fairly monotonic or cyclic with frequencies below 2 Hz within the specified time window. The predicted voluntary signal is then obtained by extrapolation of the polynomial as in (2).
where θ F b,p is the pth coefficient of the polynomial for forecast. The output of the signal blockÔ b , which describes the voluntary motion in the given time window, is also obtained by a parameterized function in (3) as we assume the predicted voluntary motion as a low-order polynomial.
where θ D b,p is the pth coefficient of the polynomial for decomposition and k = [i − F s /2 + 1, . . . , i] with sampling frequency F s . The outputÔ b is then constructed as in (4) On the other hand, the outputs of the tremor signal blockŷ b andÔ b are approximated by the sum of sinusoidal signals while considering its rhythmical and oscillatory characteristics. Accordingly, the outputŷ b from the tremor block is described by the Fourier series as in (5).
where θ F b,2m and θ F b,2m+1 are Fourier coefficients for a set of multiple frequencies, f m = 1 2, · · · , F s /2 . O b corresponding to the decomposed tremor signal is then constructed using (4) and (6) in the same manner.
where θ D b,2m and θ D b,2m+1 are Fourier coefficients to describe decomposed tremor signal. We obtain those parameters from each signal block using a neural network with four hidden layers. Each layer has an activation function for nonlinear mapping (a rectified linear unit, ReLU [28]), and the last layer consists of a fully connected layer without an activation function.

C. FORECAST AND DECOMPOSITION FLOWS
The two types of outputsŷ b andÔ b from each signal block are used for a forecast flow and a decomposition flow, respectively.
In the forecast flow, the ensemble of the prediction results from the multiple signal blocks allows accurately predicting voluntary motion at the next time step. The formula for the forecast flow is as follows: where n is the number of blocks used and w b is the weight of prediction outcome from the bth block. Consequently, the final predictionŷ is attained by the weighted sum of prediction results from the multiple signal blocks. The decomposition flow serves to decompose the entire signal into multiple sub-signals while proceeding in the cascaded model. Each block estimates sub-signal modeled by a specific type of the signal block. The formula in the decomposition flow is as follows: where X is a raw signal input and X 1 is equal to X at the first signal block. As noticed in (8), the input of each block is connected to the output of the previous block with residual connection [29]. Therefore, subtracting the output of each block from the given input signal leads the sub-signal found in the current block to be excluded in the next estimation. By proceeding with this process, the entire signal is decomposed into each sub-signal.

IV. EXPERIMENTAL RESULTS
We evaluated the proposed prediction algorithm on real hand motion data by comparing its prediction error and accuracy with those of other machine learning algorithms: 1) artificial neural networks (ANN) [30], 2) recurrent neural network (RNN) [31], 3) long short term memory model (LSTM) [32], and 4) PHTNet that employs bidirectional gated recurrent units (bi-GRU) [18]. Given trained models, the first step was to validate the prediction performance of the proposed algorithm on trained data. Next, we tested the algorithm on new data sets that were not involved during training procedures in order to investigate the generalizability of the prediction algorithm: for new tasks and different subjects.

A. HAND MOTION DATASET
We collected hand motion data using a magnetic tracker (Liberty TM with Micro Sensor 1.8 TM , Polhemus Ltd., USA), which provides six-degrees-of-freedom (6-DOF) motion with a micron level of precision at a sampling frequency of 240 Hz; we thus aimed to predict voluntary motion at the next time step, approximately 4.167 ms ahead. To collect hand motion data, which would potentially occur during surgical operation, the sensor was fixed on one end of a handpiece. In addition, a 23G hypodermic needle was attached to the other end of the hand-piece to mimic a surgical tool. The position of the tool tip was then retrieved by a homogenous transformation given the 6-DOF motion of the sensor. Two types of tasks were designed for the experiments: point/line and circle tracing. These tasks are regarded as relatively static and dynamic, respectively. For the static task, the subject was instructed to hold the tool tip above a printed target surface during data collection. The task was performed under four different scenarios: blind, bare eye, low magnification (1.5X), and high magnification (6X) using a stereo-microscope (SZX7, Olympus Corp., Japan). For the dynamic task, the subject was instructed to trace various sizes of circles above the printed target surface, where the sizes of the circles were 1, 2, and 10 mm. Each task under the various settings was repeated five times. A single trial was logged for 30 seconds, which resulted in 7080 data sets for training and testing from 7200 samples, regarding a historical time window of 120 samples. Consequently, we collected about 310,000 data sets in total: 9 of 10-tasks sets for training a model and the rest of the data for testing.

B. MODEL TRAINING
We adopted the zero-phase filter of the seventh's order Butterworth with a cut-off frequency of 2 Hz on raw motion signal to provide the ground-truth of predicted voluntary motion. In addition, mean squared error (MSE) between the predicted motion and the ground-truth was used as a loss function for training the proposed model. The model was trained for 1000 epochs using a learning rate of 10 −3 . To prevent overfitting, the learning rate decayed by 0.5 times if no decrement in the loss was found for five epochs. An early stopping criterion VOLUME 8, 2020 was invoked to prevent overfitting if the loss would not be decreased for 40 epochs.
We set the total eight signal blocks for the model: the first four blocks for the decomposition of voluntary signal and the rest of four blocks for tremor signal. Two of the four voluntary blocks were parametrized as linear functions for estimating voluntary motion, and the other two blocks were defined as cubic functions. The four tremor blocks were modeled as the combination of cyclic functions. For example, the DC offsets of the input signal were taken from the first two voluntary blocks. Then, the next sub-signals were fitted into cubic functions for the estimation of voluntary signal, while approximating voluntary motion with a frequency of 2 Hz as the third-order polynomial within a time series window of 0.5 s.
Learning procedures were accomplished by applying a backpropagation algorithm to minimize root-mean-squareerror (RMSE) in prediction. We also used the adaptive moment estimation (Adam) as an optimizer. We performed the 9-fold cross-validation to test the performance of the proposed model. The learning settings were set for the other baseline models, ANN, RNN, LSTM, and PHTNet, in the same manner.

C. DESCRIPTION OF BASELINE MODELS
In this section, we introduced the four baseline models in detail. The ANN model consists of two hidden layers and a single output layer. The first layer of the ANN is a dense layer with 480 hidden units. In this layer, we used ReLU as an activation function for nonlinear mapping. The first layer is followed by another dense layer with the same number of hidden units and the activation function. The second layer is followed by the output layer that contains a single unit for prediction output without any activation function. The RNN model consists of an RNN layer and an output layer. The RNN layer includes eight hidden units, in which a hyperbolic tangent (tanh) function was used as an activation function. The cell output of the RNN layer is connected to the single unit of the output layer without any activation function. Similarly, the LSTM model incorporates an LSTM layer and an output layer. The LSTM layer also consists of eight hidden units with the tanh activation function. Finally, the cell output of the LSTM layer is connected to the output layer that has a single unit for prediction output without any activation function. The architecture of the PHTNet comprises the four layers of bi-directional GRU [30] and one output layer. The first layer of the PHTNet is bi-direction GRU (bi-GRU) layer with four hidden units. In this layer, we used ReLU as an activation function. The first layer is followed by another bi-GRU layer with the same setting, which is repeated for the third and fourth bi-GRU layers. The fourth layer is then followed by an output layer that has one unit without any activation function.

D. PERFORMANCE MEASURE
For quantitative analysis of the performance, we measured the root mean square error (RMSE) between actual voluntary motion signal and predicted signal as in (10). In addition, accuracy was also investigated whether an algorithm accurately predicts target signal with respect to the ground truth.
where T is the total length of signal, y (t) is actual voluntary motion, andŷ (t) is the predicted voluntary motion signal at the tth timestamp. Accuracy is defined as below: Herein, s x is equal to y(t) and RMS(s x ) is the root-mean-square amplitude of the voluntary signal.

E. VOLUNTARY MOTION PREDICTION
We first validated our trained models by comparing them with the other algorithms. For the comparison of prediction performance, we trained two types of models. One was trained from nine task sets out of the total 10 task sets, excluding one of the static tasks: Model I. The other model was trained from nine task sets, excluding one of the dynamic tasks: Model II. As a result, we obtained the smallest RMSEs with our proposed method in both models among the five algorithms. In addition, the highest accuracies were also achieved with our algorithm. Interestingly, PHTNet shows considerably lower performance than those of the proposed method in Model I (static), but the performance was improved in Model II (dynamic), showing a similar level of performance to our algorithm. On the other hand, the LSTM model shows a similar performance to our algorithm with Model I, but the model resulted in the worst performance with Model II. Fig. 3 shows the resulting trajectories of predicted voluntary motion from the five algorithms: ANN, RNN, LSTM, PHT-Net, and the proposed method. The RMSE and accuracy are summarized in Table I.

F. GENERALIZED PERFORMANCE
We also evaluated whether the proposed model would be generalizable to data sets with new distributions by testing the prediction model on new tasks and a different subject. The first experiment was to verify generalized performance on the new tasks that had not been involved in the training procedures. As summarized in Table II, the generalized performance of the proposed method is comparable with PHTNet. The resulting trajectories are also shown in Fig. 4. The second test was to see the generalizability of the algorithm to new data from different subjects. In this experiment, the proposed method sill shows its generalizability on the new data set, of which results are similar to those obtained from PHTNet.  Table III. In this experiment, it is noted that the different subject shows a smaller level of hand tremor than the subject participated in model training, which   tremor compensation. As shown in Fig. 6, hand tremor drastically was suppressed within a frequency range of 6-14 Hz over the time of operation. Furthermore, we ran the analysis of power spectrum density on the predicted signals. As a result, it is found that hand tremor is decreased by -10.9 and -13.1 dB for a frequency of interest, 6-14 Hz, while applying voluntary signals predicted in the new task and the different subject to control, respectively.

H. PREDICTABILITY
The proposed model can predict voluntary motion one-timestep ahead as well as further voluntary motion over time horizon because the motion is parameterized by a time sequence. Hence, we explored the prediction capability on voluntary motion multiple-time-steps ahead. The RMSE at each future time step is depicted in Fig. 8. With the advancement of the  time step, the RMSEs also gradually increased. In particular, the prediction error at the 10-time step ahead from the present time step increases by 25%, compared to the error obtained at the time one step ahead.

V. DISCUSSION
Compared with the previous studies to predict voluntary motion [17], [18], [31], the proposed algorithm addresses the issues raised in the existing models [18] by adopting a hybrid 201562 VOLUME 8, 2020   Moreover, the proposed algorithm allows for predicting future voluntary motion over multiple time steps ahead via the parametric modeling of voluntary motion rather than simply taking a black-box model. Hence, the proposed model can also deal with the motion prediction at any time step requested potentially by inconsistent time delay in control. For instance, the prediction error was increase only by 2.2% per time step (4.17 ms) in the experiments.
Lastly, the proposed algorithm is substantially time-efficient in learning models since it mitigates the issues raised by the high complexity of deep learning-based approaches. For comparison, the proposed algorithm took 40 seconds to reach one epoch while it was about 5 minutes for PHTNet.

VI. CONCLUSION AND FUTURE WORK
In summary, this study aims to develop a new deep learning model that can accurately predict voluntary motion from complex and nonstationary hand motion. It is essential for actively suppressing undesired hand tremor and seamlessly delivering voluntary motion to surgical robots during microsurgery. To achieve this goal, we proposed the deep neural network with the decomposition and ensemble approach that can predict future voluntary motion with minimal error. This model thus decomposes complex hand motion into multiple sub-signals and ensembles the parameterized outputs for predicting future voluntary motion. As a result, the proposed model outperforms the other algorithms, which were taken as a baseline for comparison in terms of prediction error (RMSE) and accuracy. Moreover, the algorithm could also be generalizable to new data from the new tasks and also by different subjects. Finally, the proposed algorithm was tested for active tremor suppression, which resulted in substantial tremor attenuation in a frequency of interest, 6-14 Hz. Furthermore, the proposed algorithm is capable of predicting future voluntary motion within a certain time window with low error increment since the output is parameterized with respect to the time step.
To improve the proposed algorithm, future work involves the optimization of the preset historical time window and the number and types of signal blocks. In addition, reinforcement learning to evolve a trained model according to new data would be preferable to enhance overall performance as well as adaptability to various circumstances.
We also plan to apply this algorithm to a surgical robot platform to explore its capability of active tremor compensation. We thus believe the proposed decomposition and ensemble method also suggests a direction to address the time delay caused by linear filters commonly used for active tremor compensation. Finally, the approach and structure of the proposed algorithm would also be utilized in other research areas, such as estimation of pathological tremor for rehabilitation.