Kernel Reinforcement Learning-Assisted Adaptive Decoder Facilitates Stable and Continuous Brain Control Tasks

Brain-Machine Interfaces (BMIs) assist paralyzed people to brain control (BC) the neuro-prosthesis continuously moving in space. During the BC process, the subject imagines the movement of the real limb and adapts the brain activity according to the sensory feedback. The neural adaptation in the closed-loop control results in complex and changing brain signals. Simultaneously, the decoder interprets the time-varying functional mapping between neural activity and continuous trajectory. It is crucial and challenging to accurately and adaptively track the mapping to help the subject accomplish the BC task with a stable performance. Existing Kalman Filter (KF) based decoders achieve continuous trajectory control by linearly interpreting neural firing observations into self-evolving prosthetic states. However, the linear neural-state mapping might not accurately reflect the movement intention of the subject. In this paper, we propose a novel method that allows subjects to achieve continuous brain control efficiently and stably. The proposed method incorporates a kernel reinforcement learning method into a state-observation model to decode the nonlinearly neural observation into a continuous trajectory state. The state transition function ensures the continuity of the prosthetic state. The kernel reinforcement learning allows the quick adaptation of the nonlinear neural-movement mapping during the BC process. The proposed method is tested in an online brain control reaching task for rats. Compared with KF, our method achieved more successful trials, faster response time, shorter inter-trial time, and remained stable over days. These results demonstrate that the proposed method is an efficient tool to assist subjects in brain control tasks.


I. INTRODUCTION
B RAIN-MACHINE Interface (BMI) [1], [2], [3] allows paralyzed people to restore their motor functions by decoding the brain signals into the movement intention.The movement intention is then used to control the external device, which serves as a substitute for the subject's real limb.To achieve neuro-prosthesis control without real limb movement, the brain control (BC) paradigm is proposed in BMI.During the training of a BC task, the subjects need to imagine using their real limbs.A decoder establishes the mapping between the neural signals and the movement intention and gives commands to a prosthesis.Subjects must adapt the neural signals according to the sensory feedback to accomplish a movement task.
BC tasks are in general more difficult for the subjects to accomplish without actual limb movement, which brings challenges for the decoder.First, since there is no real limb movement, the neural signals are different and would be more complex and noisier.The brain signals are high-dimensional, and the neural-behavior mapping is non-linear [4], [5], [6], [7].The decoder in BC needs to accurately interpret the movement intention on arbitrary function mapping.Second, the subject gets sensory feedback at every time instance while the neuro-prosthesis is continuously moving in space.This closed-loop interaction makes the subjects adapt their neural signals frequently at hundred milliseconds level [8], [9], [10].The decoder in BC needs to follow the changes during the adaption of the subject.Third, because the movement of the neuro-prosthesis is continuous in space, the decoder for BC is not a simple classification (e.g., one-step reaching task) but continuous trajectory control.In this case, the decoding error at every time instance may accumulate and lead to a failure of the task.Thus, it is important to correct the small error at every possible time step.Finally, the neural signals are non-stationary over days even for the same BC task [11], [12], [13], [14], [15].The decoder needs to maintain stable BC performance over days [16].
A common approach for continuous brain control in BMI is the Kalman Filter (KF) [17], [18], [19].KF treats the prosthetic movement as a linearly evolving state.The neural activities of the subject serve as the observation and are linearly associated with the prosthetic state.However, the linear neural-state mapping might not accurately reflect the subject's movement intention.Due to the limited decoding accuracy of KF, the subject needs to try and adjust the brain activities many times to control the prosthesis to the target position.In addition, KF's parameters remain fixed after the model is established, which might lead to a performance drop when the subject has a significant change in neural activities.
To build the neural-state mapping more accurately in BC tasks, Extended Kalman Filter (EKF) [20], [21] has been proposed for BMI decoding.EKF employs a Taylor series expansion to approximate the nonlinearity of the neural-state mapping.In most cases, EKF uses a first order approximation, capturing only the most prominent linear characteristics of the underlying non-linear system.As a result, EKF may not provide sufficient accuracy for highly non-linear neural systems.Consequently, it can lead to suboptimal performance during the decoding process in BC tasks.
To track the non-stationary neural signals during BC tasks, ReFIT-KF [22] was proposed to recalibrate the parameters of the KF decoder when the subject has difficulties completing the task.The technology utilized the principal components of the neural data as the observation input and rotated the cursor's velocity towards the target as the state output.The modified state and neural data were used to train the parameters.ReFIT-KF has been applied successfully in several BMI applications [22], [23], [24].However, there are still some limitations.First, the ReFIT-KF utilizes the same linear model as KF to establish the relationship between the neural data and the prosthetic kinematics, which is less accurate than the nonlinear model.Second, when using ReFIT-KF over days, the specific ReFIT sessions are needed at the beginning of everyday training when performance drops, which is less efficient than adapting the decoding parameters during the BC tasks.In addition, during the calibration process of the ReFIT-KF, the velocity is directly rotated towards the target.If the subjects get distracted or do not prefer to move to the target directly, the velocity rotation might over-dominate the subject's true intention.The prosthesis can only approach the target in a pre-defined manner, which narrows down the flexibility of the decoder for general BC tasks.
Alternatively, reinforcement learning (RL) [25], [26] provides accurate and adaptive decoding with more flexibility.RL decoder non-linearly reinforces the neural-action mapping that drives the neuro-prosthesis towards the target, which worked well in discrete action selection tasks in BMI [27].One advantage of the RL decoder is the adaptation during usage.When the subject is performing the task, the RL decoder updates its parameters by reward and explores the neural-action space through trial and error.To enhance the exploring efficiency, the authors in [28], [29], and [30] proposed a Clustering-based Kernel Reinforcement Learning (CKRL) model.The neural activities are projected into the Reproducing Kernel Hilbert Space (RKHS) [31], where the non-linear neural patterns of different movement intentions could be distinguished linearly [32].A global optimal solution could be found according to the representer theorem [33].However, the RL decoder has a limitation in the BC task.For example in [26], the rats brain control a robotic arm of limited actions to perform a reaching task.Even though the robotic arm could arrive at the target region, the resulting trajectory is either straightforward or jagged, which would be unnatural for the patient to use.Thus, the existing RL decoder is not appropriate for continuous BC tasks due to the discrete action output.
In this paper, we propose a Clustering-based Kernel Reinforcement Learning assisted Adaptive Decoder (CKRLAD) that facilitates the subject to achieve continuous brain control efficiently and stably.The proposed method models the prosthetic movement and the neural signals as the evolving state and observation respectively.The prosthetic movement is characterized by a state transition function to ensure the continuity of the brain-controlled trajectory.The neural signals are interpreted as the visualized observation of the prosthetic movement by CKRL, which inherits the advantages of non-linear mapping and quick adaptation, so that the observation model could follow the change of neural signals online during the closed-loop BC task.The state transition serves as the prior estimation and the neural signal interpretation further updates the posterior estimation of the subject's movement intention, which merges the merits of both RL and KF.In this way, during the BC task, the movement intention of the subject could be decoded accurately, continuously, and adaptively.It also addresses the problem of the daily change of neural activities, which contributes to a stable BC performance over days.The proposed algorithm is tested in an experiment where two rats learned a brain control reaching task.We compare the proposed algorithm with KF, EKF, and ReFIT KF to demonstrate the advantages of online adaptation and accurate decoding.We evaluate the algorithms in terms of total trial numbers, response time, and inter-trial interval over six days.The goal is to demonstrate whether the proposed algorithm could decode the neural activity to accomplish more trials with less time, which makes the BC task easier for the subjects.
The rest of the paper is organized as follows.Section II gives a detailed description of the experimental design and the mathematics of the proposed algorithm.In section III, the evaluations of the algorithms on the experiment are presented.Finally, the discussion and conclusion are shown in section IV.

A. Signal Acquisition and Behavioral Task
The one-lever pressing brain control experiment was conducted at The Hong Kong University of Science and Technology (HKUST).All animal handling procedures were approved by the Animal Care Committee of HKUST, which strictly complied with the Guide for Care and Use of Laboratory Animals.
The subjects used in this study were two male Sprague Dawley (SD) rats.For each rat, we implanted two 16-channel electrode arrays into the left hemisphere of the brain, one in the primary motor cortex (M1), and another in the medial prefrontal cortex (mPFC).The rats were trained to perform the task with their right paws.
After the surgery, the rats were given one week to recover.They were first trained to perform a one-lever pressing Manual Control (MC) task as a prerequisite for the brain control reaching task as shown in Fig. 1(a).The rats needed to wait for an audio cue inside a behavior box near a water tap.When the rats heard the cue, they needed to press and hold a lever for 0.5 seconds.Then the rats needed to go back to the tap for water reward.The physical movement of the rats formed a trajectory between the tap and the lever.During the MC task, the neural data were recorded from the rats.The raw signals were passed through a high pass filter at 500 Hz.And the spikes were detected at the amplitude of −3.5σ s , where σ s is the standard deviation of the high passed signal's amplitude.Then the spikes were counted every 0.1 s for all 32 channels.And a 500 ms of historical spike count signal were concatenated to the current time.The final input to the decoder was a spike count vector.After the rats were proficient in the MC task (success ratio is greater than 80%), we transformed the physical movement of the rats into a smoothed trajectory in BC stage.The spike count and the trajectory were then used to train the parameters of KF.Then the rats began to learn to brain control the their intended movement position through KF.
At the transition stage, the rats could still press the lever in the behavior box.However, the reward was based on the decoded position from rats' neural activities.Initially the rats would still move to press the lever.As the rats became proficient, they found that they could get water reward without lever pressing.Then the lever was removed in the pure BC task, the rats completed the reaching task by imagining that they are pressing the lever like in MC.The flow diagram shows the control process of the BC task using Kalman filter (black arrows, Fig. 1b).The rats needed to hold the intended position within the start zone (blue region, 0-0.75) for a predefined threshold (3 s) to trigger the start of one trial.When the intended position is out of the range, the trial cannot start, which makes sure that the rat has a relatively stable state as the preparation position in MC.At the time when the trial started, the rats would hear an audio cue (10 kHz, duration 0.9 s).Then rats needed to continuously move their intended position to the press zone (gray region, 0.75-1.5)and stay within the press zone over a pre-defined time (0.5 s).This is also to make sure that the rats have a stable state like holding a lever in MC.Then they would hear a success cue (10 kHz, duration 0.09 s) and get a drop of water as a reward.If the rats could not control the intended position to reach the press zone within the trial time (8 s), or the rats did not hold the intended position within the range for the required period, the trial was considered a failure.And the rats needed to hold the intended position within the rest zone to trigger another trial.

B. Online Decoding Paradigm
Here we design an online decoding paradigm to evaluate the performance of the proposed method.The audio feedback cue received by the rats is tied only to the KF decoder.The timing of the audio cues is shared among the decoders to make the online decoding process in fair manner.We didn't let the animal performs the actual trials by different models in a sequence, as the subject could accumulate the previous learning experience when switch to another model.It is hard to eliminate the effect of the previous trained model.In our online decoding paradigm, at each time instance, we feed the same neural data into different decoders, including Refit KF, EKF, CKRLAD, to generate individual trajectories in an online parallel fashion.Note that all the behavioral criterion (staying in press or start zone for required holding time) remains the same to evaluate the decoder performance.
The online decoding scenario is shown by the red arrows in Fig. 1(b).At each time instance, the proposed decoder will receive the same neural input but generate its own output (intended position).To emulate the BC task in a more realistic way, the animal intended positions for all decoders were set to match the KF decoder's position whenever a start or success cue was issued by the KF decoder as shown in Fig. 1(b).The rationale is that the rats would adjust their neural activities in response to the same received feedback.Thus, the current decoding position is aligned with the KF decoder to represent the audio triggered intention.
After a trial starts (within-trial scenario), all algorithms concurrently decode their respective intended positions.When the proposed algorithm decodes the intended position within the press zone for a sufficient duration (0.5 s), we will assign a pseudo success.The decoding stops until the KF success cue emerges, at which all algorithms resume from the same animal intended position.This configuration facilitates a fair comparison of reaching speeds within the trial generated by the KF decoder, as all algorithms start from an identical intended position.
In the absence of a KF-triggered trial (inter-trial scenario), we observe that the rats might still demonstrate the attempt to control the intended position to the press zone for water reward.We will continue the online decoding of each algorithm and evaluate whether the intended position meet the trial start criterion: if the intended position was held within the start zone (3 s), we designated this time instance as a pseudo start.No actual feedback was provided to rats, and all algorithms continued decoding in parallel.If the algorithm held the animal intended position within the press zone for the required holding time (0.5 seconds), the trial was considered a pseudo success.In this case, the algorithm could potentially trigger additional trials for rats, enabling them to complete more tasks within the same period.

C. CKRL-Based Adaptive Decoder
The structure of CKRL based adaptive decoder is shown in Fig. 2. The top row represents the temporal evolvement of the prosthetic state, where the current state is influenced by Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.the previous state (e.g., the continuity of the animal intended movement).On the other hand, the prosthetic state (top row) is also associated with the subject neural activities (bottom row).The current neural activity (red dot) is nonlinearly translated into the RKHS feature space by CKRL [29].CKRL will generate the posterior prosthetic state (animal intended position) representation from the neural activity, which will update the prior estimation from the state transition function.After each time step, based on the relative distance to the target, a reward signal will be given to reinforce the parameters of CKRL as a time-variant observation model.The neural activity is accumulatively stored in a neural pattern reservoir, which could adaptively follow the time-varying neural activities of the subject.The mathematical details of the proposed algorithm are given in the following sub-sections.

1) State-Observation Decoding Model:
The stateobservation model of the proposed method consists of two parts.The first part is the state transition function, which demonstrates the relationship between the previous and current prosthetic states as shown (1).The second part is the observation function, which explores the association between the neural activity and the prosthetic state as shown in (2).
x t = F t (x t−1 ) + q x (1) The prosthetic state at the current time step t is denoted as a vector x t ∈ R D×1 .In this BC experiment, we define x t as the animal intended position with one-dimension (D = 1).Since the prosthesis is moving continuously, the current state x t is associated with the previous state x t−1 as shown in (1).F t (•) is the state transition function, which is approximated by a static linear matrix F ∈ R D×D .F is estimated by the linear least mean square method [17], [34].q x ∈ R D×1 denotes the zero-mean Gaussian noise with a covariance matrix Q x ∈ R D×D , which is estimated by the residual of the state transition function.In (2), H t (•) represents the relationship from the prosthetic state to the current neural activities u t ∈ N C(H +1)×1 and, where C is the total number of channels from the subject's neural signals and H is the total number of concatenated historical spikes.q o is the zero-mean Gaussian noise.Since the dimension of the neural activity u t is usually much larger than the dimension of prosthetic state x t , it would be more accurate to convert u t into x t .Hence the estimation noise would have a low dimension.By taking the inverse operation on (2), the observation equation becomes H −1 t (u t ) = x t + q u , where q u ∈ R D×1 denotes the transformation error, which is approximated as zero-mean Gaussian distribution with a covariance matrix Q u ∈ R D×D .This inverse function H −1 t (•) translates high-dimensional neural activities into low-dimensional continuous trajectory in BC task, which provides constraint and estimation that could potentially enhance the decoding accuracy.The next question is how to model H −1 t (•) accurately and adaptively.
Since CKRL [29] has a good property as a universal approximator for arbitrary nonlinearity, the optimization of CKRL can reach global optimum with a higher training efficiency compared with neural network-based RL [35], [36], [37].In the continuous BC task, we utilize CKRL to numerically approximate where φ t (u t ) transforms the high dimensional neural activity u t into the low dimensional space of the prosthetic state x t .Equation (3) can also be seen as a new observation model, where the observation matrix is the identical matrix.CKRL provides estimation on the prosthetic state, which can be used as a new observation as a correction for the prior estimation.
2) Adaptive Decoding During the Brain Control Process: In this subsection, we will introduce the adaptive decoding procedure of our proposed method during the BC process.The first step is to generate a prior estimation of the prosthetic state.Since the initial condition of the state is uncertain in BC, to avoid unstable decoding, the prior state x t|t−1 could be estimated in the form of information filter where x t−1|t−1 denotes the posterior prosthetic state output from the previous time step t −1.The information matrix of the prior state is denoted as I t|t−1 ∈ R D×D , which is derived from the posterior information matrix at the previous time I t−1|t−1 [38].The next step is to approximate the transformation φ t (u t ) in (3).The value of state Q π (x) could be approximated by CKRL with the kernel trick [33] where the input of CKRL is the neural activity u t .And u i is the neural activity prior to time t.α i x is the weight coefficient that associates the neural activity at time i with the state x.K(u i , u t ) represents the kernel function and we choose the commonly used Gaussian kernel in this paper (K , σ is the kernel width).Then the probability π (x) of the state x is estimated by a softmax policy in (7).The output of CKRL is the expectation of the state as shown in (8).
x ′ contains all the possible states.φt (u t ; θ t ) is the approximation of the neural-state transformation φ t (•) given θ t .And θ t is the parameter of CKRL prior to the current time t, which contains the previous neural activities u i and the corresponding weights α i x .
Given the state approximation from CKRL, the next step is to update the posterior prosthetic state in the form of information filter [39] as follows where K is the coefficient that determines how much we should trust the output of CKRL compared with the output of state transition function.The posterior prosthetic state x t|t and information matrix I t|t are updated in ( 10) and ( 11) respectively.x t|t is the final output of the proposed method at the current time step t.The final step is adaptively updating the decoding parameters θ t , which follows the change of the neural activity in the BC process through and trial and error.For every trial in the BC task, at each time step t, if the prosthesis is getting closer to the target, we set the reward r t = 1.And r t equals 0 otherwise.This reward signal r t is then used to update the parameters as follows [29] β is the learning rate.α t x * t is the weight that associates the current neural activity u t with the probabilistically chosen state x * t .g(x * t ) is a function that increases the learning efficiency for unexpected reward, maintain parameters of the expected reward, and punishes the non-reward neural-action mapping.After the current prosthetic state is decoded, the neural activity at the current time t will be stored into the neural pattern reservoir (C = {C, u t ).In this way, our proposed method reserves the subject's most recent neural activities and their corresponding weights to the prosthetic states.
The entire procedure of the proposed method is summarized by the pseudo code shown in Table I.

III. RESULTS
In this section, we show the experimental results of the proposed algorithm on the brain control data collected from the SD rats.We first present the scenarios where the trajectory generated by the proposed method reaches success area faster and triggers more starts.Then we show the statistical performance over multiple days of recording.Finally, we examine the neural pattern development and the algorithm adaptation over days.
We collected the data of the BC task over 6 days when the two rats were well-trained (the success ratio is greater than 80%).The first 3 days and last 3 days are consecutive respectively.But the two groups of days are one month apart.The statistical results are calculated over about 200 trials on each day.These data allowed us to evaluate the performance and adaptation of the decoding models on the long-term changes of neural activities.Both rats have 32 channels of The rats performed the BC task only utilizing the KF decoder.The audio feedbacks were generated based on the decoded movement position from the KF decoder.After the completion of BC tasks, we train the other three methods (Refit KF, EKF, CKRLAD) with the collected data.To train CKRLAD, we selected 80% recorded neural data and the corresponding actions in BC as the training data.We used a five-fold cross validation to train and test the algorithm.The state transition function F is obtained by the least square method from the intended movement trajectory.The values of F and Q x in (1) are 0.99 and 0.01 respectively, which are the same as KF.The weights in the observation function (CKRL) are obtained by reward information.For within-trial neural data, CKRLAD sets the reward r t = 1 when the intended movement goes towards the press zone (gray rectangle in Fig. 3a), and r t = 0 otherwise.For inter-trial neural data (3 seconds after trial success), the rats need to trigger the start of the next trial.CKRLAD sets reward r t = 1 when the intended movement goes to the rest zone (blue rectangle in Fig. 3), and r t = 0 otherwise.After exploration, the parameters of CKRL are β = 0.1, σ = 35, Q u = 0.1.To train ReFIT KF, we also use 80% data to run a five-fold cross validation.ReFIT KF sets all the movement velocity direction always pointing towards the target during the training stage.Then the parameters of ReFIT KF are trained by the least square method [22].To train the EKF, we select 80% of the neural-trajectory data on day 1 to train its parameter.For the following days, the parameters of EKF remains the same.

A. Representative Reconstructed Trajectories
In this subsection, we will present the intended movement trajectory reconstructions illustrating two scenarios of the BC task.The first scenario highlights the benefit of rapidly reaching the target, while the second emphasizes the ability to initiate more trials.The reconstructed trajectories are shown in Fig. 3.The horizontal axis representing time and the vertical axis (ranging from 0 to 1.5) indicating the intended movement position at each time point.The start and press zones are denoted by blue and gray regions, respectively.The heatmap displays the normalized spike count from the top seven significant channels, ranked by the correlation coefficient between the spike count and intended movement Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
position.The heatmap's resolution is the same as the decoding time window (100 ms).Trajectories of KF, EKF, ReFIT KF, and the proposed CKRLAD are depicted by black, green, blue, and red curves, respectively.
In Fig. 3(a), all algorithms initiated the trial with the same intended position at the same time, as indicated by the black bar in the start zone.When the rat heard the start cue, at the first half of the trial, it tried to move the intended position into the press zone (gray) for water reward.The neural firing rates had increased at the beginning of the trial.However, KF failed to drive the intended position towards the press zone at the first half of the trial.The linear decoding only produces a slight upward movement.Success was not achieved until the rat increased its neural firing rates once more before the success cue (black bar in the press zone).The reconstructed trajectories of EKF and ReFIT KF (green, blue) exhibit little differences from KF (black), and the successful times for all algorithms are similar.For EKF, although it established a nonlinear neural-position mapping, the linear Taylor expansion made the mapping similar to that of KF.In the case of ReFIT KF, refitting velocity did not provide much benefit when velocities decoded from KF were already aimed towards the right direction.For the proposed CKRLAD method (red), the intended position moved to the press zone faster to the press zone than the linear method, capturing the increased neural firing rates.After remaining in the press zone for 0.5 seconds, the trial achieved a pseudo success (dashed red bar).The decoding process of CKRLAD was halted until the success cue from KF emerged, then the decoded intended positions of all algorithms restarted from the same position.The above results indicate that the proposed method can accomplish successful trials more rapidly than linear methods, thereby reducing response time and enhancing the efficiency of the BC task.
Fig. 3(b) demonstrates the scenario where CKRLAD could trigger more trials and complete them successfully.In this case, the rat must maintain the intended position within the start zone to trigger the next trial.However, the linear KF and ReFIT KF trajectories oscillated out of the start zone (0-0.75),failing to remain long enough to initiate the next trial.Although the rat attempted to move the intended position to the success zone by increasing its firing rates (dark green blocks in the heatmap), no reward can be obtained since the trial has not yet started.Both EKF (green) and CKRLAD (red) held the intended position within the start zone for more than three seconds, leading to the pseudo trial initiations, as shown by the green and red dashed bars in the start zone respectively.Following the pseudo start, EKF decoded the intended position towards the press zone, reflecting the rat's reaching intention (dark green blocks in the heatmap).But the intended position could not reach the press zone due to the less non-linearity of EKF.On the other hand, CKRLAD not only captured the rat's intent to move to the press zone but also decoded the intended position within the success zone over 0.5 seconds, achieving a pseudo success (dashed red bar in the press zone).Consequently, CKRLAD demonstrates a greater potential to accurately discern the subject's movement intentions, potentially enabling the completion of more trials within the same time frame and further augmenting the efficiency of the BC task.

B. Statistical Performance Over Days
In this subsection, we will show the statistical performance of the algorithms on two rats over 6 days of BC task in Fig. 4. The performance of KF, EKF, ReFIT KF, and CKRLAD on the test segments are shown in gray, green, blue, and red bars respectively.Fig. 4(a) shows the total number of successful trials of different methods on Rat A. The horizontal axis represents different days.We employ a five-fold cross validation to calculate the number of successful trials on each day.By setting the KF success trials at 100, we scale the results of other methods accordingly.The bar and the whisker represent the mean and standard deviation of the success trial number over the cross validation, respectively.
On days 1 and 3, ReFIT KF (blue bar) and KF (gray bar) show similar performance, suggesting minimal impact from velocity refit as most velocities target the goal.From day 4 to 6, ReFIT KF outperforms KF, indicating the necessity of algorithm adaptation due to the changes in rat's neural activities.Comparing EKF (green bar) to KF (gray bar), EKF consistently surpasses KF from day 2 to 6, suggesting that non-linear neural-position mapping enhances performance.Regarding the proposed CKRLAD method (red bar), it consistently outperforms all the other methods.The results indicate that the proposed algorithm effectively tracks non-linear and non-stationary neural signals over time, achieving superior performance.
In Fig. 4(d) for Rat B, EKF demonstrates superior performance compared to other methods on day 1.However, this results in overfitting problem, leading to a decline in EKF's performance, particularly from day 4 to 6.The proposed method exhibits no substantial performance improvement during the first three days, suggesting that the rat's neural activities remain stable and similar to the training data.From day 4 to 6, our method always outperforms others, highlighting the benefits of the adaptation based on the RL framework.
Fig. 4(b) and 4(e) display the response times for different methods on two rats.The response time, defined as the interval between the start cue and success cue, measures the efficiency of the BC task using the decoding algorithm.The bar and whisker represent the mean and standard deviation of response times across trials on each day, respectively.For both rats, the response times for EKF and ReFIT KF are similar (statistical t-test, p > 0.05).EKF and ReFIT KF exhibit shorter mean response times than KF ( p < 0.01), highlighting the impact of non-linear neural-position mapping and velocity refit.The proposed CKRLAD (red) consistently demonstrates significantly shorter mean response times and the smallest variances ( p < 0.001), indicating a better and more stable performance due to the non-linearity and adaptation.
Fig. 4(c) and 4(f) illustrate the brain-triggered inter-trial time for the algorithms on both rats.The brain-triggered intertrial time is defined as the time interval between success and the next trial's start.A smaller inter-trial time suggests faster next-trial triggering and more trial completion within

C. Adaptation to Time-Variant Neural Patterns
In this subsection, we will show how the proposed method co-adapts with the rat's evolving neural patterns during the training of the continuous BC task.Also, we will show why the linear methods with fixed parameters did not work well in this scenario.Fig. 5 shows the neural patterns after Principal Component Analysis (PCA) on three different days of Rat A. The PCA was applied on the top 25% of important channels according to the absolute values of weights in KF.The PCA subspace is generated only by the neural data on Day 1.And the neural data on the next two days are projected into the same PCA subspace on Day 1.In each subplot, the x and y axes are the first and second principal components respectively.The green dots represent the neural patterns within two seconds period before a trial succeeds, which shows the reaching process.The purple dots represent the neural patterns within two seconds period after a trial succeeds, which shows the process of triggering the start of the next trial.The dashed On Day 1, the neural patterns of Start and Success could be linearly separated at the accuracy of 82%.At the same time, CKRLAD finds a non-linear decision boundary, which demonstrates a comparable discrimination accuracy of 81%.During the subsequent two days (Day 2 and 3), since they are temporally close to Day 1, the neural patterns are relatively stable.The average accuracy of the linear discrimination is 82%.Concurrently, our proposed non-linear method has a better accuracy on the first three days (average 85%).In the later phase of the experiment (Day 4 to 6), which is about one month later than Day 1, the neural patterns have drifted.Utilizing the fixed linear discrimination method established on Day 1 resulted in decreased separability (average accuracy of 67% from Day 4 to 6).In contrast, CKRLAD's parameters were adaptive according to the reward during the training of the BC task.The decision boundary co-adapts with the neural patterns (drift of the red curve).Consequently, the non-linear discrimination maintains a good performance (average accuracy of 78% from Day 4 to 6).These results demonstrate that our proposed method could match the neural patterns non-linearly and co-adapt with the changing neural patterns of the subject, which achieves a more accurate and stable decoding performance over multiple days.
IV. DISCUSSION AND CONCLUSION BMI aims to help disabled people restore their motor function by interpreting their movement intention from neural signals.The brain control process requires the subject to adapt the neural activity in the closed-loop control.It is important to design the decoder that interprets the time-variant functional mapping between neural activity and continuous trajectory to help the subject accomplish the BC task with a stable performance.In this work, we proposed a Clustering-based Kernel Reinforcement Learning-assisted Adaptive Decoder to facilitate stable and continuous Brain Control tasks in BMI.CKRLAD utilizes a state transition function to ensure the continuity of the prosthetic state.CKRLAD incorporates the subject's movement intention into the state prediction.The movement intention is translated non-linearly by a clustering-based kernel reinforcement learning, which is more accurate and efficient than the linear methods.During the multi-day BC process, CKRLAD co-adapts with the nonstationary neural signals by a reward signal, which maintains stable performance over days.Our proposed method outperforms the existing decoding methods (KF, EKF, ReFIT KF) in terms of response time, inter-trial time, and total number of successful trials over six days of BC tasks on two SD rats.The results are tested on diverse data sets with enough segmentations, thereby validating the applicability of proposed method in brain control tasks.
Our proposed method's performance improvement compared with KF and ReFIT KF highlights the advantage of incorporating non-linearity into the decoding process.This non-linearity allows for a more accurate interpretation of the subject's movement intention.At the same time, the improvement compared with Extended KF shows the benefits of model adaptation, which enables the decoder to track changes in neural signals during the BC task.By combining the merits of both non-linearity and adaptation, our method could increase the decoding efficiency, accuracy, and stability.These advantages demonstrate that the proposed method is a promising tool for the clinical brain control of the intracortical brain-machine interface.
While our study demonstrates promising results, there is a limitation to be considered.The proposed method is evaluated in an emulated scenario using the neural signals collected from KF.In the pseudo trials generated by the algorithm, the dynamics of neural signals did not change with the pseudo feedbacks.Even though the proposed algorithm shows the potential to capture the rat's movement intention as shown in Fig. 3, the actual response from the subject is not clear.In future works, testing the proposed model in a real-time BC task with online feedback with more subjects is needed.By providing real-time feedback, subjects can adapt their neural activity based on the decoded output, enabling a more accurate assessment of the algorithm's efficacy in practical BMI applications.
Another promising future research direction could be incorporating reward information directly from the subject's brain.For example, in a free robotic arm control system without predefined targets, the reward signals could be generated from the subject's own evaluation of the robotic arm's action.By detecting reward information directly from neural activity patterns [27], [40], [41], the system can autonomously learn to associate different neural patterns with successful or unsuccessful actions.This approach will allow for a more natural and adaptive learning process for the subjects in BC tasks.

Fig. 3 .
Fig. 3. Intended Movement Trajectories Comparison: Kalman Filter (black), EKF (green), ReFIT KF (blue), and CKRLAD (red).(a) Within-Trial Scenario: All algorithms initiate trials simultaneously with an identical position (black bar).(b) Inter-Trial Scenario: KF and ReFIT KF do not trigger trial start.EKF triggers a trial but fails.CKRLAD triggers a new trial and succeeds.

Fig. 4 .
Fig. 4. Statistical performance metrics over days.(a) Total number of successful trials.(b) Response time.(c) Brain triggered inter-trial interval.(d-f) The results for Rat B.

a
given time frame.The average inter-trial intervals for KF and ReFIT KF are similar ( p > 0.05), indicating that the intended velocity frequently points toward the correct target, and velocity refit has little impact.EKF's inter-trial time is shorter ( p < 0.05) than linear methods (KF and ReFIT KF), suggesting that non-linear methods trigger the next trial more quickly.CKRLAD exhibits shorter inter-trial times compared to EKF ( p < 0.05) and a smaller variance, indicating the contribution from model adaptation.These results demonstrate that CKRLAD effectively tracks neural activities during inter-trial intervals, providing a more efficient decoding performance.

Fig. 5 .
Fig. 5.The neural patterns after Principal Component Analysis (PCA) across different training days and the decision boundaries.

TABLE I THE
PSEUDO CODE OF THE PROPOSED CLUSTERING-BASED KERNEL REINFORCEMENT LEARNING ASSISTED ADAPTIVE DECODER recorded neural data.The average success trial numbers on each day are 204.8 and 212.5 for Rat A and B respectively.The average trial lengths of Rat A and B are 2.7 and 2.2 seconds respectively.