A Biologically Constrained Cerebellar Model With Reinforcement Learning for Robotic Limb Control

The cerebellum is known to be critical for accurate adaptive control and motor learning. It has long been recognized that the cerebellum acts as a supervised learning machine. However, recent evidence shows that cerebellum is integral to reinforcement learning. This paper proposes a biologically plausible cerebellar model with reinforcement learning based on the cerebellar neural circuitry to eliminate the need for explicit teacher signals. The learning capacity of cerebellar reinforcement learning is first demonstrated by constructing a simulated cerebellar neural network agent and a detailed model of the human arm and muscle system in the Emergent virtual environment. Next, the cerebellar model is incorporated in both a simulated arm and a Geomagic Touch device to further verify the effectiveness of the cerebellar model in reaching tasks. Results from these experiments indicate that the cerebellar simulation is capable of driving the “arm plant” to arrive at the target positions accurately. Moreover, by examining the effect of the number of basic units, we find the results are consistent with previous findings that the central nervous system may recruit the muscle synergies to realize motor control. The study described here prompts several hypotheses about the relationship between motor control and learning and may be useful in the development of general-purpose motor learning systems for machines.


I. INTRODUCTION
As an important central nervous system, the cerebellum plays an important role in the adaptive movement control [1]. It has often been likened to a neural machine or computer with its precise geometrical array of intrinsic cell types that allow for the integration and organization of movement-related information through both of its afferent systems. Therefore, intensive research on neurophysiology and modeling of the cerebellum were carried out to pave the way for the establishment of artificially intelligent control systems [2].
Since Eccles et al. [3] proposed a comprehensive theory about the internal neuron types, connection and function of the cerebellar cortex, the study on developing a functional model of the cerebellum has entered a new stage. Marr [4] and The associate editor coordinating the review of this manuscript and approving it for publication was Nishant Unnikrishnan.
Albus [5] recognized that the cerebellum played an essential role in error-based supervised learning (SL) and developed a non-fully connected perceptron-like associative memory network called cerebellar model articulation controller (CMAC). Due to its rapid learning convergence and simple structure, the CMAC has been intensively studied [6] and many variants, such fuzzy CMAC (FCMAC) [7]- [9], recurrent interval type-2 petri CMAC (RIT2PC) [10], have been proposed to address nonlinear problems. Fujita [11] expanded their models by incorporating a dynamic viewpoint and proposed an adaptive-filter model of the cerebellum. Ito [12] proposed a comprehensive functional model in which a cerebellar microcomplex composed of a cortical microzone and a small cell group in cerebellar acted as an adaptive controller. Alternative theories of cerebellar functions have also been proposed such as internal model [13], adjustable pattern generator [14], and tensor geometrization theory [15]. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Along this line, various cerebellar models have been developed in the context of robotic control. The key idea in all of these models is that the climbing fiber inputs onto the Purkinje cells provide an error signal that drives learning in the Purkinje cells, such that to some extent, the resulting modified output of the Purkinje cells reduces or eliminates the error in the future. These models assume that the cerebellum is responsible for supervised learning. However, recent studies showed some contrasting evidence against this prominent idea. The anatomical studies indicate that the cerebellum receives sensory information about stimuli impinging on the organism as well as information about the organism's effect. Swain et al. [16] posited that this system's organization made it a mediator of reinforcement learning (RL). They reported the evidence from a variety of learning paradigms to show that the cerebellum was integral to reinforcement learning. Therrien et al. [17] used a 15 • gradual visuomotor rotation task to examine the learning and retention of a reaching skill under error-based and reinforcement paradigms. The results showed that reinforcement schedules produced better learning and retention compared with the error-based counterpart. However, there was significant learning difference between cerebellar group and control group and cerebellar patients also varied in their learning ability. Using mechanistic model, they predicted that cerebellar damage indirectly impaired the RL capability by increasing motor noise. Recently, Yamazaki et al. [18] extended the Marr-Albus-Ito model and implemented the cerebellar circuit with RL algorithm successfully. The major limitation of this approach is that there is no direct experimental evidence or rigorous mathematical justification on the feedback inhibition term which is assumed to be fed by the molecular layer interneurons (MLIs).
However, theses cerebellar models oversimplified the cerebellum structure. To develop neurorobotics, it is necessary to adopt new discoveries from neuroscience. Moreover, the significance of the cerebellum model is not only to achieve better control of the robot movement, but also to simulate the damage of the cerebellum and establish the mapping relationship between the damage of the cerebellum and motor dysfunction, so as to guide the rehabilitation training of patients with cerebellar disease.
As a result, the modeling of cerebellum should also take the interpretability of the control process into consideration, e.g. the efficacy of neurotransmission among different layers and the learning mechanisms. Therefore, a biologically constrained cerebellum model with RL inspired by results from biology and physiology was proposed in this paper, which may provide a computational basis for cerebellar learning, memory, and movement coordination.
The rest of the paper is organized as follows. Section II presents the cerebellar RL model architecture after a brief introduction of cerebellar cortex as a RL machine. Section III describes the modeling method which is followed by the simulation experiments in Section IV. Section V gives the system test results. Finally, conclusion and future work are given in the last section, Section VI.

II. CEREBELLAR REINFORCEMENT LEARNING MODEL ARCHITECTURE A. CEREBELLAR MODEL ARCHITECTURE
The cerebellum has a very well-defined anatomy with the same basic circuit replicated throughout. Each basic circuit has a homogeneous structure with two inputs and one output as shown in Fig. 1. Within the cerebellum proper, the principal feature and cell type of the cerebellum is the Purkinje cell (PC) whose nearly two dimensional dendritic trees are arrayed perpendicular to the long axis of the lobule. Each set of PCs receives a convergent input from an array of parallel fibers (PFs) which are the axons of the granule cells (GCs), and a private climbing fiber (CF) input originating from the inferior olive (IO) in the brainstem. GCs encode the sensory information received from the mossy fibers (MFs) axons. The PC-PF synapses are believed to be related to cerebellar motor learning because of its long-time depression (LTD). As a matter of fact, CF is important for the learning process of the cerebellum. While a given PC receives input from only one climbing fiber, it is considered a very potent exciter of the cell. It is through this strong excitation of PC in specific lobules that particular skeletal motor movements are selected. The output of the PC is inhibitory. As the unique output, PCs inhibit the deep cerebellar nuclei (DCN) that form a link in a regeneratively active cerebro-cerebellar loop to form the motor command. All other connections with PCs are also inhibitory, including stellate cells (SCs), basket cells (BCs). Golgi cells (GgCs) are also inhibitory which receive excitatory input from mossy fibers and synapse on granule cells.
The computational power of the cerebellar cortex as a SL machine has been extensively examined. In this kind of models, the CFs need to provide explicit teacher signals. However, the work by Hausknecht et al. [19] showed that the cerebellum using a SL paradigm failed to perform complex tasks well in which explicit teacher signals are not provided. A number of previous studies also showed that CF activities contain information about directional errors in reaching but they didn't tell an agent how to correct the errors explicitly in terms of joint torques or muscle contractions [18], [20]. In this sense, CF is believed to convey evaluative feedback and cerebellar cortex as a RL machine.

B. CEREBELLAR MODEL IN A RL FRAMEWORK
The essential difference between RL and SL is that explicit ''teacher'' or ''error'' information is not required in RL. RL can be defined as the ability to map situations into actions that interact with the environment to maximize a reward [21]. The so-called reward or punishment is the evaluation of the actions taken by the agent. The basic elements of a RL system can be subdivided into five elements, namely, an agent, a policy, a reward or punishment function, a value function and an environment model.
The agent takes an action through a certain policy and then a reward or punishment guiding behavior is obtained through its interaction with the environment. The rewards of a series of actions constitute a value function. There are generally two basic tasks in RL algorithms: policy evaluation and policy improvement [22]. The former calculates the value function according to the current policy, and the latter evaluates the obtained value function in order to update the current policy. The goal of RL is to find an optimal policy for a specific problem, so that the value function obtained under the policy is maximized. According to the basic framework of RL, we designed a robotic limb control system with the cerebellum model as the agent as shown in Fig. 2. Model architecture of a robotic limb control system based on the cerebellar model. The cerebellar module produces signals sent to spinal/muscle system. These signals are then transformed by the spinal/muscle system into torque commands to drive the arm plant. Movement errors are detected by the inferior olive, which update the cerebellar control policy. The delay modules τ 1 , τ 2 simulate conduction delays.
The environment mainly includes the spinal/muscle system, the robotic arm plant and inferior olive. The agent, including the basic units and state encoder, perceives the state of the environment s t and receives the reward signal r t at time t, and then update policy π t to select the action a t . Afterwards, s t changes to a new state s t+1 and evaluates the taken action a t to generate a new reward r t+1 , which will be then sent to the agent again. This process will repeat until the goal of maximizing the reward signals is achieved.
According to Fig. 2, the cerebellar module consists of the state encoder and the basic unit array. The state encoder in the cerebellar module represents the GCs, which provide cerebellum with sparse expansive encoding of the inputs from the spinal cord and motor cortex. Considering the homogeneous structure of the cerebellum, we use basic unit array to establish its structure. In this model, every basic unit receives the same inputs and all the outputs of the basic units are assembled to generate the output of the model. Activation of a basic unit leads to the movement of the arm plant in a direction that is specific to that basic unit, and the magnitude of a basic unit's activity determines the velocity of that movement. Simultaneous activation of selected basic units determines the arm trajectory as the superposition of these movements. Learning processes adjust the subsets of basic units that are selected as well as characteristics of their activity in order to achieve desired movements.
The basic units generate motor commands that are fed to the spinal/muscle system to bring the arm to the specified position. The spinal/muscle system transforms muscle space signals into joint torques.

III. DERIVATION OF THE CEREBELLAR REINFORCEMENT LEARNING MODEL A. BASIC UNITS ARRAY
Each basic unit has the same structure which contains an artificial neural network (ANN) receiving inputs at each time step and produces an output of nucleus cell firings. Fig. 3 shows the structure of a single basic unit in more details. In one single basic unit, each brown sphere circle represents one PC and the vertical line above it denotes the PC dendrites while the vertical line below it stands for the PC axon. The blue horizontal lines connected with the PC dendrites are PFs. These fibers synapse upon PCs of the model, irrespective of the basic unit in which they are located. The yellow and pink lines represent CF and SC respectively. All the PCs of a single basic unit share the same CF and SC inputs. The horizontal purple line that is connected to the PC axon represents BC. The synapses formed between different neurons. The inputs of the basic unit will first determine the status of PF. The activated PF will then transmit the information of the corresponding PC-PF synapses to the PCs. Through the modulation of the intermediate neurons such as BC, SC and CF, the state of PCs will be determined. Finally, the output of each basic unit is determined by the states of all of its PCs as well as its feedback loop.
Different cells feature inhibitory as well as excitatory connections and allow the network to exhibit dynamic temporal behavior. They can be divided into two classes, one representing the state of the neurons and the other denoting the synapse information. Different neurons communicate with each other through synapses. In turn, synapses also produce different effects on neurons according to their activities which are expressed by the parameters listed in Table 1. Suppose each basic unit consists of m PFs and n PCs (also see Fig. 3), the states of PFs and PCs at time t are x 1 (t), . . . , x m (t) and y 1 (t), . . . , y n (t). The ith PC in a basic unit at time t receives the same input signals via multiple excitatory PFs x 1 (t), . . . , x m (t), a single inhibitory BC input b i (t), a single inhibitory SC input s(t), and a single CF c(t). These inputs are all binary, with 1 and 0 respectively indicating activity and inactivity.
The learned parameter of the network is the synaptic weight between the ith PC and the jth PF. Because of the synaptic plasticity and the property of long-time depression (LTD), we set w ij (t) to be a continuous adjustable variable. ϑ i , υ i , and ς i represent the synapses formed between the ith PC and BC, CF and SC respectively. These weights are often set to be fixed values according to the inhibition or potentiation to the activities of PCs. This model thus mainly involves two types of parameters, the states and weights.
Assuming that there are d basic units in the cerebellar module, the membrane potential q ki (t) of the ith PC in the kth (k = 1, 2, . . . , d) unit is determined by the total input to PC at time t, which is equal to The basic unit number is denoted as subscript k in the notations, such as w kij (t) represents weight of the synapse forming by ith PC and jth PF in the kth unit.
The state of the ith PC in the kth basic unit, denoted by y ki(t) , is given by where φ and η are respectively the on-threshold and offthreshold values of PC, and φ > η. The PCs are relatively insensitive to changes in their input except near the on-threshold φ, where transitions from the off-state to the on-state occur, and the off-threshold η, where transitions from the on-state to the off-state occur. If the PC is in the off-sate, the net input q ki (t) must exceed the φ, in order to drive it to the on-state. On the other hand, if the PC is in the on-state, the q ki (t) must drop below the η to force the cell to switch to the off-state. The effect of PC-PF synapse w kij (t) on PC membrane potential changes with time, and the trend of w kij (t) is more inclined to make the cerebellar module stop the command output. The update formula of w kij (t) is as follows where κ is a positive constant. The cerebellum produces commands to control arm movement through the spinal muscular system. If the error is within the allowed range, the instructions that cerebellum produces are appropriate. Otherwise, the relevant parameters in the model need to be adjusted through the reinforcement learning model in the inferior olive.
A basic unit's motor command is generated through the activities of its PCs, which inhibit and modulate the buildup of activity in the feedback loop. When the feedback loop in a unit is in its off-state, the output of the unit is zero. Transition to the on-state is caused by a trigger signal, and the output of a unit is the amount of the proportion of PC in the off-state. In other words, once the feedback loop is triggered, the output of the base unit depends on the state of the PC. However, the opposite process depends only on the PC. Due to the inhibition effect of the PC on the deep cerebellar nucleus, the feedback loop will be turned off when a large fraction of the PC in the basic unit are excited. At this time, the basic unit ceases to produce an output.
The output of the k basic unit O k (t) is defined by: where ε is the constant adjustment factor.

B. BASIC UNITS ARRAY
By receiving inputs from multiple classes of MFs and local inhibitory interneurons, the GCs are thought to provide a sparse expansive encoding of the incoming state information.
In this model, the sparse expansive encoding is realized by the state encoder. This coding scheme makes use of multiple tiles over the state space. A single tiling partitions the arm end position space into discrete but non-overlapping tiles.
Each PF corresponds to a single tile. When the system state falls into a particular tile, the corresponding PF x j is given an activation level of 1, and all others are set to be 0. In this way, the values of x 1 . . . x m are determined. This is realized by the quantization layer in a CMAC network. Each input activates certain fields in the quantization layer. If we use M(t) = [x(t), y(t), z(t)] T to denote a 3dimensional input state, the output of the state encoder can be formulated using a mapping f as:

C. DELAY MODULE
There are different nerve conduction delays when nerve impulses are transmitted away from the neuron along the nerve fiber or axon. In this model, there is a delay τ 1 from the spinal cord system to the arm plant and another delay τ 2 from the arm plant to the IO.
To account for the conduction delay, the eligibility trace is used which simulates a short period of memory process [23]. The concept of the eligibility trace was originally proposed by Klopf from the perspective of cognitive science. Recently, Yagishita [24], He [25] and Brzosko [26] have confirmed the existence of the eligibility trace in the striatum, cortex and hippocampus. When a state is visited, the eligibility trace is used to record the credit of the state, and the value of the credit decreases over time. When returning the reward value, if the value of the eligibility trace corresponding to the state is not zero, the state is given a certain degree of credit according to its eligibility trace.
In our model, the following replacing trace is used where ζ (0 ≤ ζ ≤ 1) is the discount-rate parameter and λ(0 ≤ λ ≤ 1) is the decaying parameter. Usually, both of them are in the range between 0 and 1. s t represents the actual state at time t, e s (t) and e s (t − 1) are the traces for state s at times t and t−1. e s (t) is calculated using

D. SPINAL/MUSCLE SYSTEM
To carry out a movement, it is necessary to identify what each joint actuator should do at each time instant under certain set of conditions. The spinal/muscle system model is used to convert the motor command from the cerebellum into arm movements. Each basic unit acts alone to generate a specific movement of the arm in the whole workspace. The arm plant trajectory is therefore the result of simultaneous execution of movements generated by multiple basic units. Let M(t) and M(t) denote the position of the end-effector at time t and its adjustment from t to t + 1. Simplifying assumptions are introduced that basic unit activity has an instantaneous and linear effect on changes in joint angles. Therefore, we have where F is a weighting matrix of 3 rows and d columns that summarizes the influence of the d basic units on the endeffector. It represents the mapping between the motor command space and arm's Cartesian position space. Each column is a weight vector implying how the corresponding basic unit influences the end-effector.
where τ is a constant representing the time delay.

E. ROBOTIC ARM PLANT
The robotic arm platform is a six degree-of-freedom (DOF) haptic device manufactured by Geomagic Touch(Touch, 3D Systems Inc., America) (Fig.4). The first three joints are equipped with driving electrodes and sensors, while the end three joints only have sensing devices. The data communication is realized through the Ethernet Adapter/USB port and VOLUME 8, 2020 it allows real-time programming through class ToolKit 3D Touch to work with Visual Studio 2017 [27]. The problem of solving for the relationship between the joint coordinates θ(t) = [θ 1 (t), θ 2 (t), θ 3 (t)] T and position of the end-effector M(t) can be summarized as the inverse kinematics [28], [29].
The inverse position kinematics model (IPKM) is used to obtain the joint coordinates θ(t) as the function of the operational position of M(t), which is expressed as follows: here, we have where, L 1 = L 2 = 135 mm, L 3 = 25 mm and L 4 = 170 mm. The driving force of Geomagic Touch is as follows: among them, F represents the force vector generated by the Geomagic Touch, M and V represents the displacement and velocity vectors of arm respectively. S is the stiffness coefficient and C is the damping coefficient.

F. INFERIOR OLIVARY MODULE
The CF originating from the inferior olive forms an excitatory synapse with PF. It provides feedback information for the cerebellar model to correct the arm movement. The feedback information is used to adjust the memory information, i.e. the weight of the PC-PF synapse. Lots of experimental results showed that the inferior olive-climbing fiber system projecting to the cerebellum plays a critical role in basic associative learning and memory [30]- [33]. For example, anatomical studies on multi-joint limb movements in monkeys by Thach [34] had shown that it might provide the ''reinforcing'' input to the cerebellum [35]. As a result, the function of the inferior olive module is implemented in the way of RL as follows.
(1) The state s t of the cerebellar model is determined based on the arm position. The initial value s t of all PFs in each basic unit x j (0) is set to 0. The behavior selection policy under the initial conditions π 0 is where m is the total number of PFs, that is, the initial probability of choosing each behavior a is the same.
(2) Then, if the action a t is selected at time t, the next state s t+1 and the next reward r t+1 are obtained. The value function V (s t ) is updated using where α is a constant coefficient, γ is the discount factor. The reward r t+1 is determined by D, which is the distance of the current arm position to the target position compared with that of the previous moment. If it is small, the reward r t+1 is set to be 0, otherwise be −1. The value function V (s t ) at time t and the estimated value function V (s t+1 ) at time t + 1 are defined by where E π t(a) {} represents the expected value of the return value through the policy π t when behavior a is selected.
(3) Next, the temporal difference (TD) error is calculated using (19), and the behavior selection probability and policy are updated according to where χ is a step parameter, Z represents the number of optional actions, p(s t , a t ) denotes the propensity to choose behavior a under the state s at time t.
(4) Afterwards, the evaluative information c(t) that CF transfers is calculated based on δ t using (5) Finally, the weight w kij(t) is adjusted according to c(t) via (23): where µ is a positive constant. Then the above five steps are repeated until the output of all the basic units O(t) are 0, which means the end of the cerebellar command generation. After that, D is calculated to compare with a threshold. If D is larger, the cerebellum continues producing commands to control arm according to (1)-(5), otherwise, terminates the whole control process.

IV. SIMULATION RESULTS
The validity and efficiency of the proposed cerebellar model were tested with several simulation experiments.
To get an intuitive understanding of the control process, Emergent 8.5.6(University of Colorado, America), a comprehensive neural network simulator, is first used to illustrate how the cerebellar neural network works to control a biophysically-realistic arm [36]. Then, the proposed cerebellar model was applied to a two-link arm to verify its efficacy in tracking control task to demonstrate the performance of the proposed cerebellar model over the CMAC approach. Finally, a study using a navigation task is also included in this section to demonstrate the disturbance performance of the cerebellar model.

A. EMERGENT SIMULATION EXPERIMENT
To provide a realistic test of our model, Emergent is used to construct a detailed model of the cerebellum and human arm. The arm has 4 degrees of freedom (3 degrees of freedom for the shoulder joint and 1 degree of freedom for the elbow joint) including 12 major muscle groups that can control arm movements as shown in Fig.5. The red, cyan and green balls represent the shoulder joint, the muscle insertion point and the desired position respectively. The blue parts stand for the hand, forearm, upper arm and torso. The muscle groups are attached to different positions of the shoulder joint, upper arm, forearm and hand. We can see the arm reaching over toward the green target. On the first reach, note how the hand overshoots the target a bit (Fig. 5(a)) and then it comes back down closer to the target ( Fig. 5(b)). Fig. 6 shows the activation of neurons when the arm exceeds or reaches the target position with the colored squares represent neurons. The neural network implemented in our model consists of three parts, the input layers, the hidden layers and the output layer. The input layers receive sensory input typically via desired length (TL), current length (L) and velocity (V) of each muscle and target position (TP), current position (HP) and velocity (HV) of the hand. The hidden layers are so called because they do not directly receive sensory input, nor do they directly drive motor output. They play an important role in the classification and processing of information from the input layer, which mainly containing MFs, GCs, GOs, PFs, IOs, BCs, SCs. The output layer PCs have neurons that synapse directly onto muscle control areas and is capable of causing physical movement. The basic cell is a neuron-like unit that stands for a neuron whose output is usually a time-continuous value ranging from 0 to 1. The higher the activation of neurons is, the brighter the color would be. If the output of a neuron is 0, it is not activated and colored in white. On the other hand, an output of 1 represents highly activated and colored in yellow.
From Fig. 6(a), we can see as the hand overshoots the target, a subset of the units in the IO layer get activated, enabled by the blue hand position. After correction, the arm is brought to the target position, as shown in Fig. 6(b). At this time, the activations of each muscle in the layer L and TL are consistent. The same applies to the layers of the HP and TP. However, when the arm exceeds the target, these layers are largely different from the target layers as shown in Fig. 6(a).

B. TWO LINK ARM TRACKING CONTROL SIMULATION
A simulated two-link planar arm is used in this simulation experiment to compare the performance of our proposed bionic cerebellum model (BCM) and conventional CMAC in the tracking task. The weight and muscle model of the two-link arm are neglected to simplify the calculation in the simulation. To obtain a reliable conclusion, 30 sets repetitive tracking control experiments were conducted. The two joint trajectories of the two-link arm and the tracking time of the two model control methods were recorded. The methods were developed in MATLAB 2018a (MathWorks, America) on a computer with Intel Core CPU i5-8250U processor of 1.6GHz and 4G of RAM.
In the simulation, the objective is to have the arm joints follow a particular trajectory. The tracking reference mode of the joints of the two-link arm was set to be a prescribed sinusoidal trajectory. Simulation results of the joint angle tracking responses are shown in Fig.7. The solid lines are the reference tracking trajectories of the arm joint angles. The dashed lines and the dotted lines are the joint angles produced by our BCM model and the traditional CMAC model respec-  tively. From this figure, we can see both control methods are stable, but the BCM has a better tracking accuracy than the CMAC through eliminating the tracking errors. Moreover, the average running time decreases from 25.06 s to 13.64 s when replacing the CMAC model with our BCM model. We could reasonably conclude that our BCM has a faster response speed and higher tracking accuracy.

C. ROBOTIC ARM NAVIGATION CONTROL SIMULATION
To further verify the control effect of the cerebellar model, a simulation experiment was designed to control the simulated two-link arm to move from an initial position to a target position. The control performance with different basic units (d = 6, d = 10) were measured to identify the effect of the number of basic unit. The 2-norm of each column of F was set to be 0.2 mm. First, the initial position of the endpoint (30,40,20) and target position (−30, −40, −20) are set to be same and thirty trials were conducted. According to Eq. (11), the movement trajectory of the endpoint and the curve of the position difference when the cerebellar model implemented with 6 basic units (d = 6) in one trial are shown in Fig. 8. The state of CF in each basic unit computed by Eq. (22) with execution times is shown in Fig.9. For the case of 10 basic units, the results are shown in Fig.10 and Fig. 11 respectively. From Fig. 8 and Fig. 10, we can see that during learning, both cerebellar models can constantly adjust the direction of the simulated robotic arm to make it move toward the target position, and finally reach the predetermined target position. All the errors are controlled within 1 mm which proves the As the number of basic units in the cerebellum increases, there are more strategic directions for the cerebellum to choose, so the number of execution times will increase. The results show that the average execution times for the cerebellar model with 6 and 10 basic units are 352±31 and 368±63. Unilateral t-test result showed that there was significant difference for the execution times (t = 1.19 <1.65 = t 0.05 ( 58)). The # 3 basic unit (negative X-axis), #4 basic unit (negative Y-axis) and # 6 basic unit (negative Z-axis) are more active throughout the learning process since the movement direction of the arm is from the first quadrant to the seventh quadrant as we can see in Fig. 9. The mapping direction of these several basic units is related to the movement direction, so it is activated, and its state value is 1. The inactive or rarely activated basic units, such as #1 (positive X-axis), #2 (positive Y-axis) and #5 (positive Z-axis) are scarcely activated because their mapping direction is opposite to the movement direction. As shown in Fig. 11, expect the #3, #4, #6 basic units, the #8 basic units is also active because it represents the direction vector (−0.14; −0.14; 0) which plays an important role in the motion.
To further investigate the model behavior, we evaluated its performances under unexpected disturbance. At a certain point, an unexpected external force is applied, which will produce a perturbation in the performed the trajectory. We evaluated how the different basis units tuning affected the performances in a noisy environment and how the BCM model deal with the perturbation. For convenience of observation, the trajectory that the arm endpoint moves is in the Y-Z plane (Fig. 12, solid line). When an unexpected perturbation  is virtually added to the arm, the trajectory deviates from the desired one, being deformed toward the left (Fig. 12, dotted line). Note that the trajectory of the arm endpoint was changed when it reached to (0, 10, 20) mm.
Examining the results more closely, as reflected by Fig. 12, when the perturbation appears, reveals that our cerebellar model immediately reacts to the errors. It could rapidly converge in reducing the tracking bias and successfully guide the arm to the target through their online tuning ability. VOLUME 8, 2020

V. EXPERIMENT RESULTS
In this study, the proposed cerebellar model was implemented with 10 basic units in a robotic arm, the Geomagic Touch. The bionic arm motion (BCM) control platform is shown in Fig. 13. Four set of trials with different ending points were carried out. In each experiment, the end-effector path is traced and the resulting trajectory in one trial are depicted in Fig. 14. The solid line is the end-effector trajectory with the BCM control model proposed in this paper, and the dotted line stands for the result from CMAC. The star and prism denote the final position for BCM and CMAC respectively, and the starting position is the origin. The results indicate that the end-effector of the robotic arm can accurately reach the target position, and all the trajectories of the BCM are much smoother than those from CMAC, which proves the validity of the proposed cerebellar model.
In addition, each set of the trials were repeated 20 times to calculate the average distance error and target position error. The performance metrics evaluated for this test are summarized in Table 2. Paired t-test was applied to investigate the influence of control method on the task performance. The results are significant with p <0.05. The results on the position error revealed that the BCM can control the end-effector to the target position more accurately than the CMAC (t = 8.94 > t(79) 0.05 = 1.65). Regarding the trajectory distance,  there was also a statistical difference between the BCM and the CMAC (t = 22.6 > t(79) 0.05 = 1.65).

VI. CONCLUSION AND FUTURE WOK
The integration of brain science and intelligent technologies will promote the breakthrough and development of brain-like intelligence research, and the understanding of human brain cognitive neural mechanism can bring new enlightenment to the research of artificial intelligence algorithm. This paper takes the cerebellum as the research object, and deeply studies the microscopic structure and mechanism of the inner neurons of the cerebellar cortex. On the basis of the neural computational method, a cerebellar learning model with reinforcement learning was proposed at the neuron level for the control and regulation of the arm movement. In our model, the Purkinje cells are assumed to generate arm-motion commands, and climbing fibers provide the reward or punishment information. Then, the neural computing simulation software Emergent was used to establish the cerebellar neural network and the simulated arm model. The simulation results show that the system can complete the predetermined control tasks, and the motion trajectory is smooth. With the increase of the number of learning, the position error of the arm to the target position is gradually reduced, indicating the effectiveness of the proposed cerebellar model. At last, the cerebellum model was transplanted into the Geomagic Touch bionic arm robot control system. Experimental tests also showed that the cerebellar model can achieve precise and stable control of the robotic arm.
In this study, we compared the control performance with the cerebellar model using 6 and 10 basic units. The results showed that both could achieve target reaching tasks. Using more basic units can improve the accuracy, but its stability becomes worse and the learning time is longer. This is in accordance with the phenomenon of redundancy in motion control. We assume that this suggests that the central nervous system may reduce the computational complexities of motor control through driving muscle synergies rather than a muscle [37]- [39]. Along this line of consideration, it is supposed that each basic unit generates a specific motor output by selecting a specific pattern of muscle activations. In future work, we will build a more physiologically plausible cerebellar model with the theory of muscle synergy to cover the movement planning with more degrees of freedom which may be useful in the development of general-purpose motor learning systems for machines.
Moreover, combining the cerebellum model with other simulated brain region could yield a more complex and capable model. Of particular interest is the basal ganglia, which is assumed to perform reinforcement learning. Therefore, another important line of work will involve incorporating a wider range of brain regions into computer simulations, as it will not only constrain certain aspects of these models but will allow for the development more sophisticated cerebellar models.