Efficient Reward-Based Structural Plasticity on a SpiNNaker 2 Prototype

Advances in neuroscience uncover the mechanisms employed by the brain to efficiently solve complex learning tasks with very limited resources. However, the efficiency is often lost when one tries to port these findings to a silicon substrate, since brain-inspired algorithms often make extensive use of complex functions such as random number generators, that are expensive to compute on standard general purpose hardware. The prototype chip of the 2nd generation SpiNNaker system is designed to overcome this problem. Low-power ARM processors equipped with a random number generator and an exponential function accelerator enable the efficient execution of brain-inspired algorithms. We implement the recently introduced reward-based synaptic sampling model that employs structural plasticity to learn a function or task. The numerical simulation of the model requires to update the synapse variables in each time step including an explorative random term. To the best of our knowledge, this is the most complex synapse model implemented so far on the SpiNNaker system. By making efficient use of the hardware accelerators and numerical optimizations the computation time of one plasticity update is reduced by a factor of 2. This, combined with fitting the model into to the local SRAM, leads to 62% energy reduction compared to the case without accelerators and the use of external DRAM. The model implementation is integrated into the SpiNNaker software framework allowing for scalability onto larger systems. The hardware-software system presented in this work paves the way for power-efficient mobile and biomedical applications with biologically plausible brain-inspired algorithms.

c 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Neurophysiological data suggest that brain networks are sparsely connected, highly dynamic and noisy [1], [2]. A single neuron is only connected to a fraction of potential postsynaptic partners and this sparse connectivity changes even in the adult brain on the timescale of hours to days [3], [4]. The dynamics that underlies the process of synaptic rewiring was found to be dominated by noise [5]. It has been further suggested that the permanently ongoing dynamics of synapses lead to a random walk that is well described by a stochastic drift-diffusion process, that gives rise to a stationary distribution over synaptic strengths. Therefore, synapses are permanently changing and randomly rewiring while the overall statistics of the connectivity remains stable [6]- [9]. Theoretical considerations suggest that the brain is not suppressing these noise sources since they can be exploited as a computational resource to drive exploration of parameter spaces, and several models have been proposed to capture this feature of brain circuits (see [10] and [11] for reviews).
The synaptic sampling model that has been proposed in [12], [13] employs this approach for rewiring and synaptic plasticity. The noisy learning rules drive a sampling process which mimics the drift-diffusion dynamics of synapses in the brain. Although the network is permanently rewired, this process provably leads to a stationary distribution of the connectivity. This distribution over the network connectivity can be shaped by reward signals, to incorporate reinforcement learning, and can be constrained to enforce sparsity [14]. The synaptic sampling model reproduces a number of experimental observations, such as the dynamics of synaptic decay under stimulus deprivation or the long-tailed distribution over synaptic weights [12], [14]. Furthermore, when equipped with standard error back-propagation this method was found to perform on a par with classical fully connected machine learning networks, at a fraction of the memory requirement [15].
However, the gain in efficiency of biology-inspired algorithms such as synaptic sampling can often not be fully realized on either dedicated neuromorphic hardware or standard digital compute hardware, since these models require complex operations such as random number generation or exponential functions. The former hardware usually has very narrowly configurable plasticity functions unsuitable for this kind of exploration [16]- [19]. Thus, synaptic weights that experience complex plasticity functions are usually precomputed in software and then run statically on mixed-signal [20], [21] or on digital neuromorphic hardware [22]. On the other hand, standard digital compute hardware is in principle flexible enough, but the functions required by the plasticity models are very expensive to compute on standard hardware which significantly narrows down the gain in efficiency. Despite recent efforts to simulate spiking neural networks on GPUs [23], there is, to the best of our knowledge, no hardware support available for random number generation, especially true random number generation, and exponential function in GPUs. A common workaround on digital hardware is to store massive amount of random numbers and look-up tables for the exponential function before the simulation starts [24]. This reduces computation time at the cost of increasing the requirements for the already limited memory of embedded applications. The 2nd generation SpiNNaker system strives to break the trade-off between computation time and memory by employing dedicated hardware components for these time-(and energy-)consuming operations. Standard ARM processors are augmented with hardware accelerators for random numbers [25] and exponential functions [26]. We show that this allows us to implement complex learning algorithms in a compact, power efficient package. In addition, by fitting the model into the local SRAM, DRAM can be switched off, further reducing the power consumption. This potentially offers a new compute substrate especially for mobile and biomedical applications such as neural implants that are strictly limited by the power budget, computation speed and memory capacity of the silicon chip on which they are executed.
In this article we present the main features of the prototype chip of the 2nd generation SpiNNaker system in detail and showcase the benefits of the architecture for experiments on reward-based synaptic sampling [14]. We show that the architecture allows us to exploit the advantage of the synaptic sampling algorithm. The model is efficiently implemented thanks to the hardware accelerators, the software optimizations and the floating point unit available in ARM M4F. We show a speedup of more than 2 due to the use of hardware accelerators. Our hardware-software system optimizes the implementation of reward-based synaptic sampling with respect to the memory footprint, computation and power and energy consumption. We built a scalable distributed real-time online learning system and demonstrate its usability in a closed-loop reinforcement learning task. Furthermore, we study a modified rewiring scheme called random reallocation that recycles the memory of synapses by immediately reconnecting them to a new post-synaptic target. We show that this more efficient version of synaptic sampling also leads to faster learning.
In Section II we give an overview of the prototype chip, focusing on the random number generator and the exponential function accelerator. Section III shows the reward-based synaptic sampling model implemented in this work. Section IV presents the software implementation and experimental results are presented in Section V.

A. System Overview
SpiNNaker [27] is a digital neuromorphic hardware system based on low-power ARM processors built for the real-time simulation of spiking neural networks (SNNs). On the basis of the first-generation SpiNNaker architecture and our previous work in power efficient multi-processor systems on chip [28], [29], the second generation SpiNNaker system (SpiNNaker 2) is currently being developed in the Human Brain Project [30]. By employing a state-of-the art CMOS technology and advanced features such as per-core power management, more processors can be integrated per chip at significantly increased energy-efficiency. In this article we use the first SpiNNaker 2 prototype chip, with architecture as shown in Fig. 1. Table I provides a brief summary of the new hardware features which are relevant for this work, in contrast to the first generation SpiNNaker [31] system. Furthermore, the table includes an outlook on the final SpiNNaker 2 chip (tape-out 2020).
The processing element (PE) is based on an ARM M4F processor core with 128 KB local SRAM, an exponential function accelerator [26], neuromorphic power management [33] and a hardware pseudo random number generator (PRNG). The SpiNNaker router [34] handles on-chip and off-chip spike communication. Furthermore the chip provides a dedicated true random number generator (TRNG). The various components are interconnected via Network-on-Chip (NoC). The chip has been fabricated in 28 nm SLP CMOS technology by GLOBALFOUNDRIES (Fig. 2). Fig. 1. Overview of the SpiNNaker 2 prototype including 4 processing elements (PE) with ARM core, power management controller (PMC) and exponential function accelerator (EXP), True Random Number Generator (TRNG), Network-on-Chip (NoC), SpiNNaker router, shared on-chip SRAM (not used in this work) and off-chip DRAM The next two Sections (II-B, II-C) will give an introduction of the hardware accelerators, i.e., the random number generator and the exponential function accelerator.

B. Random Number Generator
The hardware PRNG is a specific implementation of Marsaglia's KISS [35] random number generator. The generated sequence depends only on the initial seed. The provided The main advantage of a PRNG over a TRNG is the reproducibility, which simplifies debugging. However, due to the properties of a PRNG not all effects of the randomness might be seen, since the entropy of the sequence is reduced to the seed of the generator. In order to facilitate to run an experiment with different random inputs and a higher entropy, the prototype offers the possibility to scramble the seed of the PRNG with a value generated by the TRNG. From a software point of view just the initial configuration differs and no further changes on the code are necessary. The entropy source of the TRNG is the jitter of the different clock-generators of the chip [36]. In conventional clock generators, this unwanted noise would be cancelled by the control loop [37]. However, in this case the noise provides us with an entropy source at minimal cost in terms of power and area, since the clockgenerators have to run anyway, for the PE itself as well as for the SpiNNaker links. The principle is described in detail in [25] and has been submitted as a patent [38]. The entropy of each single clock-generator is combined as true random bus which is sampled by the PRNG in order to realize the scrambling.

C. Exponential Function Accelerator
The exponential function accelerator calculates an exponential function with the signed fixed-point s16.15 data type. In the implementation, the operand is divided into three parts: where n is the integer part, p and q are the upper and lower fractional parts, respectively. f int (n) and f frac (p) are calculated with two separate look-up tables (LUTs), and f poly (q) is a polynomial. The split into two separate LUTs considerably reduces the memory size and thus the silicon area compared 1 All clock cycle numbers in this paper are measured on the ARM core of the prototype chip to one combined LUT, by taking advantage of the properties of the exponential function. The split of the evaluation of the fractional part into a LUT and a polynomial reduces the computational complexity of the polynomial with minimum memory overhead. The overall implementation achieves single-LSB precision in the employed fixed-point format [26]. The exponential accelerator is included in each PE, and makes up for approx. 2% of the silicon area of each PE. The look-up and the polynomial calculation are parallelized, resulting in a latency of four clock cycles for each exponential function. Writing the operand to the accelerator and reading the result from it via the AHB bus adds additional two clock cycles, resulting in 6 clock cycles in total. In pipelined operation the processor writes one operand in one clock cycle and reads the result of a previous exponential function in another clock cycle, resulting in two clock cycles per exponential function [26].

III. SPIKING NETWORK MODEL
To demonstrate the performance gain of the SpiNNaker 2 hardware for simulations of spiking neural networks, we implemented the synaptic sampling model introduced in [14]. In this section we briefly review this model for stochastic synaptic plasticity and rewiring. The model combines insights from experimental results on synaptic rewiring in the brain with a model for online reward maximization through policy gradient (see Section III-C for details). The network has a large number of potential synaptic connections only a fraction of which is functional at any moment in time, whereas most others are non-functional (disconnected). The network connectivity is permanently modified through rewiring. Synaptic weight changes and rewiring are guided by stochastic learning rules that probe different network configurations. Hence, synaptic sampling, other than usually considered deterministic learning rules that converge to some (local) optimum of parameters, in our framework learning converges to a target distribution p * (θ) over synaptic parameters θ. The learning rules are designed in such a way that maxima of the distribution p * (θ) coincide with maxima of the expected reward. We first summarize the general synaptic sampling framework in Section III-A and III-B and then provide additional details to its application to reinforcement learning in Section III-C. All parameter values are summarized in Table II. In Section III-D we discuss random reallocation of synapses, a modified rewiring scheme that is more memory efficient.

A. Synapse model
In our model for synaptic rewiring we consider a neural network scaffold with a large number of potential synaptic connections between neurons. For each functional synaptic connection, we introduce a real-valued parameter θ i that determines the strength w i of connection i through the exponential mapping with a positive offset parameter θ 0 that scales the minimum strength of synaptic connections. The mapping in Eq. (2) accounts for the experimentally found multiplicative synaptic dynamics in the cortex (c.f. [7], [8], [39], see [14] for details).
For simplicity we assume that only excitatory connections (with w i ≥ 0) are plastic, but the model can be easily generalized to inhibitory synapses. The functional goal of network learning is determined by the dynamics of the synaptic parameters θ i . It was shown in [14] that for some target distribution p * (θ) over synaptic parameters with partial derivative ∂ ∂θi log p * (θ) t of the logdistribution with respect to parameter θ i evaluated at time t, the stochastic drift-diffusion processes give rise to a stationary distribution over θ that is proportional to p * (θ) 1 T . In Eq. (3) β plays the role of a learning rate and dW i are stochastic increments and decrements of Wiener processes, which are scaled by the temperature parameter T .
This result suggests that a rule for reward-based synaptic plasticity should be designed in a way that p * (θ) has most of its mass on highly rewarded parameter vectors θ. We use target distributions p * (θ) of the form p * (θ) ∝ p S (θ) × V(θ) where ∝ denotes proportionality up to a positive normalizing constant. p S (θ) can encode structural priors of the network scaffold, e.g. to enforce sparsity. This happens when p S (θ) has most of its mass near 0. In our experiments we have used a Gaussian distribution with mean µ and variance σ 2 for the prior p S (θ), such that ∂ ∂θi log p S (θ) = 1 σ 2 (µ − θ i (t)). The function V(θ) denotes the expected discounted reward associated with a given parameter vector θ. In Section III-C we will discuss in detail how the term ∂ ∂θi log V(θ) can be computed using reward-modulated plasticity rules.
Synaptic rewiring is included in this model by interpreting each synapse i for which θ i ≤ 0 as disconnected. To reconnect synapses we tested two approaches. In the first approach we continued to simulate the dynamics of the prior distribution, i.e. a process of the form (3) with p * (θ) = p S (θ) until the synapse reconnects (θ i > 0). This is the algorithm that was proposed in [14]. In Section III-D we introduce another approach for rewiring called random reallocation of synapses that makes more effective use of memory resources. The two approaches are compared in the results below.

B. Neuron model
We considered a general network of K stochastic spiking neurons and we denote the output spike train of a neuron k by z k (t), defined as the sum of Dirac delta pulses positioned at the spike times t We denote by PRE i and POST i the index of the pre-and postsynaptic neuron of synapse i, respectively, which unambiguously specifies the connectivity in the network. Further, we define SYN k to be the index set of synapses that project to neuron k. Note that this indexing scheme allows us to include multiple (potential) synaptic connections between a given pair of neurons. In all simulations we allow multiple synapses between neuron pairs. Network neurons were modeled by a standard stochastic variant of the spike response model [40]. We denote by w i (t) the synaptic efficacy of the i-th synapse in the network at time t, determined by Eq. (2). The membrane potential of neuron k at time t is then given by where ϑ k (t) denotes the slowly adapting bias potential of neuron k, and y PREi (t) denotes the trace of the (unweighted) postsynaptic potentials (PSPs) that neuron PRE i leaves in its postsynaptic synapses at time t. It is defined as y , with time constants τ m and τ r . Here * denotes convolution and Θ(·) is the Heaviside step function, i.e. Θ(x) = 1 for x ≥ 0 and 0 otherwise.
Spike trains were generated using the following method. We used an exponential dependence between the membrane potential and firing rate, such that the instantaneous rate of neuron k at time t is given by f k (t) = exp(u k ). Spike events were drawn from a Poisson process with rate f k (t). After each spike, neurons were refractory for a fixed time window of length t ref .
The bias potential ϑ k (t) in Eq. (4) implements a slow rate adaptation mechanism which was updated according to where τ ϑ is the time constant of the adaptation mechanism and ν 0 is the desired output rate of the neuron. In our simulations, the bias potential ϑ k (t) was initialized at -3 and then followed the dynamics given in Eq. (5) (see [14] for details).

C. Reward-based synaptic sampling
In a reward-based learning framework we assume that the network is exposed to a real-valued scalar function r(t) that denotes the reward at any moment in time in response to the network behavior. The value function V(θ) determines the expectation of r(t) over all possible network states while discounting future rewards, i.e. V(θ) = ∞ 0 e − τ τe r(τ ) dτ , with discounting time constant τ e and · denotes the expectation over all possible network responses. The gradient ∂ ∂θi log V(θ) can be estimated for the network model outlined above using standard reward-modulated learning rules with an eligibility trace (see [14] for details) This eligibility trace Eq. (6) is multiplied by the reward r(t) and integrated in each synapse i using a second dynamic variable wherer(t) is a low-pass filtered version of r(t) with time constant τ g . The variable g i (t) combines the eligibility trace e i (t) and the reward r(t) in a temporal average. α is a constant offset on the reward signal. This parameter can be set to an arbitrary value without changing the stationary dynamics of the model [14]. In our simulations, this offset α was chosen slightly above 0 (α = 0.02) such that small parameter changes were also present without any reward. The variable g i (t) realizes an online estimator for ∂ ∂θi log V(θ) [14]. Putting it all together, by plugging Eq. (7) into Eq. (3) the synaptic parameter changes at time t are given by Eqs.
(2) and (4)-(8) conclude the neuron and synapse dynamics used in our simulations. The parameter values are given in Table II.

D. Random Reallocation of Synapse Memory
In the original synaptic sampling model, outlined above, whenever a synapse i is disconnected (when θ i ≤ 0), it undergoes a random walk according to Eq. (3) until θ i again becomes larger than zero and the synapse reappears. The dynamics of synapses that are disconnected also become independent of the network activity and are therefore not influenced by the pre-and post-synaptic spike trains, since the eligibility trace Eq. (6) vanishes. Nevertheless, synapses need to be updated even when they are not used which wastes memory and CPU time. In a typical simulation of synaptic sampling, where the majority of synapses are nonfunctional most of the time, this overhead may even dominate the simulation. Here, we discuss a more efficient approach for synaptic rewiring called random reallocation of synapse memory.
It has been previously noted that the synaptic sampling dynamics can be replaced by a more efficient approach for online rewiring of neural networks [15]. The theoretical analysis there has shown that the original synaptic sampling formulation, with convergence to a stationary distribution p * (θ), can be combined with a hard constraint on the network connectivity such that at any moment in time a fixed number of connections M is functional, i.e. |θ > 0| = M . In this modified version of network rewiring, whenever a connection becomes nonfunctional another synapse is randomly reintroduced to keep the total number of synapses constant. Thus, non-functional synapses do not need to be simulated and therefore don't waste memory or CPU time. It has been shown that this more efficient rewiring approach also leads to a stationary distribution of network configurations, that is identical to the original posterior p * (θ) confined to the manifold of the parameter space that fulfills the constraint |θ > 0| = M (see [15] for details). This rewiring strategy has already been successfully applied to deep learning [15] and implemented on the SpiNNaker 2 prototype chip [41].
Here, we used a similar rewiring approach to the one in [15]. However, an additional limitation on the rewiring scheme comes from the memory model of the software framework. In our implementation, each neuron maintains a table of its postsynaptic targets (see Section IV-C for details). Therefore, the free space of synapses that become disconnected can most efficiently be reassigned to another postsynaptic target of the same presynaptic neuron. Consequently, we decided to use a connectivity constraint that assures that the fanout of each neuron is constant throughout the simulation. This is simply achieved by immediately reconnecting each synapse that becomes non-functional to a new randomly chosen postsynaptic target. Since drawing random numbers becomes efficient due to the random number generator (Section II-B), this approach has little computational overhead.
Our results from the prototype chip presented in Section V-C suggest, that random reallocation increases the effective usage of the hardware, the number of active synapses in the network, and also accelerates the exploration of the parameter space, leading to faster convergence to the stationary distribution. Interestingly, the connectivity constraint used here is somewhat similar to analog neuromorphic systems which contain synaptic matrices fixedly assigned to postsynaptic neurons with only the presynaptic sources flexible to some degree [42]. Rewiring in such a setup has to operate 'postsynaptic-centric' and similar to our approach has a fixed number of synapses per postsynaptic neuron [43].

IV. IMPLEMENTATION OF SYNAPTIC SAMPLING ON THE SPINNAKER 2 PROTOTYPE
The software implementation of this model is optimized regarding computation time, memory, power consumption and scalability, in order to bridge the gap between state-of-the-art biologically plausible neural models and efficient execution of the model in hardware. This is explained in more detail in the following.
A. Numerical Optimizations a) Reducing computation time with hardware generated uniform random numbers: The synaptic sampling model draws one random number for each synapse in each simulation time step (1 ms). Since thousands of synapses are simulated in each core, random number generation could dominate the computation time. As described in Section III, the Wiener process requires Gaussian random numbers to be generated. But as described in Section II-B, only uniform random number can be generated by the accelerator. As shown in Table III, the generation of a pseudo Gaussian random number with Box-Muller transform [44] in software requires 172 clock cycles. One option could be to convert the hardware generated uniform random number into Gaussian random number with Inverse CDF method [45] and look-up table, which reduces the computation time to 21 clock cycles. However, analytical and numerical studies have found that for the simulation of Wiener process, Gaussian random numbers can be replaced by uniform random numbers without affecting model performance [46]. The generation of a uniform random number in software with Marsaglia RNG [35], [47] requires 42 clock cycles, whereas with hardware it takes only 5 clock cycles, including fetching the integer random number from the accelerator and converting it to floating point type in the range of 0 to 1. b) Reducing computation time with exponential function accelerator: In the synapse model, the parameter θ of each synapse accumulates small changes in each time step. The exponential function accelerator, which calculates the exponential function within 6 clock cycles (Section II-C), uses a fixed-point data type whose precision is not enough for this model, because the change of θ would be rounded to zero. Calculating a floating point exponential function with software libraries like Newlib takes 163 clock cycles. Since high precision is only necessary for storing the small change of θ, but not necessary for calculating intermediate variables like w, θ can be stored as floating point in memory, and when calculating w with exponential function, θ can be converted to fixed point and calculated with the exponential function accelerator. The result is then converted back to floating point. Simulations show that the performance of the model is not affected. This reduces the computation time to 15 cycles with 6 cycles required by the hardware accelerator and 9 additional cycles for the conversion of data type. For the sake of comparison, emulation of exponential accelerator in software takes 95 cycles instead of 6 [26]. Thus, with conversion of data type, this approach would take 104 cycles with software (Table III).
c) Reducing memory footprint with 16-bit floating point data type: In order to simulate more synapses with limited memory, which is the case when the synapse parameters are stored in SRAM (see Section IV-B), the single precision floating point with 32 bits can be converted into half precision floating point with 16 bits. For each synapse i, three parameters need to be stored in memory: eligibility trace e i , estimated gradient g i and synaptic parameter θ i . Simulations show that converting e i and g i to half precision does not affect the model performance.

B. Local Computation
By avoiding external DRAM access and instead storing all parameters and state variables of the model locally in SRAM, both energy and computation time can be saved. To read (write) data from (to) the off-chip DRAM, the core sends a read (write) request which is first stored in a DMA (Direct Memory Access) queue in software, then sent to the DMA unit, and at last sent to the DRAM. When the read (write) process is complete, an interrupt is triggered and an interrupt handler is called, which, in case of read request, processes the data from DRAM. Then the next read/write request in the queue is sent to DMA (Fig. 3). Since the DRAM access is time consuming, the software can let DMA run in background and continue with other tasks. When the read/write process is complete, the core stops with the current task, handles the interrupt and then resumes the stopped task after the interrupt handler is complete. Although this saves computation time compared to waiting for the read/write process to complete, it still has the following drawbacks: 1) Retrieving all synapse parameters in each time step, which is necessary in this model, could easily saturate DRAM bandwidth especially in the scaled up case with tens of cores per chip [31], [48].
2) The energy consumption of DRAM access can be two orders of magnitudes higher than SRAM access [49]. 3) This only works if the other tasks are independent from the data being fetched. 4) Managing the DMA queue and calling the interrupt handler still consumes computation time, which becomes a problem when memory is frequently accessed. The drawback when not using external DRAM is the limited memory space available in SRAM. This is not a problem for this model, since on the one hand the required memory is reduced with 16-bit floating point (Section IV-A), and on the other hand due to the complexity of the model, the number of synapses per core is limited by computation as is shown in Section V-B. The memory model (Fig. 4) of this work is based on the software for the first generation SpiNNaker system [50]. The spike packet contains the ID of the presynaptic neuron. The master population table contains keys which are presynaptic neuron IDs. Each key is 4 bytes long and is stored together with the 4 byte starting address of the synapse parameters for the presynaptic neuron. These synapse parameters are stored in a contiguous memory block called synapse row. Each row is composed of 4-byte words. For each presynaptic neuron, the first word is the length of the plastic synapse region. In our implementation, the plastic synapse region consists of 8byte blocks with 2 bytes for e i , 2 bytes for g i and 4 bytes for θ i . After the plastic synapse region there is one word for the length of fixed synapse region. The next word is the length of the plastic control region which stores special parameters needed by the plasticity rules. In this work this region is used to store the parameters for the PSP kernel of input spike, e.g. h r and h f (corresponding to the time constants τ m and τ f ). Since the PSP kernel of the incoming spike is the same for all synapses of the same presynaptic neuron, the parameters for the PSP kernel are shared in order to reduce memory footprint. After the word for the length of plastic control region follow the parameters for fixed synapses.

C. Memory Model
The synapse parameters should also include the index of the postsynaptic neuron. One way to implement this is to add a 4-byte word for each postsynaptic neuron in addition to the 8 bytes for e i , g i and θ i , which is the case in the original SpiNNaker software framework. Alternatively, since in this network all input neurons have the same fanout, the indexes are stored in a 2-d array (Post-syn. Neuron ID in Fig. 4), where the column index stands for the presynaptic neuron ID and the entries represent the postsynaptic neuron IDs. Each entry represents a synapse and occupies one byte, supporting maximum 256 target neurons per core. Since multiple synapses are allowed between a pair of neurons, the ID of a postsynaptic neuron can appear multiple times in each column of the 2-d array. In general, depending on application, one of the two approaches can be chosen.
The master population table, synapse rows and postsynaptic neuron ID are arrays generated by each core after the network configuration is specified. Each core generates its own data in a distributed way instead of having a centralized host PC generating data for all cores. This, combined with local computation (Section IV-B), drastically reduces the time for data generation and transmission of data from host PC to chip, which could make up significant amount of total simulation time especially in the case of large systems [51], [52]. The SpiNNaker system employs parallel computation to run large scale neural simulations in real time. Although the prototype chip consists of only 4 cores, the software implementation of the synaptic sampling model is integrated into the SpiNNaker software framework allowing for scaling up onto larger systems. The design of the program flow is based on [50].

D. Program Flow and SpiNNaker Software Framework Integration
The timer tick signal of the ARM core is used to trigger each time step in real time. The length of a time step can be arbitrarily chosen. For this implementation, one time step is one millisecond. The timer tick signal triggers an interrupt. Then the handler of the interrupt is called and processes the incoming spikes from the last time step, which are stored in a hardware buffer in SRAM. In this step, for each incoming spike, first the starting memory address of its corresponding synapse parameters is found in the master population table, then the synaptic weights of the activated synapses in the synapse row are added to the synaptic input buffers of the target neurons.
For the network model implemented in this work (Section V-B), one of the cores, the "master core", then simulates the environment that computes the global reward signal. All cores continue with the synapse update and neuron update, which integrate the synaptic weight onto the membrane potential of the postsynaptic neuron. Next, the synaptic plasticity update is performed, as now all required information is available, i.e. incoming spikes, neuron states and global reward.
At last, the spikes of the neurons in each core are sent to the SpiNNaker router, which then multicasts the spikes to the cores containing the corresponding postsynaptic neurons. The SpiNNaker router [34] allows for fast multicast of small packets, which is key to efficient spike communication for many-core neuromorphic systems like SpiNNaker. The distributed computation, synchronization with timer tick and communication with the SpiNNaker router allows for scaling up the model implementation onto large systems consisting of millions of cores.

V. RESULTS
In the following we show how the hardware accelerators and numerical optimizations reduce the computation time for one plasticity update of the synaptic sampling model. Then, we implement a network model that performs reward-based synaptic sampling on the SpiNNaker 2 prototype, for which we also provide power and energy measurements. As shown in Section IV-A the generation of a uniform distributed random number takes 5 clock cycles with hardware accelerator and 42 clock cycles with software. The floating point exponential function with exponential accelerator and conversion of data type takes 15 clock cycles, whereas the same algorithm in software takes 104 clock cycles. The rest of the plasticity update of a synapse takes 90 clock cycles. In total, the plasticity update takes 110 clock cycles with hardware accelerators and the equivalent implementation with only software takes 236 clock cycles (Table IV). For this application, the hardware accelerators result in a speedup of 2 regarding the number of clock cycles. Considering the increase of clock frequency from 200 MHz in SpiNNaker 1 to 500 MHz in the current prototype chip, in total a speedup factor of 5 is achieved. In the plasticity update, the computation time for random number generation and exponential function reduced from 62% to 18%.   Fig. 7). The 20 hidden neurons are divided into two populations (A and B). The output spikes of the hidden neurons are sent to the environment (Env), which evaluates the global reward. A high reward is obtained if input pattern 1(2) is present and the mean firing rate of population A(B) is higher than population B(A). The global reward is sent back to the network and shapes the plastic synapses between the input neurons and the two populations. The goal is to let the two populations 'know' which spike pattern they represent and signal this with a high firing rate when their pattern is present. In addition to the feedforward input, hidden neurons receive lateral inhibitory synapses that are initiated to fixed random weights between each pair of hidden neurons.

B. Network Description
The network is mapped to the prototype chip with each core simulating 5 neurons from the two populations (see Fig. 6). The first core ("master core") also generates the input spikes and evaluates the reward. The 200 input neurons lead to 200× 5 = 1000 pairs of neurons in each core.
The profiling results in section V-A provide the computational aspect when assigning the number of synapses to simulate on each core. The ARM Cortex M4F core used in this prototype chip is configured to run at 500 MHz, which means 500 000 clock cycles are available in each time step (1 ms). The computation for one time step without plasticity update takes ca. 45 000 clock cycles for core 0 and 40 000 clock cycles for the other cores. Since each plasticity update takes 110 cycles with hardware accelerators and 236 cycles without hardware accelerators, the theoretical upper limit for the number of synapses per core is ca. 4 100 with hardware accelerators and ca. 1 900 without hardware accelerators.
In terms of memory, the prototype chip has 64 kB Data Tightly Coupled Memory (DTCM) per core, for all initialized data, uninitialized data, heap and stack. By checking the binary file size after compilation, the maximum number of synapses is estimated as 4 700. Thus, this model is limited by computation rather than memory (see table V). In the implementation, 3 000 plastic synapses per core are simulated, in order to ensure the stability of the software. Since 3 000 plastic synapses can be simulated in each core, each pair of neurons has 3 plastic synapses. Note that this is only the initial configuration. Due to random reallocation of synapse memory, the postsynaptic neuron could change, so that not each single pair of neurons has 3 plastic synapses.

C. Implementation Results
The usability of the network is demonstrated in a closedloop reinforcement learning task implemented with 4 ARM cores. The generation of input spikes and evaluation of output spikes are also implemented on chip.
As shown in Fig. 7, the 200 input neurons send two spike patterns in random order. Each spike pattern lasts for 500 ms. Resting periods of 500 ms are inserted between two pattern presentations, where the input neurons only send random spikes with low firing rate representing background noise. The numbers at the top of Fig. 7 and shaded colored areas indicate which pattern is present. As discussed above, the 20 neurons are divided into 2 populations (A and B), each representing one of the two patterns. Neuron 1 to neuron 10 belong to population A, neuron 11 to neuron 20 belong to population B. In the second row of Fig. 7, blue and green curves represent population firing rates of A and B, respectively. The firing rates were obtained with a Gaussian filter (σ = 20 ms) applied to the raw spike trains. The goal of learning is to let population A fire at a higher rate when pattern 1 is present and let population B fire at a higher rate when pattern 2 is present. Fig. 8 shows the evolution of the mean reward with and without random reallocation of synapse memory (see Section III-D). The mean reward in each minute is low-pass filtered with a Gaussian kernel with σ = 2 min. Averages over 5 independent trial runs using the true random number generator are shown with solid lines, shaded areas indicate standard deviations. The reward is normalized to the theoretically maximum reachable reward. At learning onset the two populations respond randomly to input spike patterns and the reward is low. The synaptic weights explore the parameter space with the random process guided by the global reward as described in Section III-A. Over time, the network learns the desired input/output mapping and the reward increases. After ca. 10 minutes of training, the two populations learn to respond correctly to the two spike patterns with the firing rate of one population higher than the other when the corresponding spike pattern is present, and reward becomes high. Our results show that the reward increases much faster with reallocation due to the accelerated exploration of the parameter space. After the reward reaches a high value, the network continues exploration and the reward might fluctuate while the network searches for equally good network configurations. The optimizations described in section IV result in considerable reduction of power and energy consumption. To show the Fig. 8. Time-averaged reward over throughout learning for networks with (red) and without (green) random reallocation of synapse memory. benefit of the optimizations, power and energy consumption is measured in three cases. First, the synapse rows are stored in the external DRAM memory, and the exponential function and random number generation are done only with the software running on ARM core. Second, the synapse rows are stored in the local SRAM memory, and the exponential function and random number generation are still only done with the software running on ARM core. At last, the synapse rows are stored in the local SRAM memory, and the exponential function and random number generation are done with the hardware accelerators. For this measurement, the software is run without random reallocation of synapse memory. As summarized in table VI, the power and energy consumption is reduced by local computation without external DRAM and reduction of computation time.

D. Power and Energy Measurement Results
First, the memory footprint is optimized by employing 16bit floating point data type and the compact arrangement of memory model described in sections IV-A and IV-C. The random reallocation described in section III-D increases the effective number of synapses which is otherwise only achievable with external memory like DRAM. The reduction of memory footprint allows for local computation with SRAM, as described in section IV-B. Switching off DRAM allows for a reduction of power consumption by 21%, from 285 mW to 225 mW.
In addition, as summarized in section V-A, the computation time for each plasticity update is reduced by 53.4%. Without the hardware accelerators, simulating the network with 3 000 plastic synapses per core for one time step (1 ms) takes 1.58 ms, losing the real time capability. With the hardware accelerators, the simulation of one time step is finished within 0.76 ms. To measure the energy consumption, the length of the time step is chosen to be the minimum required for each time step to finish, i.e. 1.58 ms for without accelerators and 0.76 ms for with accelerators. The reduction of computation time for plasticity update reduces the energy consumption for one time step by 51.9%, from 355.5 µJ to171 µJ .
In total, the energy consumption for the simulation of the network for one time step is reduced by 62%, from 450.3 µJ to 171 µJ, making the system attractive for mobile and embedded applications.

VI. DISCUSSION
In the following we discuss how the implementation of the reward-based synaptic sampling model would scale for larger networks on the final SpiNNaker 2 system. Finally, we argue about the possiblility to realize this network model on SpiNNaker 1 and other neuromorphic platforms with learning capabilities.

A. Scalability
The SpiNNaker architecture was designed for the scalable real-time simulation of spiking neural networks with up to a million cores [27]. SpiNNaker's scalability is based on the multi-cast network for routing of spike events [34] and a software framework for mapping network models onto the system that has shown to support the simulation of largescale neural networks [52]. Building on this, the reward-based synaptic sampling model can be scaled to future SpiNNaker 2 systems without major restrictions, i.e. as our implementation is integrated into the SpiNNaker software framework, the automatic mapping of larger networks onto many cores and the configuration of routing tables comes for free. In principle, with more than 100 cores per chip in SpiNNaker 2 (cf. Table I), DRAM bandwidth may become a bottleneck for some applications, but not in our case, as synapse variables are stored and processed locally in each core and DRAM is not used. Furthermore, a many-chip implementation should not be limited by the communication bandwith for spike packets between chips, as the reward-based synaptic sampling model is mainly limited by the computation of the synapse updates and has rather moderate spike rates (Section V-B). Still, we remark that, as in any large-scale neuromorphic hardware system, the fraction of energy consumed for communication will increase with network size [53] demanding optimized routing architectures [54].
Future work will include simulating larger networks of this type on the full-scale SpiNNaker 2 system with many cores. Such a scaled-up, real-time version of the synaptic sampling framework, will enable us to explore reward-based learning on high-dimensional input such as dynamic vision sensors [55] or conventional high-density image sensors [56].

B. Comparison with SpiNNaker 1
Reward-based learning and structural plasticity have been implemented on the SpiNNaker system before [48] [57]. The reward-based synaptic sampling model implemented in this work is more complex because of the need for random number generation and exponential function for each plastic synapse in each time step. In addition, due to the lack of floating point arithmetic, this synapse model would be very hard, if possible at all, to be implemented in the first generation SpiNNaker system, since the change of synaptic weight is very small in each time step and can not be captured by the precision of fixed point format.

C. Comparison with other neuromorphic platforms
To the best of our knowledge, there exists today no neuromorphic hardware platform, except SpiNNaker 2, that would be able to directly simulate complex learning rules such as synaptic sampling. Most other approaches have traded off accessible model complexity for a more direct implementation of the neuron dynamics. We discuss here how synaptic sampling could still be emulated on other architectures.
Clearly, since synaptic sampling is inherently an online learning model, it cannot be directly implemented on neuromorphic hardware with only static synapses, such as TrueNorth [58], NeuroGrid [59], HiAER-IFAT [54], DYNAPs [60] and DeepSouth [61]. However, the network dynamics could be approximated by alternating short time windows of network simulation and reprogramming synaptic weights by an external device.
Architectures that do support synaptic plasticity on chip, such as Loihi [62] and the BrainScales 2 system [63], have so far quite limited weight resolutions (9-bit signed integer on Loihi and 12-bit on BrainScales 2). Since 32-bit fixed-point format was found to be insufficient for this model (cf. section IV-A), it is questionable, even with stochastic rounding, whether synaptic sampling can be implemented with such low weight resolution, and at what cost in performance. Also, in the case of Loihi, the size of the microcode that is allowed for computing synaptic updates is quite limited (e.g. 16 32-bit words). Besides, hardware accelerators for complex functions like the exponential function are not available on these two platforms, which makes the implementation more challenging, especially in the case of Brainscales 2, because the high data rate caused by accelerated operation requires fast execution of learning rules. These restrictions put some doubt on whether complex learning mechanisms, as the one considered here, can be implemented exactly. Also, exact implementation of the synaptic sampling model seems infeasible on neuromorphic hardwares with configurable (but not programmable) plasticity, like ROLLS [64], ODIN [65] and TITAN [66] (see [67] and [68] for reviews). However, it might be possible to realize simplified, approximate, versions of synaptic sampling on these neuromorphic platforms.

VII. CONCLUSION
In this work, a reward-based synaptic sampling model is implemented in the prototype chip of the second generation SpiNNaker system. This real-time online learning system is demonstrated in a closed-loop online reinforcement learning task. While hardware features of the future SpiNNaker 2 and its prototypes have already been published, this is the first time learning spiking synapses have been shown on SpiNNaker 2. As shown in sections I and VI-C, this is also one of the most complex synaptic learning models ever implemented in neuromorphic hardware. The hardware accelerators and the software optimizations allow for efficient neural simulation with regard to computation time, memory and power and energy consumption, while at the same time the SpiNNaker 2 system keeps the full flexibility of being processor based. For this application, we show slightly more than a factor of 2 speedup of the algorithm compared to a pure software implementation. Coupled with the 2.5 fold increase in clock frequency, we can theoretically simulate 5 times as many synapses of this type in SpiNNaker 2 as in SpiNNaker 1 in the same time span. In addition, we show a reduction of energy consumption by 62% compared to implementation without the use of hardware accelerators and with external DRAM.
Yexin Yan received the Dipl.-Ing. (M.Sc.) in Electrical Engineering from Technische Universität Dresden, Germany, in 2016. He is currently pursuing the Ph.D. at the Chair of Highly-Parallel VLSI-Systems and Neuromorphic Circuits at Technische Universität Dresden. His research interests include hardware-software co-design for applications of brain-inspired algorithms on neuromorphic systems.
David Kappel received the PhD degree in computer science from the Graz University of Technology in 2018. He is currently a postdoctoral researcher with the TU Dresden and the University of Göttingen. His research interest focuses on models for synaptic plasticity, neural dynamics, Bayesian inference and hierarchical learning networks. His research interests include circuits for low-power systems-on-chip in advanced technology nodes, with special focus on clocking, data transmission and power management. He has experience in designing full-custom circuits for multi-processor systems-onchip (MPSoCs), like ADPLLs, register files and high-speed on-chip and offchip links, in academic and industrial research projects. He has been managing the full-custom circuit design and SoC integration for more than 12 MPSoC chips in 65nm, 28nm and 22nm CMOS technology. Currently he leads the chip design of the SpiNNaker2 neuromorphic computing system within the Human Brain Project(HBP). He is author or co-author of more than 56 publications and 10 patents (5 issued, 5 pending) in the above fields.

Steve Furber CBE FRS FREng is ICL Professor of Computer Engineering in the School of Computer
Science at the University of Manchester, UK. After completing a BA in mathematics and a PhD in aerodynamics at the University of Cambridge, UK, he spent the 1980s at Acorn Computers, where he was a principal designer of the BBC Microcomputer and the ARM 32-bit RISC microprocessor. Over 120 billion variants of the ARM processor have since been manufactured, powering much of the world's mobile and embedded computing. He moved to the ICL Chair at Manchester in 1990 where he leads research into asynchronous and low-power systems and, more recently, neural systems engineering, where the SpiNNaker project is delivering a computer incorporating a million ARM processors optimised for brain modelling applications.