Training Spiking Neural Networks Using Lessons From Deep Learning

The brain is the perfect place to look for inspiration to develop more efficient neural networks. The inner workings of our synapses and neurons provide a glimpse at what the future of deep learning might look like. This article serves as a tutorial and perspective showing how to apply the lessons learned from several decades of research in deep learning, gradient descent, backpropagation, and neuroscience to biologically plausible spiking neural networks (SNNs). We also explore the delicate interplay between encoding data as spikes and the learning process; the challenges and solutions of applying gradient-based learning to SNNs; the subtle link between temporal backpropagation and spike timing-dependent plasticity; and how deep learning might move toward biologically plausible online learning. Some ideas are well accepted and commonly used among the neuromorphic engineering community, while others are presented or justified for the first time here. A series of companion interactive tutorials complementary to this article using our Python package, snnTorch, are also made available: https://snntorch.readthedocs.io/en/latest/tutorials/index.html.


Introduction
Deep learning has solved numerous problems in computer vision [1][2][3][4][5][6], speech recognition [7][8][9], and natural language processing [10][11][12][13][14]. Neural networks have been instrumental in outperforming world champions in a diverse range of games, from Go to Starcraft [15,16]. They are now surpassing the diagnostic capability of clinical specialists in numerous medical tasks [17][18][19][20]. But for all the state-of-the-art models designed every day, a Kaggle [21] contest for state-of-the-art energy efficiency would go to the brain, every time. A new generation of brain-inspired spiking neural networks (SNNs) is poised to bridge this efficiency gap.
The amount of computational power required to run top performing deep learning models has increased at a rate of 10× per year from 2012 to 2019 [22,23]. The rate of data generation is likewise increasing at an exponential rate. The backbone of OpenAI's ChatGPT language model, GPT-3, contains 175 billion learnable parameters, estimated to consume roughly 190,000 kWh to train [24][25][26]. Meanwhile, our brains operate within ∼12-20 W of power. This is in addition to churning through a multitude of sensory input, all the while ensuring our involuntary biological processes do not shut down [27]. If our brains dissipated as much heat as state-of-the-art deep learning models, then natural selection would have wiped humanity out long before we could have invented machine learning. To be fair, none of the authors can emulate the style of Shakespeare, or write up musical guitar tabs with the same artistic flair of GPT-4. Figure 1: Spiking neural networks (SNNs) have pervaded many streams of deep learning that are in need of low-power, resource-constrained, and often portable operation. The utility of SNNs even extends to the modeling of neural dynamics across individual neurons and higher-level neural systems.

Neuromorphic Computing: A Quick Snapshot
Neuromorphic ('brain-like') engineering strives to imitate the computational principles of the brain to drive down the energy cost of artificial intelligence systems. To replicate a biological system, we build on three parts: 1. Neuromorphic sensors that take inspiration from biological sensors, such as the retina or cochlear, and typically record changes in a signal instead of sampling it at regular intervals. Signals are only generated when a change occurs, and the signal is referred to as a 'spike'.
2. Neuromorphic algorithms that learn to make sense of spikes are known as spiking neural networks (SNNs). Instead of floating point values, SNNs work with single-bit, binary activations (spikes) that encode information over time, rather than in an intensity. As such, SNNs take advantage of low-precision parameters and high spatial and temporal sparsity. 2 3. These models are designed with power-efficient execution on specialized neuromorphic hardware in mind. Sparse activations reduce data movement both on and off a chip to accelerate neuromorphic workloads, which can lead to large power and latency gains when compared to the same task on conventional hardware.
Armed with these three components, neuromorphic systems are equipped to bridge the efficiency gap between today's and future intelligent systems.
What lessons can be learnt from the brain to build more efficient neural networks? Should we replicate the genetic makeup of a neuron right down to the molecular level [28,29]? Do we look at the way memory and processing coalesce within neurons and synapses [30,31]? Or should we aim to extract the learning algorithms that underpin the brain [32]? This paper hones in on the intricacies of training brain-inspired neuromorphic algorithms, ultimately moving towards the goal of harnessing natural intelligence to further improve our use of artificial intelligence. SNNs can already be optimized using the tools available to the deep learning community. But the brain-inspired nature of these emerging sensors, neuron models, and training methods are different enough to warrant a deep dive into biologically-inspired neural networks.

Neuromorphic Systems in the Wild
The overarching aim is to combine artificial neural networks (ANNs), which have already proven their worth in a broad range of domains, with the potential efficiency of SNNs [33]. So far, SNNs have staked their claim to a range of applications where power efficiency is of utmost importance. Figure 1 offers a small window into the uses of SNNs, and their domain only continues to expand. Spiking algorithms have been used to implement low-power artificial intelligence algorithms across the medical, robotics, and mixed-reality domains, amongst many other fields. Given their power efficiency, initial commercial products often target Edge computing applications, close to where the data is recorded.
In biosignal monitoring, nerve implants for brain-machine or biosignal interfaces have to pre-process information locally at minimum power and lack the bandwidths to transmit data for cloud computation. Work done in that direction using SNNs includes on-chip spike sorting [34,35], biosignal anomaly detection [36][37][38][39] and brain-machine interfaces [40,41]. Beyond biomedical intervention, SNN models are also used in robotics in an effort to make them more human-like and to drive down the cost of operation [42][43][44]. Unmanned aerial vehicles must also operate in low-power environments to extract as much value from lightweight batteries, and have benefited from using neuromorphic processors [45]. Audio signals can be processed with sub-mW power consumption and low latency on neuromorphic hardware as SNNs provide an efficient computational mechanism for temporal signal processing [46].
A plethora of efficient computer vision applications using spiking neural networks are reviewed in Ref. [47]. SNNs are equally suitable to track objects such as satellites in the sky for space situational awareness [48,49], scientific computing [50], and have been researched to promote sustainable uses of artificial intelligence, such as in monitoring material strain in smart-buildings [51] and wind power forecasting in remote areas that face power delivery challenges [52]. At the 2018-19 Telluride Neuromorphic and Cognition Workshops, a neuromorphic robot was even built to play foosball! [53] Beyond neuromorphic applications, SNNs are also used to test theories about how natural intelligence may arise, from the higher-level learning rules of the brain [54] and how memories are formed [55], down to lower-level neuronal and synaptic dynamics [56].

Overview of Paper
The brain's neural circuitry is a physical manifestation of its neural algorithm; understanding one will likely lead to an understanding of the other. This paper will hone in on one particular aspect of neural models: those that are compatible with modern deep learning. Figure 2 provides an illustrated overview of the structure of this paper, and we will start from the ground up: • In Section 2, we will rationalise the commonly accepted advantages of using spikes, and derive a spiking neuron model from basic principles.
• These spikes will be assigned meaning in Section 3 by exploring various spike encoding strategies, how they impact the learning process, and how objective and regularisation functions can be used to sway the spiking patterns of an SNN. • In Section 4, the challenges of training SNNs using gradient-based optimisation will be explored, and several solutions will be derived. These include defining derivatives at spike times and using approximations of the gradient. • In doing so, a subtle link between the backpropagation algorithm and the spike-timing-dependent plasticity (STDP) learning rule will emerge, and be used in the subsequent section to derive online variants of backprop that move towards biologically plausible learning mechanisms ( Figure 2).
The aim is to combine artificial neural networks (ANNs), which have already proven their worth in a broad range of domains, with the potential efficiency of SNNs [33].

From Artificial to Spiking Neural Networks
The neural code refers to how the brain represents information, and while many theories exist, the code is yet to be cracked. There are several persistent themes across these theories, which can be distilled down to 'the three S's': spikes, sparsity, and static suppression. These traits are a good starting point to show why the neural code might improve the efficiency of ANNs. Our first observation is:

Spikes: Biological neurons interact via spikes
Neurons primarily process and communicate with action potentials, or "spikes", which are electrical impulses of approximately 100 mV in amplitude. In most neurons, the occurrence of an action potential is far more important than the subtle variations of the action potential [57]. Many computational models of neurons simplify the representation of a spike to a discrete, single-bit, all-or-nothing event (Figure 3(a-c)). Communicating high-precision activations between layers, routing them around and between chips is an expensive undertaking. Multiplying a high-precision activation with a high-precision weight requires conversion into integers, decomposition of multiplication into multiple additions which introduces a carry propagation delay. On the other hand, a spike-based approach only requires a weight to be multiplied by a spike ('1'). This trades the cumbersome multiplication process with a simple memory read-out of the weight value.
Despite the activation being constrained to a single bit, spiking networks are vastly different from binarised neural networks. What actually matters is the timing of the spike. Time is not a binarised quantity, and can be implemented using clock signals that are already distributed across a digital circuit. After all, why not use what is already available?
2. Sparsity: Biological neurons spend most of their time at rest, silencing a majority of activations to zero at any given time Sparse tensors are cheap to store. The space that a simple data structure requires to store a matrix grows with the number of entries to store. In contrast, a data structure to store a sparse matrix only consumes memory with the number of non-zero elements. Take the following list as an example: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1] Since most of the entries are zero, we could save time by writing out only the non-zero elements as would occur in run-length encoding (indexing from zero): "1 at position 10; 1 at position 20" For example, Figure 3(c) shows how a single action potential can be represented by a sparsely populated vector. The sparser the list, the more space can be saved.
3. Static Suppression (a.k.a., event-driven processing): The sensory system is more responsive to changes than to static input The sensory periphery features several mechanisms that promote neuron excitability when subject to dynamic, changing stimuli, while suppressing its response to static, unchanging information. In retinal ganglion cells and the primary visual cortex, the spatiotemporal receptive fields of neurons promote excitable responses to regions of spatial contrast (or edges) over regions of spatial invariance [58]. Analogous mechanisms in early auditory processing include spectrotemporal receptive fields, which cause neurons to respond more favourably to changing frequencies in sound over static frequencies [59]. These processes occur on short timescales (milliseconds), while perceptual adaptation has also been observed on longer timescales (seconds) [60][61][62], causing neurons to become less responsive to prolonged exposure to fixed stimuli.
A real-world engineering example of event-driven processing is the dynamic vision sensor (DVS), or the 'silicon retina', which is a camera that reports changes in brightness and stays silent otherwise ( Figure 3(d-e)) [63][64][65][66][67]. This also means that each pixel activates independently of all other pixels, as opposed to waiting for a global shutter to produce a still frame. The reduction of active pixels leads to huge energy savings when compared to conventional CMOS image sensors. This mix of low-power and asynchronous pixels allows for fast clock speeds, giving commercially available DVS cameras a microsecond temporal resolution without breaking a sweat [68]. The difference between a conventional frame-based camera and an event-based camera is illustrated in Figure 4. Figure 4: Functional difference between a conventional frame-based camera (above) and an event-based camera/silicon retina (below). The former records the scene as a sequence of images at a fixed frame rate. It operates independently of activity in the scene and can result in motion blur due to the global shutter. The silicon retina's output is directly driven by visual activity in the scene, as every pixel reacts to a change in illuminance. The capacitive membrane and resistive ion channels form an RC circuit. When the membrane potential exceeds a threshold θ, a spike is generated. (c) Input spikes generated by I in are passed to the neuron body via the dendritic tree. Sufficient excitation will cause spike emission at the output. (d) A simulation depicting the membrane potential U (t) reaching the threshold, arbitrarily set to θ = 0.5V , which generates output spikes.

Spiking Neurons
ANNs and SNNs can model the same types of network topologies, but SNNs trade the artificial neuron model with a spiking neuron model instead ( Figure 5). Much like the artificial neuron model [69], spiking neurons operate on a weighted sum of inputs. Rather than passing the result through a sigmoid or ReLU nonlinearity, the weighted sum contributes to the membrane potential U (t) of the neuron. If the neuron is sufficiently excited by this weighted sum, and the membrane potential reaches a threshold θ, then the neuron will emit a spike to its subsequent connections. But most neuronal inputs are spikes of very short bursts of electrical activity. It is quite unlikely for all input spikes to arrive at the neuron body in unison ( Figure 5(c)). This indicates the presence of temporal dynamics that 'sustain' the membrane potential over time.
These dynamics were quantified back in 1907 [70]. Louis Lapicque stimulated the nerve fiber of a frog leg using a hacked-together current source, and observed how long it took the frog leg to twitch based on the amplitude and duration of the driving current I in [71]. He concluded that a spiking neuron coarsely resembles a low-pass filter circuit consisting of a resistor R and a capacitor C, later dubbed the leaky integrate-and-fire (LIF) neuron ( Figure 5(b)). This holds up a century later: physiologically, the capacitance arises from the insulating lipid bilayer forming the membrane of a neuron. The resistance arises from gated ion channels that open and close, modulating charge carrier diffusion across the membrane ( Figure 5(a-b)) [72]. The dynamics of the passive membrane modeled using an RC circuit can be represented as: where τ = RC is the time constant of the circuit. Typical values of τ fall on the order of 1-100 milliseconds. The solution of (1) for a constant current input is: which shows how exponential relaxation of U (t) to a steady state value follows current injection, where U 0 is the initial membrane potential at t = 0. To make this time-varying solution compatible with a sequence-based neural network, the forward Euler method is used in the simplest case to find an approximate solution to Equation (1): where time is explicitly discretised, β = e −1/τ is the decay rate (or 'inverse time constant') of U [t], and the full derivation is provided in Appendix A.1.
In deep learning, the weighting factor of an input is typically a learnable parameter. Relaxing the physically viable assumptions made thus far, the coefficient of input current in Equation (3) and W would be a matrix, but is treated here as a single input to a single neuron for simplicity. Finally, accounting for spiking and membrane potential reset gives: 1} is the output spike generated by the neuron, where if activated (S out = 1), the reset term subtracts the threshold θ from the membrane potential. Otherwise, the reset term has no effect (S out = 0). A complete derivation of Equation (4) with all simplifying assumptions is provided in Appendix A.1. A spike is generated if the membrane potential exceeds the threshold: An example of how this might be coded in Python is as follows: 1 def lif(X, U): In your exploration of LIF neurons, you may come across many slight variants. Some variations include: • The spike threshold (Line 8) might be applied before updating the membrane potential (Line 7). This induces a one-step delay between the input signal X, and when it can trigger a spike. • The above derivations use a 'reset-by-subtraction' (or soft reset) mechanism. But an alternative shown in Appendix A.1 is a 'reset-to-zero' mechanism (or hard reset). • The factor (1 − β) from Equation (3) may be included as a co-efficient to the input term, W X. This will allow you to simulate a neuron model with realistic time constants, but does not offer any advantages when ultimately applied to deep learning.
Alternatively to the above code-block, this can be continuously executed in snnTorch over discrete time steps: An extensive list of alternative neuron types (both LIF and otherwise) is detailed in Section 2.2, along with a brief overview of their use cases.
A graphical depiction of the LIF neuron is provided in Figure 6. The recurrent neuron in (a) is 'unrolled' across time steps in (b), where the reset mechanism is included via S out θ, and explicit recurrence S out V is omitted for brevity. The unrolled computational graph is a good way to think about leaky integrate-and-fire neurons. As you will soon see in Section 4, this unrolled representation allows us to borrow many of the tricks developed by deep learning researchers in training networks of LIF neurons.
Practical Note: Soft Reset Enables Better Performance In our experience, neuron models that apply a variant of the soft reset to Equation (4) can lead to networks with higher performance (e.g., accuracy), shown in the expression below. That is, the reset mechanism in Equation (4) is scaled by the decay rate β. Why this improves performance is an open question. Figure 6: Computational steps in solving the LIF neuron model. (a) A recurrent representation of a spiking neuron. Hidden state decay is referred to as 'implicit recurrence', and external feedback from the spike is 'explicit recurrence', where V is the recurrent weight (omitted from Equation (4)) [73]. (b) An unrolled computational graph of the neuron where time flows from left to right. −θ represents the reset term from Equation (4), while β is the decay rate of U over time. Explicit recurrence is omitted for clarity. Note that kernel-based neuron models replace implicit recurrence with a time-varying filter [74][75][76].

Alternative Spiking Neuron Models
The LIF neuron is but one of many spiking neuron models. Some other models you might encounter are listed below: • Integrate-and-Fire (IF): The leakage mechanism is removed; β = 1 in Equation (4).
• Current-based: Often referred to as CuBa neuron models, these incorporate synaptic conductance variation into leaky integrate and fire neurons. If the default LIF neuron is a first-order low-pass filter, then CuBa neurons are a second-order low-pass filter. The input spike train undergoes two rounds of 'smoothing', which means the membrane potential has a finite rise time rather than experiencing discontinuous jumps in response to incoming spikes [77][78][79]. A depiction of such a neuron with a finite rise time of membrane potential is shown in Figure 5(d). • Recurrent Neurons: The output spikes of a neuron are routed back to the input, labelled in Figure 6(a) with explicit recurrence. Rather than an alternative model, recurrence is a topology that can be applied to any other neuron, and can be implemented in different ways; i) one-to-one recurrence, where each neuron routes its own spike to itself, or ii) all-to-all recurrence, where the output spikes of a full layer are weighted and summed (e.g., via a dense or convolutional layer), before being fed back to the full layer [80]. • Kernel-based Models: Also knows as the spike-response model, where a pre-defined kernel (such as the 'alpha function': see Appendix C.1) is convolved with input spikes [74][75][76]. Having the option to define the kernel to be any shape offers significant flexibility. • Deep learning inspired spiking neurons: Rather than drawing upon neuroscience, it is just as possible to start with primitives from deep learning and apply spiking thresholds. This helps with extending the short-term capacity of basic recurrent neurons. A couple of examples include spiking LSTMs [38], and Legendre Memory Units [81]. More recently, transformers have been used to further improve long-range memory dependencies in data. In a similar manner, SpikeGPT approximated self-attention into a recurrent model, providing the first demonstration of natural language generation with SNNs [82]. • Higher-complexity neuroscience-inspired models: A large variety of more detailed neuron models are out there. These account for biophysical realism and/or morphological details that are not represented in simple leaky integrators. The most renowned models include the Hodgkin-Huxley model [72], and the Izhikevich (or Resonator) model [83], which can reproduce electrophysiological results with better accuracy.
The main takeaway is: use the neuron model that suits your task. Power efficient deep learning will call for LIF models. Improving performance may call for using recurrent SNNs. Driving performance even further (often at the expense of efficiency) may demand methods derived from deep learning, such as spiking LSTMs, and recurrent spiking transformers [82,84]. Or perhaps deep learning is not your goal. If you are aiming to construct a brain model, or are tasked with an exploration of linking low-level dynamics (ionic, conductance-driven, or otherwise) with higher-order brain function, then perhaps more detailed, biophysically accurate models will be your friend. The following code-snippets show how some of these neurons can be instantiated in snnTorch. 1 import snntorch as snn 2 3 # all thresholds default to threshold=1 4 lif = snn.Leaky(beta=0.9) # vanilla leaky integrate-and-fire neuron 5 int_fire = snn.Leaky(beta=1.0) # integrate-and-fire neuron 6 cuba = snn.Synaptic(beta=0.9, alpha=0.8) # current-based neuron 7 rlif_1 = snn.RLeaky(beta=0.9, all_to_all=True) # all-to-all recurrent lif neuron 8 rlif_2 = snn.RLeaky(beta=0.9, all_to_all=False) # one-to-one recurrent lif neuron 9 slstm = snn.SLSTM(input_size=10, hidden_size=1000) # spiking LSTM: 10 inputs, 1000 outputs With the neurons instantiated, it is just a matter of passing in two arguments: i) input data, and ii) their hidden state(s).
The hidden states will be updated recursively. Having formulated a spiking neuron in a discrete-time, recursive form, we can now 'borrow' the developments in training RNNs and sequence-based models. This recursion is illustrated using an 'implicit' recurrent connection for the decay of the membrane potential, and is distinguished from 'explicit' recurrence where the output spikes S out are fed back to the input as in recurrent SNNs ( Figure 6).
While there are plenty more physiologically accurate neuron models [72], the leaky integrate and fire model is the most prevalent in gradient-based learning due to its computational efficiency and ease of training. Before moving onto training SNNs in Section 4, let us gain some insight to what spikes actually mean, and how they might represent information in the next section.

Practical Note: A Comparison of Various Neuron Models
With all of these neuron variants, what is the 'correct' approach? This ultimately depends on your goals, and an actively researched question. In our experience: • LIF neurons are the most commonly used in the deep learning context. They sit in the sweet spot between biological plausibility and practicality. It is relatively straightforward to adopt techniques used in training recurrent neural networks (RNNs) to SNNs, as will be explained in detail in Section 4. • IF models are commonly used in low-power neuromorphic hardware (e.g., SynSense's Speck and Xylo systems; BrainChip's Akida). It removes the 'multiply' step by β when updating the membrane potential. This simplification is at the cost of treating all historical data in the same way. I.e., an event that occurred 10 seconds in the past will be weighted equally as an event that occurred 1 second in the past. • CuBa neuron models add additional complexity without improving learning capacity. They tend not to be less common as a result. On the plus side, by smoothing out the membrane potential evolution (see Figure 5(d)), the spike time becomes differentiable with respect to the membrane potential. More on this in Section 4. • Recurrent spiking neurons have shown to improve learning convergence when training SNNs on time-varying tasks when using all-to-all connections [85]. One-to-one connections may not appear to help on the face of it, but by applying a fixed, negative recurrent weight, it can be used as an alternative way to model an adaptive leaky integrate-and-fire neuron. For each output spike, rather than updating the threshold, the output spike can be used to inhibit the membrane potential instead. • Kernel-based Models are, in theory, equally performant as all other first-order neuron models. They also demand more memory, as solving the state of the neuron at each time step depends on knowledge of the full history of spikes. In contrast, the formulation in Equation (4) only requires information from the previous time step. SLAYER [75] popularized their use in multi-layer deep learning, and its implementation was parallelized to improve training time in EXODUS [86]. • Deep learning inspired neurons deviate from biology and adopt models from deep learning instead.
For example, LSTMs are known to have better memory capacity than basic neurons with recurrent connections. Adding a firing threshold to the state of the LSTM can reduce inter-layer spike traffic. However, some of these neurons may introduce higher-dimensional states which can be more expensive to compute. Where leaky integrate-and-fire neurons and recurrent SNNs are insufficient, deep learningderived models may be a good alternative. • Higher-complexity neuroscience-inspired neurons are less often used in deep learning as they involve solving highly stiff ordinary differential equations. I.e., a tiny perturbation may lead to negligible influence on the membrane potential, but the very same perturbation may skyrocket into an action potential. Harnessing gradients from such an unstable signal is not a pleasant experience. As a glimmer of hope, the neuromorphic folk at Intel Labs have been aiming to bridge the gap between low-level neuronal detail in Resonator neurons and higher-level functionality. They have recently discovered emergent properties of such models that allow for efficient signal processing (e.g., spectral decomposition) and factorization of high-dimensional vectors [87].

The Neural Code
Light is what we see when the retina converts photons into spikes. Odors are what we smell when the nose processes volatilised molecules into spikes. Tactile perceptions are what we feel when our nerve endings turn pressure into spikes. The brain trades in the global currency of the spike. If all spikes are treated identically, then how do they carry meaning? With respect to spike encoding, there are two parts of a neural network that must be treated separately ( Figure 7): 1. Input encoding: Conversion of input data into spikes which is then passed to a neural network 2. Output decoding: Train the output of a network to spike in a way that is meaningful and informative

Input encoding
Input data to an SNN does not necessarily have to be encoded into spikes. It is acceptable to pass continuous values as input, much like how the perception of light starts with a continuous number of photons impinging upon our photoreceptor cells.
Static data, such as an image, can be treated as a direct current (DC) input with the same features passed to the input layer of the SNN at every time step. But this does not exploit the way SNNs extract meaning from temporal data. In general, three encoding mechanisms have been popularised with respect to input data: 1. Rate coding converts input intensity into a firing rate or spike count 2. Latency (or temporal) coding converts input intensity to a spike time

Delta modulation converts a temporal change of input intensity into spikes, and otherwise remains silent
This is a non-exhaustive list, and these codes are not necessarily independent of each other. Figure 7: Input data to an SNN may be converted into a firing rate, firing time, or the data can be delta modulated. Alternatively, the input to the network can also be passed in without conversion which experimentally represents a direct or variable current source applied to the input layer of neurons. The network itself may be trained to enable the correct class to have the highest firing rate or to fire first, amongst many other encoding strategies.

Rate Coded Inputs
How does the sensory periphery encode information about the world into spikes? When bright light is incident upon our photoreceptor cells, the retina triggers a spike train to the visual cortex. Hubel and Wiesel's Nobel prize-winning research on visual processing indicates that a brighter input or a favourable orientation of light corresponds to a higher firing rate [58]. As a rudimentary example, a bright pixel is encoded into a high firing rate, whereas a dark pixel would result in low-frequency firing. Measuring the firing rate of a neuron can become quite nuanced. The simplest approach is to apply an input stimulus to a neuron, count up the total number of action potentials it generates, and divide that by the duration of the trial. Although straightforward, the problem here is that the dynamics of a neuron vary across time.
There is no guarantee the firing rate at the start of the trial is anything near the rate at the end.
An alternative method counts the spikes over a very short time interval ∆t. For a small enough ∆t, the spike count can be constrained to either 0 or 1, limiting the total number of possible outcomes to only two. By repeating this experiment multiple times, the average number of spikes (over trials) occurring within ∆t can be found. This average must be equal to or less than 1, interpreted as the observed probability that a neuron will fire within the brief time interval. To convert it into a time-dependent firing rate, the trial average is divided by the duration of the interval. This probabilistic interpretation of the rate code can be distributed across multiple neurons, where counting up the spikes from a collection of neurons advocates for a population code [88]. This representation is quite convenient for sequential neural networks. Each discrete time step in an RNN can be thought of as lasting for a brief duration ∆t in which a spike either occurs or does not occur. A formal example of how this takes place is provided in Appendix B.1. Data can be rate-coded using the spikegen module within snnTorch: 1 from snntorch import spikegen 2 import torch

Latency Coded Inputs
A latency, or temporal, code is concerned with the timing of a spike. The total number of spikes is no longer consequential. Rather, when the spike occurs is what matters. For example, a time-to-first-spike mechanism encodes a bright pixel as an early spike, whereas a dark input will spike last, or simply never spike at all. When compared to the rate code, latency-encoding mechanisms assign much more meaning to each individual spike.
Neurons can respond to sensory stimuli over an enormous dynamic range. In the retina, neurons can detect individual photons to an influx of millions of photons [89][90][91][92][93]. To handle such widely varying stimuli, sensory transduction systems likely compress stimulus intensity with a logarithmic dependency. For this reason, a logarithmic relation between spike times and input feature intensity is ubiquitous in the literature (Appendix B.2) [94,95].
Although sensory pathways appear to transmit rate coded spike trains to our brains, it is likely that temporal codes dominate the actual processing that goes on within the brain. More on this in Section 3.2.3. 1 from snntorch import spikegen 2 import torch

Delta Modulated Inputs
Delta modulation is based on the notion that neurons thrive on change, which underpins the operation of the silicon retina camera that only generates an input when there has been a sufficient change of input intensity over time. If there is no change in your field of view, then your photoreceptor cells are much less prone to firing. Computationally, this would take a time-series input and feed a thresholded matrix difference to the network. While the precise implementation may vary, a common approach requires the difference to be both positive and greater than some pre-defined threshold for a spike to be generated. This encoding technique is also referred to as 'threshold crossing'. Alternatively, changes in intensity can be tracked over multiple time steps, and other approaches account for negative changes. For an illustration, see Figure 4, where the 'background' is not captured over time. Only the moving blocks are recorded, as it is those pixels that are changing.
Assuming the variable X stores a batch of videos, say with 100 frames in each sample, delta modulation can be applied using the spikegen module: 1 from snntorch import spikegen 2 import torch 3 4 print(X.size()) 5 >> torch.Size([100, 128, 1, 28, 28]) # time X batch-size X channel X x-dim X y-dim 6 7 S = spikegen.delta(X, num_steps=steps) # convert X to delta modulated spikes in S The previous techniques tend to 'convert' data into spikes. But it is more efficient to natively capture data in 'preencoded', spiking form. Each pixel in a DVS camera and channel in a silicon cochlear uses delta modulation to record changes in the visual or audio scene. Some examples of neuromorphic benchmark datasets are described in Table 1. A comprehensive series of neuromorphic-relevant datasets are accounted for in NeuroBench [96] 3 . Table 1: Examples of neuromorphic datasets recorded with event-based cameras and cochlear models.

Vision datasets
ASL-DVS [97] 100,800 samples of American sign language recorded with DAVIS. DAVIS Dataset [98] Includes spikes, frames and inertial measurement unit recordings of interior and outdoor scenes. DVS Gestures [99] 11 different hand gestures recorded under 3 different lighting conditions. DVS Benchmark [100] DVS benchmark datasets for object tracking, action recognition, and object recognition. MVSEC [101] Spikes, frames and optical flow from stereo cameras for indoor and outdoor scenarios. N-MNIST [102] Spiking version of the classic MNIST dataset by converting digits from a screen using saccadic motion. POKER DVS [103] 4 classes of playing cards flipped in rapid succession in front of a DVS. DSEC [104] A stereo event camera dataset for driving scenarios.

N-TIDIGITS [105]
Speech recordings from the TIDIGITS dataset converted to spikes with a silicon cochlear. SHD [106] Spiking version of the Heidelberg Digits dataset converted using a simulated cochlear model. SSC [106] Spiking version of the Speech Commands dataset converted using a simulated cochlear model.

Practical Note: Input Encoding
What if you start off with a non-spiking dataset? Applying these input encoding mechanisms will almost always lead to accuracy/performance degradation. Information loss is guaranteed. It is often wiser to apply the preprocessing techniques that are prevalent in the literature; e.g., if you have EEG data, a Fourier transform is often appropriate. Alternatively, SpikeGPT introduced binarized embeddings that 'learn' the optimal spike-based encoding of a sequence of input tokens [82]. If you must encode data into spikes, we have found that rate codes do less harm to accuracy/loss minimization than latency codes. In the ideal scenario, your sensor would naturally capture data as spikes rather than having to go through conversion and compression steps.

Output Decoding
Encoding input data into spikes can be thought of as how the sensory periphery transmits signals to the brain. On the other side of the same coin, decoding these spikes provides insight on how the brain handles these encoded signals. In the context of training an SNN, the encoding mechanism does not constrain the decoding mechanism. Shifting our attention from the input of an SNN, how might we interpret the firing behavior of output neurons?
1. Rate coding chooses the output neuron with the highest firing rate, or spike count, as the predicted class 2. Latency (or temporal) coding chooses the output neuron that fires first as the predicted class 3. Population coding applies the above coding schemes (typically a rate code) with multiple neurons per class

Rate Coded Outputs
Consider a multi-class classification problem, where N C is the number of classes. A non-spiking neural network would select the neuron with the largest output activation as the predicted class. For a rate-coded spiking network, the neuron that fires with the highest frequency is used. As each neuron is simulated for the same number of time steps, simply choose the neuron with the highest spike count (Appendix B.3).

Latency Coded Outputs
There are numerous ways a neuron might encode data in the timing of a spike. As in the case with latency-coded inputs, it could be that a neuron representing the correct class fires first. This addresses the energy burden that arises from the multiple spikes needed in rate codes. In hardware, the need for fewer spikes reduces the frequency of memory accesses which is another computational burden in deep learning accelerators.
Biologically, does it make sense for neurons to operate on a time to first spike principle? How might we define 'first' if our brains are not constantly resetting to some initial, default state? This is quite easy to address conceptually. The idea of a latency or temporal code is motivated by our response to a sudden input stimulus. For example, when viewing a static, unchanging visual scene, the retina undergoes rapid, yet subtle, saccadic motion. The scene projected onto the retina changes every few hundreds of milliseconds. It could very well be the case that the first spike must occur with respect to the reference signal generated by this saccade.

Rate vs. Latency Code
Whether neurons encode information as a rate, as latency, or as something wholly different, is a topic of much controversy. We do not seek to crack the neural code here, but instead aim to provide intuition on when SNNs might benefit from one code over the other.

Advantages of Rate Codes
• Error tolerance: if a neuron fails to fire, there are ideally many more spikes to reduce the burden of this error.
• More spiking promotes more learning: additional spikes provide a stronger gradient signal for learning via error backpropagation. As will be described in Section 4, the absence of spiking can impede learning convergence (more commonly referred to as the 'dead neuron problem').

Advantages of Latency Codes
• Power consumption: generating and communicating fewer spikes means less dynamic power dissipation in tailored hardware. It also reduces memory access frequency due to sparsity, as a vector-matrix product for an all-zero input vector returns a zero output.
• Speed: the reaction time of a human is roughly in the ballpark of 250 ms. If the average firing rate of a neuron in the human brain is on the order of 10 Hz (which is likely an overestimation [107]), then one can only process about 2-3 spikes in this reaction time window. In contrast, latency codes rely on a single spike to represent information. This issue with rate codes may be addressed by coupling it with a population code: if a single neuron is limited in its spike count within a brief time window, then just use more neurons [88]. This comes at the expense of further exacerbating the power consumption problem of rate codes.
The power consumption benefit of latency codes is also supported by observations in biology, where nature optimises for efficiency. Olshausen and Field's work in 'What is the other 85% of V1 doing?' methodically demonstrates that rate-coding can only explain, at most, the activity of 15% of neurons in the primary visual cortex (V1) [107]. If our neurons indiscriminately defaulted to a rate code, this would consume an order of magnitude more energy than a temporal code. The mean firing rate of our cortical neurons must necessarily be rather low, which is supported by temporal codes.
Lesser explored encoding mechanisms in gradient-based SNNs include using spikes to represent a prediction or reconstruction error [108]. The brain may be perceived as an anticipatory machine that takes action based on its predictions. When these predictions do not match reality, spikes are triggered to update the system.
Some assert the true code must lie between rate and temporal codes [109], while others argue that the two may co-exist and only differ based on the timescale of observation: rates are observed for long timescales, latency for short timescales [110]. Some reject rate codes entirely [111]. This is one of those instances where a deep learning practitioner might be less concerned with what the brain does, and prefers to focus on what is most useful.

Objective Functions
While it is unlikely that our brains use something as explicit as a cross-entropy loss function, it is fair to say that humans and animals have baseline objectives [112]. Biological variables, such as dopamine release, have been meaningfully related to objective functions from reinforcement learning [113]. Predictive coding models often aim to minimise the information entropy of sensory encodings, such that the brain can actively predict incoming signals and inhibit what it already expects [114]. The multi-faceted nature of the brain's function likely calls for the existence of multiple objectives [115]. How the brain can be optimised using these objectives remains a mystery, though we might be able to gain insight from multi-objective optimisation [116].
A variety of loss functions can be used to encourage the output layer of a network to fire as a rate or temporal code. The optimal choice is largely unsettled, and tends to be a function of the network hyperparameters and complexity of the task at hand. All objective functions described below have successfully trained networks to competitive results on a variety of datasets, though come with their own trade-offs.

Spike Rate Objective Functions
A summary of approaches commonly adopted in supervised learning classification tasks with SNNs to promote the correct neuron class to fire with the highest frequency is provided in Table 2. In general, either the cross-entropy loss or mean square error is applied to the spike count or the membrane potential of the output layer of neurons. Cross-Entropy Spike Rate: The total number of spikes for each neuron in the output layer are accumulated over time into a spike count ⃗ c ∈ N N C (Equation (27) in Appendix B.3), for NC classes. A multiclass categorical probability distribution is obtained by treating the spike counts as logits in the softmax function. Cross entropy minimisation is used to increase the spike count of the correct class, while suppressing the count of the incorrect classes [99,117] (Appendix B.4).
Mean Square Spike Rate: The spike counts of both correct and incorrect classes are specified as targets.
The mean square errors between the actual and target spike counts for all output classes are summed together. In practice, the target is typically represented as a proportion of the total number of time steps: e.g., the correct class should fire at 80% of all time steps, while incorrect classes should fire 20% of the time [75,[118][119][120] (Appendix B.5).

Membrane Potential
Maximum Membrane: The logits are obtained by taking the maximum value of the membrane potential over time, which are then applied to a softmax cross entropy function. By encouraging the membrane potential of the correct class to increase, it is expected to encourage more regular spiking [121][122][123]. A variant is to simply sum the membrane potential across all time steps to obtain the logits [123] (Appendix B.6).
Mean Square Membrane: Each output neuron has a target membrane potential specified for each time step, and the losses are summed across both time and outputs. To implement a rate code, a superthreshold target should be assigned to the correct class across time steps (Appendix B.7).
With a sufficient number of time steps, passing the spike count the objective function is more widely adopted as it operates directly on spikes. Membrane potential acts as a proxy for increasing the spike count, and is also not considered an observable variable which may partially offset the computational benefits of using spikes.
Cross-entropy approaches aim to suppress the spikes from incorrect classes, which may drive weights in a network to zero. This could cause neurons to go quiet in absence of additional regularisation. By using the mean square spike rate, which specifies a target number of spikes for each class, output neurons can be placed on the cusp of firing. Therefore, the network is expected to adapt to changing inputs with a faster response time than neurons that have their firing completely suppressed.
In networks that simulate a constrained number of time steps, a small change in weights is unlikely to cause a change in the spike count of the output. It might be preferable to apply the loss function directly to a more 'continuous' signal, such as the membrane potential instead. This comes at the expense of operating on a full precision hidden state, rather than on spikes. Alternatively, using population coding can distribute the cost burden over multiple neurons to increase the probability that a weight update will alter the spiking behavior of the output layer. It also increases the number of pathways through which error backpropagation may take place, and improve the chance that a weight update will generate a change in the global loss.

Spike Time Objectives
Loss functions that implement spike timing objectives are less commonly used than rate-coded objectives. Two possible reasons may explain why: (1) error rates are typically perceived to be the most important metric in deep learning literature, and rate codes are more tolerant to noise, and (2) temporal codes are considerably more difficult to implement. A summary of approaches is provided in Table 3, with their snnTorch implementation below. The timing of the first spike of each neuron in the output layer is taken ⃗ f ∈ R N C . As cross entropy minimisation involves maximising the likelihood of the correct class, a monotonically decreasing function must be applied to ⃗ f such that early spike times are converted to large numerical values, while late spikes become comparatively smaller. These 'inverted' values are then used as logits in the softmax function [76] (Appendix B.8).
Mean Square Spike Time: The spike time of all neurons are specified as targets. The mean square errors between the actual and target spike times of all output classes are summed together. This can be generalised to multiple spikes as well [75,124] (Appendix B.9).
Mean Square Relative Spike Time: A similar approach to above, but rather than specifying the precise time of each spike, only the relative time of the correct class must be specified. If the correct neuron fires a specified number of time steps earlier than incorrect neurons, the loss is fixed at zero [125] (Appendix B.10).

Membrane Potential
Unreported in the literature.
Mean Square Membrane: Analogous to the ratecoded case, each output neuron has a target membrane potential specified for each time step, and the losses are summed across both time and outputs. To implement a temporal code, the correct class should specify a target membrane greater than the threshold of the neuron at an early time (Appendix B.7).
1 from snntorch import functional as SF The use cases of these objectives are analogous to the spike rate objectives. A subtle challenge with using spike times is that the default implementation assumes each neuron spikes at least once, which is not necessarily the case. This can be handled by forcing a spike at the final time step in the event a neuron does not fire [125].

Practical Note: Output Decoding with a Read-Out Layer
Several state-of-the-art models wholly abandon spiking neurons at the output and train their models using a 'read-out layer'. This often consists of an integrate-and-fire layer with infinitely high thresholds (i.e., they will never fire), or with typical artificial neurons that use standard activation functions (ReLU, sigmoid, softmax, etc.). While this often improves accuracy, this may not qualify as a fully spiking network. Does this actually matter? If one can still achieve power efficiency, then engineers will be happy and that's often all that matters.

Spatial and Temporal Credit Assignment
Once a loss has been determined, it must somehow be used to update the network parameters with the hope that the network will iteratively improve at the trained task. Each weight takes some blame for its contribution to the total loss, and this is known as 'credit assignment'. This can be split into the spatial and temporal credit assignment problems. Spatial credit assignment aims to find the spatial location of the weight contributing to the error, while the temporal credit assignment problem aims to find the time at which the weight contributes to the error. Backpropagation has proven to be an extremely robust way to address credit assignment, but the brain is far more constrained in developing solutions to these challenges.
Backpropagation solves spatial credit assignment by applying a distinct backward pass after a forward pass during the learning process [126]. The backward pass mirrors the forward pass, such that the computational pathway of the forward pass must be recalled. In contrast, action potential propagation along an axon is considered to be unidirectional which may reject the plausibility of backprop taking place in the brain. Spatial credit assignment is not only concerned with calculating the weight's contribution to an error, but also assigning the error back to the weight. Even if the brain could somehow calculate the gradient (or an approximation), a major challenge would be projecting that gradient back to the synapse, and knowing which gradient belongs to which synapse.
This constraint of neurons acting as directed edges is increasingly being relaxed, which could be a mechanism by which errors are assigned to synapses [127]. Numerous bi-directional, non-linear phenomena occur within individual neurons which may contribute towards helping errors find their way to the right synapse. For example, feedback connections are observed in most places where there are feedforward connections [128].

Biologically Motivated Learning Rules
With a plethora of neuronal dynamics that might embed variants of backpropagation, what options are there for modifying backprop to relax some of the challenges associated with biologically plausible spatial credit assignment? In general, the more broadly adopted approaches rely on either trading parts of the gradient calculation for stochasticity, or otherwise swapping a global error signal for localised errors ( Figure 8). Conjuring alternative methods to credit assignment that a real-time machine such as the brain can implement is not only useful for developing insight to biological learning [129], but also reduces the cost of data communication in hardware [130]. For example, using local errors can reduce the length a signal must travel across a chip. Stochastic approaches can trade computation with naturally arising circuit noise [131][132][133]. A brief summary of several common approaches to ameliorating the spatial credit assignment problem are provided below: • Perturbation Learning: A random perturbation of network weights is used to measure the change in error. If the error is reduced, the change is accepted. Otherwise, it is rejected [134][135][136]. The difficulty of learning scales with the number of weights, where the effect of a single weight change is dominated by the noise from all other weight changes. In practice, it may take a huge number of trials to average this noise away [54].
• Random Feedback: Backpropagation requires sequentially transporting the error signal through multiple layers, scaled by the forward weights of each layer. Random feedback replaces the forward weight matrices with random matrices, reducing the dependence of each weight update on distributed components of the network. While this does not fully solve the spatial credit assignment problem, it quells the weight transport problem [137], which is specifically concerned with a weight update in one layer depending upon the weights of far-away layers. Forward and backward-propagating data are scaled by symmetric weight matrices, a mechanism that is absent in the brain. Random feedback has shown similar performance to backpropagation on simple networks and tasks, which gives hope that a precise gradient may not be necessary for good performance [137]. Random feedback has struggled with more complex tasks, though variants have been proposed that reduce the gap [138][139][140][141]. Nonetheless, the mere fact that such a core piece of the backpropagation algorithm can be replaced with random noise and yet somehow still work is a marvel. It is indicative that we still have much left to understand about gradient backpropagation. • Local Losses: It could be that the six layers of the cortex are each supplied with their own cost function, rather than a global signal that governs a unified goal for the brain [115]. Early visual regions may try to minimise the prediction error in constituent visual features, such as orientations, while higher areas use cost functions that target abstractions and concepts. For example, a baby learns how to interpret receptive fields before consolidating them into facial recognition. In deep learning, greedy layer-wise training assigns a cost function to each layer independently [142]. Each layer is sequentially assigned a cost function so as to ensure a shallow network is only ever trained. Target propagation is similarly motivated, by assigning a reconstruction criterion to each layer [108]. Such approaches exploit the fact that training a shallow network is easier than training a deep one, and aim to address spatial credit assignment by ensuring the error signal does not need to propagate too far [127,143]. • Forward-Forward Error Propagation: The backward pass of a model is replaced with a second forward-pass where the input signal is altered based on error, or some related metric. Initially proposed by Dellaferrera et al. [144], Hinton's Forward-Forward learning algorithm generated more traction soon after [145]. These have not been ported to SNNs at the time of writing, though someone is bound to step up to the mantle soon.
These approaches to learning are illustrated in Figure 8(a). While they are described in the context of supervised learning, many theories of learning place emphasis on self-organisation and unsupervised approaches. Hebbian plasticity is a prominent example [146]. But an intersection may exist in self-supervised learning, where the target of the network is a direct function of the data itself. Some types of neurons may be representative of facts, features, or concepts, only firing when exposed to the right type of stimuli. Other neurons may fire with the purpose of reducing a reconstruction error [147][148][149]. By accounting for spatial and temporal correlations that naturally exist around us, such neurons may fire with the intent to predict what happens next. A more rigorous treatment of biological plausibility in objective functions can be found in [115].

Practical Note: The Effectiveness of Credit Assignment Solutions
Gradient-based error backpropagation reigns supreme where final accuracy/performance is the key metric. Perturbation learning is effective with an infinitely large number of trials which is not practical, and the other methods struggle to scale to deep layers. At the same time, the brain is not considered 'deep', so perhaps the focus should not be on how to make these algorithms scale to deep models, but rather, how shallow networks can perform to the same effectiveness as deeper models. This may call for evolutionary algorithms, where it has been shown that smaller networks can be optimized using a fitness function to outperform larger networks [150,151].

Activity Regularisation
A huge motivator behind using SNNs comes from the power efficiency when processed on appropriately tailored hardware. This benefit is not only from single-bit inter-layer communication via spikes, but also the sparse occurrence of spikes. Some of the loss functions above, in particular those that promote rate codes, will indiscriminately increase the membrane potential and/or firing frequency without an upper bound, if left unchecked. Regularisation of the loss can be used to penalise excessive spiking (or alternatively, penalise insufficient spiking which is great for discouraging dead neurons). Conventionally, regularisation is used to constrain the solution space of loss minimisation, thus leading to a reduction in variance at the cost of increasing bias. Care must be taken, as too much activity regularisation can lead to excessively high bias. Activity regularisation can be applied to alter the behavior of individual neurons or populations of neurons, as depicted in Figure 8(b).
• Population level regularisation: this is useful when the metric to optimise is a function of aggregate behavior. For example, the metric may be power efficiency which is strongly linked to the total number of spikes from an entire network. L1-regularisation can be applied to the total number of spikes emitted at the output layer to penalise excessive firing, which encourages sparse activity at the output [152]. Alternatively, for more fine-grain control over the network, an upper-activity threshold can be applied. If the total number of spikes for all neurons in a layer exceeds the threshold, only then does the regularisation penalty kick in [120,123] (Appendix B.11). • Neuron level regularisation: If neurons completely cease to fire, then learning may become significantly more difficult. Regularisation may also be applied at the individual neuron level by adding a penalty for each neuron. A lower-activity threshold specifies the lower permissible limit of firing for each neuron before the regularisation penalty is applied (Appendix B.12).
Recent experiments have shown that rate-coded networks (at the output) are robust to sparsity-promoting regularisation terms [120,121,123]. However, networks that rely on time-to-first-spike schemes have had less success, which is unsurprising given that temporal outputs are already sparse.
Encouraging each neuron to have a baseline spike count helps with the backpropagation of errors through pathways that would otherwise be inactive. Together, the upper and lower-limit regularisation terms can be used to find the sweet spot of firing activity at each layer. As explained in detail in [153], the variance of activations should be as close as possible to '1' to avoid vanishing and exploding gradients. While modern deep learning practices rely on appropriate parameter initialization to achieve this, these approaches were not designed for non-differentiable activation functions, such as spikes. By monitoring and appropriately compensating for neuron activity, this may turn out to be a key ingredient to successfully training deep SNNs.
Practical Note: Avoiding Dead Neurons by Varying Thresholds One of the most common reasons your network will not train is that neurons are not firing. Probing the spiking activity to identify the layer at which activity dies out is important, and can be used to guide how to regularise your objective function. An easier approach is to simply reduce the threshold of the at-risk layers. We do this threshold hack all the time!

Training Spiking Neural Networks
The rich temporal dynamics of SNNs give rise to a variety of ways in which a neuron's firing pattern can be interpreted. Naturally, this means there are several methods for training SNNs. They can generally be classified into the following methods: • Shadow training: A non-spiking ANN is trained and converted into an SNN by interpreting the activations as a firing rate or spike time • Backpropagation using spikes: The SNN is natively trained using error backpropagation, typically through time as is done with sequential models • Local learning rules: Weight updates are a function of signals that are spatially and temporally local to the weight, rather than from a global signal as in error backpropagation Each approach has a time and place where it outshines the others. We will focus on approaches that apply backprop directly to an SNN, but useful insights can be attained by exploring shadow training and various local learning rules.
The goal of the backpropagation algorithm is loss minimisation. To achieve this, the gradient of the loss is computed with respect to each learnable parameter by applying the chain rule from the final layer back to each weight [154][155][156].
The gradient is then used to update the weights such that the error is ideally always decreased. If this gradient is '0', there is no weight update. This has been one of the main road blocks to training SNNs using error backpropagation due to the non-differentiability of spikes. This is also known as the dreaded 'dead neuron' problem. There is a subtle, but important, difference between 'vanishing gradients' and 'dead neurons' which will be explained in Section 4.3.
To gain deeper insight behind the non-differentiability of spikes, recall the discretised solution of the membrane potential of the leaky integrate and fire neuron from Equation (4): , where the first term represents the decay of the membrane potential U , and the second term is the weighted input W X. The reset term and subscripts have been omitted for simplicity. Now imagine a weight update ∆W is applied to the weight W (Equation (4)). This update causes the membrane potential to change by ∆U , but this change in potential fails to precipitate a further change to the spiking presence of the neuron (Equation (5)). That is to say, dS/dU = 0 for all U , other than the threshold θ, where dS/dU → ∞. This drives the term we are actually interested in, dL/dW , or the gradient of the loss in weight space, to either '0' or '∞'. In either case, there is no adequate learning signal when backpropagating through a spiking neuron (Figure 9(a)).

Shadow Training
The dead neuron problem can be completely circumvented by instead training on a shadow ANN and converting it into an SNN (Figure 9(b)). The high precision activation function of each neuron is converted into either a spike rate [157][158][159][160][161] or a latency code [162]. One of the most compelling reasons to use shadow training is that advances in conventional deep learning can be directly applied to SNNs. For this reason, ANN-to-SNN conversion currently takes the crown for static image classification tasks on complex datasets, such as CIFAR-10 and ImageNet. Where inference efficiency is more important than training efficiency, and if input data is not time-varying, then shadow training could be the optimal way to go.
In addition to the inefficient training process, there are several drawbacks. Firstly, the types of tasks that are most commonly benchmarked do not make use of the temporal dynamics of SNNs, and the conversion of sequential neural networks to SNNs is an under-explored area [159]. Secondly, converting high-precision activations into spikes typically requires a long number of simulation time steps which may offset the power/latency benefits initially sought from SNNs. But what really motivates doing away with ANNs is that the conversion process is necessarily an approximation. Therefore, a shadow-trained SNN is very unlikely to reach the performance of the original network.
The issue of long time sequences can be partially addressed by using a hybrid approach: start with a shadow-trained SNN, and then perform backpropagation on the converted SNN [163]. Although this appears to degrade accuracy (reported on CIFAR-10 and ImageNet), it is possible to reduce the required number of steps by an order of magnitude. A more rigorous treatment of shadow training techniques and challenges can be found in [164].

Backpropagation Using Spike Times
An alternative method to side step the dead neuron problem is to instead take the derivative at spike times. In fact, this was the first proposed method to training multi-layer SNNs using backpropagation [124]. The original approach in SpikeProp observes that while spikes may be discontinuous, time is continuous. Therefore, taking the derivative of spike timing with respect to the weights achieves functional results. A thorough description is provided in Appendix C.1. instead of the gradient of the spike generation mechanism, which is a continuous function as long as a spike necessarily occurs [124]. (d) Surrogate gradients: the spike generation function is approximated to a continuous function during the backward pass [123]. The left arrow (←) indicates function substitution. This is the most broadly adopted solution to the dead neuron problem.

Practical Note: Gradients at Spike Times
This approach continues to be widely researched, but has taken a back seat to backpropagation through time using surrogate gradient descent, as it does not perform as well at optimizing a loss function. Why is this? Our best guess is the lower performance occurs because there are fewer edges to backpropagate errors through, making credit assignment more challenging. More details on surrogate gradient descent in the following section.
Intuitively, SpikeProp calculates the gradient of the error with respect to the spike time. A change to the weight by ∆W causes a change of the membrane potential by ∆U , which ultimately results in a change of spike timing by ∆f , where f is the firing time of the neuron. In essence, the non-differentiable term ∂S/∂U has been traded with ∂f /∂U . This also means that each neuron must emit a spike for a gradient to be calculable. This approach is illustrated in Figure 9(c). Extensions of SpikeProp have made it compatible with multiple spikes [165], which are highly performant on data-driven tasks some of which have surpassed human level performance on MNIST and N-MNIST [76,[166][167][168]. Several drawbacks arise. Once neurons become inactive, their weights become frozen. In most instances, no closed-form solutions exist to solving for the gradient if there is no spiking [169]. SpikeProp tackles this by modifying parameter initialization (i.e., increasing weights until a spike is triggered). But since the inception of SpikeProp in 2002, the deep learning community's understanding of weight initialization has gradually matured. We now know initialization aims to set a constant activation variance between layers, the absence of which can lead to vanishing and exploding gradients through space and time [153,170]. Modifying weights to promote spiking may detract from this. Instead, a more effective way to overcome the lack of firing is to lower the firing thresholds of the neurons. One may consider applying activity regularization to encourage firing in hidden layers, though this has degraded classification accuracy when taking the derivative at spike times. This result is unsurprising, as regularization can only be applied at the spike time rather than when the neuron is quiet.
Another challenge is that it enforces stringent priors upon the network (e.g., each neuron must fire only once) that are incompatible with dynamically changing input data. This may be addressed by using periodic temporal codes that refresh at given intervals, in a similar manner to how visual saccades may set a reference time. But it is the only approach that enables the calculation of an unbiased gradient without any approximations in multi-layer SNNs. Whether this precision is necessary is a matter of further exploration on a broader range of tasks.

Backpropagation Using Spikes
Instead of computing the gradient with respect to spike times, the most commonly adopted approach over the past several years is to apply the generalised backpropagation algorithm to the unrolled computational graph (Figure 6(b)) [75,117,158,171,172], i.e., backpropagation through time (BPTT). Working backwards from the final output of the network, the gradient flows from the loss to all descendants. In this way, computing the gradient through an SNN is mostly the same as that of an RNN by iterative application of the chain rule. Figure 10(a) depicts the various pathways of the gradient ∂L/∂W from the parent (L) to its leaf nodes (W ). In contrast, backprop using spike times only follows the gradient pathway whenever a neuron fires, whereas this approach takes every pathway regardless of the neuron firing. The final loss is the sum of instantaneous losses t L[t], though the loss calculation can take a variety of other forms as described in Section 3.3.
Finding the derivative of the total loss with respect to the parameters allows the use of gradient descent to train the network, so the goal is to find ∂L/∂W . The parameter W is applied at every time step, and the application of the weight at a particular step is denoted W [s]. Assume an instantaneous loss L[t] can be calculated at each time step (taking caution that some objective functions, such as the mean square spike rate loss (Section 3.3.1), must wait until the end of the sequence to accumulate all spikes and generate a loss). As the forward pass requires moving data through a directed acyclic graph, each application of the weight will only affect present and future losses. The influence of W [s] on L[t] at s = t is labelled the immediate influence in Figure 10 Practical Note: Do I need to remember all of this to use SNNs?
Thankfully, gradients rarely need to be calculated by hand as most deep learning packages come with an automatic differentiation engine. Using SNNs does not require an intricate knowledge of the internals. But advancing SNN research most certainly does! Isolating the immediate influence at a single time step as in Figure 9(d) makes it clear that we run into the spike nondifferentiability problem in the term ∂S/∂U ∈ {0, ∞}. The act of thresholding the membrane potential is functionally equivalent to applying a shifted Heaviside operator, which is non-differentiable.
The solution is actually quite simple. During the forward pass, as per usual, apply the Heaviside operator to U [t] in order to determine whether the neuron spikes. But during the backward pass, substitute the Heaviside operator with a continuous function,S (e.g., sigmoid). The derivative of the continuous function is used as a substitute ∂S/∂U ← ∂S/∂U , and is known as the surrogate gradient approach (Figure 9(d)).

Surrogate Gradients
A major advantage of surrogate gradients is they help with overcoming the dead neuron problem. To make the dead neuron problem more concrete, consider a neuron with a threshold of θ, and one of the following cases occurs: 1. The membrane potential is below the threshold: U < θ 2. The membrane potential is above the threshold: U > θ 3. The membrane potential is exactly at the threshold: U = θ In Case 1, no spike is elicited, and the derivative would be ∂S/∂U U <θ = 0. In Case 2, a spike would fire, but the derivative remains ∂S/∂U U >θ = 0. Applying either of these to the chain of equations in Figure 9(a) will null ∂L/∂W = 0. In the improbable event of Case 3, ∂S/∂U U =θ = ∞, which swamps out any meaningful gradient when applied to the chain rule 4 . But approximating the gradient, ∂S/∂U , solves this.
One example is to replace the non-differentiable term with the threshold-shifted sigmoid function, but only during the backward pass. This is illustrated in Figure 9(d). More formally: and therefore, This means learning only takes place if there is spiking activity. Consider a synaptic weight attached to the input of a spiking neuron, W in and another weight at the output of the same neuron, W out . Say the following sequence of events occurs: 1. An input spike, S in is scaled by W in 2. The weighted spike is added as an input current injection to the spiking neuron (Equation (4)) 3. This may cause the neuron to trigger a spike, S out 4. The output spike is weighted by the output weight W out 5. This weighted output spike varies some arbitrary loss function, L Figure 11: Sequence of steps during the forward pass.
Let the loss function be the Manhattan distance between a target value y and the weighted spike: where updating W out requires: More generally: a spike must be triggered for a weight to be updated. The surrogate gradient does not change this.
Now consider the case for updating W in , where the following derivative must be calculated: • Term A is simply W out based on the above equation for L • Term B would almost always be 0, unless substituted for a surrogate gradient To summarize: the surrogate gradient enables errors to propagate to earlier layers, regardless of spiking. But spiking is still needed to trigger a weight update.
Practical Note: The 'Best' Surrogate Gradient?
Various work empirically explore different surrogate gradients. These include triangular functions, fast sigmoid and sigmoid functions, straight-through estimators, and various other weird shapes. Is there a best surrogate gradient? In our experience, we have found the following function to be the best starting point: You might see this referred to as the 'arctan' surrogate gradient, first proposed in Ref. [173]. This is because the integral of this function is:S = 1 π arctan(πU ) As of 2023, this is the default surrogate gradient in snnTorch. We have no idea why it works so well.
To reiterate, surrogate gradients will not enable learning in the absence of spiking. This provokes an important distinction between the dead neuron problem and the vanishing gradient problem. A dead neuron is one that does not fire, and therefore does not contribute to the loss. This means the weights attached to that neuron have no 'credit' in the credit assignment problem. The relevant gradient terms during the training process will remain at zero. Therefore, the neuron cannot learn to fire later on and so is stuck forever, not contributing to learning.
On the other hand, vanishing gradients can arise in ANNs as well as SNNs. For deep networks, the gradients of the loss function can become vanishingly small as they are successively scaled by values less than '1' when using several common activation functions (e.g., a sigmoid unit). In much the same way, RNNs are highly susceptible to vanishing gradients because they introduce an additional layer to the unrolled computational graph at each time step. Each layer adds another multiplicative factor in calculating the gradient, which makes it susceptible to vanishing if the factor is less than '1', or exploding if greater than '1'. The ReLU activation became broadly adopted to reduce the impact of vanishing gradients, but remains underutilised in surrogate gradient implementations [153].
Surrogate gradients do not need to be explicitly defined in snnTorch as the arctan surrogate is applied by default. But the following code snippet shows how you might use an alternative with a leaky integrate-and-fire neuron: 1 import snntorch as snn 2 from snntorch import surrogate 3 4 lif_1 = snn.Leaky(beta=0.9, spike_grad=surrogate.fast_sigmoid()) 5 lif_2 = snn.Leaky(beta=0.9, spike_grad=surrogate.sigmoid()) 6 lif_3 = snn.Leaky(beta=0.9, spike_grad=surrogate.straight_through_estimator()) 7 lif_4 = snn.Leaky(beta=0.9, spike_grad=surrogate.triangular()) Surrogate gradients have been used in most state-of-the-art experiments that natively train an SNN [75,117,158,171,172]. A variety of surrogate gradient functions have been used to varying degrees of success, and the choice of function can be treated as a hyperparameter. While several studies have explored the impact of various surrogates on the learning process [123,174], our understanding tends to be limited to what is known about biased gradient estimators. There is a lot left unanswered here. For example, if we can get away with approximating gradients, then perhaps surrogate gradients can be used in tandem with random feedback alignment. This involves replacing weights with random matrices during the backward pass. Rather than pure randomness, perhaps local approximations can be made that follow the same spirit of a surrogate gradient.
In summary, taking the gradient only at spike times provides an unbiased estimator of the gradient, at the expense of losing the ability to train dead neurons. Surrogate gradient descent flips this around, enabling dead neurons to backpropagate error signals by introducing a biased estimator of the gradient. There is a tug-of-war between bringing dead neurons back to life and introducing bias. Given how prevalent surrogate gradients have become, we will linger a little longer on the topic in describing their relation to model quantization. Understanding how approximations in gradient descent impacts learning will very likely lead to a deeper understanding of why surrogate gradients are so effective, how they might be improved, and how backpropagation can be simplified by making approximations that reduce the cost of training without harming an objective.

The Link Between Surrogate Gradients and Quantized Neural Networks
Surrogate gradients have been around under a few disguises for over a decade now. Hinton overcame the challenge of thresholding activations and weights in binarized neural networks by simply ignoring them during the backward pass [175]. He coined the term, 'straight-through-estimator', as the gradient passes 'straight through' the nondifferentiable operator. Equivalently, this is like setting the surrogate gradient to be ∂S/∂U = 1. The exact same methodology is applied when training quantized neural networks [176,177].

A Brief History of Surrogate Gradients
A couple of years after Hinton introduced the straight-through-estimator, Hunsberger and Eliasmith used 'softened' functions for the firing rate of a leaky integrator neuron, making them amenable to backpropagation. Lee, Delbruck and Pfeiffer subsequently provided the first demonstration of approximating the backward path of spike signals in 2016 [178]. This is akin to what we presently do today. But Shrestha and Orchard were the first to release code that was compatible PyTorch with SLAYER in 2018, where a fast sigmoid gradient was used on the backward pass [75]. 2019 is when the term 'surrogate gradient' was coined by Zenke, Mostafa and Neftci [174], and Zenke's codebase SpyTorch showed how features in PyTorch could be used to implement gradient approximations very easily [152]. This spawned a proliferation of various applications, projects and libraries, including snnTorch, that continue to extend these techniques.
Training quantized neural networks involves adjusting the weights and activations to use lower-precision fixed-point representations while maintaining acceptable performance and accuracy [179,180]. Quantized-fixed point arithmetic requires fewer computational resources and lower memory storage compared to floating-point arithmetic, and commonly used in accelerators, both neuromorphic and otherwise, as a result. Several methods have been proposed to construct quantized neural networks, and they can be broadly categorized into the following approaches: 1. Post-training quantization: A neural network is first trained using standard floating-point arithmetic. Once training is complete, the weights and activations are quantized to lower-precision fixed-point representations.
Post-training quantization is simple to implement and computationally efficient, but it may result in a significant loss of accuracy for certain models or tasks.
2. Quantization-aware training: This method involves training the neural network with quantization built into the forward pass during the training process. The quantization process is non-differentiable, and is thus ignored during the gradient calculation step by applying Hinton's straight-through-estimator. The weight update is applied to the full precision weight, which is quantized only during the forward-pass. This allows the model to learn how to compensate for the quantization errors during training, leading to better performance and accuracy compared to post-training quantization. However, quantization-aware training is more computationally intensive and may require modifications to the training algorithm [181].
3. Mixed-precision training: Different parts of the neural network use different levels of numerical precision. For example, the forward pass may use lower-precision fixed-point arithmetic, while the backward pass and weight updates use higher-precision floating-point arithmetic. This approach can help maintain the benefits of reduced computational complexity and memory requirements while minimizing the impact on model accuracy [182]. 4. Binary and ternary neural networks: These are extreme cases of quantized neural networks, where the weights and activations are quantized to binary or ternary values, typically {-1, 0, 1} for ternary networks and {-1, 1} for binary networks. Training such networks often involves learning a real-valued scaling factor alongside the binary or ternary weights to improve the model's expressive power. These ultra-low precision networks can significantly reduce computational requirements and power consumption, but they may suffer from reduced accuracy or increased model complexity.
Several studies have shown that SNNs are extremely robust to quantization when quantization-aware training is used.
In the extreme case, binarized weights appeared to have a far weaker impact on a range of classification problems than an equivalent non-spiking neural network. Our working theory is because approximations and truncation errors are likely to be absorbed in the sub-threshold dynamics of the neuron [176,177].
The following code sample shows the use of the Python library, Brevitas, in constructing quantized SNNs. Brevitas already accounts for the straight-through-estimator gradient during training so the developer does not need to make any modifications during the backward-pass [183]. It should be noted that this approach models a reduced precision network, but does not represent variables in a reduced precision format. 1 import snntorch as snn 2 import torch.nn as nn 3 import brevitas.nn as qnn 4 5 # full precision model 6 net = nn.Sequential(nn.Linear(784, 10), 7 snn.Leaky(beta=0.9, init_hidden=True)) 8 9 # quantized model 10 num_bits = 8 11 quant_net = nn.Sequential(qnn.QuantLinear(784, 10, weight_bit_width=num_bits), 12 snn.Leaky(beta=0.9, init_hidden=True)) The above code snippets quantize weights and activations. The membrane potential and hidden states are often neglected, and post-quantized outside of the training process. State-based quantization aware training is also possible, where the membrane potential is discretized during the forward-pass. This is straightforward to account for in snnTorch by passing an argument to a neuron model, which triggers quantization of the membrane potential during the forward pass: 1 from snntorch.functional import quant

Practical Note: Quantization Ranges
Say you have N bits available to represent weights and states. How do you determine the range these N bits should span? Weights will use the minimum and maximum full precision weight values as the range. States are a lesser explored in how they should best be represented, but our early experimental results indicate that for a normalized resting potential of 0 V, negative values can be safely clipped. This provides positive states a finer resolution rather than wasting them on ranges that do not trigger spikes. Naturally, this does not hold true where inhibitory spikes can be triggered from negative thresholds.

A Bag of Tricks in BPTT with SNNs
Many advances in deep learning stem from a series of incremental techniques that bolster the learning capacity of models. These techniques are applied in conjunction to boost model performance. For example, He et al.'s work in 'Bag of tricks for image classification with convolutional neural networks' not only captures the honest state of deep learning in the title alone, but also performs an ablation study of 'hacks' that can be combined to improve optimization during training [184]. Some of these techniques can be ported straight from deep learning to SNNs, while others are SNN specific. A non-exhaustive list of these techniques are provided in this section. These techniques are quite empirical and each bullet would have its own 'Practical Note' text box, but then this paper would just turn into a bunch of boxes.
• The reset mechanism in Equation (4) is a function of the spike, and is also non-differentiable. It is important to ensure the surrogate gradient is not cloned into the reset function as it has been empirically shown to degrade network performance [123]. Quite simply, we ignore it during the backward pass. snnTorch does this automatically by detaching the reset term in Equation (4) from the computational graph by calling the '.detach()' function.
• Residual connections work remarkably well for non-spiking nets and spiking models alike. Direct paths between layers are created by allowing the output of an earlier layer to be added to the output of a later layer, effectively skipping one or more layers in between. They are used to address the vanishing gradient problem and improve the flow of information during both forward and backward propagation, which enabled the neural network community to construct far deeper architectures, starting with the ResNet family of models and now commonly used in Transformers [185]. Unsurprisingly, they work extremely well for SNNs, too [173].
• Learnable decay: Rather than treating the decay rates of neurons as hyperparameters, it is also common practice to make them learnable parameters. This makes SNNs resemble conventional RNNs much more closely. Doing so has shown to improve testing performance on datasets with time-varying features [56].
• Graded Spikes: Passive dendritic properties can attenuate action potentials, as can the cable-like properties of the axon. This feature can be coarsely accounted for as graded spikes. Each neuron has an additional learnable parameter that determines how to scale an output spike. Neuronal activations are no longer constrained to {1, 0}. Can this still be thought of as a SNN? From an engineering standpoint, if a spike must be broadcast to a variety of downstream neurons with a 8 or 16-bit destination address, then adding another several bits to the payload can be worth it. The 2nd generation Loihi chip from Intel Labs incorporates graded spikes in such a way that sparsity is preserved. Furthermore, the vector of learnt values scales linearly with the number of neurons in a network, rather than quadratically with weights. It therefore contributes a minor cost in comparison to other components of an SNN.
• Learnable Thresholds have not been shown to help the training process. This is likely due to the discrete nature of thresholds, giving rise to non-differentiable operators in a computational graph. On the other hand, normalizing the values that are passed into a threshold significantly helps. Adopting batch-normalization in convolutional networks helps boost performance, and learnable normalization approaches may act as an effective surrogate for learnable thresholds [186][187][188].
• Pooling is effective for downsampling large spatial dimensions in convolutional networks, and achieving translational invariance. If max-pooling is applied to a sparse, spiking tensor, then tie-breaking between 1's and 0's does not make much sense. One might expect we can borrow ideas from training binarized neural networks, were pooling is applied to the activations before they are thresholded to binarized quantities. This corresponds to applying pooling to the membrane potential, in a manner that resembles a form of 'local lateral inhibition'. But this does not necessarily lead to optimal performance in SNNs. Interestingly, Yu et al. applied pooling to the spikes instead. Where multiple spikes occurred in a pooling window, a tie-break would occur randomly among them [173]. While no reason was given for doing this, it nonetheless achieved state-of-the-art (at the time) performance on a series of computer vision problems. Our best guess is that this randomness acted as a type of regularization. Whether max-pooling or average-pooling is used can be treated as a hyperparameter. As an alternative, SynSense's neuromorphic hardware adopts sum-pooling, where spatial dimensions are reduced by re-routing the spikes in a receptive field to a common post-synaptic neuron.
• Optimizer: Most SNNs default to the Adam optimizer as they have classically been shown to be robust when used with sequential models [189]. As SNNs become deeper, stochastic gradient descent with momentum seems to increase in prevalence over the Adam optimizer. The reader is referred to Godbole et al.'s Deep Learning Tuning Playbook for a systematic approach to hyperparameter optimization that applies generally [190].

The Intersection Between Backprop and Local Learning
An interesting result arises when comparing backpropagation pathways that traverse varying durations of time. The derivative of the hidden state over time is ∂U [t]/∂U [t − 1] = β as per Equation (4). A gradient that backpropagates through n time steps is scaled by β n . For a leaky neuron we get β < 1, which causes the magnitude of a weight update to exponentially diminish with time between a pair of spikes. This proportionality is illustrated in Figure 10(b). This result shows how the strength of a synaptic update is exponentially proportional to the spike time difference between a pre-and post-synaptic neuron. In other words, weight updates from BPTT closely resemble weight updates from spike-timing dependent plasticity (STDP) learning curves (Appendix C.2) [32].
Is this link just a coincidence? BPTT was derived from function optimization. STDP is a model of a biological observation. Despite being developed via completely independent means, they converge upon an identical result. This could have immediately practical implications, where hardware accelerators that train models can excise a chunk of BPTT and replace it with the significantly cheaper and local STDP rule. Adopting such an approach might be thought of as an online variant of BPTT, or as a gradient-modulated form of STDP.

Long-Term Temporal Dependencies
Neural and synaptic time constants span timescales typically on the order of 1-100s of milliseconds. With such time scales, it is difficult to solve problems that require long-range associations that are larger than the slowest neuron or synaptic time constant. Such problems are common in natural language processing and reinforcement learning, and are key to understanding behavior and decision making in humans. This challenge is a a huge burden on the learning process, where vanishing gradients drastically slow the convergence of the neural network. LSTMs [191] and, later, GRUs [192] introduced slow dynamics designed to overcome memory and vanishing gradient problems in RNNs. Thus, a natural solution for networks of spiking neurons is to complement the fast timescales of neural dynamics with a variety of slower dynamics. Mixing discrete and continuous dynamics may enable SNNs to learn features that occur on a vast range of timescales. Examples of slower dynamics include: • Adaptive thresholds: After a neuron fires, it enters a refractory period during which it is more difficult to elicit further spikes from the neuron. This can be modeled by increasing the firing threshold of the neuron θ every time the neuron emits a spike. After a sufficient time in which the neuron has spiked, the threshold relaxes back to a steady-state value. Homeostatic thresholds are known to promote neuronal stability in correlated learning rules, such as STDP which favours long term potentiation at high frequencies regardless of spike timing [193,194]. More recently, it has been found to benefit gradient-based learning in SNNs as well [171] (Appendix C.3). • Recurrent attention: Hugely popularized from natural language generation, self-attention finds correlations between tokens of vast sequence lengths by feeding a model with all sequential inputs at once. This representation of data is not quite how the brain processes data. Several approaches have approximated self-attention into a sequence of recurrent operations, where SpikeGPT is the first application in the spiking domain and successfully achieved language generation [82]. In addition to more complex state-based computation, SpikeGPT additionally employs dynamical weights that vary over time. • Axonal delays: The wide variety of axon lengths means there is a wide range of spike propagation delays.
Some neurons have axons as short as 1 mm, whereas those in the sciatic nerve can extend up to a meter in length. The axonal delay can be a learned parameter spanning multiple time steps [75,195,196]. A lesser explored approach accounts for the varying delays in not only axons, but also across the dendritic tree of a neuron. Coupling axonal and dendritic delays together allows for a fixed delay per synapse. • Membrane Dynamics: We already know how the membrane potential can trigger spiking, but how does spiking impact the membrane? Rapid changes in voltage cause an electric field build-up that leads to temperature changes in cells. Joule heating scales quadratically with voltage changes, which affects the geometric structure of neurons and cascades into a change in membrane capacitance (and thus, time constants). Decay rate modulation as a function of spike emission can act as a second-order mechanism to generate neuron-specific refractory dynamics. • Multistable Neural Activity: Strong recurrent connections in biological neural networks can support multistable dynamics [197], which facilitates stable information storage over time. Such dynamics, often called attractor neural networks [198], are believed to underpin working memory in the brain [199,200], and is often attributed to the prefrontal cortex. The training of such networks using gradient descent is challenging, and has not been attempted using SNNs as of yet [201].
Several rudimentary slow timescale dynamics have been tested in gradient-based approaches to training SNNs with a good deal of success [75,171], but there are several neuronal dynamics that are yet to be explored. LSTMs showed us the importance of temporal regulation of information, and effectively cured the short-term memory problem that plagued RNNs. Translating more nuanced neuronal features into gradient-based learning frameworks can undoubtedly strengthen the ability of SNNs to represent dynamical data in an efficient manner.

Temporal Locality
As incredible as our brains are, sadly, they are not time machines. It is highly unlikely our neurons are breaching the space-time continuum to explicitly reference historical states to run the BPTT algorithm. As with all computers, brains operate on a physical substrate which dictates the operations it can handle and where memory is located. While conventional computers operate on an abstraction layer, memory is delocalised and communicated on demand, thus paying a considerable price in latency and energy. Brains are believed to operate on local information, which means the best performing approaches in temporal deep learning, namely BPTT, are biologically implausible. This is because BPTT requires the storage of the past inputs and states in memory. As a result, the required memory scales with time, a property which limits BPTT to small temporal dependencies. To solve this problem, BPTT assumes a finite sequence length before making an update, while truncating the gradients in time. This, however, severely restricts the temporal dependencies that can be learned.
The constraint imposed on brain-inspired learning algorithms is that the calculation of a gradient should, much like the forward pass, be temporally local, i.e. that they only depend on values available at either present time t or t − 1.
To address this, we turn to online algorithms that adhere to temporal locality. Real-time recurrent learning (RTRL) proposed back in 1989 is one prominent example.

Real-Time Recurrent Learning
RTRL estimates the same gradients as BPTT, but relies on a set of different computations that make it temporally, but not spatially, local [202]. Since RTRL's memory requirement does not grow with time, then why is it not used in favour of BPTT? BPTT's memory usage scales with the product of time and the number of neurons; it is O(nT ). For RTRL, an additional set of computations must be introduced to enable the network to keep track of a gradient that evolves with time. These additional computations result in a O(n 3 ) memory requirement, which often exceeds the demands of BPTT. But the push for continuously-learning systems that can run indefinitely long has cast a spotlight back on RTRL (and variants [203][204][205][206][207]), with a focus on improving computational and memory efficiency.
Let us derive what new information needs to be propagated forward to enable real-time gradient calculation for an SNN. As in Equation (7) First we define the influence of parameter W on the membrane potential U [t] as m[t], which serves to track the derivative of the present-time membrane potential with respect to the weight. We then unpack it by one time step: The immediate and prior influence components are graphically illustrated in Figure 10(a). The immediate influence is also natural to calculate online, and evaluates to the unweighted input to the neuron X[t]. The prior influence relies on historical components of the network: Based on Equation (4), in the absence of explicitly recurrent connections, the temporal term evaluates to β. From Equation (10), the second term is the influence of parameters on U [t − 1], which is by definition m[t − 1]. Substituting these back into Equation (10) gives: This recursive formula is updated by passing the unweighted input directly to m[t], and recursively decaying the influence term by the membrane potential decay rate β. The gradient that is ultimately used with the optimizer can be derived with the chain rule: is the immediate credit assignment value obtained by backpropagating the instantaneous loss to the hidden state of the neuron, for example, by using a surrogate gradient approach. The calculation of m[t] only ever depends on present time inputs and the influence at t − 1, thus enabling the loss to be calculated in an online manner. The input spike now plays a role in not only modulating the membrane potential of the neuron, but also the influence m[t]. The general flow of gradients is depicted in Figure 12.
An intuitive, though incomplete, way to think about RTRL is as follows. By reference to Figure 12, at each time step, a backward-pass that does not account for the history of weight updates is applied: ∂L[0]/∂W [0] (the immediate influence). Rather than directing gradients backwards through time, the partial derivative ∂U [0]/∂W [0] is 'pushed' forward in time. In doing so, it is scaled by the temporal term, X[0]β. This term modulates the immediate influence at the next time step. This can be thought of as a gradient term that 'snowballs' forward in time as a result of modulating and accumulating with the immediate influence term, but also loses a bit of 'momentum' every time the temporal term β decays it.
In the example above, the RTRL approach to training SNNs was only derived for a single neuron and a single parameter. A full scale neural network replaces the influence value with an influence matrix M [t] ∈ R n×P , where n is the number of neurons and P is the number of parameters (approximately O(n 2 ) memory). Therefore, the memory requirements of the influence matrix scales with O(n 3 ).
Recent focus in online learning aims to reduce the memory and computational demands of RTRL. This is generally achieved by decomposing the influence matrix into simpler parts, approximating the calculation of M [t] by either completely removing terms or trading them for stochastic noise instead [203][204][205][206]. Marschall et al. provides a systematic treatment of approximations to RTRL in RNNs in [207], and variations of online learning have been applied specifically to SNNs in [119,120,208].

RTRL Variants in SNNs
Since 2020, a flurry of forward-mode learning algorithms have been tailored to SNNs. All such works either modify, re-derive, or approximate RTRL: • e-prop (Bellec et al., 2020 [119]): RTRL is combined with surrogate gradient descent. Recurrent spiking neurons are used where output spikes are linearly transformed and then fed back to the input of the same neurons. The computational graph is detached at the explicit recurrent operation, but retained for implicit recurrence (i.e., where membrane potential evolves over time). Projecting output spikes into a higher-dimensional recurrent space acts like a reservoir, though leads to biased gradient estimators that underperforms compared to BPTT. • decolle (Kaiser et al., 2020 [120]): 'Deep continuous online learning' also combines RTRL with surrogate gradient descent. This time, greedy local losses are applied at every layer [142]. As such, errors only need to be propagated back to a single layer at a time. This means that errors do not need to traverse through a huge network, which reduces the burden of the spatial credit assignment problem. This brings about two challenges: 1) not many problems can be cast into a form with definable local losses, and 2) greedy local learning prioritizes immediate gains without considering an overall objective. • OSTL (Bonhstingl et al., 2022 [209]): 'Online spatio-temporal learning' re-derives RTRL. The spatial components of backpropation and temporal components are factored into two separate terms; e.g., one that tracks the 'immediate' influence, and one that tracks the 'prior influence' from Equation (10 [141]). In other words, the weights in the final layer are updated based on an approximation of RTRL. Earlier layers are updated based on partial derivatives that do not rely on a global loss, and are spatially 'local' to the layer. Instead, the target output is used to modulate these gradients. This addresses spatial credit assignment by using signals from a target, rather than backpropagating gradients in the immediate influence term of Equation (10). The cost is that it both inherits drawbacks from e-prop and DRTP. DRTP prioritizes immediate gains without considering an overall objective, similar to greedy local learning. • OSTTP (Ortner and Pes, et al., 2023 [211]): 'Online Spatiotemporal Learning with Target Projection' combines OSTL (functionally equivalent to RTRL) with DRTP. It inherits the drawbacks of DRTP, while addressing the spatial credit assignment problem. • FPTT (Kag, et al., 2021 [212]): 'Forward Propagation Through Time' considers RTRL for sequence-tosequence models with time varying losses. A regularization term is applied to the loss at each step to ensure stability during the training process. Yin et al. subsequently applied FPTT to SNNs with more complex neuron models with richer dynamics [213].
This is a non-exhaustive list of RTRL alternatives, and can appear quite daunting at first. But all approaches effectively stem from RTRL. The dominant trends include: 1. Approximating RTRL to test how much of an approximation the training procedure can tolerate without completely failing [119] 2. Replacing the immediate influence with global-modulation of a loss or target to address spatial credit assignment [120,210,211] 3. Modifying the objective to promote stable training dynamics [212] 4. Identifying similarities to biology by factorizing RTRL into eligibility traces and/or three-factor learning rules [119,210,213] Several RTRL-variants claim to outperform BPTT in terms of loss minimization, though we take caution with such claims as the two effectively become identical to BPTT for the case where weight updates are deferred to the end of a sequence. We also note caution with claims that suggest improvements over RTRL, as RTRL can be thought of as the most general case of forward-model learning applied to any generic architecture. Most reductions in computational complexity arise because they are narrowly considered for specific architectures, or otherwise introduce approximations into their models. In contrast, Tallec and Ollivier developed an 'unbiased online recurrent optimization' scheme where stochastic noise is used and ultimately cancelled out, leading to quadratic (rather than cubic) computational complexity with network size [203].

Practical Considerations with RTRL
Several practical considerations should be accounted for when implementing online learning algorithms. For an approach that closely resembles BPTT, the gradient accumulated at the end of the sequence can be used to update the network, which is referred to as a 'deferred' update. Alternatively, it is possible to update the network more regularly as a gradient is consistently available. While this latter option is a more accurate reflection of biological learning (i.e., training and inference are not decoupled processes), there are two issues that must be treated with care. Firstly, adaptive optimizers such as Adam naturally reduce the learning rate as parameters approach optimal values [189]. When applying frequent updates on a given batch of data, future batches will have less influence on weight updates. The result is a learning procedure that assigns a higher weighting to early data than to later data. If the sampled data does not satisfy the i.i.d assumption, which is the case when a system experiences data in an online fashion, learning may not perform well. Secondly, the reverse problem is catastrophic forgetting where new information causes the network to forget what it has previously learnt [214]. This is especially problematic in real-time systems because a "real-world batch size is equal to 1". Several approaches to overcome catastrophic forgetting in continual learning have been proposed, including using higher dimensional synapses [215], ensembles of networks [216], pseudo-replay [217], and penalizing weights that change excessively fast [218].

Spatial Locality
While temporal locality relies on a learning rule that depends only on the present state of the network, spatial locality requires each update to be derived from a node immediately adjacent to the parameter. The biologically motivated learning rules described in Section 3.4 address the spatial credit assignment problem by either replacing the global error signal with local errors, or replacing analytical/numerical derivatives with random noise [137].
The more 'natural' approach to online learning is perceived to be via unsupervised learning with synaptic plasticity rules, such as STDP [32,219] and variants of STDP (Appendix C.2) [220][221][222][223]. These approaches are directly inspired by experimental relationships between spike times and changes to synaptic conductance. Input data is fed to a network, and weights are updated based on the order and firing times of each pair of connected neurons (Figure 10(b)). The interpretation is that if a neuron causes another neuron to fire, then their synaptic strength should be increased. If a pair of neurons appear uncorrelated, their synaptic strength should be decreased. It follows the Hebbian mantra of 'neurons that fire together wire together' [146].
There is a common misconception that backprop and STDP-like learning rules are at odds with one other, competing to be the long-term solution for training connectionist networks. On the one hand, it is thought that STDP deserves more attention as it scales with less complexity than backprop. STDP adheres to temporal and spatial locality, as each synaptic update only relies on information from immediately adjacent nodes. However, this relationship necessarily arises as STDP was reported using data from 'immediately adjacent' neurons. On the other hand, STDP fails to compete with backprop on remotely challenging datasets. But backprop was designed with function optimization in mind, while STDP emerged as a physiological observation. The mere fact that STDP is capable at all of obtaining competitive results on tasks originally intended for supervised learning (such as classifying the MNIST dataset), no matter how simple, is quite a wonder. Rather than focusing on what divides backprop and STDP, the pursuit of more effective learning rules will more likely benefit by understanding how the two intersect.
We demonstrated in Section 4.3.4 how surrogate gradient descent via BPTT subsumes the effect of STDP. Spike time differences result in exponentially decaying weight update magnitudes, such that half of the learning window of STDP is already accounted for within the BPTT algorithm ( Figure 10(b)). Bengio et al. previously made the case that STDP resembles stochastic gradient descent, provided that STDP is supplemented with gradient feedback [224,225]. This specifically relates to the case where a neuron's firing rate is interpreted as its activation. Here, we have demonstrated that no modification needs to be made to the BPTT algorithm for it to account for STDP-like effects, and is not limited to any specific neural code, such as the firing rate. The common theme is that STDP may benefit from integrating error-triggered plasticity to provide meaningful feedback to training a network [226].

Outlook
Designing a neural network was once thought to be strictly an engineering problem whereas mapping the brain was a scientific curiosity [227]. With the intersection between deep learning and neuroscience broadening, and brains being able to solve complex problems much more efficiently, this view is poised to change. From the scientist's view, deep learning and brain activity have shown many correlates, which lead us to believe that there is much untapped insight that ANNs can offer in the ambitious quest of understanding biological learning. For example, the activity across layers of a neural network have repeatedly shown similarities to experimental activity in the brain. This includes links between convolutional neural networks and measured activity from the visual cortex [228][229][230], and auditory processing regions [231]. Activity levels across populations of neurons have been quantified in many studies, but SNNs might inform us of the specific nature of such activity.
From the engineer's perspective, neuron models derived from experimental results have allowed us to design extremely energy-efficient networks when running on hardware tailored to SNNs [232][233][234][235][236][237][238]. Improvements in energy consumption of up to 2-3 orders of magnitude have been reported when compared to conventional ANN acceleration on embedded hardware, which provides empirical validation of the benefits available from the three S's: spikes, sparsity and static data suppression (or event-driven processing) [20,80,[239][240][241]. These energy and latency benefits are derived from simply applying neuron models to connectionist networks, but there is so much more left to explore.
It is safe to say the energy benefits afforded by spikes are uncontroversial. But a more challenging question to address is: are spikes actually good for computation? It could be that years of evolution determined spikes solved the long-range signal transmission problem in living organisms, and everything else had to adapt to fit this constraint. If this were true, then spike-based computation would be pareto optimal with a proclivity towards energy efficiency and latency. But until we amass more evidence of a spike's purpose, we have some intuition as to where spikes shine in computation: • Hybrid Dynamical Systems: SNNs can model a broad class of dynamical systems by coupling discrete and continuous time dynamics into one system. Discontinuities are present in many physical systems, and spiking neuron models are a natural fit to model such dynamics. • Discrete Function Approximators: Neural networks are universal function approximators, where discrete functions are considered to be modelled sufficiently well by continuous approximations. Spikes are capable of precisely defining discrete functions without approximation. • Multiplexing: Spikes can encode different information in spike rate, times, or burst counts. Re-purposing the same spikes offers a sensible way to condense the amount of computation required by a system. • Message Packets: By compressing the representation of information, spikes can be thought of as packets of messages that are unlikely to collide as they travel across a network. In contrast, a digital system requires a synchronous clock to signal that a communication channel is available for a message to pass through (even when modelling asynchronous systems). • Coincidence Detection: Neural information can be encoded based on spatially disparate but temporally proximate input spikes on a target neuron. It may be the case that isolated input spikes are insufficient to elicit a spike from the output neuron. But if two incident spikes occur on a timescale faster than the target neuron membrane potential decay rate, this could push the potential beyond the threshold and trigger an output spike. In such a case, associative learning is taking place across neurons that are not directly connected. Although coincidence detection can be programmed in a continuous-time system without spikes, a theoretical analysis has shown that the processing rate of a coincidence detector neuron is faster than the rate at which information is passed to a neuron [242,243]. • Noise Robustness: While analog signals are highly susceptible to noise, digital signals are far more robust in long-range communication. Neurons seem to have figured this out by performing analog computation via integration at the soma, and digital communication along the axon. It is possible that any noise incident during analog computation at the soma is subsumed into the subthreshold dynamics of the neuron, and therefore eliminated. In terms of neural coding, a similar analogy can be made to spike rates and spike times. Pathways that are susceptible to adversarial attacks or timing perturbations could learn to be represented as a rate, which otherwise mitigates timing disturbances in temporal codes. • Modality normalisation: A unified representation of sensory input (e.g., vision, auditory) as spikes is nature's way of normalising data. While this benefit is not exclusive to spikes (i.e., continuous data streams in nonspiking networks may also be normalised), early empirical evidence has shown instances where multi-modal SNNs outperform convolutional neural networks on equivalent tasks [20,239]. • Mixed-mode differentiation: While most modern deep learning frameworks rely on reverse-mode autodifferentiation [244], it is in stark contrast to how the spatial credit assignment problem is treated in biological organisms. If we are to draw parallels between backpropagation and the brain, it is far more likely that approximations of forward-mode autodifferentation are being used instead. Equation (12) in Section 5 describes how to propagate gradient-related terms forward in time to implement online learning, where such terms could be approximated by eligibility traces that keep track of pre-synaptic neuron activity in the form of calcium ions, and fades over time [119,245]. SNNs offer a natural way to use mixed-mode differentiation by projecting temporal terms in the gradient calculation from Equation (11) into the future via forward-mode differentation, while taking advantage of the computational complexity of reverse-mode autodifferentation for spatial terms [73,120].
A better understanding of the types of problems spikes are best suited for, beyond addressing just energy efficiency, will be important in directing SNNs to meaningful tasks. The above list is a non-exhaustive start to intuit where that might be. Thus far, we have primarily viewed the benefits of SNNs by examining individual spikes. For example, the advantages derived from sparsity and single-bit communication arise at the level of an individual spiking neuron: how a spike promotes sparsity, how it contributes to a neural encoding strategy, and how it can be used in conjuction with modern deep learning, backprop, and gradient descent. Despite the advances yielded by this spike-centric view, it is important not to develop tunnel vision. New advances are likely to come from a deeper understanding of spikes acting collectively, much like the progression from atoms to waves in physics.
Designing learning rules that operate with brain-like performance is far less trivial than substituting a set of artificial neurons with spiking neurons. It would be incredibly elegant if a unified principle governed how the brain learns. But the diversity of neurons, functions, and brain regions imply that a heterogeneous system rich in objectives and synaptic update rules is more likely, and might require us to use all of the weapons in our arsenal of machine learning tools. It is likely that a better understanding of biological learning will be amassed by observing the behavior of a collection of spikes distributed across brain regions. Ongoing advances in procuring large-scale electrophysiological recordings at the neuron-level can give us a window into observing how populations of spikes are orchestrated to handle credit assignment so efficiently, and at the very least, give us a more refined toolkit to developing theories that may advance deep learning [246,247]. After all, it was not a single atom that led to the silicon revolution, but rather, a mass of particles, and their collective fields. A stronger understanding of the computational benefits of spikes may require us to think at a larger scale, in terms of the 'fields' of spikes.
As the known benefits of SNNs manifest in the physical quantities of energy and latency, it will take more than just a machine learning mind to navigate the tangled highways of 100 trillion synapses. It will take a concerted effort between machine learning engineers, neuroscientists, and circuit designers to put spikes in the front seat of deep learning.

A.1 Forward Euler Method to Solving Spiking Neuron Models
The time derivative dU (t)/dt is substituted into Equation (1) without taking the limit ∆t → 0: For small enough values of ∆t, this provides a sufficient approximation of continuous-time integration. Isolating the membrane potential at the next time step on the left side of the equation gives: To single out the leaky membrane potential dynamics, assume there is no input current I in (t) = 0A: Let the ratio of subsequent values of U , i.e., U (t + ∆t)/U (t) be the decay rate of the membrane potential, also known as the inverse time constant. From Equation (15), this implies that β = (1 − ∆t/τ ).
Assume t is discretised into sequential time-steps, such that ∆t = 1. To further reduce the number of hyperparameters from Equation (15), assume R = 1Ω. This leads to the result in Equation (3), where the following representation is shifted by one time step: The input current is weighted by (1 − β) and time-shifted by one step such that it can instantaneously contribute to membrane potential. While this is not a physiologically precise assumption, it casts the neuron model into a form that better resembles an RNN. β can be solved using the continuous-time solution from Equation (2). In absence of current injection: where U 0 is the initial membrane potential at t = 0. Assuming Equation (18) is computed at discrete steps of t, (t + ∆t), (t + 2∆t)..., then the ratio of membrane potential across two subsequent steps can be calculated using: It is preferable to calculate β using Equation (19) rather than β = (1 − ∆t/τ ), as the latter is only precise for ∆t << τ . This result for β can then be used in Equation (17).
A second non-physiological assumption is made, where the effect of (1 − β) is absorbed by a learnable weight W : This can be interpreted the following way. X[t] is an input voltage, spike, or unweighted current, and is scaled by the synaptic conductance W to generate a current injection to the neuron. This leads to the following result: where the effects of W and β are decoupled, thus favouring simplicity over biological precision.
To arrive at Equation (4), a reset function is appended which activates every time an output spike is triggered. The reset mechanism can be implemented by either subtracting the threshold at the onset of a spike as in Equation (4), or by forcing the membrane potential to zero: reset-to-zero (22) In general, reset-by-subtraction is thought to be better for performance as it retains residual superthreshold information, while reset-to-zero is more efficient as U [t] will always be forced to zero when a spike is triggered. This has been formally demonstrated in ANN-SNN conversion approaches (Section 4.1), though has not yet been characterised for natively trained SNNs. The two approaches will converge for a small enough time window where U [t] is assumed to increase in a finite period of time:

B Appendix B: Spike Encoding
The following spike encoding mechanisms and loss functions are described with respect to a single sample of data. They can be generalised to multiple samples as is common practice in deep learning to process data in batches.

B.1 Rate Coded Input Conversion
An example of conversion of an input sample to a rate coded spike train follows. Let X ∈ R m×n , be a sample from the MNIST dataset, where m = n = 28. We wish to convert X to a rate-coded 3-D tensor R ∈ R m×n×t , where t is the number of time steps. Each feature of the original sample X ij is encoded separately, where the normalised pixel intensity (between 0 and 1) is the probability a spike occurs at any given time step. This can be treated as a Bernoulli trial, a special case of the binomial distribution R ijk ∼ B(n, p) where the number of trials is n = 1, and the probability of success (spiking) is p = X ij . Explicitly, the probability a spike occurs is: Sampling from the Bernoulli distribution for every feature at each time step will populate the 3-D tensor R with 1's and 0's. For an MNIST image, a pure white pixel X ij = 1 corresponds to a 100% probability of spiking. A pure black pixel X ij = 0 will never generate a spike. A gray pixel of value X ij = 0.5 will have an equal probability of sampling either a '1' or a '0'. As the number of time steps t → ∞, the proportion of spikes is expected to approach 0.5.

B.2 Latency Coded Input Conversion
The logarithmic dependence between input feature intensity and spiking timing can be derived using an RC circuit model. Starting with the general solution of the membrane potential with respect to the input current in Equation (2) and nulling out the initial conditions U 0 = 0, we obtain: For a constant current injection, U (t) will exponentially relax towards a steady-state value of I in R. Say a spike is emitted when U (t) reaches a threshold θ. We solve for the time U (t) = θ: The larger the input current, the faster U (t) charges up to θ, and the faster a spike occurs. The steady-state potential, I in R is set to the input feature x:

B.3 Rate Coded Outputs
A vectorised implementation of determining the predicted class from rate-coded output spike trains is described. Let ⃗ S[t] ∈ R N C be a time-varying vector that represents the spikes emitted from each output neuron across time, where N C is the number of output classes. Let ⃗ c ∈ R N C be the spike count from each output neuron, which can be obtained by summing ⃗ S[t] over T time steps: The index of ⃗ c with the maximum count corresponds to the predicted class: Figure S.4: Rate coded outputs. ⃗ c ∈ R N C is the spike count from each output neuron, where the example above shows the first neuron firing a total of 8 times.ŷ represents the index of the predicted output neuron, where it indicates the first neuron is the correct class.

B.4 Cross Entropy Spike Rate
The spike count of the output layer ⃗ c ∈ R N C is obtained as in Equation (27). c i is the i th element of ⃗ c, treated as the logits in the softmax function: The cross entropy between p i and the target y i ∈ {0, 1} N C , which is a one-hot target vector, is obtained using:

B.5 Mean Square Spike Rate
As in Equation (27), the spike count of the output layer ⃗ c ∈ R N C is obtained. c i is the i th element of ⃗ c, and let y i ∈ R be the target spike count over a period of time T for the i th output neuron. The target for the correct class should be greater than that of incorrect classes:

B.6 Maximum Membrane
The logits ⃗ m ∈ R N C are obtained by taking the maximum value of the membrane potential of the output layer ⃗ U [t] ∈ R N C over time: The elements of ⃗ m replace c i in the softmax function from Equation (29)  The peak membrane potential for each neuron is used in the cross entropy loss function. This encourages the peak of the correct class to grow, while that of the incorrect class is suppressed. The effect of this is to promote more firing from the correct class and less from the incorrect class.
Alternatively, the membrane potential is summed over time to obtain the logits:

B.7 Mean Square Membrane
Let y i [t] be a time-varying value that specifies the target membrane potential of the i th neuron at each time step. The total mean square error is calculated by summing the loss for all T time steps and for all N C output layer neurons: Alternatively, the time-varying target y i [t] can be replaced with a time-static target to drive the membrane potential of all neurons to a constant value. This can be an efficient implementation for a rate code, where the correct class target exceeds the threshold and all other targets are subthreshold values. The membrane potential at each time step is applied to the mean square error loss function. This allows a defined membrane target. The example above sets the target at all time steps at the firing threshold for the correct class, and to zero for incorrect classes.

B.8 Cross Entropy Latency Code
Let ⃗ f ∈ R N C be a vector containing the first spike time of each neuron in the output layer. Cross entropy minimisation aims to maximise the logit of the correct class and reduce the logits of the incorrect classes. However, we wish for the correct class to spike first, which corresponds to a smaller value. Therefore, a monotonically decreasing function must be applied to ⃗ f . A limitless number of options are available. The work in [76] simply negates the spike times: Taking the inverse of each element f i of ⃗ f is also a valid option: The new values of f i then replace c i in the softmax function from Equation (29). Equation (36) must be treated with care, as it precludes spikes from occurring at t = 0, otherwise f i → ∞.

B.9 Mean Square Spike Time
The spike time(s) of all neurons are specified as targets. In the case where only the first spike matters, ⃗ f ∈ R N C contains the first spike time of each neuron in the output layer, y i ∈ R is the target spike time for the i th output neuron. The mean square errors between the actual and target spike times of all output classes are summed together: Figure S.9: Cross Entropy Latency Code. Applying the inverse (or negated) spike time to the cross entropy loss pushes the correct class to fire first, and incorrect classes to fire later.
This can be generalised to account for multiple spikes [75]. In this case, ⃗ f i becomes a list of emitted spike times and ⃗ y i becomes a vector desired spike times for the i th neuron, respectively. The k th spike is sequentially taken from ⃗ f i and ⃗ y i , and the mean square error between the two is calculated. This process is repeated n times, where n is the number of spike times that have been specified and the errors are summed together across spikes and classes: The timing of all spikes are iterated over, and sequentially applied to the mean square error loss function. This enables the timing for multiple spikes to be precisely defined.

B.10 Mean Square Relative Spike Time
The difference between the spike time of correct and incorrect neurons is specified as a target. As in Appendix B.9, y i is the desired spike time for the i th neuron and f i is the actual emitted spike time. The key difference is that y i can change throughout the training process.
Let the minimum possible spike time be f 0 ∈ R. This sets the target firing time of the correct class. The target firing time of incorrect neuron classes y i is set to: where γ is a pre-defined latency, treated as a hyperparameter. In the first case, if an incorrect neuron fires at some time before the latency period γ then a penalty will be applied. In the second case, where the incorrect neuron fires at γ steps after the correct neuron, then the target is simply set to the actual spike time. These zero each other out during the loss calculation. This target y i is then applied to the mean square error loss (Equation (38)).

B.11 Population Level Regularisation
L1-regularisation can be applied to the total number of spikes emitted at the output layer to penalise excessive firing [152], thus encouraging sparse activity at the output: where λ 1 is a hyperparameter controlling the influence of the regularisation term, and S i [t] is the spike of the i th class at time t.
Alternatively, an upper-activity threshold θ U can be applied where if the total number of spikes for all neurons in layer l exceeds this threshold, only then does the regularisation penalty apply: where c i is the total spike count over time for the i th neuron in layer l, and N is the total number of neurons in layer l. λ U is a hyperparameter influencing the strength of the upper-activity regularisation, and [·] + is a linear rectification: if the total number of spikes from the layer is less than θ U , the rectifier clips the negative result to zero such that a penalty is not added. L is typically chosen to be either 1 or 2 [123]. It is possible to swap out the spike count for a time-averaged membrane potential as well, if using hidden-state variables is permissible [120].

B.12 Neuron Level Regularisation
A lower-activity threshold θ L that specifies the lower permissible limit of firing for each neuron before the regularisation penalty is applied: The rectification [·] + now falls within the summation, and is applied to the firing activity of each individual neuron, rather than a population of neurons, where λ L is a hyperparameter that influences the strength of lower-activity regularisation [123]. As with population-level regularisation, the spike count can also be substituted for a time-averaged membrane potential [120].
C Appendix C: Training Spiking Neural Networks

C.1 Backpropagation Using Spike Times
In the original description of SpikeProp from [124], a spike response model is used: where W i,j is the weight between the i th presynaptic and j th postsynaptic neurons, f (k) i is the firing time of the k th spike from the i th presynaptic neuron, and U j (t) is the membrane potential of the j th neuron. For simplicity, the 'alpha function' defined below is frequently used for the kernel: where τ and Θ are the time constant of the kernel and Heaviside step function, respectively.
Consider an SNN where each target specifies the timing of the output spike emitted from the j th output neuron (y j ). This is used in the mean square spike time loss (Equation (37), Appendix B.9), where f j is the actual spike time. Rather than backpropagating in time through the entire history of the simulation, only the gradient pathway through the spike time of each neuron is taken. The gradient of the loss in weight space is then: The first term on the right side evaluates to: The third term can be derived from Equation (43): The second term in Equation (45) can be calculated by calculating −∂U j /∂f j t=fj instead, and then taking the inverse.
In [124], the evolution of U j (t) can be analytically solved using Equations (43) and (44): Note, the input current is triggered at the onset of the pre-synaptic spike t = f i , but is evaluated at the time of the post-synaptic spike t = f j . The results can be combined to give: This approach can be generalized to handle deeper layers, and the original formulation also includes delayed response kernels that are not included above for simplicity.

C.2 Backpropagation Using Spikes
Spike Timing Dependent Plasticity The connection between a pair of neurons can be altered by the spikes emitted by both neurons. Several experiments have shown the relative timing of spikes between pre-and post-synaptic neurons can be used to define a learning rule for updating the synaptic weight [32]. Let t pre and t post represent the timing of the pre-and post-synaptic spikes, respectively. The difference in spike time is: When the pre-synaptic neuron emits a spike before the post-synaptic neuron, such that the pre-synaptic spike may have caused the post-synaptic spike, then the synaptic strength is expected to increase ('potentiation'). When reversed, i.e., the post-synaptic neuron spikes before the pre-synaptic neuron, the synaptic strength decreases ('depression'). This rule is known as spike timing dependent plasticity (STDP), and has been shown to exist in various brain regions including the visual cortex, somatosensory cortex and the hippocampus. Fitting curves to experimental measurements take the following form [32]: where ∆W is the change in synaptic weight, A + and A − represent the maximum amount of synaptic modulation that takes place as the difference between spike times approaches zero, τ + and τ − are the time constants that determine the strength of the update over a given interspike interval. This mechanism is illustrated in Figure   For a strong, excitatory synaptic connection, a pre-synaptic spike will trigger a large post-synaptic potential (refer to U in Equation (4)). As membrane potential approaches the threshold of neuronal firing, such an excitatory case suggests that a post-synaptic spike will likely follow a pre-synaptic spike. This will lead to a positive change of the synaptic weight, thus increasing the chance that a post-synaptic spike will follow a pre-synaptic spike in future. This is a form of causal spiking, and STDP reinforces causal spiking by continuing to increase the strength of the synaptic connection.
Input sensory data is typically correlated in both space and time, so a network's response to a correlated spike train will be to increase the weights much faster than uncorrelated spike trains. This is a direct result of causal spiking. Intuitively, a group of correlated spikes from multiple pre-synaptic neurons will arrive at a post-synaptic neuron within a close time interval, causing stronger depolarization of the neuron membrane potential, and a higher probability of a post-synaptic spike being triggered.
However, without an upper bound, this will lead to unstable and indefinitely large growth of the synaptic weight. In practice, an upper limit should be applied to constrain potentiation. Alternatively, homeostatic mechanisms can also be used to offset this unbounded growth, such as an adaptive threshold that increases each time a spike is triggered from the neuron (Appendix C.3).

C.3 Long-Term Temporal Dependencies
One of the simplest implementations of an adaptive threshold is to choose a steady-state threshold θ 0 and a decay rate α: Each time a spike is triggered from the neuron, S out [t] = 1, the threshold jumps by (1 − α). This is added to the threshold through an intermediary state variable, b[t]. This jump decays at a rate of α at each subsequent step, causing the threshold to tend back to θ 0 in absence of further spikes. The above form is loosely based on [171], though the decay rate α and threshold jump factor (1 − α) can be decoupled from each other. α can be treated as either a hyperparameter or a learnable parameter.