Spiking Neural Networks: A Survey

The field of Deep Learning (DL) has seen a remarkable series of developments with increasingly accurate and robust algorithms. However, the increase in performance has been accompanied by an increase in the parameters, complexity, and training and inference time of the models, which means that we are rapidly reaching a point where DL may no longer be feasible. On the other hand, some specific applications need to be carefully considered when developing DL models due to hardware limitations or power requirements. In this context, there is a growing interest in efficient DL algorithms, with Spiking Neural Networks (SNNs) being one of the most promising paradigms. Due to the inherent asynchrony and sparseness of spike trains, these types of networks have the potential to reduce power consumption while maintaining relatively good performance. This is attractive for efficient DL and if successful, could replace traditional Artificial Neural Networks (ANNs) in many applications. However, despite significant progress, the performance of SNNs on benchmark datasets is often lower than that of traditional ANNs. Moreover, due to the non-differentiable nature of their activation functions, it is difficult to train SNNs with direct backpropagation, so appropriate training strategies must be found. Nevertheless, significant efforts have been made to develop competitive models. This survey covers the main ideas behind SNNs and reviews recent trends in learning rules and network architectures, with a particular focus on biologically inspired strategies. It also provides some practical considerations of state-of-the-art SNNs and discusses relevant research opportunities.


I. INTRODUCTION
I N the past decade, the field of Deep Learning (DL) has seen a remarkable series of developments, with ever more accurate and robust algorithms that have revolutionized many areas, such as computer vision or natural language processing. Considerable advances in hardware and the introduction of GPU-enabled DL have permitted significant boosts in processing speed and efficiency. Consequently, more complex and robust models have been developed, but the computing power used in cutting-edge algorithms grew even faster.
Often, the starting point of this accelerated growth in DL is marked by the 2012 ImageNet competition, where a team of scientists from the University of Toronto proposed a deep Convolutional Neural Network (CNN), AlexNet [1], an algorithm that had a top-5 error 10.8 % lower than the second-best solution. As a result, deeper networks have been investigated, with an ever-increasing number of parameters and complexity. In the subsequent years, we witness the rapid emergence of several known deep model architectures (e.g., VGGNet [2], ResNet [3], GPipe [4], BERT [5], GPT-3 [6]). However, such outstanding improvements of model capabilities have been correlated with an increase in models' parameters, complexity, prediction latency, training time, etc. Nowadays, training a single top-performing model requires many hardware and energy resources, which results in a large carbon footprint [7]. On the contrary, costs scale at roughly the same rate as the demand in computing power, meaning we are fast reaching a point where DL might become unsustainable [8]. Moreover, some applications might require thorough consideration of models' parameters and efficiency.
A recent survey [9] highlights some key ideas. For instance, specific applications, such as mobile, robotics, or critical systems might require the models to be optimized for the device they will be deployed. Furthermore, special attention must be given when integrating multiple models in the same infrastructure since resources might be exhausted. Therefore, in the face of these challenges, there is a growing interest in developing efficient DL algorithms.
Several strategies can be adopted on multiple levels to approach the challenge of efficient DL. For instance, whereas compression techniques target the representational efficiency of the unified model, learning techniques focus on the training stage. Plus, one could choose to handle the problem at the level of the models' architecture. Nonetheless, a growing paradigm with efficient DL is that of Spiking Neural Networks (SNNs) [10].
Although significant advances have been achieved, the performance of SNNs on benchmark datasets such as MNIST [11], CIFAR-10 [12], or Fashion-MNIST [13] is often lower than conventional Artificial Neural Networks (ANNs). In part, this can be explained by the fact that images on those datasets were acquired resorting to traditional sensors instead of event-driven cameras. However, there are other major drawbacks with SNNs, such as the non-differentiable nature of activation functions due to discrete spike trains, the difficulty in propagating spike information in multilayer unsupervised SNNs, or the local characteristic of biologically inspired learning rules, such as Spike Timing Dependent Plasticity (STDP).
This work covers the main ideas behind SNN models, approaching recent trends in learning rules, network architectures, and biologically inspired strategies. It will also address some practical considerations of state-of-the-art SNNs, further discussing relevant research opportunities. Previous works [14,15,16,17,18] have already investigated the main developments in the field of SNNs, but our work differs from these since it is particularly focused on the working principles of biological neural networks. In general, most SNN researches only address a subset of biological mechanisms and different works present distinct strategies, but we argue that maximizing the extent of biological neural network properties exploited in SNNs is a very promising approach and could lead to significant improvements in the performance of SNN models. Moreover, there exist in the literature many promising ideas and conclusions regarding bio-inspired strategies to train SNN models, however, many works are predominantly focused on demonstrating the feasibility and viability of particular components (e.g. learning rules, connection types, etc.) which means there is still the lack of a unifying strategy to develop and train SNN algorithms. Therefore, this work intends to contribute towards bridging that gap by aggregating most of the already existing knowledge regarding SNNs that simulate the working mechanisms of biological neural circuits. To that end, it summarizes the leading ideas and conclusions of recent works, emphasizing the most promising findings and possible future directions in bio-inspired strategies.
Besides this introduction, this work is structured as follows: Section II presents the fundamental principles of SNNs, including neuron models and synapses. Next, section III details the main properties of biological neural networks and summarizes the results of main works that are inspired by the working mechanisms of biological neural networks. Section IV, on the other hand, highlights the main information encoding schemes in biological neuronal networks and how SNNs can mimic those mechanisms further summarizing the relevant findings. In section V we discuss different learning strategies and how the SNN community is addressing the problem. But we also present the results of both the main papers reviewed and of our work in section VI. Finally, we conclude this work and wrap our main findings in section VII.

II. FUNDAMENTALS OF SPIKING NEURAL NETWORKS
A SNN architecture consists of neurons interconnected by synapses that determine how information is propagated from a presynaptic (source) to a postsynaptic (target) neuron. The activity of the presynaptic neurons modulates the activation of the corresponding postsynaptic neurons and, unlike the conventional ANNs, the information in SNNs is encoded and transmitted in the form of spikes. Each input is presented for a prespecified amount of time (T), meaning that instead of a single forward propagation, SNNs typically have multiple, T δt , forward passes. But as in its biological counterpart, once a presynaptic neuron activates, it sends a signal to its postsynaptic equivalent, in the form of a synaptic current, that is proportional to the weight, or conductance, of the synapse.
In general, as the synaptic current reaches the target neuron it will alter its membrane potential (v mem ) by a certain amount, δv. If v mem reaches a predefined threshold (v thresh ), the postsynaptic neuron will emit a spike and reset its membrane voltage to the resting potential (v rest ). Notwithstanding, many strategies can be adopted to model the neuron and synapse dynamics. On top of that, different network architectures and applications might require specific combinations of learning rule, v mem dynamics, and neuron model.
There is a broad pool of neuron models and often the literature reports networks that establish the Mcculloch-Pitts, [19], Izhikevich, [20], CSRM or SRM0 [10] neuron models as its basic units. However, the Leaky Integrate and Fire (LIF) and its variants is one of the most popular. In LIF models, the neuron is modelled as a parallel Resistor-Capacitor (RC) circuit with a "leaky" resistor [21], as represented in Figure  1a. The output voltage V (t) of this circuit, (analogous to v mem ) is then mathematically defined as: From 1, we see that V (t) is dependent on the conductance, g L , of the resistor, the capacitance, C, of the capacitor, on the resting voltage (E L ) and of a current source I (t). If we FIGURE 1: If there is no input current, V (t) will decay to its resting potential EL, otherwise it will increase by δv , as established by Equation 2. The capacitor will discharge (comparable to a neuron emitting a spike) when V (t) reaches the predefined threshold. The synaptic conductance is here represented by the Synaptic Weight (W) vector. It is possible to observe that if the input is not sufficient, the neuron will not fire. Also, it is observable that in the first moments after the neuron has emitted a spike (t ref r ), it cannot fire again, regardless of the input it is receiving. multiply 1 by R := 1 C , we obtain dvmem dt in terms of the membrane time constant, τ m : We observe that, due to the leaky behaviour of the model, v mem is constantly decaying to its rest value. Another important consideration for LIF-neurons is that it can endure a refractory period, i.e., a period after reset during which the neuron cannot fire again, regardless of its input. Figure 1b illustrates the behaviour of a LIF node on spike train input.
The activation function, A(t), of LIF neurons is thus defined as: A major drawback of SNNs is the non-differentiable nature of its activation function 3, meaning that Backpropagation (BP), the most widely used learning algorithm in ANNs, cannot be directly employed [22]. But for the network to be able to learn, we must decide proper strategies to update the synaptic weights, W. One of the prevailing methods is the biologically inspired STDP. STDP results from a set of neurobiological findings that started in 1949, with Donald Hebb, who proposed a fundamental principle of synaptic plasticity to describe how learning might be accomplished in the brain. In "The Organization of Behaviour: A Neuropsychological Theory" [23] he famously states: "When an axon of Cell A is near enough to excite a Cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A's efficiency, as one of the cells firing B, is increased." Later, his work was supported with the discovery of Long Term Potentiation (LTP) [24], a neural mechanism that describes a persistent increase in synaptic strength if a presynaptic neuron fires briefly before its postsynaptic equivalent [25]. On the contrary, Long Term Depression (LTD) refers to the process of decreasing the synaptic effectiveness of a presynaptic neuron when there is no evident causal relationship between its spiking activity and that of the postsynaptic neuron [25]. More specifically, STDP is an asymmetric form of Hebbian learning rules and it establishes that the synaptic weight change is proportional to the relative timing between pre-and postsynaptic activations [26]. Equation 4 formally describes this mechanism: Where A > 0 and B < 0 define the learning rates, τ is the timing window constant, t pre and t post are the absolute timings of pre-and postsynaptic spikes and ∆w is the synaptic weight update. From this system of equations, it is observable that the first branch refers to LTP and the second branch to LTD. Notably, STDP is a local learning rule, meaning it does not consider global information for weight updating and, as this work will discuss later, it becomes challenging to train SNNs with direct BP.

III. FROM THE RETINA TO THE VISUAL CORTEX: BIOLOGICALLY INSPIRED SPIKING NEURAL NETWORKS
ANNs are inspired by the mammalian brain, yet their principles are fundamentally different, more specifically in what refers to their structure, learning rules, and computations. For instance, whereas ANNs approximate neurons as nonlinear continuous functions, biological neurons compute asynchronous event streams through discrete and temporally precise action potentials [15]. Although ANNs (e.g., CNNs, Recurrent Neural Networks (RNNs), Transformers, etc.) have achieved outstanding results in many Machine Learning (ML) tasks, they are incredibly inefficient in comparison with biological neural circuits. SNNs, on the other hand, exhibit many properties in common with biological neural networks, namely sparsity of computations, low power consumption, fast inference, or a considerable degree of parallelism [15]. For this reason, there is a growing effort from neuroscientists VOLUME 4, 2016 to better understand the principles behind learning and information processing in the mammalian brain. Notably, one of the most widely understood brain functions is the visual system and the Primary Visual Area (V1). Many correlates of human vision have thus served as inspiration for several SNN strategies and their applications to computer vision tasks.

A. THE MAMMALIAN RETINA: INPUT STIMULI SELECTIVITY
As stated previously, SNNs are promising for event-driven computations. This is primarily encouraged by the biological structure and working mechanisms of the mammalian retina. The mammalian retina is a multilayer system, containing thousands of photoreceptors. Photoreceptors act as transducers, converting light signals to discrete electric signals (spikes) which are further processed by the brain. In turn, the photoreceptors connect to ON-centre and OFF-centre bipolar cells. ON-centre cells are depolarized by stimulating the centre of the Receptive Field (RF), whereas OFF-centre cells are depolarized by lack of stimuli at the centre of their respective RF. Moreover, ON/OFF bipolar cells possess a surrounding region in their RFs with inverse properties [27]. Figure 2 illustrates the response of ON-centre-and OFFcentre-surround cells to different light inputs.
These mechanisms demonstrate the retina's role in preprocessing input stimuli and suggest some form of selectivity. ON-centre/OFF-centre bipolar cells are responsible for coding contrasts in luminance, meaning that they ignore redundant information and instead focus on relevant intensity changes (events). Next, bipolar cells connect to ganglion cells, the output neurons of the retina. Evidence supports that ganglion cells are selective to a plethora of input stimuli features, including colour, luminance contrasts, object size, and stimulus direction and orientation [28]. Notably, at the retina level, the RFs present a more basic structure and encode a broad range of spatial frequencies [29]. Many of such properties have already been explored in the development of bioinspired SNNs. For example, the work of [30] proposes a 3 stages SNN architecture to process Dynamic Vision Sensor (DVS) data. Their algorithm contains an Unsupervised Learning (UL) layer for spatio-temporal pattern extraction, that mimics ganglion cell selectivity, and a Supervised Learning (SL) layer inspired in the retinotopic organization of the visual cortex to classify the MNIST-DVS dataset. In turn, [31] apply ON-OFF Difference of Gaussian (DoG) filtering, before encoding the input images into spike trains and feeding them to a convolutional SNN classifier. Figure 3 illustrates the obtained DoG filters as well as 2 example outputs, after convolution with On-and Off-DoG filters.
Similarly, [32] use ON-centre and Off-centre DoG convolutions in a hierarchical multilayer SNN for corner detection and, recently, [33] proposed the use of on-centred input neurons with a 5×5 receptive field to mimic how ganglion cells perceive the visual scene. But despite these developments, it remains challenging to model the behaviour of ganglion cells in ML models, partly due to their diversity, apparent fine-tuning for specific features of the visual scene, distinct functions, subtypes, and connections to downstream visual processing pathways [34,35]. A possible solution could be to model the retina using CNNs. Indeed, there is a certain degree of realism in this approach since the eyes contain very few Feedback (FB) connections, meaning Feedforward (FF) networks, like CNNs can be used [36]. Nevertheless, modelling the behaviour of the retina and ganglion cells seems to be a promising avenue for the development of high-performing and efficient SNNs. More concretely, by mimicking the ability of these cells to encode the visual scene into a potentially abstract subset of features, multilayer hierarchical SNNs models could better discriminate between classes by taking advantage of more relevant and complementary information.

B. THE NEURAL CIRCUITS: INFORMATION TRANSMISSION
Ganglion cells will subsequently send output spikes to visual processing centres in the brain. Information is thus propagated and processed through an intricate network of neurons that form complex circuits. Notwithstanding, these networks can be decomposed into much simpler circuits that perform elementary operations. The most widely used neural circuit is the FF excitation (Figure 4), and although simple, it permits the flow of information through the network. At each layer, every neuron receives input from multiple presynaptic nodes through converging connections and is itself divergently connected to multiple postsynaptic equivalents. Another advantage of this circuit is it can increase the Signal to Noise Ratio (SNR) considering that different neurons process similar inputs but uncorrelated noise [37]. Although FF excitatory circuits fulfil a prominent role in propagating information to deeper brain regions, other mechanisms occur locally that regulate the relationship between input stimulation and output spiking activity. Two of said mechanisms are FF and FB inhibition circuits.
In FB inhibition, a layer of presynaptic excitatory neurons will stimulate postsynaptic excitatory neurons as well as the inhibitory neurons that project back to the presynaptic excitatory layer. In turn, FF inhibition refers to the case where the postsynaptic population of inhibitory neurons connects to postsynaptic neuron populations. It is believed that FF/FB inhibition act dynamically to perform a plethora of functions. These include: controlling of synchronous or oscillatory spiking activity, regulation of neuron sensitivity to input stimuli, through the promotion of a fast response, and preventing quiescence or saturation of postsynaptic firing rates [38,39,40,41,42]. Lateral inhibition circuits also partake in neural networks. It consists of the capacity that some neurons present to reduce the activity of parallel pathways. Through this, it is possible to exacerbate some activity whilst reducing the transmission of less relevant spikes. Lateral inhibitory circuits enhance contrasts and carry out a significant role in decorrelating neural responses, therefore, promoting

Light Stimuli Incidence Spiking Activity
Off-Centre

On-Centre
Off-Surround FIGURE 2: On-centre-and Off-centre-surround bipolar cells' response to light stimuli. If there is maximum overlap between light incidence and the cell's RF, it outputs a maximum response pattern, demonstrated by the highest firing rates. However, if there is a weak correspondence, we observe a weak firing response or no spikes at all.  On-centre/Off-centre DoG filters. Source: [31] the discrimination of similar stimuli [43]. For example, it is believed to occur at the retina, where active photoreceptors, through horizontal cells, will inhibit neighbouring photoreceptors [37]. The work of [44] has demonstrated positive results when using FB inhibition to promote the separability of similar input stimuli. The increase in selectivity is explained by the suppression of non-informative activity. But there are a few considerations with this approach. The solution is supported on the basis that FF connections from uninformative neu-rons, coding stimuli overlap, will dominate the discriminative connections. Beyond a certain degree of overlap, it can even prevent the informative output neurons from firing at all. To overcome such limitations, a mechanism of dynamic excitation/inhibition was proposed, where FB inhibitory connections were strengthened (with anti-Hebbian learning) every time excitatory connections were potentiated. The results show this strategy promotes the learning of selective responses and increases the discrimination of similar input patterns. This is a biologically realistic approach, since neural circuits of multilayer excitation/inhibition could partake in the suppression of uninformative stimuli. Remarkably, the authors suggest this strategy could be incorporated in hierarchical multilayer networks, with FB inhibition from higher to lower layers, to allow the recognition of more complex patterns.
All things considered, inhibition regulates spatio-temporal dynamics of fundamental importance to determine the performance of neural networks and specifically in the context of SNNs, it has already been explored. For instance, [45] use lateral inhibition to promote competition between neurons. With this, they propose a model where each neuron's RF, encoded in its synaptic weights, will correspond to a single data sample or the average of a limited subset of data samples. Then, when a single input is presented, the firing response of each prototype is used to predict the corresponding class. Figure 5 illustrates the presented network.
[33] on the other hand, propose an ensemble of hierarchical unsupervised SNNs. They interactively combine excitatory and inhibitory neurons. This strategy allows regulating network states and promotes distinguishable activations. Considerably, they add lateral inhibition to perform the VOLUME 4, 2016   [45]. It shows the input connections to an example excitatory neuron. Excitatory neurons are connected to inhibitory neurons via oneto-one connections, as shown for the example neuron. The red shaded area denotes all connections from one inhibitory neuron to the excitatory neurons. Each inhibitory neuron is then connected to all excitatory neurons, except for the one it receives a spike from. Source: [45] classification task. [46], in turn, argue that dynamic excitatory and inhibitory neural circuits facilitate convergence and improve the performance of neural networks. The authors propose a deep SNN that resorts to the mechanisms of adaptive self-FB and balanced excitatory and inhibitory neuron circuits, whilst suggesting such strategies could accelerate the training of SNNs. Another work [47] demonstrates the potential of global FB connections and local learning rules for multilayer SNNs. They train the network in a 2-step approach. First, a FF pass to obtain the predictions, then FB is used to obtain the targets of the various hidden layers. In the second step, the loss is computed at the last layer, and the prediction error is propagated to hidden layers, through global FB connections. The weights of hidden layers are thus updated locally, resorting to STDP. In this way, the proposed architecture solves the BP challenges of SNNs, whilst avoiding transmitting the error layer by layer. FB connections are equally suggested by [48]. They introduce a novel way of training FB SNNs, based on the implicit differentiation on the equilibrium state. In this work, it is suggested a certain similarity to Hebbian learning, but the weight updating strategy considers both average firing rates and temporal information.
Despite the aforementioned contributions, there exist other relevant circuit designs that could potentially lead to performance gains and increased versatility of SNN. Often, traditional ANNs perform several steps of dimensionality expansion (feature generation) and dimensionality reduction (feature selection). This is largely inspired by the neural circuits of living organisms. Furthermore, it is firmly established that dimensionality expansion increases the separability of neural representations. More concretely, lowerdimensional representations produce distinct responses to specific features, whereas expansion allows the representation of said separable representations as well as their conjunction, thus augmenting the expressive basis and subsequent linear separability of classes [49]. For instance, a study [50] highlights that sparse expansion implemented by FF connections increases robustness to the variability of the input signal, i.e., increases de SNR. Furthermore, it can also decrease the dissimilarity between classes by encoding input statistics into the expanding connections' synaptic weights. Besides, increasing the dimensionality of the inputs also augments the performance of the classifier [50]. However, the separability achieved through dimensionality expansion comes at the cost of generalizability. Meaning, it increases the sensitivity to noise [49]. But to overcome such limitations, compression can reduce the dimensionality and extract meaningful representations from the data. Put differently, there is a need of achieving a balance between expansion and compression, depending on the task objectives. Figure  6a illustrates a simplified neural circuit of dimensionality expansion and reduction. There are many well-known examples of ANNs that have defined dynamic pathways of expansion and compression (e.g., RNNs [51]). Similarly in SNNs, a few works also revolve around these concepts. For instance, [52] propose reservoir computing techniques, namely Liquid State Machines (LSMs), and optimization strategies to develop networks suitable for neuromorphic computing. With this learning approach they are able to find an optimum SNN network topology (i.e., connections) and corresponding hyperparameters (e.g. number of neurons, leakage constants, etc.) that can serve as the Liquid of a LSM. This model thus incorporates the dynamic (time-varying) behavior of recurrent SNNs. The readout layer is then trained to classify information from the extracted reservoir state vectors. [53], on the other hand, propose deep LSMs with randomly initialized hidden layers, interleaved with Winner-Take-All (WTA) layers, trained with STDP, to achieve a low-dimensional representation of the information captured by high-dimensional hidden layers. The authors argue this leads to hidden layers capable of processing information over multiple time-scales. The hidden neurons are set to 10x the dimension of the input space and consist of primary neurons, connected to the input layer, and auxiliary neurons which only have recurrent connections within the hidden layer. An attention-modulated readout layer is then stacked on top of the Liquid to perform classification on the DogCentric [54] video recognition dataset. In turn, [55] suggest the readout layer of a LSM to be trained with Backpropagation Through Time (BPTT). In their work, the authors propose the Liquid to have recurrect connections and static weights, randomly initialized. They then transfer their GPU-trained network to neuromorphic hardware, suggesting competitive performance on the N-MNIST [56] dataset. Nonetheless, we argue that despite challenging, leveraging these strategies in SNNs could lead to performance gains by augmenting the expressive basis of neuron populations.

C. THE VISUAL CORTEX: INFORMATION PROCESSING
We have seen that light stimuli will be pre-processed at the retina and then propagated through an intricate network of several simple circuits. But it remains to discuss how information is processed at the different building blocks of the visual pathway. Indeed, it is supposed that the mammalian brain processes visual information in a bottom-up approach, where each subsequent layer is more specialized than the previous. The goal is to extract salient points, or groupings, of low-level features that can reduce redundancy and complexity of the scene but encode relevant information. It all starts at the retina, where ON-and OFF-centre-surround cells will impose some form of pre-processing to the input stimuli. At this stage, more than a dozen different types of ganglion cells will split visual information into separate input streams, representing unique low-level features of the visual scene.
Although not yet fully understood, it is supported that these features entail, among others, luminance contrasts, direction selectivity, edge selectivity, motion sensitivity, and colour [34,57]. Next, the axons of the retinal ganglion cells form the optical nerve and project towards the Lateral Geniculate Nucleus (LGN). From here, visual information travels to the visual cortex. Figure 6b demonstrates the visual pathway of the human brain.
As information propagates from the retina to LGN and through the visual cortex, neurons' RFs become more complex, covering ever bigger regions [58]. More precisely, it was suggested that these RFs are organized according to their complexity. Neurons with simple RFs, respond to stimuli of specific slit width, slant, orientation, and position. Complex RFs on the other hand, cover a wider range of the visual scene 1 Source: wikimedia commons. Visual System; Miquel Nieto (a) Schematic of dimensionality expansion and reduction in neural circuits. Source: [37] (b) Visual pathway of the human brain. The primary processing centres are the retina and V1. 1 instead of a single position but equally respond to slit-shaped stimuli. In turn, neurons with hypercomplex and higher-order hypercomplex RFs require elaborate inputs to activate [59]. The visual cortex, located at the occipital lobe, is the primary unit responsible for combining the elementary features incoming from the LGN so that we can build a complete representation and understanding of the visual scene. V1, in the visual cortex, is understood as the first stage of cortical processing. V1 neurons are, thus, categorized into simple, complex, and hypercomplex cells. Simple cells are perceived to compute linear combinations of incoming streams of information, whilst complex cells perform operations on the output of simple neurons [60]. Notably, simple neurons can also be categorized into ON-centre and OFF-centre cells. Complex neurons, on the other hand, will produce a sole response. Furthermore, it is well established that neurons in V1 are organized in a retinotopic manner, meaning neighbouring neurons represent neighbouring positions in the visual field and that its RFs completely represent the visual scene.
Nonetheless, these neurons present some form of selectivity for a panoply of characteristics, similarly to retina ganglion cells. However, V1 neurons are more selective than the former. Two of the major attributes of V1 neurons are their selectivity to orientation and spatial frequency. More precisely, V1 neurons operate at various scales, with little VOLUME 4, 2016 contextual information, and the way it performs salient point detection strikingly resembles 2D Gabor functions [61,62]. 2D Gabor functions are defined as complex sinusoids, modelled by a Gaussian envelope. Equation 5 defines the general case, for spatial coordinates: With γ the aspect ratio, λ the wavelength of the sinusoid, and σ the effective width. Due to their nature, Gabor functions are sensitive to edges, orientation, and texture, operating at different frequencies and scales [63], just like V1 neurons' computations. In this way, V1 neurons encode information in a sparse basis set that represents the entire visual scene. This encoding proves efficient and attractive for the development of SNNs since a linear combination of a limited number of neurons' RFs can be used to reconstruct the whole visual input. That is, only a few neurons are required to be active at a time [64]. V1 output will then stimulate the extrastriate visual areas, V2 and V4, and afterwards, visual information is sent to the Infrotemporal (IT) cortex [39]. At each stage, neuron populations will combine stimuli from previous layers, therefore forming progressively larger and more complex RFs. From simple oriented bars and edges at V1 to moderately complex features, like corners, in V2 and V4, to complex objects and faces, at the IT cortex [65]. Another relevant property that arises from the aforementioned visual processing pathway is invariance to stimuli position and scale, as RFs become larger. Figure 7 summarizes the progression of input stimuli from the retina to the visual cortex.
Although CNNs are not biologically plausible, they are indeed inspired by and present many similarities with biological neural networks of the mammalian's brain. For example, CNNs can also present the properties of invariance to scale, position and slight deformities of the input stimuli [66,67,68,69,70]. Moreover, the first layers of CNNs trained on natural images have been shown to present Gabor-like RFs [1]. Besides, most CNNs are also organized hierarchically, with deeper layers progressively encoding more abstract and complex features.
Similarly, some SNN architectures also simulate the visual information processing of the ventral pathway. For instance, [71] used a biologically plausible multilayer hierarchical SNN to classify the benchmarking MNIST [11] and CIFAR-10 [12] datasets. The proposed architecture consists of 6 layers: an encoding layer to simulate the retina and 2 convolution layers that, together with 3 pooling layers, emulate the visual cortex. Promisingly, their work incorporates many known properties of the visual pathway. It includes shiftinvariance, edge-like filters at the first convolutional layer (V1), and increasing RFs. In the same way, [33] suggest an ensemble of hierarchical multilayer SNNs, where each model is composed of several convolutional and pooling layers, simulating the working mechanisms of the primate brain. Other authors have proposed spiking convolutional neural networks, inspired in V1 computations. Whereas several works propose hand-crafted kernels of fixed parameters [72,73,74,75], [76] suggest a hierarchical unsupervised algorithm, based on STDP and spike time coding to classify several datasets. The network is comprised of an encoding layer followed by a sequence of several convolutional and pooling layers. Besides the biologically inspired hierarchical structure, the proposed solution is also translation invariant, due to pooling operations. Furthermore, to mimic the retina ganglion cells, the first layer uses DoGs filters, to detect contrasts. These filters are then encoded in spike latency. In turn, [77], despite using a shallow SNN, show that localized Gabor-like receptive fields in conjunction with unsupervised STDP offer a promising solution to increase the performance of SNNs. Moreover, the authors argue that biologically plausible DL demonstrates great potential to improve the performance of SNNs.
We have seen that SNNs are biologically plausible. Nonetheless, their performance is behind that of CNNs, and more work is needed to bridge the gap. Although existing SNNs might be inspired by some of the operating principles and properties of the mammalian brain, more mechanisms could be explored. Markedly, many of the proposed algorithms are shallow and do not explore to the full extent the bottom-up strategy characteristic of visual processing in the mammalian brain. With this in mind, we suggest that integrating more of known biological mechanisms into SNNs architectures could lead to performance gains. In concrete, we identify that concepts such as neural circuits (e.g., dynamic inhibition/excitation, dimensionality expansion, etc.), and ON-center/OFF-center surround RFs, or known properties of the visual processing pathway (e.g., V1 Gaborlike selectivity, hierarchical structure, etc.) need to be more extensively studied and integrated with SNNs to allow the development of energy and data-efficient DL and that can compete with state-of-the-art algorithms like CNNs.

IV. INFORMATION ENCODING
As mentioned previously, SNNs require the encoding of information, which constitutes a considerable difference compared to traditional ANNs. In fact, in image-based tasks, traditional ANNs use pixel intensities to extract critical features, while in SNNs, there is the need to map each pixel intensity to a discrete spike domain before feature extraction.
The mammalians' brain is extraordinarily efficient when performing complex tasks, and the goal of SNN architectures would be to mimic as close as possible that organ's properties, thus becoming resource-efficient and suitable for applications with certain hardware constraints. Henceforth, the employment of biologically plausible information encoding methods has a crucial role in SNN-based architectures. In fact, the mammalian brain is remarkable in the sense that it performs task-specific encoding, suggesting that ML practitioners could also select different encoding schemes, depending on the application.
There are several encoding strategies, but broadly speak- ing, these can be divided into two critical groups: rate and temporal coding. As the names suggest, rate coding is based on spikes' firing rates to represent information whilst temporal coding considers the spike times, thus, in general, allowing faster responses. Despite for many years being thought that the mammalian brain exercised, essentially, rate coding, evidence supports the idea that it must also rely upon temporal encoding strategies. Rate coding schemes are often categorized into count-, density-or population rate methods (Figure 8), as suggested by [78]. In count rate coding, also defined as frequency coding, the mean firing rate (υ) is computed over a prespecified time window (T), as formally defined by Equation 6.
with N spike being the number of spikes in a stimulus time window, T. This is the most common rate coding scheme, and it converts each pixel intensity to a spike train. Typically, the probability of spike occurrence at a given time instant, t, is modelled by a Poisson distribution in which the mean event rate is proportional to the pixel intensity. Several studies evidence the existence of biological rate coding. [79] performed the first known experiment that allowed to verify the existence of a biological rate coding. They submitted a frog's muscle to different weights and, through the measurement of nerve action potentials using a capillary electrometer, concluded that the firing rate was proportional to the force exerted by the weights. [80] also describe the crucial role that rate coding plays in human voluntary muscle contrac-tion, mostly for intermediate and higher forces. Furthermore, some experiments corroborated the Poisson-based coding hypotheses through the analysis of biological recordings of the medial temporal and primary visual cortices of macaque monkeys [81]. Density rate coding, on the contrary, represents a nonbiologically plausible method that averages the neural activity over several simulations. The spike density, p(t), is therefore defined as where, first, the number of spikes, N spike , is averaged over a specified time interval [t; t + ∆t], that defines the duration of the simulation and, second, the average firing rate is also averaged over K simulations.This is also known as Post-Stimulus Time Histogram (PSTH). The main problem with this approach is that it is impossible for biological neural networks. Input stimuli must be processed in a single run. Nonetheless, averaging a neuron's response over simulations allows the smoothing of intrinsic and network-related noise spikes that occur both in vivo and in simulated neurons [82]. Another way of using the mean firing rates to encode information is through the population rate coding scheme, which is in many ways similar to density coding, except that it averages over several neurons instead of over simulations. In this technique, the firing rate A(t) is obtained by averaging the number of spikes N spike in the time interval [t; t + ∆t] over the number of neurons N and the duration ∆t, as described in 8.

VOLUME 4, 2016
[83] assessed the N1 response through magnetoencephalography by presenting two unique sounds coming from different locations, referenced as adaptor and probe locations. N1 attenuation reflects the degree of overlapping of the neurons responding to the adaptor and the probe sounds. The participants of the study were initially exposed to the adaptor sound and, after some defined time interval, the probe sound and the N1 attenuation were measured. It was verified that the increase of the spatial separation between the adaptor and the probe led to a decrease in attenuation, confirming the existence of a population rate coding method in the human auditory cortex. More recently, [84] tested the rate population coding hypothesis against the labelled-line one to attempt to explain how the sound direction is perceived and if it depends on sound intensity. In this experiment, individuals were submitted to various sound intensities, and interaural time differences, and asked to indicate the perceived laterality. The labelled-line model states that each receptor only responds to a certain stimulus, implying that the response should be intensity invariant while in the population rate coding model each receptor responds to several stimuli, depending on the response of a population of neurons. The obtained results validate the population rate coding hypothesis since it was verified a midline bias to the perceived laterality for lower sound intensities. The authors point out that, since human visual perception is also based on interocular disparity, the brain may operate a similar coding method in visual perception tasks. Some evidence shows the presence of population rate coding in both macaque and mouse visual systems [85,86].
Although rate coding methods are widely used in SNNs, the need to average over a certain time interval compromises the responsiveness of the systems since it requires a sufficiently large time window to increase its performance. On the contrary, biological systems need to respond almost instantly to stimuli, which is only possible through methods that implement precise timing instead of mean firing rates. Biological evidence of these methods was provided by [87] that, through the measurement of event-related potentials, concluded that the human brain needs less than 150 ms to process a complex visual task. In the experiment, an image was flashed for 20 ms where subjects had to decide the presence or absence of an animal in the stimulus. This experiment paved the way for the study of more temporally precise encoding strategies, with temporal coding being now extensively studied for computer vision applications. From the available techniques, we highlight Time-To-First-Spike (TTFS), Rank Order Coding (ROC), phase coding, and burst coding methods due to their wider adoption.
TTFS is the simplest case of temporal coding, and it considers the precise time window between the beginning of a stimulus and the first spike emitted by a neuron. With this strategy, information is encoded such that signals with high amplitude trigger an early response while low amplitude signals translate to late spikes or no spikes at all. [88] found strong evidence of a correlation between the latency of the first spike and stimulus contrast in the retinal pathway. It also demonstrated that spike count affects the response to the said stimulus. In the inferior visual cortex, experiments showed that the first spike contains more information than the combined spikes [89]. In turn, [90] demonstrated the importance of first spikes in visual processing tasks. By submitting mice to trivial discriminating visual tasks and silencing their primary visual cortex, in well-defined intervals, after stimulus onset, they concluded that most neurons emitted as little as one spike or no spikes at all. Besides, only 16% of the neurons spiked more than twice.
Similarly to TTFS, rank order coding (ROC) considers the relative timing of spikes, but across neuron populations. Naturally, the amount of information that can be encoded is constrained by the size of the neuronal population. Biological evidence of this method was presented by [91]. Through retinal recordings, using multielectrode arrays, they analyzed the response of mice Retinal Ganglion Cells (RGCs) to several stimuli. Their work concludes that a single pair of those cells did not provide sufficient information, meaning a larger population of RGCs would be necessary to encode all of the information. In turn, phase coding considers a global reference in the form of a periodic oscillatory signal, where spike times are phase-locked in relation to that reference oscillation ( Figure 9). There is also biological evidence of this encoding strategy. For instance, [92] recorded local field potentials in macaques' V1 while presenting a colourized movie and demonstrated that both the spike count and lowfrequency local field potentials provided important information regarding the film. A similar study [93] showed that phase coding plays a critical role in swift response for object identification and categorization tasks.
Regarding burst coding, the information is encoded considering both the number of spikes and the inter-spike interval. This allows controlling the precision in information transmission since longer bursts (i.e., with more spikes) can carry more information. Burst spikes consist of lowduration high-frequency spike trains that allow systems to transfer information more accurately. The presence of burst spike trains is well established in biological systems. Their function varies depending on the part of the brain we are considering. For example, in the hippocampus, these bursts play a role in memory maintenance, while in the thalamus they perform a "wake-up call" to prepare the neurons to receive a stimulus [95].
In summary, the mammalian brain uses different information encoding methods depending on the tasks considered. Therefore, in SNN-based architectures, the same principle applies. However, more work is needed since, on one side, the exact mechanisms by which the biological neurons encode information is not completely known. On the other side, it remains to clarify in which situations each strategy is more suitable. Rate-based coding is common in SNNs research,  : Schematic representation of phase coding. In this case, it is not possible to distinguish each stimuli through their spike count (since it is the same for both) but rather through the timing of appearance regarding the global oscillatory reference. Source: Adapted from [94] mainly when the focus is not the information coding itself but rather the study of a specific network module or training strategy, such as the learning rule. These methods are not the most biologically plausible but rather straightforward to implement. But temporal-based methods are also frequently addressed due to biological plausibility and fast inference response. For example, [96] propose a modified TTFS encoding strategy (T2FSNN), having obtained an accuracy of 91.43% on the CIFAR-10 [12] dataset. On the other hand, [97] hypothesized that using burst coding in deep SNN networks is advantageous, as it increases energy efficiency and reduces latency. Their work proposes a hybrid neural coding scheme and a 2-layered network, with each layer possessing an independent neural coding strategy. The authors observed that burst coding in the hidden layer decreased the latency, regardless of the input layer neural coding. The authors report a 91.41% accuracy on the CIFAR-10 [12] dataset and 99.25% on MNIST [11]. These studies, however, took advantage of ANN to SNN conversion methods, meaning little work was done towards studying and integrating different encoding schemes with SNNs developed from scratch. But, since SNNs are still in their infancy, the most effective way to combine the encoding method with neuron model and network architecture is yet to be discovered. Although this is of great importance since if not properly encoded, information might be lost when being propagated through the network layers.

V. LEARNING STRATEGIES
SNNs aim to be a fault-tolerant, energy and data-efficient biologically inspired solution, but despite its promising results, there are some barriers to overcome. Namely, whilst it is well established that biological neurons process and transmit information in the form of spikes, the exact mechanisms by which biological neural networks learn are nonetheless an open research question. Inevitably, learning strategies in SNNs are coupled with the various elements of a neural network, including how information is encoded, the neuron model, and the general architecture. Learning in SNNs is, thus, a challenging task, and there is a need of finding an optimum solution.
Much of the success of ANNs comes from the BP algorithm [98]. In the 1990's it was thought that learning useful representations from raw data was not feasible but later, in 2006, a team of researchers demonstrated that BP works remarkably well to train deep ANNs for classification and recognition tasks [99,100,101,102]. Since then, the interest in BP grew exponentially, and, today, it is undoubtedly the most widely used strategy to train DL algorithms [103]. However, there are strong arguments defending that BP is not biologically plausible. On one hand, it is thought that biological neurons perform both linear and nonlinear operations, whilst VOLUME 4, 2016 BP consists of only linear mechanisms [104] (i.e., in each iteration, a first-order Taylor approximation of the function to be optimized is applied). In second, the FB path would possess the symmetric weight of the forward propagation, which does not occur in biological systems, the so-called weight transport problem [105]. Also, BP would require bidirectional synapses whereas biological presynaptic neurons connect unidirectionally to their postsynaptic equivalents and, besides, learning in the brain occurs continuously, in a single step, whereas BP is a 2-step algorithm. Further, BP would require neurons to store the exact derivatives of the activation functions but how these derivatives could be computed and stored by neurons is currently unclear [104]. Adding to this, the activation functions of biological neuron models' (e.g., LIF) is non-differentiable, and BP is instantaneous. On the contrary, neural networks, due to the nature of spike trains, perform asynchronous computations, in a timely manner. Lastly, despite some evidence confirming the existence of some form of BP in the brain [106,107,108,109], the exact mechanisms by which it occurs are poorly understood [110]. For these reasons, it is challenging to train SNNs using BP, as is normally done with state-of-the-art CNNs.

A. CONVERSION OF CONVENTIONAL ANNS TO SNNS
Broadly speaking, SNNs can be divided into 2 dominant categories: Converted ANNs to SNNs and directly trained SNNs. In the first case, conventional ANNs are fully trained, using BP, before being converted to an equivalent model consisting of spiking neurons. This method is often referred to as rate-based learning since, commonly, the analogue outputs of the traditional ANNs are converted to spike trains through rate encoding. Directly trained SNNs are trained resorting to biologically plausible learning rules or use approximations to allow BP, but in either case, they consider full advantage of spiking neurons.
Naturally, converted ANNs to SNNs usually achieve performances comparable to state-of-the-art ANNs. Nonetheless, their accuracy is still behind that of non-spiking ANNs of approximate architecture [111]. Conversion could explain said performance since it assumes that the firing rate of SNNs is equal to the activation of ANNs, which is not necessarily true and thus might be a source of error. Other disadvantages of such a learning approach include: not being biologically plausible, as well as the limited implementation of many ANNs operators that are crucial to improving the performance of the networks, like max-pooling, batch normalization, or softmax activation function [112]. This means that converted ANNs to SNNs make many approximations that reduce the generalizability of conversion methods [112]. Another disadvantage resides in rate-based coding, in which computational costs increase linearly with the firing rate. Ultimately, SNNs' efficiency could be dampened, in very deep architectures or in situations where many neurons activate or possess high firing rates [112]. To this end, work has been done towards near-lossless conversion [113,111,114,115,112,116], nonetheless, direct training of SNNs is often preferred so researchers can take full advantage of SNNs' properties. Concerning direct training of SNNs, ubiquitous ANNs training paradigms, like SL, reinforcement learning, and UL have been explored [17].

B. UNSUPERVISED LEARNING
Initial works propose the use of STDP, a form of UL with a local learning rule that is both biologically plausible and adequate to deal with the non-differentiable discrete binary spikes, characteristic of SNNs. As seen previously, STDP arises from Hebbian learning theories, and it states that a synaptic weight is potentiated or depressed, depending on tight correlations between pre-and postsynaptic activations. However, often, some variant of STDP is preferred to the detriment of the formal definition (Equation 4). Predominantly, this is because it is challenging to assess the precise timing of postsynaptic action potentials. On the other hand, from a biological perspective, although STDP depends on the precise timing of spikes, there exist other properties that slightly change between distinctive types of synapses, meaning there is a remarkable diversity of STDP rules [117].
One of the most frequently cited unsupervised STDP algorithms is that of [45]. The authors suggest a fully unsupervised, biologically plausible, shallow algorithm to perform a classification task on the MNIST [11] handwritten digits dataset. Inspired by the work of [118], they use a variant of STDP that resorts to synaptic traces (Equation 9) to compute weight dynamics, whilst arguing it improves simulation speed. Furthermore, they test the proposed architecture with 3 other learning rules. On one hand, they added an exponential weight dependence to the previous strategy of synaptic traces [119,120] (see Equation 10). Subsequently, they considered pre-and postsynaptic traces, with independent weight updating for pre-and postsynaptic activity (Equation 11) and, at last, they resorted to the triplet STDP [121] with divisive weight normalization [45].
Notably, they added further biologically plausible mechanisms, like inhibitory synapses, to achieve competitive learning, thus promoting separability of the classes and improving the overall performance of their network. Likewise, the work of [122] introduces the lattice map spiking neural networks (LM-SNNs) where STDP is used in conjunction with the Self Organizing Map (SOM) [123] algorithm. The pointed architecture is, in many ways, similar to [45] but instead of a fixed inhibition, it introduces the notion of relaxed inhibition that increases with interneuron distance. The authors argue this encourages neighbouring neurons to learn similar filters, thus conferring the network the capacity to learn while seeing limited examples. Remarkably, the network is trained with a 2-level inhibition scheme, where after a set of predefined samples, interneuron inhibition increases suddenly, favouring competition and, consequently, augmenting model performance.
But UL based on STDP and its variants limits the models to shallow architectures with limited expressive power and that could not scale well in larger real-world problems [124,125,75]. Therefore, proper strategies must be found to permit unsupervised training of deep SNNs. A simple yet powerful solution is to train unsupervised deep SNNs in a layer-wise manner. For example, [126] train an unsupervised 3 layered model based on this paradigm. Interestingly, they combine a weight-dependent variant of STDP with a simplified Bayesian neuron model to classify the MNIST [11] handwritten digits dataset, yielding competitive results. Likewise, [127,33,128] propose a greedy layer-by-layer training scheme. To note that these works perform classification of the extremely simple MNIST [11] handwritten digits dataset, meaning that, despite promising, there is no guarantee that these models would work with more realistic data. We argue there's an unmet need of developing and validating SNN models on complex datasets like CIFAR-10 [12], CIFAR-100 [12], or ImageNet [129].

C. SUPERVISED LEARNING
SL is an appealing learning scheme and demonstrated to be efficacious for training traditional ANNs. However, due to inherent incompatibilities with spiking neurons, researchers must come up with proper strategies to apply BP to deep SNNs. To this end, much work has been done, and some strategies have proven successful. However, almost all the solutions implement approximations for the activation functions. Computations are thus performed with surrogate gradients that inevitably present approximations, causing some loss of performance. Another disadvantage of training SNNs with BP is the lack of biological plausibility. Nonetheless, BP has allowed unprecedented results of SNNs in various benchmarking datasets. The work of [130] suggests an approximate derivative algorithm that accounts for the leaky behaviour of LIF neurons. Interestingly, they leverage the central ideas behind well-known CNN architectures, like the dropout strategy, and introduce them in their model. The proposed method permits the training of deep multilayer SNNs, namely VGG and Residual architectures, and the application of spike-based BP. More specifically, the authors set the last layer neurons' threshold to a high value so that the neurons do not fire. Then, they define the output of the last layer as the accumulated voltage divided by T time steps. With this strategy, they compute a loss function, defined as the squared error over all output neurons (Equation 13), where the error (Equation 12) is computed as the difference between the target label and the output. Next, to propagate the error to hidden layers, this work approximates each output neuron activation function as equivalent to the total input current received by the neuron over T time steps (Equation 14).
Output error , e j = output j − label j (12) Regarding the hidden layers, they propose post-spike trains as neuronal outputs, with a pseudo-derivative, assuming the following approximations. In the first place, they estimate the derivative of an Integrate and Fire (IF) neuron. Next, a leak correctional term is estimated to compensate for the leaky behaviour of the LIF neurons. Finally, they obtain the approximate derivative of the LIF neuron activation function as the combinations of the 2 previous estimations. With this strategy, they can train deep multilayer SNNs and achieve competitive performances. Notably, they compare their results with ANNs to SNNs methods, demonstrating that the considered solution achieves comparable inference efficiency in terms of spikes per inference. Their results underline that deep SNNs (like VGG9 and ResNet11) are more efficient than ANN-SNN converted networks. More important, their work suggests that direct training of SNNs, resorting to spike-based BP, requires less computational energy when compared to converted ANNs to SNNs. Furthermore, it highlights the potential of SNNs for energy-efficient computations.
Another work revolves around SL based on temporal coding [131]. To overcome the limitations of discrete rate encoding schemes, the authors resort to a temporal encoding strategy, where information is encoded in the spike times, thus a continuous representation that is favourable to apply BP. Moreover, it is debated that this strategy is more efficient than rate encoding since power consumption decreases with smaller firing rates [132,131,133,116]. They use non-leaky IF neurons with synaptic current kernels, which means that the current increases instantaneously, at the moment of arrival of an input spike, but decays exponentially afterwards. To ensure a single spike is emitted per neuron after firing, the neurons are set to an infinitely long refractory period. To define a neuron's activation function, the authors establish the set of spikes that triggered a firing as the Causal Set of Input Spikes (C). Resorting to said set, they can subsequently determine a nonlinear relationship that maps input spike times to the spike time of the firing neuron. This way, a differentiable cost function can be imposed and the gradient, with respect to the weights in the shallower layers, computed by backpropagating the errors through the network. [134] introduced Deep Continuous Local Learning (DECOLLE), a learning approach focused on local error functions. To compute the local errors, the authors use intermediate classifiers with fixed random weights and auxiliary random targets,ŷ. The inputs to these classifiers consequently represent the activations of the layer being trained. Moreover, instead of minimizing a global objective function, the algorithm minimizes many local objectives, yet, this approach still allows the network to minimize the loss at the top layer. In addition, it puts pressure on deeper layers to learn relevant representations whilst leading the network to learn a stack of useful hierarchical features [131]. To enforce locality, DECOLLE sets to zero all non-local gradients. Errors are propagated to only update the weights of the connections incoming to the local spiking layer, but the overall approach can be interpreted as a synthetic gradient without an outer loop to mimic a full BP. They trained a fully connected SNN with three convolutional layers and Poisson encoding, reporting an error of 4.46 % on the DVS128-Gesture [135] dataset.
In turn, [136] developed a SL strategy based on TTFS coding. This approach allows the authors to derive an analytical expression for the time, T, at which the membrane voltage will first cross the threshold. The expression, T, is thus differentiable with respect to synaptic weights and presynaptic spike times, meaning BP can be used. Plus, it allows the exact computation of partial derivatives, and the fact that each neuron only spikes once is extremely desirable from an efficiency standpoint. Another work that takes advantage of temporal coding for computing the exact derivatives of postsynaptic with respect to presynaptic times for BP is that of [137]. The adopted neuron model was the Spike Response Model (SRM), with an alpha function [138] to model the synaptic conductance and consequently the membrane potential of the postsynaptic neuron. This allows computing the spike time of an output neuron with respect to the presynaptic spike times. The desired behaviour to predict a class is that the neuron of the correct class should be the first to spike. To get the prediction error the softmax function is computed on the negative values of the output spike times, which minimises the spike time of the target neuron while maximising the spike time of the non-target neurons. Next, the cross-entropy loss is calculated in the customary form.
A major drawback when applying BP to SNNs is the need of computing the membrane voltage for each neuron, at each time step, which is computationally prohibitive in current hardware systems, despite it being efficient when implementing SNNs in neuromorphic hardware. Additionally, there is the problem of the non-diferentiable LIF neurons' activation function. To circumvent that need, the work of [139] suggests an abstract layer response model of the neurons for training deep networks, as illustrated in figure 10.
Considering a single layer, l, the algorithm starts by defining the layer response model input and output as, respectively, z l−1,i = e t l−1,i /τ FIGURE 10: Difference between the neuron model for neuromorphic hardware implementation and the layer response neuron model for training. Source: [139] and z l,j = e t l,j /τ , with t l,j the time to a neuron's first and only spike as defined by TTFS coding, and i/j the input/output neurons. Then, the output of the L layered network, Z L is defined by a non linear mapping, f , and connection weights, W , that establish Z L = f (Z 0 , W ). This allows to obtain the loss, L for target class, C, as: Finally, considering that the layer response model uses the closed-form input-output response t j = f (t i ; w ji ), it becomes possible to apply BP as is traditionally done in ANNs. The learned weights, w ij can then be transferred to IF neurons for inference and hardware implementation. This algorithm extends on the work of [131], but overcomes some of its limitations. Namely, whereas [131] was limited to shallow networks (2-3 fully connected layers), [139] propose a new training scheme that improved computation speed and convergence, allowing deep SNNs to scale.
The work of [140], on the other hand, explores latency learning to avoid the derivation of the thresholding activation function. It considers TTFS coding, with one spike per neuron, and IF neurons. At the output layer, there is only 1 neuron per class, where the predicted class is that of the first spiking neuron. This is a very sparse and energy-efficient coding scheme. The network is trained with a temporal form of BP, with surrogate gradients for the neuron firing time with respect to its membrane potential, where the error is computed considering the difference between the firing time and target firing times. Interestingly, target firing times are defined dynamically, they propose a relative method that takes the actual firing times into account. Assuming an input image of the i th category, the minimum output firing time, τ , is computed as τ = min{t o j |1 < j < C}. Then, the target firing time is set as: where γ is a positive constant term penalizing output neurons with firing time close to τ . In a special case where neurons are silent during the entire simulation time, the target is defined as: to promote the firing of the output neuron during the simulation. Although a simple solution, the proposed strategy was demonstrated to achieve competitive performance on the MNIST [11] and CALTECH face/motorbike datasets. Given the very sparse nature of this algorithm the authors suggest it could be particularly energy and memory efficient, specially when combined with neuromorphic hardware. Also, it can make accurate and quick predictions, as a decision can be made before all neurons have fired, way earlier than the entire stimulus presentation time, contrary to rate-based SNN models.
Recently, [141] introduced EventProp, a novel, and promising event-based method, that employs BP, but that allows the computation of exact gradients. In essence, the authors backpropagate errors at spike times to obtain the exact gradient in an event-based, temporally, and spatially sparse manner. More specifically, to compute the gradient the algorithm starts by considering the LIF neuron model as a dynamic system, where the partial derivative of a state variable ( V (t) ) with respect to a parameter, p (synaptic weight, w), jumps at discontinuities. Next, the adjoint method [142] is combined with the partial derivative jumps of the LIF neuron to derive the EventProp algorithm, which the authors define as analogous to BP in ANNs. Since EventProp backpropagates errors at spike times, it only requires the storage of variables at spike times. Therefore, it has low memory requirements, thus favourable for neuromorphic computing. In fact, contrary to the previous strategies that require the computation of gradients for every neuron at each time step, EventProp only requires computations for spiking neurons, thus possibly being more energy-efficient than other works.
In contrast with biological neurons that present dynamic membrane properties(e.g., heterogeneous time constants, adaptive thresholds), most existing learning algorithms require manual tuning of membrane-related parameters and assume that all neuron populations in a SNN present the same values for those constants. Nonetheless, to achieve a more biologically plausible algorithm, [143] introduced the Parametric Leaky Integrate-and-Fire (PLIF) neurons, which present learnable time constants, τ . This work's training framework is supported on a BP strategy. Defining the neurons' output as O, and the target as Y, the loss function was defined as the Mean Squared Error (MSE(O, Y)), considering that the neuron that represents a given class should have the maximum excitability, whereas the others should remain silent. Then, the authors compute the gradients resorting to a surrogate activation function, defined as 1 π arctan(πx) + 1 2 . The results were evaluated on neuromorphic datasets, like CIFAR10-DVS [144] or DVS128-Gesture [135], and suggest PLIF-based SNNs to learn faster and to achieve better performances when compared to LIF-based SNNs.
In the same line of reasoning, [145] introduced an SNN with direct input encoding and leakage and threshold optimization (DIET-SNN). The proposed learning scheme resorts to an hybrid strategy where an ANN is first converted to SNN before being fine tuned with surrogate gradient and BPTT. Interestingly, weighted pixel values are directly fed to the network's first layer, at each time step, instead of being converted to spikes. The first layer of the network thus works as both feature extractor and spike generator. Moreover, the underlying training strategy, supported on surrogate gradients and BPTT, optimizes not only the network parameters (connection weights), but also the LIF neuron parameters, namely, firing threshold and membrane leak factor. The authors suggest the neuron threshold to be an important parameter as if too low, it prevents the neuron from firing (dead neuron problem) and if too high, it affects the ability of the neuron to distinguish between input patterns. On the contrary, the optimized membrane leak makes the network firing response less sensitive to irrelevant input and increases the sparseness of convolutional and dense layers. DIET-SNN was tested on the CIFAR-10 [12], CIFAR-100 [12], and ImageNet [129] datasets and demonstrated to achieve better latency/accuracy tradeoff with 20−500x less timesteps.
Importantly, BP has also permitted the training of very deep ANNs. In conventional ANNs, deep residual networks, which consist of many stacked "Residual Units", have achieved top performance. Identity mapping is a central idea in residual learning, where the output of layer( l), (X l+1 ), is given by (F (X l , W ) + X l ), with X l the input feature vector, and F (X l , W ), the residual mapping to be learned. The operator is realized by skip connections that simply perform identity mapping. With this strategy, ResNet [3] was presented as a robust and reliable solution to the vanishing/exploding gradient problem in deep networks. Similarly, converted ResNets to Spiking ResNets have achieved competitive performance [146,147], although requiring longer time steps to achieve top results, as conversion is based on rate coding. To overcome the limitations of converion strategies, [148] introduced Spike-Element-Wise ResNet (SEW ResNet), a direct training strategy to allow residual learning in SNNs. Much like Spiking ResNet [146], SEW ResNet substitutes the ReLU activation for a Spiking Neuron (SN), however, it also finds an elementwise function, g, to realize identity mapping. This strategy overcomes the drawbacks of Spiking ResNet, namely the exploding/vanishing gradient problem and the limited applicability of Spiking ResNet to specific neuron models/dynamics. Figure 11 illustrates the main differences between a ResNet, VOLUME 4, 2016 FIGURE 11: Residual modules of, from left to right: ResNet, Spiking ResNet, and SEW ResNet. Source: [148] Spiking ResNet, and SEW ResNet module. Combined with spike-based BP this strategy has allowed, for the first time, the training of very deep SNNs (with more than 100 layers) and achieved competitive results.
The focus of this work is not that of reinforcement learning problems, but the strategy has equally been used to train SNNs [149,150]. It comes from biological evidence supporting that neuromodulators impact learning in the brain. For example, it was demonstrated that dopamine acts as a reward signal, affecting synaptic plasticity [151,152,153].
In summary, several strategies have been adopted to train SNNs, with some more successful than others. For a complementary and comprehensive overview of learning in SNNs we refer the reader to the work of [17]. However, we have seen that learning in SNNs is an open research question that presents many challenges mainly due to the nondifferentiable nature of SNN activation functions, sparsity of spike trains, and computations over time. Many of the proposed solutions resort to approximations, yet this translates to limited generalizability, meaning more effort is needed from the research community towards a unified strategy for training SNNs.

VI. RESULTS AND DISCUSSION
Many SNNs algorithms demonstrate good results on simple datasets like the MNIST [11] handwritten digits dataset, yet, most papers do not address more complex data. On the other hand, SNNs require extensive fine-tuning of hyperparameters, and this task can have a significant impact on learning and subsequent models' accuracy. In addition, most papers do not report the set of used hyperparameters, which hinders the reproducibility of the methodology. Overall, there is a need of understanding general practical considerations when implementing SNN models. We argue this knowledge could help researchers achieve better and faster results when developing custom architectures. Henceforth, we here implement 2 learning algorithms: SL with surrogate gradients and unsupervised STDP. For STDP, we replicate a widely cited algorithm [45].In respect to SL, we resort to 3 network architectures with varying degrees of depth and complexity.
The goal is to provide the reader with some insights about practical considerations when implementing SNNs and to compare supervised and unsupervised training strategies. We further discuss the advantages and disadvantages of each strategy while exploring the relevance of hyperparameter optimization and model performance on different computer vision datasets. In this case, we opted for the MNIST [11] and CIFAR-10 [12] datasets. Table 1 demonstrates some details and the reported performance of the previously reviewed algorithms. We observe that most works use the MNIST [11] handwritten digits dataset. At the same time, performance usually decreases in more complex data. Notably, [33] report a performance of 93.00% on the CIFAR-10 [12] dataset resorting to an UL strategy. In general, SL and converted ANNs to SNNs demonstrate superior performance, but lack of biological plausibility and limitations in the ability to generalize make these strategies less attractive.
For replicating the work of [45], we opted for the Bind-sNET [156] package. Moreover, we introduced the dropout [157] strategy to assess how this regularization technique would impact learning. The choice of SNN simulation software is critical when developing SNN models. Several tools exist, but most of them are not ML oriented or present steep learning curves for new users [156]. On the other hand, many are biology-oriented, meaning they are computationally expensive to simulate and need extensive hyper-parameter tuning [156]. Thus, [156] have developed BindsNET, an ML oriented library in python, built on top of PyTorch [158]. BindsNET contains a set of software objects and methods to simulate diverse types of neurons (bindsnet.network.nodes), as well as various types of connections between them (bindsnet.network.topology). The bindsnet.network.Network object is responsible for combining the different nodes and connections, whilst also coordinating the simulation logic of all underlying components [156]. We opted for this package since it allows rapid prototyping and permits GPU acceleration. On the other hand, BindsNET requires less hyperparameter fine-tuning in comparison with the Brian [159] simulator.
For dropout regularization, instead of randomly dropping connections, we randomly forced input pixel intensities to zero ,during training, with a probability given by p drop . The subsequent postsynaptic neurons would not activate, and consequently, the connection weight would not be updated. This ensures the same units are dropped during the entire duration of the example presentation and avoids averaging effects [130]. Table 2 presents the obtained performances. It allows the comparison between models trained with different hyperparameters and different datasets. Furthermore, we observe the effects of dropout regularization in the training of SNNs.
Our results are consistent with the performances reported by [45]. We observe the models with more neurons have higher accuracy. Besides, introducing the dropout strategy also produces significant performance increases. A challenge  performances are significantly lower than using the values provided by the BindsNET authors in their examples directory (excitatory strength=22.50 and inhibitory strength=120), but significantly better than the default parameters of the BindsNET implementation of the [45] algorithm (excitatory strength=22.50 and inhibitory strength=17.50). The suboptimal parameters detected by our algorithm could be explained by the nature of random search, that despite being more computationally efficient might not find the optimum set of hyper-parameters. Broadly speaking, random search [160] translates to results as good or better than other strategies, like manual search or grid search, but given the nature of the addressed SNN architecture, these two hyper-parameters impact competition and neurons' activation, meaning it directly influences learning, thus being of great importance in training the model. This underlines the challenges of hyper-parameter optimization in SNNs. Plus, we observe that the algorithms trained with dropout have, in general, better performances than their no dropout counterparts, specifically considering the model architectures with a smaller number of neurons. We argue that, in this situation, the dropout regularization prevents a limited subset of neurons from dominating the others (due to competition), thus reducing overfitting. Interestingly, due to the sparse nature of SNNs, a much lower p drop (0.2) is required when compared to conventional ANNs (usually around 0.5) [130]. Finally, we observe a reduced performance ( 10% ≤ accuracy < 12.89%) when the model was trained on the CIFAR-10 [12] dataset. These results underline the challenges of training SNN models in complex datasets. Figure 12, indicates two different sets of weights learned by the network trained on the CIFAR-10 [12] dataset. In Figure 12-A, we observe a set of neurons' RFs (weights selected randomly and rearranged to the shape of the input image) of the model with the best training performance (accuracy = 18.80%), and in Figure 12-B , the same neurons' RFs after 1800 training iterations (accuracy = 9.74 %). It clearly demonstrates that in datasets with a considerable degree of variability, the averaging effect of this SNN architecture leads to simple and non-interpretable receptive fields, where a single neuron can respond to a wide range of prototypes. This results in high confusion between classes and poor model performance. To further illustrate the challenge of training complex data using shallow SNN models, in Figure 13 we show there is an overlap of learned RFs between 2 distinct classes. The weights corresponding to high-intensity regions (higher firing rates) change the most when compared to smaller firing rate input stimuli. In fact, the most impressive performances reported in the literature of models trained on complex datasets usually resort to SL strategies like approximate BP or converted ANNs to SNNs, or use some form of CNNs, meaning there's an unmet need of finding the best strategies to train UL SNN models that can compete with traditional ANNs.
For the SL experiments, due to ease of use and similarity with typical PyTorch worflows, we opted for the snnTorch [161] Python package as it is particularly designed for performing gradient-based learning with SNNs. snnTorch can be intuitevely used with PyTorch [158] and is agnostic to typical neural network layers, such as fully-connected layers, convolutional layers, or residual connections. Interestingly, spiking neurons are designed in such a way that they can be easily stacked on top of other neural network layers as if they were yet another activation function. Furthermore, membrane potentials are computed recursively, meaning the gradient can be computed without storing membrane potential traces for all neurons in a system. Naturally, snnTorch is FIGURE 12: A: Neurons' RFs (selected randomly and rearranged to the input images shape) of the best performing model on the training set. The RFs are well defined. It is even possible to identify some classes (e.g. horse; car) B: The same neurons' RFs after 1800 training iterations. We observe that after several iterations the learned RFs are simple and non-interpretable. This could be explained by the high degree of variability in the CIFAR-10 dataset, where averaging effects lead to neurons that respond to a wide range of prototypes. Bottom: corresponding learned RFs. It becomes clear that high intensity image regions dominate the activation of the excitatory layer neurons, meaning that if there is a high degree of overlap between a previously learned RF and a high intensity region of a new image, that neuron has a high probability of firing again, regardless of the input class. We observe that the synaptic weights corresponding to high intensity regions (higher firing rates) change the most when compared to smaller firing rate input stimuli.
suitable for GPU-enabled training, which increases the speed of computations.
Considering specifically the MNIST [11] handwritten digits dataset, we opted for a 2-layered network, similar to [45]. We considered a fully connected hidden layer, with N neurons, for feature extraction, and an output classification layer with 10 neurons (1 per class). Then, we converted the static images to spike trains, resorting to a Poisson distribution, where each pixel intensity was encoded in the mean firing rate. Each batch of sample images was presented to the network for 25 time steps. To allow BP, we considered Cross Entropy (CE) and Mean Squared Error (MSE) as loss functions, L, respectively established by equations 18 and 19: With N the batch size; C the number of classes; T the batch presentation time; S k (t) the predicted spike activity of the LIF neurons (in L CE , S y represents the predicted spiking activity of the correct class) andŜ(t) the target spiking activity. To calculate the losses, the spikes are first accumulated over T time steps. The CE Count Loss, L CE , promotes the neuron of the correct class to spike at each time step, t, while suppressing the activation of the incorrect classes neurons. In turn, the MSE Count Loss, L M SE , encourages a spiking activity target for the correct class (correct rate), and a target for the incorrect classes (incorrect rate). But to make BP possible we approximate a neuron's activation function as a Sigmoid, similar to [162], which then allows to compute surrogate gradients: Considering this training strategy, we assessed the impact of different hyperparameters on network performance, namely, the number of hidden neurons, the learning rate, η, and the decay rate of the membrane potential, β. Table  3 presents in detail the obtained performances for different values of the considered hyperparameters on the MNIST [142] handwritten digits dataset. Overall, we observe better performance with SL than with UL. Generally speaking, our results indicate L M SE to be more suitable than L CE , while the decay rate, β, is also a very important hyperparameter. We obtained a top performance of 98.53 % for 6400 hidden neurons, trained with L M SE (correct rate = 0.8, incorrect rate = 0.1), of learning rate η = 5e −4 and decay rate β = 0.75. In turn, comparing the performance of the 800 hidden neuron SL network with the best results of our experiments on the Diehl and Cook SNN [45], we observe an improvement of 6.12 %.
We also ran the 2-layered shallow network on the CIFAR-10 [12] dataset. The results are very consistent with the performance of the Diehl and Cook [45] SNN on these data, suggesting deeper networks would be necessary to achieve competitive results. Therefore, we adopted 2 additional SNN architectures. The topology of the first network (CSNN), consisted of 2 5x5 convolutional layers, for feature extraction, with, respectively, 24 and 128 output channels, interleaved by a 2x2 max pooling layer, followed by a LIF neuron. On top of these 2 convolutional blocks we added a fully connected layer for input classification. Figure 14 illustrates the considered network architecture. We then trained various models with different hyperparameters to assess their impact on model performance. We considered both L M SE and L CE as loss functions and approximate the neuron's activation as a fast sigmoid [162] to compute the surrogate gradients: As an alternative approach, in line with the work of [130], we considered an VGG9 SNN architecture (VGG9-C), composed of 8 convolutional layers and a fully connected layer for classification. We used 2x2 average pooling layers to reduce the size of convolutional feature maps, as suggested in [130]. As a variant of this architecture, we substitute the last convolution with a fully connected layer (VGG9-F). Figure VI illustrates both of these strategies. Regarding the bipolar transform, we considered the 3 RGB channels. Then, we follow the approach of [130] and scale pixel intensities to the [-1, 1] range so that each channel presents mean = 0 and standard deviation = 1. This ensures we can take maximum advantage of the images' informative content. At last, we separate the positive and negative pixel intensities into bipolar channels before Poisson encoding. For data augmentation we considered random horizontal flips and random resized crops. We trained each model for 50 epochs, with a batch size of 128, and Adam optimizer. Additionally, we considered a correct rate of 0.8 and incorrect rate of 0.1 for L M SE . In table 4 we summarize the hyperparameters and corresponding performances.
The results underline that SNNs trained with surrogate gradients present by far the best performance when compared with UL. The reported performances are inferior to the literature top results, nonetheless, we mainly intended to underline the advantages of BP with surrogate gradients so we did not perform any hyperparameter optimization. Moreover, we apply the backward pass only once for each simulation and did not experiment with other loss functions and surrogate gradients, meaning further accuracy improvements could be achieved. We also could have considered dropout regularization as the technique is commonly used in SNNs trained with surrogate gradients. However, the reported results are fairly illustrative of the advantages and disadvantages of using BP with surrogate gradients to train SNNs. We experienced a significant performance increase when the RGB input was directly encoded to spike trains instead of first converting the images to grayscale. We also demonstrate that data augmentation can be used in conjunction with SNNs trained with surrogate gradients and that it leads to a significant improvement of the results.
In summary, our results highlight the performance gap between UL and SL in SNNs. We observe that whereas unsupervised local learning with STDP works quite well on simple datasets, the strategy is not currently feasible on complex data and more work is necessary towards achieving high performing and biologically plausible UL algorithms. On the contrary, SL with surrogate gradients offers a more optimum efficiency/accuracy trade-off. Although, for the reasons already discussed, BP is not biologically plausible, the solution can scale to more complex data. Moreover,  for SL tasks and its implementation of SNNs more closely mimics common PyTorch [158] workflows, where, typically, activation functions are replaced by biological neuron models (e.g., LIF) On the other hand, considering that SNNs research is still in its infancy, there are very few large scale practical applications [163], but Autonomous Driving (AD) is one of the fields that could benefit the most from SNNs. On the one side, the AD problem space is usually rich in events, suitable for SNNs, on the other, it requires efficient networks for energy-constrained systems. Therefore, some authors have already explored the application of SNNs to AD. For instance, CarSNN, proposed by [164], addresses the problem of car classification from the background in an event-driven dataset (N-CARS Dataset [165]). They resort to Spatio Temporal Back-Propagation (STBP) to train the SOEL [166] system in 3 hierarchical stages for varying sizes of input images. Significantly, the model was implemented on the Loihi Neuromorphic Chip, with a power consumption of only 350 mW, suggesting the potential of SNNs to be deployed in real-time, in resource-constrained systems. Furthermore, [167] proposed a spiking convolutions SNN model with temporal coding for object recognition in LiDAR temporal pulses data.
But more effort is required from the research community to make SNNs a reality in AD. Explicitly, we observe that most works address classification problems, whereas the AD problem space is more complex, usually involving, among other tasks, object detection, object classification, semantic segmentation and panoptic segmentation. Notwithstanding, a few works have already focused on that problem. [168] have proposed Spiking-YOLO, a converted ANN to SNN that allows object detection with a performance comparable to traditional models. In turn, [169] introduced a SNN strategy for single object localization. Based on the DECOLLE [134] algorithm, a deep local learning scheme, the authors propose an encoder-decoder strategy with 3 layers in each part, to train a spiking convolutions SNN model on the Oxford-IIIT-Pet dataset [170]. They report a mean Intersection over Union (mIoU) of 63.2 % on the test set. However, we can see the limited applicability of these models in real-world scenarios. Moreover, these works resort to convolutionalbased strategies, meaning there is still the need of developing fully neuromorphic SNNs that can deal with this kind of complex problems, like AD where many resource constraints are imposed on the models.

VII. CONCLUSION
SNNs are a kind of biologically inspired ANNs. Since information is propagated in the form of spikes, SNNs are supported to be more efficient, particularly if combined with neuromorphic hardware. In fact, most state-of-the-art solutions evidence efficiency gains in SNNs when compared with traditional ANNs. More specifically, directly trained SNNs have demonstrated to be more efficient than converted ANNs to SNNs, although both strategies seem to provide an efficient alternative. However, due to lack of implementation of most suggested SNNs on neuromorphic hardware, it becomes difficult to compare these strategies in terms of efficiency and suitability for such devices, meaning future work would require further validation of some SNN algorithms on neuromorphic hardware.
UL with STDP is one of the most used learning techniques and some models have achieved performance comparable to traditional ANNs, particularly on simple datasets. This is a biologically inspired, hence more efficient, learning method. Notwithstanding, it limits the models to shallow architectures, and it does not work effectively with complex datasets, meaning there is the need to find proper training algorithms. But one key observation from this survey is that it is still not clear how biologically plausible UL could be used to train multilayer SNN models. However, [171] suggest that meta-learning could be helpful for rediscovering biologically plausible synaptic plasticity rules. The authors were able to derive several known local plasticity rules, requiring only the definition of a loss function and the flexible parameterization of the candidate plasticity rules. They argue such a metalearning approach could lead to novel methodologies for training ANNs. This "learning how to learn" strategy is also biologically plausible as it happened during the millions of years of evolution as well as, for short term survival reasons, it is fundamental during an individual's lifetime. We suggest this could be more extensively studied for SNNs and hypothesise it could help with the unsupervised training of multilayer networks. On the other hand, since the activation functions of SNNs are non-differentiable, proper strategies must be found to allow the supervised training of SNNs. A typical approach is to use surrogate gradients. However, those approximations limit the generalization of the proposed solutions.
Nonetheless, several developments have been made in the last years in the field of SNNs, including learning methods, encoding strategies, connections, and network structures. Many works are inspired by or mimic biological features such as neural circuits of excitation and inhibition, On-Centre and Off-Centre bipolar cells selectivity, expanding RFs, V1 selectivity, etc. But we argue that these mechanisms are still poorly developed in SNNs and more work is necessary towards a unifying strategy to train SNNs and to explore the full extent of known properties of biological neurons in SNNs. Not only could this bring performance gains, but it could also permit fault-tolerant, energy, and data-efficient SNNs. .

ACKNOWLEDGMENT
We acknowledge the contributions of all other collaborators of the THEIA project, for their review and critical perspective on this work. Also, we acknowledge the contributions of Tiago Gonçalves and Leonardo Capozzi for their support during the initial stages of this research project. where he investigates deep learning algorithms to detect and segment nuclei from histopathology microscopic imaging. His other research interests include the fields of machine learning, computer vision, and pattern recognition, in general, as well as their applications to the medical imaging domain.

AD
MARCELO CARVALHO was born in Paredes, Porto, Portugal in 1999 and is currently finishing his M.Sc. degree in biomedical engineering with Faculty of Engineering -University of Porto, Porto, Portugal. He is developing his master thesis on neuromorphic architectures for efficient perception, exploring spiking neural networks potentialities, in collaboration with the THEIA project, in the Faculty of Engineering -University of Porto, Porto, Portugal. In 2021, he was a member of a biometric research group with the goal of developing deep learning methods to detect facial presentation attacks and, in the same year, he was also involved in the IEEE VIP Cup 2021 for in-bed pose estimation. His research interests fit in the machine learning and computer vision fields with particular emphasis on deep learning for biomedical applications.
DIOGO CARNEIRO obtained the M.Sc. degree in Electronic and Telecommunications Engineering from Aveiro University, Aveiro, Portugal, in 2017, with a thesis on generalization and anticipation skills on robot ball catching using supervised learning, where it was explored the anticipation of ball trajectories based on human motion, as well as the generalization of robot motion based on dynamic movement primitives. He is working at Bosch Car Multimedia Portugal, S.A., Braga, Portugal, since 2018 as a Deep Learning and Machine Learning Researcher on several areas such as computer vision and digital signal processing in the context of interior vehicle sensing and autonomous driving. His research interests revolve around deep learning areas with practical effect on real world applications. JAIME S. CARDOSO is a Full Professor at the Faculty of Engineering of the University of Porto (FEUP). From 2012 to 2015, Cardoso served as President of the Portuguese Association for Pattern Recognition (APRP), affiliated to the IAPR. Cardoso is also a Senior Member of IEEE since 2011. His research can be summed up in three major topics: computer vision, machine learning and decision support systems. Image and video processing focuses on medicine and biometrics. The work on machine learning cares mostly with the adaptation of learning to the challenging conditions presented by visual data, with a focus on deep learning and explainable machine learning. The particular emphasis of the work in decision support systems goes to medical applications, always anchored on the automatic analysis of visual data. Cardoso has co-authored 300+ papers, 100+ of which in international journals, which attracted 6500+ citations, according to google scholar. VOLUME 4, 2016