The Heidelberg Spiking Data Sets for the Systematic Evaluation of Spiking Neural Networks

Spiking neural networks are the basis of versatile and power-efficient information processing in the brain. Although we currently lack a detailed understanding of how these networks compute, recently developed optimization techniques allow us to instantiate increasingly complex functional spiking neural networks in-silico. These methods hold the promise to build more efficient non-von-Neumann computing hardware and will offer new vistas in the quest of unraveling brain circuit function. To accelerate the development of such methods, objective ways to compare their performance are indispensable. Presently, however, there are no widely accepted means for comparing the computational performance of spiking neural networks. To address this issue, we introduce two spike-based classification data sets, broadly applicable to benchmark both software and neuromorphic hardware implementations of spiking neural networks. To accomplish this, we developed a general audio-to-spiking conversion procedure inspired by neurophysiology. Furthermore, we applied this conversion to an existing and a novel speech data set. The latter is the free, high-fidelity, and word-level aligned Heidelberg digit data set that we created specifically for this study. By training a range of conventional and spiking classifiers, we show that leveraging spike timing information within these data sets is essential for good classification accuracy. These results serve as the first reference for future performance comparisons of spiking neural networks.


Introduction
Spiking neural networks (SNNs) are biology's solution for fast and versatile information processing.From a computational point of view, SNNs have several desirable properties: They process information in parallel, they are noise-tolerant, and highly energy efficient [Boahen, 2017].Precisely which computations are carried out in a given biological SNN depends in large part on its connectivity structure.To understand how SNNs operate, it is therefore important to learn how connectivity gives rise to function in such networks.For conventional non-spiking artifical neural networks (ANNs), the problem of finding suitable connectivity structures that solve a specific computational task is usually phrased as an optimization problem.By performing gradient descent on a sensibly chosen objective function, the desired functionality and connectivity are acquired simultaneously for a predefined architecture.To allow gradient-based optimization, ANN models are traditionally constructed using graded neuronal activation functions [Rumelhart et al., 1986], which are in contrast to the binary nature of individual spikes of biological neurons.Importantly, binary spiking largely precludes the use of gradient-based optimization techniques to construct functional SNNs [Neftci et al., 2019].However, a growing number of novel algorithms now promises to translate the success of gradient-based learning from conventional neural networks to the spiking domain [Neftci et al., 2019, Pfeiffer and Pfeil, 2018, Tavanaei et al., 2018] and thus to instantiate functional SNNs in-silico both on conventional computers and neuromorphic hardware [Schemmel et al., 2010, Friedmann et al., 2016, Furber et al., 2013, Davies et al., 2018, Moradi et al., 2018].The emergent diversity of different learning algorithms now urgently calls for principled means of comparison between different methods.Unfortunately, widely accepted benchmark datasets for SNNs that would permit such comparisons are scarce [Davies, 2019].In this article, we seek to fill this gap by introducing two new broadly applicable classification datasets for SNNs.
In the following, we provide a brief motivation for why benchmarks are important before reviewing existing tasks that have been used to assess SNN performance in the past.By analyzing the strengths and shortcomings of these tasks, we motivate our specific choices for the datasets we introduce in this article.Finally, we establish the first set of baselines by testing a range of conventional and SNN classifiers on these datasets.

Why benchmarks?
The ultimate goal of a benchmark is to provide a quantitative unbiased way of comparing different approaches and methods to the same problem.While each modeler usually works with a set of private benchmarks, which are tailored to their specific problem of study, it is equally important to have shared benchmarks, which ideally everybody agrees to use, to foster comparison and constructive competition between researchers [Davies, 2019].
The last decades of machine learning research would be hard to imagine without the ubiquitous MNIST dataset [LeCun et al., 1998], for instance.Despite being applicable to SNNs as well, such ANN benchmarks have to be transformed to spikes.This transformation step puts comparability at risk, by leaving fundamental design decisions to the modeler.Presently, the SNN network community has a shortage of established benchmarks that avoid the conversion step by directly providing spike trains to the end user.By impeding the quantitative comparison between methods, the lack of suitable benchmarks has the potential to slow down the progress of the SNN research community as a whole.
Since community benchmarks are important, then why is there little agreement on which benchmark to use?There are several possible reasons, but the most likely ones are the following: First, an existing benchmark may be unobtainable.For instance, it could be unpublished, behind a paywall, or too difficult to use.Second, a published benchmark might be tailored to a specific problem and therefore not general enough to be of interest to other researchers.Third, a benchmark may be saturated, which means that it is already solved with high precision by an existing method.Naturally, this precludes the characterization of improvements over these approaches.Finally, a benchmark could require extensive preprocessing.Because preprocessing can have a strong impact on performance, leaving too many preprocessing decisions to the user might have adverse effects on comparability.Consider for example the MNIST dataset.Individual pixels can be converted to Poisson spike trains with firing rates proportional to the pixels' gray values, or the pixel values can be converted into a spike onset-latency.Depending on the implementation details both approaches could yield vastly different results.
The question, therefore, is: What would an ideal benchmark dataset for learning in SNNs look like?While this question is difficult, if not impossible, to answer, it is probably fair to say that an ideal benchmark should be at least unsaturated, require minimal preprocessing, be sufficiently general, easy to obtain, and free to use.

Previous work
Numerous studies have measured performance in SNNs differently.In the following, we give a brief overview of how previous work has assessed computational performance of structured SNNs.We limit our scope to benchmarks that have been used to solve supervised learning tasks with SNNs.For an extensive review of studies with a primary focus on unsupervised learning and biologically plausible plasticity models such as spike-timing dependent plasticity (STDP) see Tavanaei et al. [2018].Although supervised learning is not necessarily biologically plausible, it is a useful tool to characterize the computational capabilities of spiking circuits, and to engineer functional circuits for neuromorphic applications.
Supervised learning in SNNs can coarsely be categorized into steady-state rate-coding and sequenceto-sequence mapping, although also hybrids between the two exist.In steady-state rate-coding, SNNs approximate conventional analog neural networks by using an effective firing rate code in which both input and output firing rates remain largely constant during the presentation of a single stimulus [Pfeiffer and Pfeil, 2018, Zylberberg et al., 2011, Neftci et al., 2017].Inputs to the networks are often provided by Poisson distributed spike trains for which the firing rates are chosen proportional to the current input level.Similarly, network outputs are provided as a firing rate or spike count of designated output units.Because of these input-output specifications, steady-state rate-coded networks can often be trained using network translation [Pfeiffer and Pfeil, 2018], and, importantly, they can be directly applied to and compared to standard machine learning datasets (e.g.MNIST [LeCun et al., 1998], CIFAR10 [Krizhevsky et al., 2009], or SVHN [Netzer et al., 2011]).
The capabilities of SNNs, however, go beyond such rate-coding networks.Specifically, spike timing can serve as an extra coding dimension [Bohte et al., 2002, Mostafa, 2018, Comsa et al., 2019].To make use of temporal coding schemes, several studies have looked at sequence-to-sequence mapping problems.In this setting, input and output activity varies during the processing of a single input example.Within this coding scheme, outputs can be either individual spikes [Gütig, 2014], spike trains with predefined firing times [Memmesheimer et al., 2014], or continuously varying quantities derived from the spikes.The latter are typically defined as linear combinations of low-pass filtered spike trains [Eliasmith and Anderson, 2004, Denève and Machens, 2016, Abbott et al., 2016, Nicola and Clopath, 2017, Gilra and Gerstner, 2017].
One of the simplest temporal coding benchmarks is the temporal exclusive-OR (XOR) task, which exists in different variations [Bohte et al., 2002, Abbott et al., 2016, Huh and Sejnowski, 2018].A simple SNN without hidden layers cannot solve this problem, similar to the Perceptron's inability to solve the regular XOR task.Hence, the temporal XOR is commonly used to demonstrate that a specific method supports hidden-layer learning.In the temporal XOR task, a neural network has to solve a Boolean XOR problem in which the logical off and on levels correspond to early and late spike times respectively.While the temporal XOR does require a hidden layer to be solved correctly, its intrinsic low-dimensionality and the low number of input patterns render this benchmark saturated.Therefore, its possibilities for quantitative comparison between training methods are limited.
To asses learning in a more fine-grained way, several studies have focused on SNNs' abilities to generate precisely timed output spike trains in more general scenarios [Memmesheimer et al., 2014, Ponulak and Kasiński, 2009, Pfister et al., 2006, Florian, 2012, Mohemmed et al., 2012, Gardner and Grüning, 2016].To that end, it is customary to use several randomly generated Poisson input spike trains to generate a specific target spike train.Apart from regular output spike trains (e.g.Gardner and Grüning [2016]), also random target spike trains with increasing length and Poisson statistics have been considered [Memmesheimer et al., 2014].Similarly, the Tempotron [Gütig and Sompolinsky, 2006] uses an interesting hybrid approach in which random temporally encoded spike input patterns are classified into binary categories corresponding to spiking versus quiescence of a designated output neuron.In the associated benchmark, task performance is measured as the number of binary patterns that can be classified correctly.While mapping random input spikes to output spikes allows a fine-grained comparison between methods, the aforementioned tasks lack non-random structure.
A common benchmark for studies with a focus on modeling continuous low-dimensional dynamics with recurrently connected spiking neural network (RSNN) is to reproduce the dynamics of classic nonlinear systems such as the two-dimensional vander-Pol oscillator or the chaotic three-dimensional Lorenz system [Nicola and Clopath, 2017, Gilra and Gerstner, 2017, Thalmeier et al., 2016].However, low-dimensional dynamical systems may not be a good model for the processing of high-dimensional sensory data.
Finally, some datasets for n-way classification were born out of practical engineering needs.The majority of these datasets are based on the output of neuromorphic sensors like, for instance, the dynamic vision sensor (DVS) [Lichtsteiner et al., 2008] or the silicon cochlea [Anumula et al., 2018].An early example of such a dataset is Neuromorphic MNIST [Orchard et al., 2015], which was generated by a DVS recording MNIST digits that were projected on a screen.The digits were moved at certain intervals to elicit spiking responses in the DVS.The task is to identify the corresponding digits from the elicited spikes.This benchmark has been used widely in the SNN community.However, being based on the MNIST dataset it is nearing saturation.The DAS-DIGITS dataset [Anumula et al., 2018] was created similarly.The dataset was generated by processing the original TIDIGITS spoken digit dataset with a 64 channel silicon cochlea.Unfortunately, because TIDIGITS is under a proprietary license, the license requirements for the derived dataset are not entirely clear.Moreover, because TIDIGITS contains sequences of spoken digits, the task goes beyond a straight-forward n-way classification problem and therefore is beyond the scope for many current SNN implementations.
More recently, IBM has released the DVS128 Gesture Dataset [Amir et al., 2017] under a Creative Commons license.The dataset consists of numerous DVS recordings of 11 unique hand gestures performed by different persons under varying lighting conditions.The spikes in this dataset are provided as a continuous data stream, which makes extensive cutting and preprocessing necessary.Finally, the 128×128 pixel size renders this dataset computationally expensive unless additional preprocessing steps such as downsampling are applied.
In this article, we sought to generate widely applicable SNN benchmarks with comparatively modest computational requirements.To that end, we developed a processing framework to convert audio data into spikes.Using this framework, we generated two new spike-based datasets for speech classification and keyword spotting that are not saturated by current methods.Moreover, solving these problems with high accuracy requires taking into account spike timing.Finally, to facilitate their use, extension, and future generation of new datasets, we released both datasets1 and software2 under permissive public domain licenses.

Results
To improve the quantitative comparison between SNNs, we have created two large spike-based classification datasets from audio data.We focused on audio signals of spoken words due to their low bandwidth in comparison to video data and because audio possesses a temporal dimension, which can be translated to the spiking domain naturally.To that end, we employed an artificial model of the inner ear and parts of the ascending auditory pathway (Figure 1) to convert audio data into spikes.Directly providing input spikes, allows us to sidestep the issue of user-specific preprocessing that can confound comparability.Taken together, these design decisions provide the basis for comparable benchmark results at a comparatively low computational cost.
The first dataset, the Spiking Heidelberg Digits (SHD), is based on a spoken digit dataset that was recorded in-house at the University of Heidelberg in a soundproofed room with professional recording equipment (Methods Section 4.1.1).This underlying Heidelberg Digits (HD) audio dataset contains 20 classes of spoken digits, namely the English and German digits from zero to nine, spoken by 12 speakers (Figure 2).The second dataset, the Spiking Speech Commands (SSC) was derived from Google's free Speech Commands (SC) dataset [Warden, 2018] that consists of 35 classes of spoken command words.
To separate the data into training and test sets, we have applied two different partitioning strategies for the SHD and the SSC.For the SHD, we held out two speakers exclusively for the test set.The remainder of the test set was filled with samples (5 % of the trials) from all speakers also present in the training set.This division allows to assess a trained network's ability to generalize across speakers.For the SSC, the recordings were divided randomly by a hashing function [Warden, 2018].Importantly, we adhere to the partitioning proposed by the author that also includes a predefined validation data set.
While both datasets were generated using the same overall processing pipeline (cf. Figure 1), the underlying audio data differ in important respects.The HD dataset was optimized for recording quality and precise audio alignment.In contrast, the SC dataset is intended to closely mimic real-world conditions for key-word spotting on mobile devices.Therefore, it has higher noise levels and lower temporal alignment precision.On the other hand, it features additional classes and an about 10-fold larger number of trials.
To facilitate the use of these datasets and to simplify the access to a broader community, we used an event-based representation of spikes in Hierarchical Data Format 5 (HDF5) file format.This choice was to ensure short download times and ease of access through most common programming environments.For each partition and dataset, we provide a single HDF5 file which holds spikes, digit labels, and additional meta information such as the speaker's identity, age, body height, and gender.

Spike timing contains essential information about class identity
To test both spiking datasets to ensure they are not saturated, we trained a range of classifiers on them.We were specifically interested in whether spike timing information is essential to solve the tasks with high accuracy.To test this, we first generated a reduced version of the datasets in which we removed all temporal information.To that end, we computed spike count patterns from both datasets, which, by design, do not contain temporal information about the stimuli.Using these reduced spike count datasets, we then trained different linear and nonlinear support vector machine (SVM) classifiers (Methods Section 4.4.1) and measured their classification performance on the respective test sets.We found that while a linear SVM readily overfitted the data in the case of SHD, its test performance only marginally exceeded the 50 % accuracy mark on the SHD (Figure 3a).In the case of SSC, overfitting was less pronounced, but also the overall test accuracy dropped to 20 % (Figure 3b).Thus linear classifiers provided a low degree of generalization.
To assess whether this situation was different for nonlinear classifiers, we trained SVMs with polynomial kernels up to a degree of 3.These SVMs showed less overfitting and overall comparable results on both the validation (about 70 % on SHD and 25 % on SSC) and test sets (about 50 % on SHD and 25 % on SSC).Slightly better performance of about 60 % on the SHD and 30 % on the SSC was achieved when using a SVM with a radial basis function (RBF) kernel.The performance on the SHD test set which includes speakers that are not part of the training set was noticeably lower compared to the accuracy on the validation data.Especially, for polynomial and RBF kernels the generalization across speakers was worse than for the linear kernel (Figure 3a).Moreover, we found the performance on the SSC test set to be on par with the accuracy on the validation set.This is likley an effect of the uniform speaker distribution (Figure 3b).These results illustrate that both linear and nonlinear classifiers trained on spike count patterns without temporal information were unable to surpass the 60 % mark for SHD and the 30 % mark for the SSC dataset.Therefore, spike counts are not sufficient to achieve high classification accuracy on the studied datasets.
Next, we wanted to assess whether decoding accuracy could be improved when training classifiers that have explicit access to temporal information of the spike times.To that end, we trained long short term memorys (LSTMs) on the full spiking activity.In more detail, we defined the cross entropy loss function on the activity of the last time step of the network (Figure 4a).Later, we will discuss a second case in which the loss function was defined on the maximum output values when comparing to SNNs.Despite the small size of the SHD dataset, LSTMs showed reduced overfitting and were able to solve the classification problem with an accuracy of (82.4 ± 1.7) % (Figure 3a) which is significantly higher than the best performing SVM.All quoted error estimates correspond to the standard deviation.Similarly, for the SSC dataset the LSTM test accuracy was more than twice as high (70.1 ± 0.3) % as the best-performing classifier on the spike count data (SVMs with RBF kernel (29.4 ± 0.9) %).However, the degree of overfitting we observed was slightly higher than on SHD.Thus, a classifier with access to spike timing information (Figure 3) performs better than any of the classifiers trained on spike count data alone.

Training spiking neural networks
Having established that both spiking datasets contain useful temporal information that can be used by a suitable classifier, we sought to train SNNs of leaky integrate-and-fire (LIF) neurons using backpropagation through time (BPTT) and to assess their generalization performance.One problem with training SNNs with gradient descent arises because the derivative of the neural activation function appears in the evaluation of the gradient.Because spiking is an intrinsically binary process, the resulting gradients are ill-defined.To nevertheless train networks of LIF neurons on a supervised loss function, we used a surrogate gradient approach [Neftci et al., 2019], which constitutes a continuous relaxation of the real gradient of a SNN that can be implemented as an in-place replacement when performing BPTT.Importantly, we did not change the neuron model and the associated forward-pass of the model, but used a fast sigmoid as a surrogate activation function when computing gradients (Methods Section 4.3.3).
In contrast to LSTMs, bio-inspired SNNs have fixed, finite time constants of about 10 ms.Because of this constraint, we considered two different loss functions for both LSTMs and SNNs (Figure 4a).The results for LSTMs shown above were obtained by training with a last-time-step loss, where the activation of the last time step of each example and readout unit was used to calculate the cross entropy at the output.In addition, we also considered a max-over-time loss, in which the time step with maximum activation of each readout unit was taken into account to compute the standard cross entropy loss.This loss function is motivated by the Tempotron [Gütig and Sompolinsky, 2006] in which the network signals its decision about the class membership of the applied input pattern by whether a neuron spiked or not.We evaluated the performance of LSTMs and SNNs for different loss functions.Training LSTMs with a cross entropy loss based on the activity of the last time step of every sample was associated with optimal performance in comparison to SNNs (Figure 4b).The drop in performance of feed-forward SNNs trained with last-time-step loss suggests that time constants were too low to provide all necessary information at the last time step.Underlining the prior assumption, RSNNs perform substantially better.This was presumably due to active memory implemented through reverberating activity through the recurrent connections.Overall, SNNs performed better in combination with the max-over-time loss function (Figure 4c).Interestingly, and in contrast to SNNs, LSTMs showed substantial overfitting when trained with the aforementioned loss function.Motivated Overall, SNN performed better when trained with the max-over-time loss.In contrast, LSTMs were prone to overfitting when used with this loss function.
by these results, we used the last-time-step loss in combination with the LSTMs and the max-overtime loss for SNNs throughout the remainder of this manuscript.
Surrogate gradient learning introduces a new hyperparameter β associated with the steepness of the surrogate derivative.We sought to understand the impact of this parameter on our results.Because changes in β may require a different optimal learning rate η, we performed a grid search over β and η based on the RSNN architecture trained on SHD (Section 4.3.3).We found that sensible combinations for both parameters lead to stable performance plateaus over a large range of values (Figure 5a).Only for small values of β the accuracy dropped dramatically.Compared to small β, the performance decreased only slowly for high β.Interestingly, the learning rate had hardly any effect on peak performance for the parameter values we tested.As expected, convergence speed heavily depends on both η and β.Both, the performance on the validation data and the convergence speed motivated us to use β = 100 and η = 1 × 10 −3 for all SNNs architectures presented in this article unless mentioned otherwise.
With the parameter choices discussed above, we trained various SNN architectures on SHD and the SSC.To that end, we trained feed-forward SNNs with multiple layers l and a single-layer RSNN.The RSNN robustly outperformed spiking feed-forward architectures for all tested layer counts l.Interestingly, increasing l did not significantly improve performance above 45 % on SHD (Figure 6a).In addition, all choices of l caused higher levels of overfitting for the feed-forward SNNs.Moreover, they reached slightly lower accuracy levels than the SVMs on the SHD, but performed significantly better on SSC (Figure 3).However, when testing RSNNs, we found consistently higher performance and improved generalization across speakers.In comparison to the accuracy of LSTMs, RSNNs showed only marginally higher overfitting, but generalized less well across speakers.For the larger SSC dataset, the degree of overfitting was much smaller (Figure 6b).Here, increasing the number of layers of the feed-forward SNNs lead to a monotonic increase of performance on the test set from (32.7 ± 0.6) % for a single layer to (43.2 ± 1.1) % in the case of three hidden layers.For the highest layer count of l = 3, the performance was comparable to the one achieved by the RSNN.As for SHD, the RSNN achieved the highest accuracy of (69.6 ± 1.5) % which was still less than the LSTM with (82.4 ± 1.7) %.
Next, we examined the convergence speed of all tested architectures.Initially, the SNNs converged faster than LSTMs trained on SHD (Figure 7).After about 30 epochs, LSTMs were yielding higher accuracy than SNNs.The performance on the vali-  dation set reached its peak after about 150 epochs for all SNNs and the LSTM (Figure 7b).Additional training only increased performance on the training dataset (Figure 7a), but did not impact generalization on the validation dataset (Figure 7b).The gap between testing and validation performance was highest for feed-forward architectures, whereas RSNNs showed larger differences than LSTMs.

Generalization across speakers and datasets
As the last step, we investigated the generalization across speakers in the SHD dataset.Because the test set of SHD contains two held out speakers, generalization can be accessed by evaluating the accuracy per speaker.Specifically, the digits spoken by the Speakers 4 and 5 are only present in the test set.We compared the performance on the digits of the held-out speakers to all other speakers and found a clear performance drop across all classification methods for Speakers four and five (Figure 8).For SVMs, the linear kernel lead to the smallest accuracy drop of about 20 % on the digits spoken by Speaker four and five, whereas we found a decrease of 20 % to 30 % for polynomial and RBF kernels (Figure 8a).LSTMs generalized best with a drop of only 15 % (Figure 8a).Feed- forward SNNs were most strongly affected with a drop of about 30 % to 35 %.RSNNs, however, only saw a decline of 20 % in performance (Figure 8b).This illustrates that the composition of the test set of SHD can provide meaningful information with regard to generalization across speakers.
Because the English digits are part of both, the SHD and the SSC, we were able to test the generalization across datasets by training SNNs and the LSTM on the full SHD dataset while testing on the SSC dataset and vice versa.For testing, the datasets were restricted to the common English digits zero to nine.Perhaps not surprisingly, networks generalized better, when trained on the larger SSC dataset as a reference and tested on SHD.Highest performance was reached by the recurrent architectures.The difference between reference and generalization is much larger, when trained on SHD as a reference and tested on the SSC dataset.Nevertheless, all architectures trained on the SHD and tested on SSC reached performance levels above chance.Here, RSNNs performed best with about 25 % accuracy.

Discussion
In this article, we introduced two new public domain spike-based classification datasets to facilitate the quantitative comparison of SNNs.By training a range of spiking and non-spiking classifiers, we provide the first set of baselines for future comparisons.
Both spiking datasets are based on auditory classification tasks but were derived from data that was acquired in different recording settings.We chose audio data sets as the basis for our benchmarks because audio has a temporal dimension which makes it a natural choice for spiking processing.However, movie data, audio requires fewer input channels for a faithful representation, which renders the derived spiking datasets computationally more tractable.
We did not use one of the other existing audio datasets as a basis for the spiking version for different reasons.For instance, a large body of spoken digits is provided by the TIDIGITS dataset [Leonard and Doddington, 1991].However, this dataset is only available under a commercial license and we were aiming for fully open datasets.As opposed to this, the Free Spoken Digit Dataset [Zohar et al., 2018] is available under Creative Commons BY 4.0 license.Since this dataset only contains 2k recordings with an overall lower recording and alignment quality, we deemed recording HD as a necessary contribution.Other datasets, such as Mozilla's Common Voice [Mozilla, 2019], LibriSpeech [Panayotov et al., 2015], the Spoken Wikipedia Corpora [Köhn et al., 2016], and TED-LIUM [Rousseau et al., 2012] are also publicly available.However, these datasets pose more challenging speech detection problems since they are only aligned at the sentence level.Such more challenging tasks are left for future research on functional SNNs.The only existing dataset with word-level alignment that we found which was both in the public domain and large enough for our purposes was the SC dataset.This is the reason why we chose to base one spiking benchmark on SC while simultaneously providing a separate and smaller dataset HD with higher recording quality and alignment precision.Finally, the high-fidelity recordings of HD also make it suitable for quantitative evaluation of the impact of noise on network performance, because well-characterized levels of noise can be added later.
In our model, the spike conversion step consists of a published physical inner-ear model [Sieroka et al., 2006] followed by an established hair-cell model [Meddis, 1988].The processing chain is completed by a single layer of bushy cells (BCs) to increase phase-locking and to decrease the overall number of spikes.This approach is similar to the publicly available DASDIGIT dataset [Anumula et al., 2018].DASDIGIT is composed of recordings from the TIDIGIT dataset [Leonard and Doddington, 1991] which have been played to a dynamic audio sensor with 2 × 64 frequency selective channels.In contrast to SHD and SSC, the raw audio files of the TIDIGIT dataset are only available under a commercial license.Also, the frequency resolution, measured in frequency selective bands of the basilar membrane (BM) model, is about a factor of 10 lower.As the software used for processing SHD and SSC datasets is publicly available, it is straight forward to extend the present datasets.This step is more difficult for DASDIGIT, because it requires a dynamic audio sensor.
We standardize the conversion step from raw au- dio signals to spikes by generating spikes from the HD and the SC audio datasets.In doing so, we both improve the usability settings and reduce a common source of performance variability due to differences in the preprocessing pipelines of the end-user.To establish the first set of baselines, we trained a range of non-spiking and spiking classifiers on both the SHD as well as the SSC.In comparing the performance on the full datasets with performance obtained on reduced spike count datasets, we found that the temporal information available in the spike times can be leveraged for better classification by suitable classifiers.Moreover, architectures with explicit recurrence, like LSTMs and RSNNs, were the best performing models among all architectures we tested.Most likely, the reverberating activity through recurrent connections implements the required memory, thereby bridging the gap between neural time constants and audio features.Therefore, the inclusion of additional state variables evolving on a slower time scale as in Bellec et al. [2018] will be an interesting extension to improve performance of SNNs.
As for (a), a decrease in performance for the held-out speakers was be observed.
SNNs showed that the choice of loss functions can have a marked effect on classification performance.While LSTMs perform best with a last-time-step loss, in which only the last time step is used to calculate the cross entropy loss, SNNs achieved their highest accuracy for a max-over-time loss, in which the maximum membrane potential of each readout unit is considered.A detailed analysis of suitable cost functions for training SNNs will be an interesting direction for future research.
In summary, we have introduced two versatile and open spiking datasets and performed a first set of performance measurements using SNN classifiers.This constitutes an exciting step forward to a more quantitative comparison of functional SNNs in-silico both on conventional computers and neuromorphic hardware [Schemmel et al., 2010, Friedmann et al., 2016, Furber et al., 2013, Davies et al., 2018, Moradi et al., 2018].Error bars correspond to the standard deviation of 10 repetitions.Accuracy on the SSC digits was significantly lower than on the digits in the SSC test set.(b) Same as in (a), but showing the performance of networks when trained on SSC and tested on the English digits of SHD.As opposed to (a), networks trained on the SSC digits generalize well on the English digits in the SHD test set.

Methods
We start with a description of the auditory datasets in Section 4.1 which served as the basis for our spiking datasets.The transformation associated with the ascending auditory pathway are explained in Section 4.2.Finally, the methods for supervised learning in SNNs are specified in Section 4.3 and for control networks in Section 4.4.

Audio datasets
In addition to the newly recorded HD dataset, we also describe the SC dataset, created and processed by the TensorFlow and AIY teams [Warden, 2018].

Heidelberg Digits
The Heidelberg Digits (HD) consist of approximately 10k high-quality recordings of spoken digits ranging from zero to nine in English and German language3 .In total 12 speakers have been included, six of which were female and six male.
The speaker ages range from 21 yr to 56 yr with a mean of (29 ± 9) yr.We recorded around 40 digit sequences for each language with a total digit count of 10 420 (cf. Figure 2).
Audio recordings were performed in a soundshielded room at the Heidelberg University Hospital.Data was acquired with three microphones, two AudioTechnica Pro37 in different positions and a Beyerdynamic M201 TG (Figure 1).Digitized by a Steinberg MR816 CSX audio interface, recordings were made in WAVE format with a sampling rate of 48 kHz and 24 bit.
To improve the yield of the automated processing, a manual pre-selection and cutting of the raw audio tracks were performed accompanied by conversion to Free Lossless Audio Codec (FLAC) format.The cleaned-up tracks were externally mastered by a mastering studio [Schumann].The cutting times of the mastered files were determined using a gate with speaker-dependent threshold and release time which were optimized by the blackbox-optimizer described in [Knysh and Korkolis, 2016].The loss function was designed to produce 10 single files with the lowest possible threshold and shortest gate opening to prevent unnecessary computation during successive analysis and modeling.Additionally, speaker-specific ramp-in and ramp-out times where determined by visual inspection to shorten single audio file duration.The final digit files differ in duration due to speaker differences (Figure 2).30 ms Hanning windows were applied to the start and end of the peak normalized audio signals as further processing stages involve the computation of fast Fourier transformations (FFTs).
We partitioned the digits into training and testing datasets by assigning the digits of two speakers exclusively to the testing dataset to create space for well-founded statements on generalization.In more detail, all digits spoken by the speakers 4 and 5 were added to the testing dataset.Moreover, 5 % of the recordings of each digit and language of all other speakers were appended to the testing dataset.

Speech Commands
The Speech Commands (SC) dataset is composed of 1 s WAVE-files containing a single English word each [Warden, 2018] 4 .The words are spoken by 1864 speakers and published under the Creative Commons BY 4.0 license.In this study, we considered version 0.02 with 105 829 audio files, in which a total of 24 single word commands (Yes, No, Up, Down, Left, Right, On, Off, Stop, Go, Backward, Forward, Follow, Learn, Zero, One, Two, Three, Four, Five, Six, Seven, Eight, Nine) were repeated about five times per speaker, whereas ten auxiliary words (Bed, Bird, Cat, Dog, Happy, House, Marvin, Sheila, Tree, and Wow ) were only repeated about once.Partitioning into training, testing and validation dataset is done by using a hashing function [Warden, 2018].All files were recorded in an uncontrolled environment but without background conversations.The files are available as 16 bit littleendian PCM encoded WAVE-format with 16 kHz sample rate.By extracting the loudest part of the time-series, they were trimmed to a duration of 1 s and temporally centered.Also, files were screened for silence by considering the file size and incorrect words by manual review.

Spike conversion
The audio files described in the previous sections served as the basis for our spiking datasets.The acoustic signals of the HD and SC dataset are transformed into neural activity by an inner ear model (Figure 1).To mimic its basic effects, a hydrodynamic basilar membrane (BM) model and a transmitter pool based hair cell (HC) model were applied in succession.Phase-locking is increased by a layer of bushy cells (BCs) integrating the signals from HCs at the same position of the BM.The resulting spikes are saved in event-based form and stored together with the corresponding digit and speaker ID as well as speaker meta information in a HDF5 file.and Fernández-Alonso, 2003] of samples.Each listing of these VLArrays holds a Numpy-Array [Oliphant, 2006] containing the spike times or the spike emitting unit, respectively.The item labels consists of an array of digit IDs for each sample in spikes.The extra entry comprises additional information about the speaker: First, speaker holds an The BM (blue) separates the scala tympani (lower chamber) from the scala vestibuli (upper chamber).At the helicotrema (green), the two scalae are connected.The scala tympani ends in the round window (yellow).A sound wave v sig is penetrating the eardrum, appling pressure at the oval window (red) by moving the ossicles, leading to a compression and slower traveling wave.We have neglected the scala media [Sieroka et al., 2006] and consider a stretched form.
array of speaker IDs for each sample, second, keys contains an array of strings where each element i describes the transformation between the digit ID i and the spoken words.Last, meta_info involves the arrays gender, age, and body_height with the respective information in which entry i corresponds to speaker i.The speaker meta information is only available for SHD and is therefore omitted for SSC.
We make the spiketrains of SHD, as well as SSC available to the public5 .In addition, we provide supplementary information on the general usage, including code snippets.

Basilar membrane model
As a complete consideration of hydrodynamic BM models is beyond the scope of this manuscript, we follow the steps of Sieroka et al. [2006] and in the following highlight the key steps of their derivation.A fundamental aspect of a cochlea model is the interaction between a fluid and a membrane causing spatial frequency dispersion [Sieroka et al., 2006, de Boer, 1980, 1984].Key mechanical features of the cochlea are covered by the simplified geometry of the BM in Figure 10.Here, we assume the fluid to be inviscid and incompressible.Furthermore, we assume the oscillations to be small that the fluid can be described as linear.The BM is expressed in terms of its mechanical impedance ξ(x, ω) which depends on the position in the x-direction and the angular frequency ω = 2πν: with a transversal stiffness S(x) = C 0 e −αx − a, a resistance R(x) = R 0 e −αx/2 and an effective mass m [ de Boer, 1980].The damping of the BM is described by γ = R 0 / √ C 0 m.Variations of the stiffness over several orders of magnitude allow to encompass the entire range of audible frequencies.The corresponding parameters are detailed in Table 2.
Let p(x, ω) be the difference between the pressure in the upper and lower chamber.The following expression fulfills the boundary conditions v y = 0 for y = h, and v z = 0 at z = ±b [de Boer, 1980]: (2) The Laplace equation yields expressions for m 0 = k and m 1 = k 2 + π 2 /b 2 .Only the principal mode of excitation in the z-direction is considered by setting n = 1.With the assumptions made above, the Euler equation reads for the y-component of the velocity in the middle of the BM: where we have dropped the y and z argument for readability.In the following, we consider the limiting case of long waves with kh 1.By combining Equations ( 2) and (3), one gets: where the replacement p(k) → p(x), k → i∂ x and k 2 → ∂ 2 x has been applied.Here, p(k) denotes the Fourier transform of p(x, 0, 0).The solution of this equation is well approximated by: where H (2) 0 is the second Hankel function and g(x, ω) and G(x, ω) are given by: An analytical expression for G(x, ω) can be found in Sieroka et al. [2006].The model is applied to a given stimulus by: where v sig (ω) is the Fourier transformation of the stimulus.The input impedance of the cochlea is modeled by: with the Bessel functions of first J β and second Y β kind of order β and ζ = 2ω/α 2/(hC 0 ).For further analysis, we evaluate v y (x, t) in the range (0, 3.5 cm] in C steps of equal size (Figure 1c).Further, each recording is normalized to 65 dB root mean square (RMS) before application to the BM model.The model parameters are listed in Table 2.

Hair cell model
The transformation of the movement of the BM to spikes is realized by the HC.The velocity of the BM, v y (x, t), is normalized to a RMS displacement of |v y (x, t)| = 1 in the resonance region in response to a 500 Hz sine stimulus at 30 dB to combine the HC model described in [Meddis, 1988[Meddis, , 1986] ] with the BM model depicted in the previous paragraph.
The following description illustrates the key steps of [Meddis, 1986], to which we refer for further details.
In the HC model, one assumes that the cell contains a specific amount of free transmitter molecules q(x, t) which could be released by use of a permeable membrane to the synaptic cleft (Figure 11).The permeability is a function of the velocity of the BM, v y (x, t): (10) The amount c(x, t) of transmitter in the cleft is subject to chemical destruction or loss through diffusion l • c(x, t) as well as re-uptake into the cell r • c(x, t), which is modeled by the following differential equation: A fraction n • w(x, t) of the reuptaken transmitter w(x, t) is continuously transferred to the free transmitter pool: The transmitter originates in a manufacturing base that replenishes the free transmitter pool at a rate y[1 − q(x, t)]:  Parameters taken from Meddis et al. [1990].As in the original publication, the model is formulated using dimensionless transmitter concentrations.

Bushy cell model
Finally, the phase-locking of HC outputs is increased by feeding their spike output to a population of BCs.
In contrast to Rothman et al. [1993], we implement the BCs as LIF neurons, described in Section 4.3.2.
In more detail, we consider a single layer (l = 1) of BCs without recurrent connections (V (l) ij = 0 ∀i, j).A single BC integrates the spiketrains of the 40 HCs for each channel of the BM to increase phase-locking (Figure 1).The parameters used for BCs are listed in Table 4.

Spiking network models
To establish a performance reference on the two spiking datasets we trained networks of LIF neurons with surrogate gradients and BPTT using a supervised loss function.We start with a description of the network architectures (Section 4.3.1),followed by the applied neuron and synapse model (Section 4.3.2).We close with a description of the supervised learning algorithm (Section 4.3.3)and the loss functions (Section 4.3.4).

Network
The spiketrains emitted by the C BCs are used to stimulate the actual classification network.In this manuscript, we apply feed-forward networks with l ∈ {1, 2, 3} and a recurrent network with l = 1.Each layer comprises N = 128 LIF neurons.For all network architectures, the last layer is followed by a linear readout consisting of leaky integrators which do not spike.

Neuron and synapse models
We consider LIF neurons where the membrane potential u (l) i of the i-th neuron in layer l obeys the differential equation: with the membrane time constant τ mem , the input resistance R, the leak potential u leak , and the input current I (l) i (t).Spikes are described by their firing time.The k-th firing time of neuron i in layer l is denoted by k t (l) i and defined by a threshold criterion: Immediately after k t (l) i , the membrane potential is clamped to the reset potential u where the sum runs over all presynaptic partners j and W (l) ij are the corresponding afferent weight from the layer below.The V (l) ij correspond to recurrent connections within each layer.
For neurons subject to supervised learning, the case τ ref = 0 is considered for simplicity.Here, the reset can be incorporated in Equation ( 14) through an extra term:   function Θ.We set u leak = 0, R = 1 and u thres = 1 to obtain for small time steps δt and discrete time: for the synaptic current and: for the membrane potential with κ = exp (−δt/τ syn ) and λ = exp (−δt/τ mem ).In all our experiments Equations ( 18) and ( 19) are evaluated with a time step of δt = 0.5 ms.

Supervised learning
The task of learning is to minimize a cost function L over the entire dataset.To achieve this, gradient descent is applied which modifies the network parameters W ij : with the learning rate η which is set to η = 0.001.
In more detail, we use custom PyTorch [Paszke et al., 2017] code implementing the SNNs with surrogate gradients [Neftci et al., 2019] and applying the BPTT algorithm to compute the gradients.We chose a fast sigmoid to calculate the surrogate gradient: with the steepness parameter β, which is set to β = 100.

Loss functions
Two different loss functions L are considered for the readout layer l = L.These functions differ in the

Figure 1 :
Figure 1: Processing pipeline for the Heidelberg Digits (HD) and the Speech Commands (SC) dataset.(a) The HD are recorded in a sound-shielded room.(b) Afterwards, the resulting audio files are cut and mastered.(c) The HD as well as the SC dataset are fed through a hydrodynamic basilar membrane model.(d) Basilar membrane decompositions are converted to phase-coded spikes by use of a transmitter-pool based hair cell model.(e) The phase-locking is increased by combining multiple spiketrains of hair cells at the same position in a single bushy cell.

Figure 2 :
Figure 2: The Heidelberg Digits (HD) have a balanced class count and variable temporal duration.The HD consist of 10420 recordings of spoken digits ranging from zero to nine in English and German language.(a) Histogram of per-speaker digit counts.The different colors correspond to the German (blue) and English (red) digit counts.Variable numbers of digits are available for each speaker and each language (b) Histogram of per-class digit counts.The dataset is balanced in terms of digits within each language.(c) Histogram of audio recording durations.The HD audio recordings were cut for minimal duration to keep computation time at bay.

Figure 3 :
Figure 3: Temporal information is essential to classify the SHD and SSC datasets with high accuracy.(a) Bar graph of classification accuracy for different SVMs trained on spike count vectors and a LSTM classifier trained on the full SHD dataset.The different colors correspond to training (dark gray), validation (red), and test accuracy (blue).Error bars indicate the standard deviation over 10 repetitions.Classification accuracy on SHD is highest for LSTMs which also show the least degree of overfitting.(b) Same as in (a), but showing performance on SSC.LSTMs with access to temporal information outperform the SVM classifiers by a large margin.

Figure 4 :
Figure 4: Setup and multiple choices of loss functions for SNNs.(a) Schematic of single layer recurrent network with two readout units.For SNNs, a max-over-time loss is applied, where the maximal decomposition of each readout unit over time is used to calculate the cross entropy (marked by colored arrows).In contrast, for LSTMs a last-time-step loss is used where only the last time step of the activation is considered in the calculation of the cross entropy (marked by grey arrow).(b) Bar graph of classification accuracy for different SNNs and a LSTM on the SHD.The different colors indicate training (dark gray), validation (red), and testing accuracy (blue).Error bars correspond to the standard deviation of 10 repetitions.Only the LSTM and RSNN show good generalization when trained with a last-time-step loss.(c) Same as in (b), but showing performance for a max-over-time loss.Overall, SNN performed better when trained with the max-over-time loss.In contrast, LSTMs were prone to overfitting when used with this loss function.

Figure 5 :
Figure 5: Accuracy, but not convergence speed is only mildly affected by the steepness β of the surrogate derivative.(a) Accuracy as a function of β on the validation set of SHD.The different colors indicate different learning rates η.Error bars correspond to the standard deviation of 10 repetitions.Performance is highest for a wide range of β values (β ≥ 100) and depends only slightly on η.(b) Number of epochs needed to reach an accuracy > 0.8, n 0.8 .In contrast to the performance, n 0.8 strongly depends on both β and η.

Figure 6 :
Figure 6: Recurrent SNNs outperform feedforward architectures on both datasets.(a) Bar graph of classification accuracy for different SNN architectures on the SHD.The different colors indicate training (dark gray), validation (red), and testing accuracy (blue).Error bars correspond to the standard deviation of 10 repetitions.The accuracy reached by RSNN is comparable to the performance of LSTMs.Increasing the number of layers in feed-forward architectures hardly affected performance.(b) Same as in (a), but showing performance on the SSC.The performance of SNNs was lower than the one reached by LSTMs.In contrast to (a), an increasing number of layers lead to a monotonic increase of accuracy.

Figure 7 :
Figure 7: The convergence speed and performance levels differs for LSTMs and SNNs.(a) Learning curves of classification accuracy on training set of SHD.The different colors indicate LSTM (darkgray), 1-(red), 2-(blue), and 3-layer (green) SNN, and RSNN (yellow).Errors correspond to the standard deviation of 10 repetitions.All learning curves were obtained from a grid search over learning rate and surrogate gradient stepness values shown in Figure 5 (Methods Section 4).Both LSTMs and RSNNs fit the data with high accuracy.(b) Same as in (a), but showing the performance on the validation data set of SHDs.These results illustrate that RNNs generalize best on held-out data.

Figure 8 :
Figure 8: By reserving two speakers for the test set, SHD allows to assess speaker generalisation performance.(a) Per-speaker classification accuracy on the test set of SHD.The different colors indicate linear (dark gray), poly-2 (red), poly-3 (blue), and RBF (green) SVM, and LSTM (yellow).Errors correspond to the standard deviation of 10 repetitions.A clear decrease in performance was observable for samples spoken by the held-out speakers 4 and 5 (highlighted).(b) Same as in (a), but showing the per-speaker accuracy of SNNs.Here, different colors correspond to 1-(dark gray), 2-(red), and 3-hidden layers (blue), and RSNNs (green).As for (a), a decrease in performance for the held-out speakers was be observed.

Figure 9 :
Figure 9: Trained networks generalize across datasets.(a) Bar graph of the performance of SNNs and LSTM trained on SHD.The different colors indicate the performance on the SHD test set (dark gray), and the performance on the English digits of the SSC test set (red).Chance level is highlighted by a dashed line.Error bars correspond to the standard deviation of 10 repetitions.Accuracy on the SSC digits was significantly lower than on the digits in the SSC test set.(b) Same as in (a), but showing the performance of networks when trained on SSC and tested on the English digits of SHD.As opposed to (a), networks trained on the SSC digits generalize well on the English digits in the SHD test set.

Figure 10 :
Figure10: Schematic view of the basilar membrane (BM) mechanics.The BM (blue) separates the scala tympani (lower chamber) from the scala vestibuli (upper chamber).At the helicotrema (green), the two scalae are connected.The scala tympani ends in the round window (yellow).A sound wave v sig is penetrating the eardrum, appling pressure at the oval window (red) by moving the ossicles, leading to a compression and slower traveling wave.We have neglected the scala media[Sieroka et al., 2006] and consider a stretched form.

FactoryFigure 11 :
Figure 11: Schematic illustration of the transmitter flow within the hair cell (HC) model proposed by Meddis [1988].Figure adapted from Meddis [1986].The model comprises four transmitter pools which allow to describe the transmitter concentration in the synaptic cleft.
) = u reset for t ∈ k t (l) i , k t (l) i + τ ref , with the refractory period τ ref .The synaptic input current onto the i-th neuron in layer l is generated by the arrival of presynaptic spikes from neuron j, S ) = k δ(t − k t (l) j ).A common first-order approximation to model the time course of synaptic currents are exponentially decaying currents which are linearly summed[Gerstner and Kistler, 2002]: du u rest − u thres ) .(17) To formulate the above equations in discrete time for time step n and stepsize δt, the output spiketrain S (l) i [n] of neuron i in layer l at time step n is expressed as a nonlinear function of the membrane volatge S (l) i [n] = Θ(u (l) i − u thres ) with the Heavyside

Table 1 :
Comparison of the performance on the test sets of SHD and SSC of all considered networks, training algorithms and datasets.Errors are given by the standard deviation over 10 repetitions.The column reference holds accuracy values for the regular usage of the datasets, i.e. train and test on the same dataset.The second column (SHD/SSC) holds the generalisation performance for the other dataset.

Table 3 :
List of hair cell (HC) model parameters.

Table 4 :
List of leaky integrate-and-fire (LIF) neuron parameters used for BCs and SNNs.