DTS-SNN: Spiking Neural Networks With Dynamic Time-Surfaces

Convolution helps spiking neural networks (SNNs) capture the spatio-temporal structures of neuromorphic (event) data as evident in the convolution-based SNNs (C-SNNs) with the state-of-the-art classification-accuracies on various datasets. However, the efficacy aside, the efficiency of C-SNN is questionable. In this regard, we propose SNNs with novel trainable dynamic time-surfaces (DTS-SNNs) as efficient alternatives to convolution. The novel dynamic time-surface proposed in this work features its high responsiveness to moving objects given the use of the zero-sum temporal kernel that is motivated by the simple cells’ receptive fields in the early stage visual pathway. We evaluated the performance and computational complexity of our DTS-SNNs on three real-world event-based datasets (DVS128 Gesture, Spiking Heidelberg dataset, N-Cars). The results highlight high classification accuracies and significant improvements in computational efficiency, e.g., merely 1.51% behind of the state-of-the-art result on DVS128 Gesture but a $\times 18$ improvement in efficiency. The code is available online (https://github.com/dooseokjeong/DTS-SNN).

method that involves a large number of multiply-accumulate 23 operations over 3D feature maps. Therefore, convolution 24 generally results in high computational complexity and high 25 power consumption, which is a daunting challenge, partic-26 ularly for C-SNNs, because power efficiency is supposed 27 to be one of the key advantages of SNNs over deep neural 28 networks (DNNs). 29 SNNs are time-dependent hypotheses consisting of spik- 30 ing units and unidirectional synapses [4]. One of the 31 The associate editor coordinating the review of this manuscript and approving it for publication was Fu-Kwun Wang .
advantages of SNNs is their operations based on asyn-32 chronous spikes, unlike layer-wise sequential operations in 33 DNNs which impose forward locking constraints [5], [6]. 34 To leverage this advantage, it is required to implement SNNs 35 in dedicated hardware, which is referred to as neuromorphic 36 hardware [7], [8], [9], [10], [11]. Generally, a neuromorphic 37 processor consists of multiple cores supporting asynchronous 38 event-based operations across them. The consequent power 39 efficiency is the key feature of neuromorphic hardware. Time-surface (TS) analyses are effective methods to pro-41 cess asynchronous events (spikes) for various tasks [12], [13], 42 [14]. A TS for a given event is a 2D map of the event 43 timestamps prior to the event in the spatial vicinity of the 44 event. Therefore, the TS can capture the spatio-temporal local 45 structure of the events responding to the object. Nevertheless, 46 the previous TSs are not tailored to SNNs and hardly support 47 end-to-end learning. 48 In this regard, we attempt to use TSs, in place of con-49 volution, to extract the features of event data in a highly 50 operation-efficient manner to leverage the key advantage, 51  A schematic of the temporal kernel is illustrated in Figure 1.

82
This type of temporal kernel supports the responsiveness of 83 simple cells to moving objects over their receptive fields.

84
A comparison between simple cell's responses to static and 85 moving objects is depicted in Figure 1.

91
As a workaround, Sironi et al. proposed HATS that is 92 based on averaged TSs [13]. For a compact representation, 93 they partitioned the input field into grid-cells. For each grid-94 cell, the TSs for several recent events in the grid-cell are 95 averaged over the entire timesteps and grid-cell to acquire 96 a single smooth TS that is representative for the grid-cell. 97 Unlike HOTS, in HATS, the TS of each event considers 98 several recent timestamps by convolving an event stream with 99 an exponentially decaying temporal kernel.

100
HOTS uses a set of TSs as a dictionary to compare it 101 with the input TS and to consequently build a feature map 102 (matching frequency). This comparison is repeated through 103 multiple stages. The feature histogram from the last stage 104 is used to categorize input data, which is based on the 105 histogram-similarity between the input and instances in each 106 category. HATS uses grid-cell-wise averaged TSs as a dic-107 tionary. Similar to HOTS, the feature map is built based on 108 matching frequency but uses a support vector machine as a 109 classifier. Notably, both HOTS and HATS use time-invariant 110 (static) TSs as inputs to their classifiers.

111
With the development of training algorithms for SNNs, 112 there have been attempts to process event data by exploiting 113 the spatio-temporal processing ability of SNNs [1], [2], [18], 114 [19], [20]. Yet, to achieve high classification accuracies, most 115 of them used large C-SNNs with multiple hidden layers, 116 which cause significant computational complexity.

117
This lets us revisit the initial motivation of SNNs, energy-118 efficiency, and consequently rethink of efficient methods to 119 extract the spatio-temporal features of event data using TSs 120 as alternatives to convolution. To this end, the prerequisites 121 include (i) the modification of the conventional time-invariant 122 TSs to time-dependent (dynamic) forms with a noise-robust 123 temporal kernel and (ii) development of a DTS builder sup-124 porting end-to-end batch learning.

DTS-SNNs consists of a DTS builder and SNN classifier. 128
The builder constructs DTSs for the events at every timestep, 129 which are subsequently fed into the SNN as inputs. To val-130 idate the feature extraction ability of the proposed DTS 131 builder and the importance of well-defined features for SNNs, 132 we used a simple dense SNN with a single hidden layer, which 133 was trained using a surrogate gradient-based backpropagation 134 algorithm [21]. This section elucidates the DTS in compari-135 son with the previous TSs and a method to build DTSs in 136 parallel for the samples in a single batch. For an event stream from an event camera, the ith event (e i ) 140 is encoded as e i = (p i , t i , X i ), where p i , t i , and X i denote 141 its polarity p i ∈ {−1, 1}, timestamp, and location on a 142 2D pixel array X i = (x i , y i ), respectively. The DTS for the 143 ith event T e i only considers the previous or simultaneous 144 events e j (j ≤ i) of the same polarity (p j = p i ), which are 145 FIGURE 2. TSs for three consecutive events of the same polarity at t 1 , t 2 , and t 3 . We set r x and r y to 2 so that the spatial domain for each TS D e i is a 5 × 5 grid. The function f denotes a timestamp-encoding function.

154
The event stream ρ for each location X j is described by

156
where δ is the Dirac delta function, and t k is a set of all 157 previous timestamps t k = t k |k ≤ j, X k = X j . The zero-158 sum temporal kernel k tzs is given by (a) Zero-sum temporal kernel k tzs with τ 1 = 40 ms and τ 2 = 80 ms compared with the single-exponential kernel k t with τ 0 = 40 ms. b Encoding the timestamps (gray bars) using k tzs and k t . The events were generated at a constant rate of 100 Hz. Figure 3 shows an example of timestamps encoded using 179 the zero-sum temporal kernel k tzs compared with encoding 180 using a single-exponential kernel k t (Figure 3(a)). Figure 3(b) 181 highlights the high responsiveness of the encoding function to 182 events varying their rate such that it outputs high responses 183 in the initial encoding phase while the following responses 184 merely fluctuate around zero due to the constant event 185 rate. This can clearly be differentiated from the timestamp 186 encoding using the single-exponential kernel k t as compared 187 in Figure 3(b).

188
Similar to HATS, the input field is partitioned into grid-189 cells, and a single grid-cell-wise representative DTS is built 190 for each grid-cell. However, unlike HATS, the representative 191 DTS T c (t) for a given grid-cell c and timestep t is the 192 weighted sum of the DTSs of simultaneous events T e i .

193
where e t,c = {e i |t i = t, X i ∈ c}. The weight of each element 195 time-surface T e i is denoted by a i which is a trainable param-196 eter. This set of weights is shared among all grid-cells. Note 197 that, in HATS, the representative TS is the simple average of 198 the TSs of simultaneous events as follows.
The key to training SNNs using DTSs on a given dataset 202 is the parallel calculations of DTSs for all samples in a 203 batch. Additionally, the compatibility of parallel calculations 204 with readily available deep learning frameworks enhances 205 efficiency. To this end, we propose pixel-wise timestamp-206 encoding banks E (t) that are updated once for every timestep. 207 The timestamp encodings in the bank can readily be retrieved 208 when events at particular pixels occur. This bank is subse-209 quently unfolded to endow each pixel with an element time-210 surface. At a given timestep, the element time-surfaces for 211 the simultaneous events are retrieved and summed with their 212 weights to calculate the DTS for a given grid-cell. 213 We consider periodically distributed grid-cells over 214 a H × W input field for a given polarity; each grid-cell is 215 h c × w c in size so that there exist H /h c × W /w c grid-cells on 216 VOLUME 10, 2022 FIGURE 4. Procedure for building DTSs for a given sample at a given timestep t . R x and R y define the size of a DTS such that R x = 2r x + 1 and R y = 2r y + 1.
the input field. As such, the spatial domain of each DTS D e i is 217 R x × R y in size, where R x = (2r x + 1) and R y = 2r y + 1 .

218
The procedure is detailed in the following subsections. The 219 pseudocode is shown in Appendix D.
encodings for all pixels so that its dimension is identical to Each element E pxy (t) is calculated by convolving an event-225 stream (polarity p) at a location (x, y), ρ (t; p, x, y) with the 226 zero-sum temporal kernel k tzs , 227 E pxy (t) = (k tzs * ρ (t; p, x, y)) (t) .
(6) 228 For efficient computation, we transform this convolution into 229 a recursive form as follows: The timestamp-encoding bank E (t) ∈ R P×H ×W is subse-237 quently unfolded to build a preliminary time-surface map   . We use a spike function S ϑ u When the potential in Eq. (10) crosses a threshold for 300 spiking ϑ, a spike is emitted. Subsequently, the potential is 301 reset to zero. 302 We trained the SNN classifier using the spatio-temporal in Python using the Pytorch's Autograd framework [22]. 327 We trained the networks using Adam [23] without weight 328 decay and learning rate scheduling.  Table 1 shows the performance and efficiency of 339 DTS-SNN on DVS128 Gesture in comparison with previ-340 ous methods using CNN-based SNNs. It highlights (i) high 341 classification accuracy (1.51% lower than the state-of-the-342 art result though) and (ii) extremely high computational effi-343 ciency (×18 of that of the state-of-the-art result). The high 344 computational efficiency arises from the use of a small FCN 345 instead of a CNN and the high efficiency of the DTS builder. 346 The evolution of test accuracy with training epoch is plotted 347 in Figure 5. 348 Additionally, we achieved a 3.12% accuracy improve-349 ment by using the zero-sum temporal kernel k tzs instead 350 of the single-exponential kernel k t , which indicates the 351 higher temporal responsiveness of the zero-sum temporal 352 kernel than the conventional single-exponent temporal ker-353 nel. We visualize the DTSs T c of the two temporal ker-354 nels at five timesteps (200 -240) in Figure 6. A detailed 355 comparison between the two time-surfaces is addressed 356 in Appendix B.   Table 4. We considered 365 the original 700-long 1D sample at a given timestep as a 2D 366 sample (H = 1, W = 700) and mapped it onto a 1 × 35 grid, 367    The learning kinetics for both cases is plotted in Figure 5.  Table 1 and compared with several state-of-the- enhances the responsiveness to time-varying events. To show 399 this, we address the distribution of timestamp encoding val-400 ues on a H /h c × W /w c × R x × R y DTS map T c (t) at 401 given timesteps. We plot the distribution for the zero-sum 402 temporal kernel k tzs and single-exponential kernel k t at given 403 timesteps in Figure 7. We used a sample from DVS128 Ges-404 ture. The comparison evidently indicates the larger dispersion 405 of encoding values for the single-exponential kernel, and 406 thus the larger standard deviation than the zero-sum temporal 407 kernel. The larger encoding values for the single-exponential 408 kernel are likely attributed to persistent events. The zero-sum 409 temporal kernel filters out such large encoding values and 410 consequently allows the SNN classifier to pay attention to 411 time-varying events.

412
The dimensions of each time-surface and grid-cell are 413 important hyper-parameters, which are given by (R × R and 414 C × C) for 2D data and (R and C) for 1D data. The depen-415 dency of classification accuracy on these hyper-parameters 416 is shown in Table 2. The larger the value R, the further 417 VOLUME 10, 2022 events are considered to build time-surfaces, capturing the 418 spatio-temporal correlation of likely incoherent long-range 419 events. The larger the value C, the further time-surfaces are 420 considered to build a single representative time-surface per 421 grid-cell. Table 2 highlights the presence of optimal dimen-422 sions of time-surfaces and grid-cells which may optimally 423 take into account coherent events by filtering out incoherent 424 long-range events. We chose the optimal values R and C for 425 each dataset with reference to the data in Table 2.   (1 − P s (t)) ≈ P s (k) .

468
The expected value y at timestep t can be calculated consid-469 ering the nontrivial cases only.

476
Lemma 3: Consider the convolution of a train of Poisson 477 spikes ρ at a constant firing rate r using the zero-sum tempo-478 ral kernel k tzs , y (t) = (k tzs * ρ) (t). The result converges to 479 zero as time t increases, i.e., y (∞) = 0.
We consider a Poisson-spike train whose firing rate r is given 489 by a boxcar function with constant nonzero firing rate r 0 in 490 the range t 0 < t < t 1 . 491 where H is the Heaviside step function. Consequently, 493 we have the result of the convolution in Eq. (15) as follows. 494

507
where b is a non-negative constant that determines the con-508 tribution of a single event to timestamp encoding such that (18) 510 We evaluated the classification accuracy on DVS128 Gesture  Table 3. 521 TABLE 3. Classification accuracy for three different sets of time-constants τ 1 and τ 2 . This is the best validation accuracy from a single trial for each case.

522
The pseudocode for constructing dynamic time-surfaces in 523 parallel is shown in Algorithm 1.