Memory-Dependent Computation and Learning in Spiking Neural Networks Through Hebbian Plasticity | IEEE Journals & Magazine | IEEE Xplore

Memory-Dependent Computation and Learning in Spiking Neural Networks Through Hebbian Plasticity


Abstract:

Spiking neural networks (SNNs) are the basis for many energy-efficient neuromorphic hardware systems. While there has been substantial progress in SNN research, artificia...Show More

Abstract:

Spiking neural networks (SNNs) are the basis for many energy-efficient neuromorphic hardware systems. While there has been substantial progress in SNN research, artificial SNNs still lack many capabilities of their biological counterparts. In biological neural systems, memory is a key component that enables the retention of information over a huge range of temporal scales, ranging from hundreds of milliseconds up to years. While Hebbian plasticity is believed to play a pivotal role in biological memory, it has so far been analyzed mostly in the context of pattern completion and unsupervised learning in artificial and SNNs. Here, we propose that Hebbian plasticity is fundamental for computations in biological and artificial spiking neural systems. We introduce a novel memory-augmented SNN architecture that is enriched by Hebbian synaptic plasticity. We show that Hebbian enrichment renders SNNs surprisingly versatile in terms of their computational as well as learning capabilities. It improves their abilities for out-of-distribution generalization, one-shot learning, cross-modal generative association, language processing, and reward-based learning. This suggests that powerful cognitive neuromorphic systems can be built based on this principle.
Published in: IEEE Transactions on Neural Networks and Learning Systems ( Volume: 36, Issue: 2, February 2025)
Page(s): 2551 - 2562
Date of Publication: 19 December 2023

ISSN Information:

PubMed ID: 38113154

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Spiking neural networks (SNNs) are a well-established model of neural computation [1]. In contrast to conventional artificial neural networks (ANNs), neurons in an SNN communicate via stereotypical pulses—so-called spikes—and temporally integrate incoming information in their membrane potential. Since these are key features of biological neurons, SNNs are heavily used to model information processing in the brain [2], [3], [4]. Furthermore, SNNs are well-suited for implementation in neuromorphic hardware [5], leading to highly energy-efficient AI applications [6], [7], [8], [9].

For a long time, SNNs have been inferior to ANNs in terms of performance on standard pattern recognition tasks. However, a number of recent advances in SNN research have changed the picture, showing that SNNs can achieve performances similar to ANNs [10]. In particular, the use of surrogate gradients for SNN training [11], [12], [13] and the use of longer adaptation time constants in recurrent SNNs have been instrumental in this respect [14]. Nevertheless, SNNs still lack many capabilities of their biological counterparts—for some of which biologically implausible ANN solutions have been proposed.

Since computations in SNNs have—in contrast to computations in feed-forward ANNs—a strong temporal component, they have been proposed to be particularly suited for temporal computing tasks [14]. Here, it has turned out that the ability to retain information on several time scales is crucial. The most basic time constant in spiking neurons is the membrane time constant on the order of tens of milliseconds. In principle, arbitrary time constants can be realized by recurrent connections in recurrent SNNs. However, such recurrent retention of information is rather brittle and hard to learn. Instead, there were several suggestions to utilize longer time constants available in biological neuronal circuits such as short-term plasticity [15], [16] on the order of hundreds of milliseconds, and adaptation time constants of neurons [3], [14] on the order of seconds. The inclusion of such time constants has been shown to extend the computational capabilities of SNNs. However, typical cognitive tasks are frequently situated on a much slower time scale of minutes or longer. For example, when we watch a movie, we have to rapidly memorize facts to follow the story and draw conclusions as the narrative evolves. For such tasks, time constants on the order of seconds are insufficient.

Here, we consider Hebbian synaptic plasticity [17] as a mechanism to extend the range of time constants and therefore the computational capabilities of SNNs. Hebbian synaptic plasticity is abundant in both the neocortex and hippocampus [18], [19], [20]. While many forms, in particular, in sensory cortical areas, are believed to shape processing on a very slow developmental scale, there is also evidence for rapid plasticity that can in principle be utilized for online processing on the behavioral time scale, most prominently in the hippocampus [21], [22].

We present a novel memory-augmented SNN model that is equipped with a hetero-associative memory subject to Hebbian plasticity. Previous work has so far explored the concept of rapidly changing weights for memory only in the context of conventional ANNs [23], [24], [25], [26]. It has been shown that the integration of some type of memory into neural networks can strongly enrich their computational capabilities, which was previously achieved with rather unbiological types of memory components [27], [28], [29], [30], [31], [32], [33], [34] or with heavy weight sharing [35]. In contrast, our model is based on a novel associative memory component implemented by biological Hebbian plasticity for SNNs.

We experimentally show that our novel SNN model enriched by Hebbian plasticity outperforms state-of-the-art deep-learning mechanisms of long short-term memory (LSTM) networks [36], [37] and the long short-term memory spiking neural networks (LSNNs) [3], [14] in a sequential pattern-memorization task, as well as demonstrate superior out-of-distribution generalization capabilities compared to these models. The contemporary exceptional performance of standard deep-learning mechanisms strictly relies on the availability of a large number of training examples, whereas humans are capable of learning new tasks based on a single exposure (one-shot learning) [38]. We show that our memory-equipped SNN model provides a novel SNN-based solution to this problem and demonstrate that it can be successfully applied to one-shot learning and classification of handwritten characters, improving over previous SNN models [39], [40]. We also demonstrate the capability of our model to learn associations for audio-to-image synthesis from spoken and handwritten digits. Our SNN model enriched by Hebbian plasticity further presents a novel solution to a variety of cognitive question-answering tasks from a standard benchmark [27], achieving comparable performance to both memory-augmented ANN [41] and SNN-based [9] state-of-the-art solutions to this problem with a simpler architecture. In a final application scenario, we demonstrate that our model can learn from rewards on an episodic reinforcement learning task and attain a near-optimal strategy on a memory-based card game.

In contrast to H-mem, a previously proposed nonspiking model that utilizes Hebbian plasticity [41], our SNN model is well-suited for implementation in energy-efficient neuromorphic hardware. To demonstrate the potential efficiency advantages of such an implementation, we analyze its energy efficiency and compare it to H-Mem.

In summary, our results show that Hebbian plasticity enhances the computational capabilities of SNNs in several directions including out-of-distribution generalization, one-shot learning, cross-modal generative association, answering questions about stories of various types, and memory-dependent reinforcement learning. This suggests that Hebbian plasticity is a central component of information processing in the brain and artificial spike-based computing systems. Since local Hebbian plasticity can easily be implemented in neuromorphic hardware, this also suggests that powerful cognitive neuromorphic systems can be built based on this principle.

SECTION II.

Method

We consider networks of standard leaky integrate-and-fire (LIF) neurons modeled in discrete time steps $\Delta t$ . The membrane potential $V_{j}$ of neuron $j$ at time $t$ is given by \begin{equation*} V_{j}(t + \Delta t) = \alpha V_{j}(t) + (1 - \alpha) I_{j}(t) - \vartheta z_{j}(t) \tag{1}\end{equation*} View SourceRight-click on figure for MathML and additional features. where $\alpha $ defines the membrane potential decay per time step. The total synaptic input current $I_{j}(t)= \sum _{i} W_{ji} z_{i}(t)$ is given by the sum over all presynaptic neuron’s output spike trains weighted by the corresponding synaptic weights $W_{ji}$ . When the neuron’s membrane potential $V_{j}(t)$ is above some threshold $\vartheta $ , the neuron spikes ($z_{j}(t)=1$ ) and the membrane potential is reset [last term in (1)]. If the neuron does not spike, we define $z_{j}(t)=0$ (for parameter values, see Table I).

TABLE I Neuron and Plasticity Parameters
Table I- Neuron and Plasticity Parameters

The considered network model is shown in Fig. 1. At the core of the network is a heteroassociative memory [42], that is, a single-layer feed-forward SNN. Here, spiking neurons $\boldsymbol {z}^{\mathrm {key}}$ in the key layer project to neurons $\boldsymbol {z}^{\mathrm {value}}$ in the value layer with synaptic weights $W_{ji}^{\mathrm {assoc}}$ that are subject to rapid Hebbian plasticity. We used $l=100$ neurons in both the key and value layers in all simulations. The use of this simple heteroassociative memory architecture is motivated from the hippocampal circuitry where it was shown that rapid plasticity is found in the connections from region CA3–CA1, resembling a similar single-layer architecture with key layer corresponding to CA3 and the value layer corresponding to CA1 [21], [22]. From a machine-learning perspective, this architecture can be motivated by memory-augmented neural networks, a class of ANN models that were shown to outperform standard ANNs in memory-dependent computational tasks. This class includes networks with key value memory systems such as memory networks [27], [28], Hebbian memory networks [41], and transformers [43]. The latter is particularly powerful for language processing, giving rise to language models such as GPT-3 [44].

Fig. 1. - Schematic of the SNN model. Inputs 
$\boldsymbol {x}$
 are encoded by an input encoder (IE). Memory induction: The encoded input 
$\boldsymbol {z}^{\mathrm {s,enc}}$
 activates some neurons in the key and value layers through weight matrices 
$W^{\mathrm {s,key}}$
 and 
$W^{\mathrm {s,value}}$
, respectively. Since the key neurons 
$\boldsymbol {z}^{\mathrm {key}}$
 and the value neurons 
$\boldsymbol {z}^{\mathrm {value}}$
 are pre- and postsynaptic to the synapses in 
$W^{\mathrm {assoc}}$
, their activity induces weight changes there. Memory recall: The encoded input 
$\boldsymbol {z}^{\mathrm {r,enc}}$
 activates some neurons in the key layer through the weight matrix 
$W^{\mathrm {r,key}}$
. This activity activates some neurons in the value layer through synapses 
$W^{\mathrm {assoc}}$
, thus potentially recalling some information that has been stored previously. Finally, value neurons 
$\boldsymbol {z}^{\mathrm {value}}$
 project to a layer of output neurons 
$\boldsymbol {o}$
. To allow for a memory recall based on previously recalled information, activity in the value layer is fed back to the key layer with some delay 
$d_{\mathrm {feedback}}$
.
Fig. 1.

Schematic of the SNN model. Inputs $\boldsymbol {x}$ are encoded by an input encoder (IE). Memory induction: The encoded input $\boldsymbol {z}^{\mathrm {s,enc}}$ activates some neurons in the key and value layers through weight matrices $W^{\mathrm {s,key}}$ and $W^{\mathrm {s,value}}$ , respectively. Since the key neurons $\boldsymbol {z}^{\mathrm {key}}$ and the value neurons $\boldsymbol {z}^{\mathrm {value}}$ are pre- and postsynaptic to the synapses in $W^{\mathrm {assoc}}$ , their activity induces weight changes there. Memory recall: The encoded input $\boldsymbol {z}^{\mathrm {r,enc}}$ activates some neurons in the key layer through the weight matrix $W^{\mathrm {r,key}}$ . This activity activates some neurons in the value layer through synapses $W^{\mathrm {assoc}}$ , thus potentially recalling some information that has been stored previously. Finally, value neurons $\boldsymbol {z}^{\mathrm {value}}$ project to a layer of output neurons $\boldsymbol {o}$ . To allow for a memory recall based on previously recalled information, activity in the value layer is fed back to the key layer with some delay $d_{\mathrm {feedback}}$ .

In our model, neurons in the key layer (key neurons) receive input from two spiking neuron populations of size $d$ each, $\boldsymbol {z}^{\mathrm {s,enc}}$ and $\boldsymbol {z}^{\mathrm {r,enc}}$ , responsible for storing (s) and recalling (r) information to and from the memory, respectively (see below). Similarly, neurons in the value layer (value neurons) receive input from two sites, namely from $\boldsymbol {z}^{\mathrm {s,enc}}$ and from the key layer neurons $\boldsymbol {z}^{\mathrm {key}}$ . Neurons in the value layer project to an output circuit which generates the final output of the model.

Our model takes a sequence of input tokens $\langle x_{1}, \ldots, x_{M} \rangle $ . For example, the sequence could be $x_{1}$ = “Mary went to the bathroom,” $x_{2}$ = “John moved to the hallway,” $x_{3}=$ “Mary traveled to the office,” and $x_{4}$ = “Where is Mary?” (this is a simple example sequence from the bAbI tasks considered in Results, see Table S2 in the Supplementary). Each input token $x_{m}$ can either be a fact or a query. In the bAbI example, $x_{1}$ to $x_{3}$ are facts and $x_{4}$ is a query. For facts, the network performs a store operation to memorize important information about this input for later use. For example, with input $x_{1}$ it could associate “Mary” with “bathroom.” For a query, the network should respond with an output $\hat { \boldsymbol {o}}$ , which should be “office” in our example. Hence, it performs a recall operation to retrieve information from its memory needed to determine the correct output. If the current input $x_{m}$ is a fact, neurons $\boldsymbol {z}^{\mathrm {s,enc}}$ are activated while the neurons $\boldsymbol {z}^{\mathrm {r,enc}}$ are silent (no spikes produced). This will initiate a store operation. On the other hand, if the input is a query, neurons $\boldsymbol {z}^{\mathrm {r,enc}}$ are activated while the neurons $\boldsymbol {z}^{\mathrm {s,enc}}$ remain silent, initiating a recall operation. The outputs of these neurons are determined by an input encoder (IE in Fig. 1) that converts the input $x_{m}$ into spike trains. Each input is presented for a duration of $\tau _{\mathrm {sim}} = {\mathrm {100\,\,ms}}$ . The input encoder is task-specific, see specific tasks below. In the simplest case, we used a single dense layer $\mathcal {E}$ consisting of $d$ LIF neurons. Note that while we used the same input encoder for both query and facts for simplicity, different encoders could be utilized as well.

Consider a fact input $x_{m}$ from which some aspects should be stored in memory. As it is a fact input, it elicits a spike response in $\boldsymbol {z}^{\mathrm {s,enc}}$ . The synaptic connections to the key layer activate some key neurons, while the synaptic connections to the value layer activate some value neurons. More specifically, the input current to a neuron $j$ in the key layer is given by \begin{equation*} I^{\mathrm {key}}_{j}(t) = \sum _{i} W^{\mathrm {s,key}}_{ji} z_{i}^{\mathrm {s,enc}}(t) \tag{2}\end{equation*} View SourceRight-click on figure for MathML and additional features. where $W^{\mathrm {s,key}}$ is a synaptic-weight matrix of size $l \times d$ . We denote the spike train of a neuron $j$ in the key layer as $z_{j}^{\mathrm {key}}$ . Similarly, the input current to a neuron $k$ in the value layer is given by \begin{equation*} I^{\mathrm {value}}_{k}(t) = \sum _{i} W^{\mathrm {s,value}}_{ki} z_{i}^{\mathrm {s,enc}}(t) + c \sum _{j} W^{\mathrm {assoc}}_{kj}(t) z_{j}^{\mathrm {key}}(t) \quad \tag{3}\end{equation*} View SourceRight-click on figure for MathML and additional features. where $W^{\mathrm {s,value}}$ is a synaptic-weight matrix of size $l \times d$ , $c = 0.2$ is a constant, and where $W^{\mathrm {assoc}}$ are association synapses represented as a square matrix of size $l \times l$ . The first term in (3) represents the contribution of the spikes from the input encoder to this current, and the second term represents the contribution of the spikes that travel from the key layer via $W^{\mathrm {assoc}}$ to a neuron in the value layer. The constant $c=0.2$ dampens the impact of the associative connections to the value neurons during the store operation. This is necessary to ensure that the currently stored information is not conflated with previously stored associations. In biological circuits, such a dampening could be implemented via inhibition of particular parts of the dendritic tree of the neuron (e.g., of the apical dendrites). We denote the spike train of a neuron $k$ in this layer as $z_{k}^{\mathrm {value}}$ . Synapses $W^{\mathrm {assoc}}$ are subject to Hebbian plasticity. As neurons in the key and value layers are pre- and postsynaptic to the synapses in $W^{\mathrm {assoc}}$ , their activity induces weight changes there. Weight changes in association synapses $W^{\mathrm {assoc}}$ are given by \begin{align*}&\hspace {-0.5pc}\Delta W_{kj}^{\mathrm {assoc}}\left ({t}\right) =\gamma _{+} \left ({w^{\mathrm {max}} - W_{kj}^{\mathrm {assoc}}\left ({t}\right)}\right)\,\kappa _{k}^{\mathrm {value}}\left ({t}\right) \kappa _{j}^{\mathrm {key}}\left ({t}\right) \\&- \gamma _{-} W_{kj}^{\mathrm {assoc}}\left ({t}\right) \kappa _{j}^{\mathrm {key}}\left ({t}\right)^{2} \tag{4}\end{align*} View SourceRight-click on figure for MathML and additional features. where $\gamma _{+} > 0$ , $\gamma _{-} > 0$ , and $w^{\mathrm {max}}$ are constants, and where $\kappa _{j}^{\mathrm {key}}$ and $\kappa _{k}^{\mathrm {value}}$ are exponential activity traces of $z_{j}^{\mathrm {key}}$ and $z_{k}^{\mathrm {value}}$ , respectively (for parameter values, see Table I). Traces are updated by an amount $1 - \exp (-\Delta t/\tau _{\mathrm {trace}})$ at the moment of spike arrival and decay exponentially with time constant $\tau _{\mathrm {trace}}$ in the absence of spikes. The first term in (4) implements a soft upper bound $w^{\mathrm {max}}$ on the weights. The Hebbian term $\kappa _{k}^{\mathrm {value}}(t)\kappa _{j}^{\mathrm {key}}(t)$ strengthens connections between coactive neurons in the key and value layers. Finally, the last term generally weakens connections from the currently active key neurons. Since the Hebbian component strengthens connections to active value neurons, this emphasizes the current association and de-emphasizes old ones. This update is similar to Oja’s rule [45]. Association synapses are then updated according to $W^{\mathrm {assoc}}(t + \Delta t) = W^{\mathrm {assoc}}(t) + \Delta W^{\mathrm {assoc}}(t)$ .

Now, consider a query input $x_{m}$ that should trigger a memory recall. In this case, the postencoder neurons $\boldsymbol {z}^{\mathrm {r,enc}}$ are active, while neurons $\boldsymbol {z}^{\mathrm {s,enc}}$ are silent. This activates the key neurons through the weight matrix $W^{\mathrm {r,key}}$ giving rise to activity in the key layer. For some tasks, it is beneficial to perform a recall based on previously recalled information. For example, when you are asked “Can Tweety fly?,” you may first recall that Tweety is a canary and then remember that canaries can fly to arrive at the correct answer. We, therefore, included a feedback loop in the model from the value neurons to the key neurons (see the loop on the right side of Fig. 1). In this way, recalled activity can influence the recall itself after some delay. We set the feedback delay $d_{\mathrm {feedback}}$ to 1 ms if not otherwise stated. The input current to a neuron $j$ in the key layer during recall is given by \begin{align*}&\hspace {-0.5pc}I^{\mathrm {key}}_{j}(t) = \sum _{i \le d} W^{\mathrm {r,key}}_{ji} z_{i}^{\mathrm {r,enc}}(t) \\&+ \sum _{d \le k \le d+l} W^{\mathrm {r,key}}_{jk} z_{k-d}^{\mathrm {value}}(t-d_{\mathrm {feedback}}) \tag{5}\end{align*} View SourceRight-click on figure for MathML and additional features. where $W^{\mathrm {r,key}}$ is a synaptic weight matrix of size $l \times d + l$ , and where $d_{\mathrm {feedback}}$ is the synaptic delay in feedback connections. As before, we denote the spike train of a neuron $j$ in this layer as $z_{j}^{\mathrm {key}}$ . This activity activates some neurons in the value layer through synapses $W^{\mathrm {assoc}}$ , thus potentially recalling some information that has been stored previously. Synaptic current to a neuron $k$ in the value layer is thus given by \begin{equation*} I^{\mathrm {value}}_{k}(t) = \sum _{j} W_{kj}^{\mathrm {assoc}}(t) z_{j}^{\mathrm {key}}(t) \tag{6}\end{equation*} View SourceRight-click on figure for MathML and additional features. giving rise to a spike train $z_{k}^{\mathrm {value}}$ .

Finally, the value neurons project to an output network. The architecture of the output network depends on the task at hand. In the simplest case, the network output was determined by taking the sum of $z_{k}^{\mathrm {value}}$ over the last $\tau _{\mathrm {read}}$ time steps and passing the result through a final weight matrix $W_{\mathrm {out}}$ to a layer of output neurons $\boldsymbol {o}$ \begin{equation*} \hat {o}_{j} = \sum _{k} W^{\mathrm {out}}_{jk} \sum _{t^{\prime } = 0}^{\tau _{\mathrm {read}}} z_{k}^{\mathrm {value}}(M\tau _{\mathrm {sim}} - t^{\prime}). \tag{7}\end{equation*} View SourceRight-click on figure for MathML and additional features.

During training for all tasks, the weights $W^{\mathrm {s,key}}$ , $W^{\mathrm {s,value}}$ , $W^{\mathrm {r, key}}$ , and $W^{\mathrm {out}}$ were trained by minimizing the cross-entropy loss between the softmax of $\hat { \boldsymbol {o}}$ and the target output using backpropagation through time (BPTT) together with the Adam optimizer [46]. If not stated otherwise, we included a regularizing term that encourages low firing rates in the network. To overcome the problem of nonexistent gradients of spiking neurons, we used the standard surrogate gradient method as in [3]. Note that the association weights $W^{\mathrm {assoc}}$ are plastic during inference. In the training process, the network learns how to use the Hebbian updates during inference to rapidly store task-relevant information to this matrix and retrieve it at a query. See the Supplementary for details on the model and training setup.

SECTION III.

Results

A. Memorizing Associations

We first tested the ability of our model to one-shot memorize associations and to use these associations later when needed. Here, we conducted experiments on a task that requires to form of associations between random continuous-valued vectors and integer labels that were sequentially presented to the network.

In each instance of this task, we randomly drew $N$ real vectors where each element of a vector was sampled from a uniform distribution on the interval $[0, 1$ ). We then generated input sequences of $N$ pairs, each containing one of the random vectors and an associated label from 1 to $N$ [Fig. 2(a) (right)]. After all those vector-label pairs have been presented to the network, it receives a query vector. The query vector was equal to one of the $N$ vectors and was randomly selected for each input sequence. The network was required to output the label of the query vector.

Fig. 2. - Association task and out-of-distribution generalization. (a) Schematic of the network model (left), the network input for an input sequence of length 
$N=8$
 (top right), and the network activity after training (bottom right). Eight 10-D random vectors along with an associated label from 1 to 8 are presented sequentially as facts (
$x_{1}$
–
$x_{8}$
). After all those vector-label pairs have been presented to the network, it receives a query vector (
$x_{9}$
). The query vector is equal to one of the vectors presented as facts (randomly selected; vector with label 2 in the shown example). The network is required to output the label of the query vector. Inputs are encoded as spike trains 
$\boldsymbol {z}^{\mathrm {s,enc}}$
 and 
$\boldsymbol {z}^{\mathrm {r,enc}}$
 (100 ms for each input). Synaptic connections to the key layer activate some key neurons 
$\boldsymbol {z}^{\mathrm {key}}$
, while the synaptic connections to the value layer activate some value neurons 
$\boldsymbol {z}^{\mathrm {value}}$
. Spikes in these layers during storing of the third vector-label pair 
$x_{3}$
 and during recall are shown within red, green, and blue rectangles, respectively. Neurons that are both active during storing of 
$x_{3}$
 and recalling are highlighted with saturated color. (b) Performance comparison of our model with an LSTM network and an LSNN in this task. Shown is the test accuracy for various sequence lengths 
$N$
. While the accuracy of our model stays above 90% for sequences with up to 50 vector-label pairs, the performance of the LSTM network and the LSNN quickly decreases with increasing sequence length. (c) Out-of-distribution generalization capability. We trained models with a sequence length of 
$N_{\mathrm {train}}=5$
 and evaluated the model’s generalization capability to test sets with shorter and longer sequence lengths 
$N_{\mathrm {test}}$
. Comparison to the LSTM network and LSNN: While all models generalize to new test sets with shorter sequences, our model shows superior generalization to longer sequences.
Fig. 2.

Association task and out-of-distribution generalization. (a) Schematic of the network model (left), the network input for an input sequence of length $N=8$ (top right), and the network activity after training (bottom right). Eight 10-D random vectors along with an associated label from 1 to 8 are presented sequentially as facts ($x_{1}$ $x_{8}$ ). After all those vector-label pairs have been presented to the network, it receives a query vector ($x_{9}$ ). The query vector is equal to one of the vectors presented as facts (randomly selected; vector with label 2 in the shown example). The network is required to output the label of the query vector. Inputs are encoded as spike trains $\boldsymbol {z}^{\mathrm {s,enc}}$ and $\boldsymbol {z}^{\mathrm {r,enc}}$ (100 ms for each input). Synaptic connections to the key layer activate some key neurons $\boldsymbol {z}^{\mathrm {key}}$ , while the synaptic connections to the value layer activate some value neurons $\boldsymbol {z}^{\mathrm {value}}$ . Spikes in these layers during storing of the third vector-label pair $x_{3}$ and during recall are shown within red, green, and blue rectangles, respectively. Neurons that are both active during storing of $x_{3}$ and recalling are highlighted with saturated color. (b) Performance comparison of our model with an LSTM network and an LSNN in this task. Shown is the test accuracy for various sequence lengths $N$ . While the accuracy of our model stays above 90% for sequences with up to 50 vector-label pairs, the performance of the LSTM network and the LSNN quickly decreases with increasing sequence length. (c) Out-of-distribution generalization capability. We trained models with a sequence length of $N_{\mathrm {train}}=5$ and evaluated the model’s generalization capability to test sets with shorter and longer sequence lengths $N_{\mathrm {test}}$ . Comparison to the LSTM network and LSNN: While all models generalize to new test sets with shorter sequences, our model shows superior generalization to longer sequences.

The model was trained for 4250 iterations using backpropagation through time (BPTT) [3], [47]. We used two dense layers as input encoders in this task. Each layer consisted of 80 LIF neurons. One layer was used to encode the random input vectors and another layer was used to encode the integer labels. Inputs were applied to these layers for 100 ms each, giving rise to spike trains $\boldsymbol {z}^{\mathrm {s,enc}}$ and $\boldsymbol {z}^{\mathrm {r,enc}}$ (see Section Model and Training Details in the Supplementary for more details to the model and the training setup). Fig. 2(a) (bottom right) shows the network activity after training for one test example with a sequence length $N$ of eight vector-label pairs.

In Fig. 2(b), we compare the performance of our model for various sequence lengths (number of vector-label pairs) to the standard generic artificial and spiking recurrent network models: the standard LSTM network [36] and the LSNN [3], [14]. The LSTM network consisted of 100 LSTM units. The LSNN consisted of 200 regular spiking and 200 adaptive neurons. For both models, we used the same architecture for the input encoder and the output layer as in our model. For short sequences, the performance of the LSTM network and the LSNN is comparable to our model. While the accuracy of the LSTM and LSNN drastically drops at some point, the accuracy of our model stays above 90% for sequences containing up to 50 vector-label pairs. The training of the network included a firing rate regularization term (see Supplementary), which resulted in a rather low average firing rate of 13 Hz per neuron. Hence, each neuron fired on average 13 spikes/s or approximately one spike for each 100 ms input token.

We next asked whether the network can generalize from a given sequence length to different sequence lengths. To this end, we trained it with a single sequence length of $N_{\mathrm {train}}=5$ vector-label pairs and tested the model on shorter and longer sequences of up to length 30. Note that the labels in this task were randomly chosen from $\{1, {\dots }, 30\}$ also during training (and the output layer was of size 30) such that the network encountered all possible labels in the training process. The association sequences were each time generated by randomly drawing the vectors and associated labels, hence, besides the different lengths, test sequences also included unseen samples. In Fig. 2(c), we compare its performance to an LSTM network and an LSNN in terms of test accuracy. While all models generalize to shorter sequences, our model shows superior generalization to longer sequences. This constitutes a strong form of generalization—out-of-distribution generalization—since the distribution of input sequences during training (sequences of length 5) differed significantly from the distribution of sequences during testing (sequences of length up to 30). This capability indicates that the network learned an algorithm where it stores pattern-label pairs and recalls the label from the pattern at query type. In contrast to the operation of the LSTM/LSNN, such an algorithm is basically independent of the sequence length.

We further asked whether the network is also able to generalize to different input firing rates. Similar to our previous approach, we trained the network with a single sequence length of 2, utilizing labels randomly selected from the set $\{1, 2\}$ . During test time we kept the sequence length at 2 vector-label pairs and tested the model for various input firing rates ranging from 20% to 200% of the firing rate used during training. In Fig. S1 in the Supplementary, we compare its performance to the LSNN in terms of the test accuracy in this task. While the performance of both models decreases with decreasing input rate, we have observed that our model generalizes better to higher input firing rates.

B. One-Shot Learning

While standard deep-learning approaches need large numbers of training examples during training, humans can learn new concepts based on a single exposure (one-shot learning) [48]. A large number of few-shot learning approaches have been proposed using artificial neural networks [49], [50], but biologically plausible SNN models are very rare [39]. We wondered whether Hebbian plasticity could endow SNNs with one-shot learning capabilities. To that end, we applied our model to the problem of one-shot classification on the Omniglot [51] dataset. Omniglot consists of 1623 handwritten characters from 50 different alphabets. Each character is considered as one class. There are 20 examples of each class, each hand drawn by a different person. Following [52], we resized all the images to $28 \times 28$ and augmented the dataset with rotations in multiples of 90°. We split the data into a training, test, and validation set consisting of 1028, 423, and 172 classes, respectively. As in [52], we increased the number of characters by augmenting the dataset with three rotations (90 °, 180 °, and 270 °), where each rotation was counted as a new character class, resulting in 4112, 1692, and 688 classes in total for the train, test, and validation set, respectively.

We used a convolutional neural network (CNN) as the input encoder for the Omniglot image. This CNN was first pretrained alone on the training set using the prototypical loss [49] and then converted into a spiking CNN by using a threshold-balancing algorithm [53], [54], which is known to be the state-of-the-art approach in facilitating learning for deep convolutional SNNs [55], [56], [57] (see Section Converting Pre-Trained CNNs in the Supplementary for details on the conversion algorithm).

In each instance of the one-shot learning task, we randomly drew five different Omniglot classes. We then generated an input sequence as follows: From each of these five classes, we randomly drew a character instance and generated a random sequence out of them (i.e., the sequence is a random permutation of the five instances). This sequence was shown to the network together with a sequence of labels from 1 to 5 (again, a random permutation). Hence, each input character was paired with a unique label [see Fig. 3(a) (right)]. After all those image-label pairs have been presented to the network, it receives a query image. The query image showed another randomly drawn sample from one of these five Omniglot classes. The network was required to output the label that appeared together with an image of the same class as the query image (one-shot five-way classification).

Fig. 3. - Omniglot one-shot task and a visualization of the embeddings learned by the encoder CNN in this task. (a) Schematic of the network model (left) and the network input (right). Five images from the Omniglot dataset along with an associated label from 1 to 5 are presented sequentially as facts (
$x_{1}$
 to 
$x_{5}$
). After all those image-label pairs have been presented to the network, it receives a query image (
$x_{6}$
). The network is required to output the label that appeared together with an image of the same class as the query image. (b) A 
$t$
-SNE visualization of the embeddings learned by the encoder CNN. A subset of the Tengwar script is shown (an alphabet in the test set). Misclassified characters are highlighted in red along with arrows pointing to the correct cluster. Inset: A 
$t$
-SNE visualization of the learned key and value representation of the inputs. Colors indicate character class.
Fig. 3.

Omniglot one-shot task and a visualization of the embeddings learned by the encoder CNN in this task. (a) Schematic of the network model (left) and the network input (right). Five images from the Omniglot dataset along with an associated label from 1 to 5 are presented sequentially as facts ($x_{1}$ to $x_{5}$ ). After all those image-label pairs have been presented to the network, it receives a query image ($x_{6}$ ). The network is required to output the label that appeared together with an image of the same class as the query image. (b) A $t$ -SNE visualization of the embeddings learned by the encoder CNN. A subset of the Tengwar script is shown (an alphabet in the test set). Misclassified characters are highlighted in red along with arrows pointing to the correct cluster. Inset: A $t$ -SNE visualization of the learned key and value representation of the inputs. Colors indicate character class.

We trained our model for 200 epochs with 200 iterations/ epochs on the training set. In each iteration, we randomly drew a batch of 256 input sequences. During this training, the whole model was optimized, including the pretrained encoder CNN. We treated the grayscale values of an Omniglot image as a constant input current, applied for 100 ms to 784 LIF neurons, to produce input spikes to the CNN. The final layer of the CNN consisted of 64 LIF neurons. A single dense layer consisting also of 64 LIF neurons was used to encode the integer labels as spike trains with a duration of 100 ms per label (see Section Model and Training Details in the Supplementary for more details to the model and the training setup).

The rationale for this network architecture is that, given a suitable generalizing representation of the character, the Hebbian weight matrix can easily associate characters to labels, thus performing one-shot memorization of previously unseen classes to arbitrary labels. In biology, suitable representations could emerge from evolutionary optimized networks potentially fine-tuned by unsupervised plasticity processes. In our setup, these representations are provided by the CNN encoder, where the prototypical loss ensures that similar inputs are mapped to similar representations [49]. The network model can thus be seen as a spiking implementation of a prototypical network. However, the biologically unrealistic nearest-neighbor algorithm used to determine the output of the latter is replaced here by a simple heteroassociative memory. In Fig. 3(b), we show a sample $t$ -SNE visualization of the embeddings produced by the spiking CNN that was used as image encoder in this task. Despite the shown characters being rather diverse, the network can represent them as well-separated clusters. This clustering can also be observed in the learned key and value representations of the inputs [inset of Fig. 3(b)]. Overall, the SNN achieved an accuracy of 92.2% when tested on a one-shot five-way classification of novel character classes at an average firing rate of 25 Hz.

In a previous work [39], an SNN achieved an accuracy of 83.8% on a variant of the standard Omniglot one-shot learning task considered here. Instead of Hebbian plasticity, the model relied on a more elaborate three-factor learning rule for rapid learning. This model used a convolutional input encoder consisting of nonspiking neurons with batch normalization, which was trained end-to-end together with the SNN with BPTT. Recently, [40] presented a spiking CNN-based implementation of SiameseNets [58] and model-agnostic meta-learning (MAML) [50], which are other classical few-shot learning algorithms. Their spiking CNN implementations with a similar direct input encoding achieved 89.1% (spiking SiameseNets) and 91.5% (spiking MAML) one-shot five-way accuracies on Omniglot. Our model’s accuracy of 92.2% is a slight improvement over these results. Note, however, that these models were specifically designed for few-shot learning, while this is only one application of our versatile SNN model with Hebbian plasticity.

C. Cross-Modal Associations

Humans can imagine features of previously encountered stimuli. For example, when you hear the name of a person, you can imagine a mental image of their face. Here, in contrast to the associations considered above, not just classes are associated but (approximate) mental images. We, therefore, asked whether Hebbian plasticity can enable SNNs to perform such cross-modal associations. We trained our model in an autoencoder-like fashion. We used the FSDD [59] and the MNIST [60] datasets in this task. FSDD is an audio/speech dataset consisting of recordings of spoken digits. The dataset contains 3000 recordings of 6 speakers (50 of each digit per speaker).

In each instance of the task, we randomly drew three unique digits between 0 and 9. For each digit, we then generated a pair containing a randomly drawn instance of this digit as an audio file from the FSDD dataset, and a randomly drawn image of the same digit from the MNIST dataset [Fig. 4(a) (right)]. After these audio–image pairs have been presented to the network, it received an additional audio query. The audio query was another randomly drawn instance from the FSDD dataset of one of the previously presented digits. The network was required to generate the image of the handwritten digit that appeared together with the spoken digit of the same class as the audio query.

Fig. 4. - Audio to image synthesis task and examples of generated images. (a) Schematic of the network model (left) and the network input (right). Three audio–image pairs are presented sequentially as facts. Input pairs (
$x_{1}$
, 
$x_{2}$
, and 
$x_{3}$
) contain a spoken digit from the FSDD dataset and an image of the same digit from the MNIST dataset. After all those audio–image pairs have been presented to the network, it receives an audio query (
$x_{4}$
). The network is required to generate an image of the handwritten digit that appeared together with the spoken digit of the same class as the audio query. (b) Example MNIST images from the test set (top) and the corresponding images that were reconstructed by the network (bottom). The reconstructed images are not just typical images for the digit classes, but rather images that are very similar to the images presented previously with the audio cues (compare the three rightmost image pairs).
Fig. 4.

Audio to image synthesis task and examples of generated images. (a) Schematic of the network model (left) and the network input (right). Three audio–image pairs are presented sequentially as facts. Input pairs ($x_{1}$ , $x_{2}$ , and $x_{3}$ ) contain a spoken digit from the FSDD dataset and an image of the same digit from the MNIST dataset. After all those audio–image pairs have been presented to the network, it receives an audio query ($x_{4}$ ). The network is required to generate an image of the handwritten digit that appeared together with the spoken digit of the same class as the audio query. (b) Example MNIST images from the test set (top) and the corresponding images that were reconstructed by the network (bottom). The reconstructed images are not just typical images for the digit classes, but rather images that are very similar to the images presented previously with the audio cues (compare the three rightmost image pairs).

We used one CNN to encode the audio input (more specifically the mel-frequency cepstrum coefficients (MFCCs); see section Model and Training Details in the Supplementary) and one CNN to encode the MNIST images. The CNNs were pre-trained on FSDD/MNIST classification tasks, respectively, and the pretrained models were converted into spiking CNNs by using the threshold-balancing algorithm [53], [54] (see Section Converting Pre-Trained CNNs in the Supplementary for details to the conversion algorithm). We removed the final classification layers of the CNNs and used the penultimate layer consisting of 64 LIF neurons as the encoding of the input stimuli. The spiking CNN was then fine-tuned during end-to-end training (see Section Model and Training Details in the Supplementary for more details on the model and the training setup). Following the value layer of the model, the image reconstruction was produced by a two-layer fully-connected network with 256 and 784 LIF neurons, respectively.

In Fig. 4(b), we show example MNIST images from the test set and the corresponding images that were reconstructed by the network. One can see that not just a typical image for the digit class was imagined by the network, but rather an image that is very similar to the image presented previously with the audio cue. This shows that the network did not just memorize the digit class in its associative memory, but rather features that benefit the reconstruction of this specific sample. To quantify the reconstruction performance of our model, we computed the mean squared difference (MSD) between the image produced by the network and all MNIST images in an input sequence. The MSD was 0.03± 0.02 (mean ± standard deviation; median was 0.02 with a lower and upper quartile of 0.01 and 0.03, respectively) between the reconstructed image and the target image, and 0.1 ± 0.04 between the reconstructed image and the two other MNIST images in the input sequence (statistics are over 1000 examples in the test set). Again, the network operated in a sparse activity regime with an average firing rate of 10 Hz.

D. Question Answering

The bAbI dataset [27] is a standard benchmark for cognitive architectures with memory. The dataset contains 20 different types of synthetic question-answering (QA) tasks, designed to test a variety of reasoning abilities on stories. Each of these tasks consists of a sequence of sentences followed by a question whose answer is typically a single word (in a few tasks, answers are multiple words; see Table S2 in the Supplementary for example stories and questions). We provided the answer to the model as supervision during training, and it had to predict it at test time on a separate test set. The performance of the model was measured using the average error on the test dataset over all tasks and the number of failed tasks (according to the convention of [27], a model had failed to solve a task if the test error was above 5% for that task).

Each instance of a task consists of a sequence of $M$ sentences $\langle x_{m}, \ldots, x_{M}\rangle $ , where the last sentence is a question, and an answer $a$ . We represent each word $j$ in a given sentence $x_{m}$ by a one-hot vector $\boldsymbol {w}_{m,j}$ of length $V$ (where $V$ is the vocabulary size). We limited the number of sentences in a story to 50 (similar to previous work [28], [35]).

We used a dense layer consisting of 80 LIF neurons as input encoder in this task. We found it helpful to let the model choose for itself which type of sentence encoding to use. We, therefore, used a learned encoding (see Section Model and Training Details in the Supplementary and [35]). Each sentence was encoded as a spike train with a duration of 100 ms. Similar to previous work [28], [35], we performed three independent runs with different random initializations and reported the results of the model with the highest validation accuracy in these runs. The average firing rate per neuron estimated over all bAbI tasks was 12 Hz, which corresponds to approximately one spike for each 100 ms input token.

In Table II, we compare our model to the spiking RelNet [9] and to the H-Mem model [41], a nonspiking memory network model. Similar to the feedback-loop from the value layer to the input of the key layer in our model, the H-Mem model can utilize several memory accesses conditioned on previous memory recalls. The results of H-Mem were reported for a single memory access (1-hop) and three memory accesses (3-hop). It turned out that multiple memory hops were necessary to solve some of the bAbI tasks. Similarly, we found that in our model, an instantaneous feedback loop (with a delay $d_{\mathrm {feedback}}$ of 1 ms) struggled with a few tasks that could be solved with a delay of 30 ms, corresponding roughly to three hops during the network inference that lasted 100 ms. We compared the performance of these models in terms of their mean error, error on individual tasks, and the number of failed tasks. Our model with 1 ms feedback solved 13 of the 20 tasks. By increasing the feedback delay $d_{\mathrm {feedback}}$ from 1 to 30 ms, it was able to solve 16 of the 20 tasks. This result indicates that the spiking network can make use of multiple memory accesses asynchronously, that is, simply through a delayed feedback loop without the need for discrete memory access steps. The spiking RelNet model solved 17 of the 20. Note, however, that this model is much more complex, makes heavy use of weight sharing, and employs pretrained LSNNs for word embeddings.

TABLE II Test Error Rates (in %) on the 20 bAbI QA Tasks. Comparison of our Model to the Spiking RelNet Model [9] and to the Nonspiking Memory-Based Model H-Mem [41] Performing One Memory Access (1-Hop) and Three Memory Accesses (3-Hop). Shown are Mean Error, Error on Individual Tasks, and the Number of Failed Tasks (According to the Convention of [27], a Model Had Failed to Solve a Taskif the Test Error Was Above 5% for That Task; Results of the Alternative Models Were Taken From the Respective Papers). Keys: Mem. Acc. = Number of Memory Accesses; fb. Delay = Synaptic Delay $d_{\mathrm{feedback}}$ in Feedback Loop
Table II- Test Error Rates (in %) on the 20 bAbI QA Tasks. Comparison of our Model to the Spiking RelNet Model [9] and to the Nonspiking Memory-Based Model H-Mem [41] Performing One Memory Access (1-Hop) and Three Memory Accesses (3-Hop). Shown are Mean Error, Error on Individual Tasks, and the Number of Failed Tasks (According to the Convention of [27], a Model Had Failed to Solve a Taskif the Test Error Was Above 5% for That Task; Results of the Alternative Models Were Taken From the Respective Papers). Keys: Mem. Acc. = Number of Memory Accesses; fb. Delay = Synaptic Delay
$d_{\mathrm{feedback}}$
 in Feedback Loop

E. Reinforcement Learning

While supervisory signals are arguably scarce in nature, it is well-established that animals learn from rewards [61]. To test whether memory can also serve SNNs in the reward-based setting, we evaluated our model on an episodic reinforcement learning task. The task is based on the popular children’s game Concentration. The game Concentration requires good memorization skills and is played with a deck of $n$ pairs of cards. The cards in each pair are identical. At the start of each game, the cards are shuffled and laid out face down. A player’s turn consists of flipping over any two of the cards. If the cards match, then they are removed from the game and the current player moves again. Otherwise, the cards are turned face down again, and the next player proceeds. The player who collects more pairs wins the game.

Here, we consider a one-player solitaire version of the game [Fig. 5(a)]. In this version of the game, the objective is to find all matching pairs with as few card flips as possible. The cards are arranged on a 1-D grid of cells, each of which may be empty or may contain one card. The grid is just large enough to hold all of the $2n$ cards. Each card may either be face up or face down on any given time step (initially all cards are face down). The agent’s available actions are to flip over any of the cards at a given time step. More precisely, the action space is an integer from $\{1, 2, \ldots, 2n\}$ . Whenever two cards are face up but do not match, they automatically turn face down in the next time step. Whenever two cards match, they are removed from the grid and the agent is rewarded. The agent receives a small penalty for each card flip. The game continues until all cards are removed.

Fig. 5. - SNN learns to play the game Concentration from rewards. (a) Example game moves for a Concentration game with four cards played as a solitaire. The objective of the game is to turn over pairs of matching cards with as few card flips as possible. Shown is—for each time step 
$t_{i}$
—the agent’s observation 
$\boldsymbol {s}_{i}$
, the action 
$a_{i}$
 taken by the agent, and the resulting card configuration 
$c_{i}$
 after taking action 
$a_{i}$
. The agent’s observation 
$\boldsymbol {s}$
 contains the state of the cells (face down card, illustrated by a square with a diamond pattern; face up card, square; or empty, grayed out square), the previous action taken by the agent, and the face of the card the agent had flipped in the previous time step. At time 
$t_{1}$
, all cards are face down. Flipping card 1 (action 
$a_{1}$
) results in configuration 
$c_{1}$
 (i.e., card 1, showing a blue triangle, is face up). Flipping card 2 in time step 
$t_{2}$
 reveals an orange disk (configuration 
$c_{2}$
). The two cards do not match, hence they are turned face down again in the next time step 
$t_{3}$
. In time step 
$t_{3}$
, card 4, which shows a blue triangle, is flipped. By recalling that the matching card is card 1, the agent flips card 1 in 
$t_{4}$
. Matching face-up cards are then removed from the board. The game continues until the remaining two cards are removed. Board configurations at which the agent receives a reward are indicated by a red rectangle. (b) Evolution of the number of card flips the agent takes to finish a game over the number of games during training. Shown is the evolution of the number of flips for a deck of four cards (blue) and six cards (green). Black dashed lines show the mean number of flips required when the agent has perfect memory and follows an optimal strategy (5.33 for a deck of four cards and 8.65 for a deck of six cards). (c) Histogram of the number of flips an agent takes to finish a game with four cards (blue) and six cards (green). Shown is the histogram for a random agent (left), our agent after training (middle), and an agent that has perfect memory and follows an optimal strategy (right). Histograms are computed from 1000 games each and are scaled to the same width.
Fig. 5.

SNN learns to play the game Concentration from rewards. (a) Example game moves for a Concentration game with four cards played as a solitaire. The objective of the game is to turn over pairs of matching cards with as few card flips as possible. Shown is—for each time step $t_{i}$ —the agent’s observation $\boldsymbol {s}_{i}$ , the action $a_{i}$ taken by the agent, and the resulting card configuration $c_{i}$ after taking action $a_{i}$ . The agent’s observation $\boldsymbol {s}$ contains the state of the cells (face down card, illustrated by a square with a diamond pattern; face up card, square; or empty, grayed out square), the previous action taken by the agent, and the face of the card the agent had flipped in the previous time step. At time $t_{1}$ , all cards are face down. Flipping card 1 (action $a_{1}$ ) results in configuration $c_{1}$ (i.e., card 1, showing a blue triangle, is face up). Flipping card 2 in time step $t_{2}$ reveals an orange disk (configuration $c_{2}$ ). The two cards do not match, hence they are turned face down again in the next time step $t_{3}$ . In time step $t_{3}$ , card 4, which shows a blue triangle, is flipped. By recalling that the matching card is card 1, the agent flips card 1 in $t_{4}$ . Matching face-up cards are then removed from the board. The game continues until the remaining two cards are removed. Board configurations at which the agent receives a reward are indicated by a red rectangle. (b) Evolution of the number of card flips the agent takes to finish a game over the number of games during training. Shown is the evolution of the number of flips for a deck of four cards (blue) and six cards (green). Black dashed lines show the mean number of flips required when the agent has perfect memory and follows an optimal strategy (5.33 for a deck of four cards and 8.65 for a deck of six cards). (c) Histogram of the number of flips an agent takes to finish a game with four cards (blue) and six cards (green). Shown is the histogram for a random agent (left), our agent after training (middle), and an agent that has perfect memory and follows an optimal strategy (right). Histograms are computed from 1000 games each and are scaled to the same width.

Instead of using images, we define each card face to be a 10-D random continuous-valued vector. The agent’s observation vector $\boldsymbol {s}$ [see Fig. 5(a)] contains three components: 1) a one-hot vector for each cell that encodes the state of the cell (a cell can either be empty, containing a face-down card, or a face-up card); 2) a one-hot vector encoding the previous action taken by the agent (i.e., a one-hot vector encoding which grid position was flipped); and 3) a 10-D real vector for the image of the card the agent had flipped in the previous time step (this is a zero vector if the card is face down after the flip or if the agent’s action was to flip an empty cell). In contrast to the other considered tasks, where we sequentially presented some facts followed by a query to which the network should respond with an output, here instead, in each time step, the network had to figure out by itself when to store and recall information, given only the current observation vector.

The performance was evaluated in terms of the number of flips performed until all matching pairs had been removed from the grid. Agents were trained with proximal policy optimization (PPO) [62] in actor–critic style using 64 asynchronous vectorized environments (see Section Model and Training Details in the Supplementary for details of the model and the training setup). We evaluated our model on a deck of four cards and a deck of six cards. The evolution of the number of cards flips the agent takes to finish a game over the number of games during training is shown in Fig. 5(b). After training, we evaluated the agents on 1000 games and recorded the number of card flips the agent takes to finish each game. Fig. 5(c) shows the histogram of the number of cards flips in this evaluation for a random agent, our agent, and an agent that has perfect memory and that follows an optimal strategy. If an agent has no memory at all and plays by simply flipping cards at random, then the expected number of flips the agent takes to finish a game with $n$ pairs of cards is $(2n)^{2}$ . For an optimal agent, the length lies between $2n$ and $4n - 2$ where the expected number of flips is $(6 - 4\ln {2})n + 7/8 - 2\ln {2}$ as $n \to \infty $ [63]. For the four-card game, the SNN reached an optimal performance (mean number of flips: 5.33; optimal: 5.33). An average of 8 flips were achieved within 8235 games. To simplify the meta-parameter search in these quite demanding simulations, we did not include a rate regularizer during training. Hence, the average firing rate in the network was 50 Hz, higher than in other simulations. The six-card game was harder to train. Still, the final network’s performance was again close to optimal [see Fig. 5(c)]. Redrawing the random vectors at the beginning of each game (i.e., using a new deck of cards in each game) marginally decreased the performance (mean number of flips was 5.35 for the four-card game, and 10.88 for the six-card game, where the optimum is 8.65 flips).

F. Ablation Studies

The different weight matrices in our network serve different purposes (see Fig. 1). The matrix $W^{\mathrm {s,key}}$ defines what aspects of the input are used to define the key vector during storage and hence could be used as the query of a later recall. The matrix $W^{\mathrm {s,value}}$ defines what aspects of the input are associated with this key and can hence be retrieved at a recall. During a recall, the matrix $W^{\mathrm {r,key}}$ defines which aspects of the query input are used as a key for the recall. The relevant aspects for all these matrices depend on the task, and hence, we trained these matrices with backprop. The training procedure could potentially be simplified if some of these matrices could be learned via unsupervised learning that could implement more generic codes for the key and value layers. To investigate whether supervised end-to-end training is necessary, we asked whether some of these matrices could be replaced by generic matrices. To this end, we reconsidered the association task (Section III-A and Fig. 1).

We first generated independent random weight matrices for $W^{\mathrm {s,key}}$ and $W^{\mathrm {r,key}}$ where we fit the statistics of these matrices to those of the same matrices in a fully trained network on this task (see Sections I–J in the Supplementary). This procedure ensured that the matrices had the correct weight magnitude distribution. We then fixed the matrices and trained the rest of the network on the association task as usual. The resulting network performed poorly on the task. For a sequence length of 8, its accuracy was 21.7% when compared to 98.2% for the fully trained network. However, one can argue that the key matrices for storage and recall should be similar to match the keys in these cases. In fact, when we used the identical randomly generated matrix for both, $W^{\mathrm {s,key}}$ and the feed-forward part of $W^{\mathrm {r,key}}$ (the recurrent connections from the value layer were chosen randomly as before), the accuracy increased to 81.8%, which is, however, still clearly worse than the fully trained network. While this simple approach enables the network to solve bAbI task one (error 0%), it turns out that it is not sufficient for more complex tasks of this dataset. When performing the same experiment on bAbI tasks five and six, the network failed to learn it (error 25.6% and 24.2%, respectively). These results indicate that supervised training of some matrices could be replaced by generic matrices or potentially by unsupervised training if the values of the key matrices for storage and recall are coupled at simple association tasks, but not at more complex tasks.

To accomplish rapid associations during inference, it is necessary that Hebbian plasticity can elevate membrane potentials of neurons above the threshold in the value layer within the time of an input token presentation (100 ms in our setup). This means that neuron thresholds should be relatively small compared to the maximum weights of the association matrix. In all our simulations, we used a threshold of $\vartheta =0.1$ at a maximum weight of $w^{\mathrm {max}}=1$ . To evaluate whether network performance is sensitive to the exact value of $\vartheta $ , we performed control simulations where we trained the model with higher thresholds on the pattern association task from Section III-A with $N=8$ patterns as in Fig. 2(a) (for details, see Sections I–J and Fig. S2 in the Supplementary). For a threshold of $\vartheta =0.3$ (300 % increase), performance was still high with 97 % accuracy, and even for a threshold of $\vartheta =1$ , an accuracy of 90.4% was achieved. This shows that performance is quite insensitive to the exact threshold value. Fig. S2 also shows that in the latter case ($\vartheta =1$ ), individual weights stay clearly below the threshold after a store operation has been performed by the network. Nevertheless, low thresholds are beneficial for network performance.

SECTION IV.

Efficiency of Neuromorphic Implementation

SNNs can be implemented highly efficiently in analog neuromorphic hardware [64], [65], [66]. However, recently energy-efficient digital implementations have been introduced [5], [67]. The main advantage of digital SNNs over digital ANNs is the drastic reduction of multiply-accumulate (MAC) operations. Since spikes are binary, for every input spike, the weight has to be added to the membrane potential of the neuron, which is one accumulate operation (AC). Since MAC operations are much more expensive than AC operations ($31\times $ more expensive in a standard CMOS process [68]), this provides the basis for the efficiency of SNNs in digital hardware [69]. However, since the AC operations are performed per-spike, the average firing rates of neurons in the network are an important factor with sparser activity resulting in more efficient operation. In the following, we compare our SNN model with the similar but nonspiking H-Mem model [41] in terms of its efficiency. We can nicely compare these models as we can use similar architectures with the same number of neurons $n$ and number of synapses $s$ . For each processed input token, the ANN needs $s$ MAC operations, resulting in an energy consumption of $sE_{\mathrm {MAC}}$ , where $E_{\mathrm {MAC}}$ denotes the energy consumed by one MAC operation. For the SNN, we have approximately $Tfs$ AC operations, where $T$ is the duration of the token presentation (100 ms in our simulations) and $f$ is the average firing rate in the network. In addition, each neuron reduces its membrane potential by a constant factor in each time step. The network performs $(T/\Delta t)n$ such operations per token presentation, where $\Delta t$ is the time step. Note, however, that this is a multiplication with a constant which can be implemented more efficiently than a MAC and in fact, some SNN models skip this decay. We, therefore, assume that the energy consumption is comparable to an AC operation. In total, this amounts to an energy consumption of $(Tfs + (T/\Delta t) n) E_{\mathrm {AC}}$ , where $E_{\mathrm {AC}}$ denotes the energy consumed by one AC operation.

As an illustrative example, we consider the association task from Section III-A. The core network without the input encoder and the readout, consists of $n=300$ neurons and $s=44$ k synapses. In the SNN, we used $T= {\mathrm {100\,\,ms}}$ presentation time for each input token, and the average firing rate was measured as $f= {\mathrm {13\,\,Hz}}$ . These numbers lead to an energy consumption of $44\cdot 10^{3} E_{\mathrm {MAC}}$ for H-Mem and $87.2\cdot 10^{3} E_{\mathrm {AC}}$ for the SNN. Using the estimated factor $E_{\mathrm {MAC}}=31 E_{\mathrm {AC}}$ , this amounts to a 15 times more efficient SNN implementation. We refer to the Supplement for details. Note that this calculation does not include Hebbian updates of the association weights (see Section V).

SECTION V.

Discussion

Local Hebbian plasticity is a key ingredient of biological neural systems. While only local Hebbian plasticity is needed during inference in our model, it was trained with BPTT. Since BPTT is biologically implausible, we cannot claim that our network is a model for how functionality could be learned by an organism. Instead, our results provide an existence proof for powerful memory-enhanced SNNs. One can speculate that brain networks were shaped by evolution to make use of Hebbian plasticity processes. In this sense, BPTT can be seen as a replacement for evolutionary processes. For example, the brain might have evolved networks for one-shot learning (Fig. 3) that are particularly tuned to behaviorally relevant stimuli. In addition to evolutionary optimization, local approximations of BPTT such as the recently introduced e-prop algorithm [70] could then further shape the evolved circuits for specific functionality.

The integration of synaptic plasticity for inference in artificial neural networks was used in [23] and [24] and adopted recently [25], [26]. In the latter, input representations were bound to labels with Hebbian plasticity. Memory-augmented neural networks use explicit memory modules which are a differentiable version of a digital memory [27], [28], [29], [30], [31], [32], [33]. Our model utilizes biological Hebbian plasticity instead. The use of fast Hebbian plasticity in SNNs was studied in [71] and [72]. These models implement a type of variable binding and can explain certain aspects of higher-level cognition in the brain. They were, however, not applied to complex cognitive tasks. In [73], training of networks with synaptic plasticity was explored, but there the parameters of the plasticity rule were optimized instead of the surrounding control networks. Other uses of Hebbian plasticity for network computations include the unsupervised pretraining followed by supervised training of the output layer [74] and the derivation of local plasticity rules for unsupervised learning [75].

The inclusion of Hebbian plasticity can be viewed as the introduction of another long time constant in the network dynamics. Previous work has shown that longer time constants can significantly improve the temporal computing capabilities of SNNs. In this direction, short-term synaptic plasticity [15], [16] and neuronal adaptation [14] have been exploited. One-shot learning of SNNs has been studied in [39]. Instead of Hebbian plasticity, this model relied on a more elaborate three-factor learning rule. Another SNN model, the spiking RelNet, was tested on the bAbI task set [9]. We have compared this model to ours in Table II. The architecture of this model is quite different from our proposal. As a spiking implementation of relational networks, it is rather complex and makes heavy use of weight sharing. Both these models perform well on similar tasks of the bAbI task set, but interestingly, there are some differences. For example, our model solves task 2 “two supporting facts” on which the spiking RelNet fails. This might be due to the possibility of multiple memory accesses through the feedback loop in our model. On the other hand, our model fails at task 17 “positional reasoning” which is solved by the spiking RelNet. We suspect that this is due to the more complex network structure of the spiking RelNet. To the best of our knowledge, no previous spiking (or artificial) neural network model is performing well on both, one-shot learning and bAbI tasks, as well as on the other tasks we presented.

Our results on bAbI tasks show that a synaptic delay in the feedback connections from the value layer to the key layer is essential for good performance in some tasks. While we used fixed delays of 1 and 30 ms in our simulations, an intriguing option is to optimize this delay during training. In general, optimizing delays together with weights in the network [76] could lead to more efficient networks that can better utilize spike timing information. This option could be investigated in future work.

Our discussion of efficient network implementation in Section IV did not include Hebbian updates of the association weights. Since here, multiplication is performed at every time step, the vanilla spiking model is not efficient. To optimize the performance of the Hebbian updates, one could perform only one update per input token based on the product of the number of pre- and postsynaptic spikes during the token presentation. This would use computational resources comparable to the nonspiking case. Even more efficient would be the use of analog memories such as memristors [77], for which efficient implementations of Hebbian plasticity have been proposed [78].

SECTION VI.

Conclusion

We have presented a novel SNN model that integrates Hebbian plasticity in its network dynamics. We found that this memory enrichment renders SNNs surprisingly flexible. In particular, our simulations show that Hebbian plasticity endows SNNs with the capability for one-shot learning, cross-modal pattern association, language processing, and memory-dependent reward-based learning. Hebbian plasticity is spatially and temporally local, i.e., the synaptic weight change depends only on a filtered version of the pre- and postsynaptic spikes and on the current weight value. This is a very desirable feature of any plasticity rule, as it can easily be implemented both in biological synaptic connections and in neuromorphic hardware. In fact, current neuromorphic designs support this type of plasticity [5]. Hence, our results indicate that Hebbian plasticity can serve as a fundamental building block in cognitive architectures based on energy-efficient neuromorphic hardware.

ACKNOWLEDGMENT

The authors would like to thank Wolfgang Maass and Arjun Rao for initial discussions.

References

References is not available for this document.