On-Chip Trainable Spiking Neural Networks Using Time-To-First-Spike Encoding

Artificial Neural Networks (ANNs) have shown remarkable performance in various fields. However, ANN relies on the von-Neumann architecture, which consumes a lot of power. Hardware-based spiking neural networks (SNNs) inspired by a human brain have become an alternative with significantly low power consumption. In this paper, we propose on-chip trainable SNNs using a time-to-first-spike (TTFS) method. We modify the learning rules of conventional SNNs using TTFS to be suitable for on-chip learning. Vertical NAND flash memory cells fabricated by a device manufacturer are used as synaptic devices. The entire learning process considering the hardware implementation is also demonstrated. The performance of the proposed network is evaluated through the MNIST classification in system-level simulation using Python. The proposed SNNs show an accuracy of 96% for a network size of 784 – 400 – 10. We also investigate the effect of non-ideal cell characteristics (such as pulse-to-pulse and device-to-device variations) on inference accuracy. Our networks demonstrate excellent immunity for various device variations compared with the networks using off-chip training.


I. INTRODUCTION
Recently, artificial neural networks (ANNs) have demonstrated superior performance in various fields, such as classification tasks, pattern recognition, and detection tasks [1]- [3]. As the demand for ANNs increases, the limitations of the conventional von Neumann architecture, such as training speed and power consumption, have become a major concern [4]. Various researches have been conducted to overcome these issues, including digital accelerators [5] or efficient learning algorithms [6]. However, the conventional von Neumann architecture has fundamental limitations in terms of power consumption and time required for memory access [7]. As an alternative, hardware-based spiking neural networks (SNNs) that use analog synaptic devices have The associate editor coordinating the review of this manuscript and approving it for publication was Shuihua Wang . emerged with advantages in terms of power consumption and operation time [8]- [10].
Spiking Neural Networks mimic the behavior of the human brain, which consists of numerous neurons and synapses [11]. In the SNNs, neurons communicate with adjacent neurons by generating spikes and transmitting those via synapses. Each neuron integrates the spikes propagated from the preceding neurons as a form of its membrane potential. When the membrane potential exceeds the threshold of the neuron, a spike is generated and transmitted to the post neuron [12]. The spikes can contain information in two general forms: firing rate and firing time [13].
By using firing time as the information-carrying quantities, the network can operate with a small number of spikes. Compared to rate coding, where the spiking rate of a neuron encodes an analog value of ANNS, the temporal-based networks can be implemented more power-efficient on neuromorphic hardware since it reduces the number of spikes [14]. VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ However, the temporal-based networks are not suitable for applying the conventional learning rules of ANNs, as they convert analog values of ANNs into temporal formats.
To implement the temporal-based SNNs, various approaches have been proposed, such as methods using derivatives of temporal values [15], alpha synaptic function [16], and dynamic target firing time [17]. Although these attempts have achieved remarkable results in terms of network performance, an additional software-based computation is required in the learning process due to their complex learning algorithms. Therefore, these methods are unsuitable for implementing onchip training, which trains the network by applying update pulses to synaptic devices in the hardware level [18] without ANN-to-SNN conversion. This reveals the limitations of conventional approaches in terms of power consumption.
In this paper, we propose an on-chip trainable temporalbased SNNs. While conventional training methods are complicated to implement in hardware, simplified methods such as static target firing time and constant denominator for gradient normalization are applied. The conductance characteristics of the synaptic devices are obtained from the results measured from cells in vertical NAND flash memory cell strings. The holistic process of the proposed network consists of 5 phases: 1 forward phase, 2 backpropagation phases, and 2 update phases. Schematic circuitry to generate the error value between the output spike and the teaching signal is also proposed. The performance of the proposed network is evaluated at the system level by classifying the MNIST dataset, and the power efficiency of the network is estimated through the total amount of synaptic weight updates. Furthermore, the effects of variations occurred by non-ideal characteristics of the synaptic devices are evaluated in three types: a pulse to pulse variation, a device to device variation, and a stuck-atoff ratio [19]. This paper is organized as follows. Section II contains the measured characteristics of cells in a cell string of vertical NAND flash memory and the proposed algorithms to train the network. Section III presents a scheme to implement the proposed network in hardware. Section IV provides simulation results and discussion. Finally, section V provides a summary and a conclusion of our research.

A. VERTICAL NAND FLASH MEMORY
In this work, cells in cell strings of vertical NAND (VNAND) flash memories manufactured by a memory company are used as synaptic devices. Each string contains multiple wordline (WL) cells and two select-line transistors. A schematic view of the VNAND flash string and the bias condition is demonstrated in Fig. 1 (a). The center WL cell is erased and programmed to demonstrate the long-term potentiation (LTP) and long-term depression (LTD) characteristics. In the LTP process, the GIDL (Gate Induced Drain Leakage) mechanism initiated by the erase pulse provides holes to the selected WL cell and lowers its threshold voltage [20]. For the LTD process, the program pulse initiating FN-tunneling is applied to the gate of the selected WL cell.
The measured LTP/LTD characteristics of the synapse are presented in Fig. 1 (b). Both LTP and LTD were conducted for five types of the update pulse width. A total of 40 pulses, 20 program pulses, and 20 erase pulses were applied, respectively. Both erase and program processes show non-linear conductance behaviors. Each inset presents the number of required pulses to obtain the same amount of conductance change as when the unit pulse is applied 20 times. Since the pulse width and the number of pulses have an inverse proportional relationship, the amount of conductance change is proportional to the update pulse width. Therefore, weight updates can be implemented in hardware by modulating with a pulse width proportional to the delta value stored in each neuron. The detailed process is covered in III. The conductance behavior of synaptic devices is generally expressed as follows [21], where G is the conductance of the device, β is the nonlinearity factor, and x is the number of the applied pulses. a and c are the fitting parameters. The conductance characteristics of the measured device are fitted with a non-linearity factor equal to 2.434 in the LTP process and 3.504 in the LTD process.

B. LEARNING METHODS
We employed time-to-first-spike (TTFS) method as the encoding rule and the modified learning methods of the previous work [17]. Consider a gray image with pixel values ranging from 0 to I max . The intensity of each pixel is converted to a single spike spiking at a specific time from each corresponding input neuron. Spike time of i th neuron t i is calculated from the pixel intensity I i as follows: where I max is 255 and t max is 511. A schematic illustration for temporal encoding. Each pixel intensity is converted to a single spike. A pixel with the higher intensity corresponds to a spike that fires earlier. Fig. 2 demonstrates the schematic description of the temporal encoding rule. Assuming each neuron only fires once per an input image, spike train S i of i th input neuron is defined as follows: Spikes generated from the input layer are multiplied by synaptic weights and integrated into the neurons in the next layer following the non-leaky I&F (Integrate and Fire) model [22]. Each neuron and synapse is a fully connected. The membrane potential V l j of j th neuron in layer l can be defined as where N l−1 is the number of neurons in layer l − 1, and w ji is the synaptic weight connecting j th neuron in layer l and i th neuron in layer l − 1. The neuron fires when the membrane voltage exceeds its threshold. Spike train of j th neuron in layer l is expressed as follows, where θ l j is the threshold of the j th neuron in layer l, and h l j (t) indicate that the neuron did not fire before.
The neuron which fires the earliest in the output layer determines the label of the input image. To train the network that the output neuron for the right category fires first, we define the error e i of the i th neuron in the output layer and the loss as follows: where t i is the firing time of the output neuron, and t target is the target firing time. The method to set the target firing time is presented at the end of this section. The synaptic weight connecting the i th neuron in layer l − 1 and the j th neuron in layer l is updated as where η is the learning rate, and (10) is assumed following the previous work [17]. The delta value δ l j of the neuron is defined from the derivative of the loss as where t l i is the firing time of the neuron in layer l. If layer l is the output layer, the delta value is obtained as and when layer l is the hidden layer, the delta value is calculated by the backpropagation as follows [17], To prevent gradient exploding and vanishing, we normalize gradient by modulating the delta values. L2-norm is generally used as a normalization tool. However, it is challenging to implement the L2-norm in hardware [23]. Due to the aforementioned challenge, a novel method for normalization is proposed. Instead of L2-norm, predefined constant parameter r l becomes a denominator of the normalization term. In the proposed network, r l is the saturated value of L0-norm of the delta values during the training process. With a layer-wise hyperparameter r l , the delta values are normalized as follows, In (7), the functionality of the target firing time t target is encouraging the target neuron to fire the first. As a result, the target neuron fires earlier at every training step, and the other neurons fire later. Various methods are utilized in previous works: cross-entropy loss [15] and the relative target firing time [17] to implement the update process. Although these attempts effectively train the network, these are not suitable for on-chip training due to the difficulty of hardware implementation. To obtain cross-entropy loss in hardware,  it necessitates the inclusion of circuitry for calculating exponential function [24], logarithmic function [25], and summation function [26]. If the target firing time is a dynamic parameter, additional circuits are required for sensing the firing time of the output neuron and generating the target signal according to the output spike. In order to avoid the circuit complexity of the above methods, a constant target firing time is utilized in this paper. Given the functionality of the target firing time, the simplest way is to set the target firing time at the first time step for the target and the last time step for the others. However, it may lead the target neuron to fire earlier than most neurons in the input layer and hidden layer, resulting in information loss which degrades the performance of the network [16]. Therefore, we set the target firing time to be spaced apart from both ends of the time step by a specific interval as follows: where p is a hyperparameter with a value of between 0 and 1. Fig. 3 demonstrates the cases of weight update with the proposed target firing time t target .   Fig. 4 presents the block diagram of the hardware implementation scheme for the proposed network. The entire process consists of 5 phases: 1 forward phase, 2 backpropagation phases, and 2 update phases.

III. A SCHEME FOR HARDWARE IMPLEMENTATION
In the forward phase (x of Fig. 4), input signals encoded in the temporal form are integrated into the neurons of the input layer. When the membrane voltage of the input neuron exceeds the threshold, it fires and the spike propagates to the next layer through the synaptic array. The spikes eventually reach the output layer with this process. The synaptic array is divided into the G + synaptic array and the G − synaptic array in order to represent a weight with a value from negative to positive. The weight value w of a synapse is represented as follows, The schematic diagram for the network structure is shown in Fig. 5. At the output layer, the spikes of the output layer are transmitted to the delta value generating circuit (Fig. 6. (a)). The delta value generating circuit has two capacitors. Each capacitor is charged during the time difference between the output signal and the teaching signal. At the target firing time, the teaching signal is applied to the circuit. For the target output neuron, the teaching signal is applied to the red dashed part of Fig. 6. (a). Then, the voltage V + delta of the capacitor C + deltaa increases only if the teaching signal is prior to the output signal. For the other output neurons, the teaching signal is applied to the blue dashed part of Fig. 6. (a). The voltage V − delta of the capacitor C − delta increases when the teaching signal is posterior to the output signal. The voltage transitions across C + delta and C − delta are shown in Fig. 6. (b). These voltage values V + delta and V − delta represent the positive and the negative part of the delta value in Section II-B, respectively.
The stored delta values in the output layer are propagated to the hidden layer during the backpropagation phase (y of Fig. 4). After the forward phase, the delta value is converted to the magnitude of the voltage regardless of the sign. Since it is necessary to distinguish the sign of the delta value in the process of the backward weighted sum of the delta value, the backpropagation phase is performed twice by dividing the cases of positive and negative delta values. In order to obtain the positive delta value of the hidden layer, the delta values of the target output neuron and the other neurons converted through pulse width modulator (PWM) are applied to the G + synaptic array and the G − synaptic array, respectively. The weighted sum of the delta value should be positive in this case while the target output neuron always has the positive delta value and the others have negative delta value. In section II-B, only the hidden neurons that fire prior to the output neurons receive the delta value of the output neurons. The pulse scheme to implement this algorithm in hardware is demonstrated in Fig. 7. At the beginning of this phase, the input signal is applied again to the network. The pulse representing the delta value of the output neuron is emitted through the PWM circuit when the output neuron fires. When the hidden neuron fires, the switch connecting the delta capacitor of the hidden neuron and the synaptic array is closed. The output pulse width of the PWM circuit is 30µs when the delta value has a maximum value when the length of each time step is 50µs. According to this method, the hidden neuron receives the backpropagated delta value from the output neuron only when it fires earlier than the output neuron. In case of obtaining the negative delta value of the hidden layer, the delta values of the target output neuron and the other neurons converted through the PWM circuit are applied to the G − synaptic array and the G + synaptic array, respectively. The rest of the process is the same as that of the positive delta value.
After obtaining all delta values of the hidden layer, the update phase (z of Fig. 4) begins. Each synaptic device receives an update pulse corresponding to the delta value stored in the neuron connected backward. The update phase is divided into 2 sub-phases: weight increase sub-phase and weight decrease sub-phase, since the delta value is divided into each sign. In section II-B, weight update only occurs when the prior neuron fire earlier than the posterior neuron. Due to this property, the input signal is applied to the network to check the firing time of neurons without additional memory.
In the sub-phase for weight increase, the voltage across the capacitors that store the positive delta values of each neuron is converted to a pulse width through the PWM circuit when the neuron fire. The pulse output through the PWM circuit has a magnitude of a half of the erase bias V ERS and the program bias V PGM , and a width (<30µs) proportional to a delta value. These pulses are applied to the BLs and SLs of the VNAND flash cell strings in the G + synaptic array and the WLs of the G − synaptic array connected in front of the neurons. At the same time, the negative pulses with a magnitude of a half of V ERS and V PGM are applied to the WLs of the G + synaptic array and the DLs/SLs of the G − synaptic array connected behind of the neurons. In the G + synaptic array, the voltage across DLs/SLs and WLs generate GIDL and consequently strengthen the weights of the synapses. In the G − synaptic array, FN-tunneling is occurred and depress the weights of the synapses. This scheme allows the synapse can be updated only when the prior neuron fires earlier than the posterior neuron. Fig. 8 demonstrates this pulse scheme. The weight increase in this sub-phase is accomplished with a potentiation of the G + synaptic array and a depression of the G − synaptic array.
The sub-phase for weight decrease is conducted with the opposite mechanism that leads to a conductance decrease of the G + synaptic array and a conductance increase of the G − synaptic array. The pulses generated from the negative delta values are sequentially applied as program pulses of cells in the G + synaptic array (LTD process) and erase pulses of cells in the G − synaptic array (LTP process). As a result, the weight decrease sub-phase is finished with G + decreases and G − increases.

IV. RESULTS AND DISCUSSION
The performance of the proposed network is evaluated through the MNIST dataset classification. The system-level simulation for this task is conducted using Python. We implement the network with a size of 784 -400 -10. The threshold of a neuron is set to 100 uniformly and the learning rate is 0.2. The synaptic weights in the network reflect the measured device characteristics shown in Section II-A. The proposed learning rules for hardware implementation are adopted.

A. TARGET FIRING TIME
We evaluated the performance of the network with various p of (15). Fig. 9. (a) shows the classification accuracy of the network while p varies from 0.1 to 0.5 in 5 steps, and the inset of Fig. 9. (a) presents the case when p is 0. The inset shows that the accuracy of the network decreases with increasing training epochs, as mentioned in Section II-B. The accuracy of the network saturates to a particular value when p is non-zero. As the value of p increases, the performance of the network tends to improve. If p is greater than 0.5, there   is a time interval in which the firing time of the neuron of the target index overlaps the firing time of other neurons, so the simulation is not performed for this case. The best classification accuracy of the network is 96% at a p of 0.5. Fig. 9. (b) shows the approximated power consumption of the network with various p. The power consumption is 31268 VOLUME 10, 2022 approximated from the total amount of the weight updates. When the value of p increases, the network consumes low power during the training process. As demonstrated in Fig. 9. (b), with a larger value of p, the weight update is terminated earlier, and the neurons store the smaller delta values. Therefore, in the case that p is 0.5, the network costs the least power consumption.
With the advantage in terms of classification accuracy and power consumption, we set the value of p to 0.5. Thus Fig. 3 is modified to Fig. 10. Furthermore, when p is 0.5, the target firing time of the target output neurons and the other output neurons are the same. Identical target firing time for all output neurons allows using same delta value generation circuit for all neurons, reducing circuit complexity for hardware implementation.
The confusion matrix of input indices and the mean firing time of the output neurons at a p of 0.5 is shown in Fig. 11. Each output neuron fires much earlier than the other neurons when the input signal corresponding to the index comes in, and fires later in other cases. The mean firing time of the output neurons for the input signal with the correct index is 84.5.

B. NON-LINEAR CHARACTERISTICS OF SINAPTIC DEVICES
The measured LTP and LTD characteristics of the synaptic device in Fig. 1 have a non-linear curve. To verify the effect of non-linear characteristics, we compare two cases of synapse characteristics: linear and non-linear, when the other conditions are identical. Fig. 12 presents the simulation results. When the synapse characteristic is non-linear, the network shows a slightly higher accuracy of 96%, compared to the accuracy of 95.4% with the linear characteristics. Due to the non-linear LTD nature, in a large weight range, the weight is updated by a relatively small amount. This kind of update can be viewed as a variant of weight decay regularization, which is well-known for improving training accuracy [27].
The conductance change of non-linear synaptic device becomes smaller as the conductance increase. This property restricts the device from reaching to excessive conductance level. From section II-B, the performance of the network can be degraded when the target neuron fires earlier than most of the input signals. The non-linear characteristic of the synaptic device may prevent this degradation by regulating weight increase.
When the non-linear LTP and LTD characteristics of VNAND cells are taken into account for the weights of the synapses, the accuracy exhibits large perturbations. According to Section III, for weight increase, G + synaptic device is erased and G − synaptic device is programmed sequentially. Considering a weight is sufficiently strengthened during the training process, the G + synaptic device is fully erased and the G − synaptic device is fully programmed. If weight decrease occurs in this situation, the G + synaptic device needs to be programmed and the G − synaptic device needs to be erased. Due to the LTP and LTD characteristics of VNAND cells, both updates result in large conductance changes with the fully erased G + synaptic device and the fully programmed G − synaptic device. It seems to cause instability in the training curve demonstrated in Fig. 11.

C. EFFECTS OF DEVICE VARIATIONS
The intrinsic device variation is inevitable in hardware implementation [28]. Three types of the device variation are considered: pulse-to-pulse variation [29], device-to-device variation [30], and stuck-at-off variation [31]. We evaluate the tolerance of the proposed network to these variations by comparison with the off-chip learning scheme that uses ANNs-to-SNNs conversion. In the off-chip learning scheme, the values of the pre-trained weights are transferred to the conductance of devices after the training process.
The effect of pulse-to-pulse variation is shown in Fig. 13. (a). When the update pulse is applied to the synaptic device, a fluctuation of the pulse width is approximated along with a Gaussian distribution. The pulses are applied at each training step in our on-chip learning scheme, while they are applied only once at the end of the training process in offchip learning scheme. As σ/µ is increased from 0 to 1, the network with the on-chip learning scheme shows immunity to the variation, whereas the accuracy significantly degrades with the off-chip learning scheme. Fig. 13. (b) presents the effect of device-to-device variation. Synaptic devices in a synaptic array all have slightly different characteristics. We assume this variation by the nonlinearity factor β along with a Gaussian distribution. The network with the off-chip learning scheme shows a small degradation than the case of pulse-to-pulse variation, while the network with the on-chip learning scheme still maintains its good performance.
We define the stuck-at-off ratio as the proportion of the number of stuck-at-off synaptic devices to the total number of synaptic devices. Note that 10% of synaptic devices have a conductance of 0 when the stuck-at-off ratio is 0.1. Fig. 13. (c) demonstrates the performance of the networks as the stuck-at-off ratio increases from 0 to 0.5. With the on-chip learning scheme, the performance of the network slightly degrades by 1% of accuracy when the stock-at-off ratio is 0.5. The accuracy of the network using the off-chip learning scheme decreases below 50% at the same stuck-atoff ratio. Overall, networks using the on-chip learning scheme are more tolerant of device variation than those using the off-chip learning scheme.

V. CONCLUSION
In this paper, we have proposed hardware implementable SNNs using TTFS encoding scheme. Modified learning methods, including constant target firing time and delta normalization method, were proposed considering hardware implementation. The entire process of the proposed network consists of 5 phases: 1 forward phase, 2 backpropagation phases, and 2 update phases. In the forward phase, input signals encoded by the TTFS method are applied to the network and transmitted to the next layers. Each backpropagation phase generates the positive and the negative delta values in all layers except the input layer. In update phases, the synaptic devices are updated by the delta value of the post neuron. Each update phase is responsible for the weight increase process and weight decrease process.
VNAND flash memory cells fabricated by a company were used as synaptic devices in this work. Measured characteristics of the device showed that the update pulse width is proportional to the conductance change. Based on this characteristic, we employed PWM to generate the pulses for the synaptic weight updates. The measured LTP and LTD behavior has non-linear conductance with a non-linearity factor (β) of 2.434 in the LTP process and a β of 3.504 in the LTD process.
The performance of the proposed training method was evaluated through the MNIST dataset classification. The system-level simulation using Python is conducted for 5 cases of the target firing time. The network has 784 input neurons, 400 hidden neurons, and 10 output neurons. After 100 epochs of the training process, the network presented the highest classification accuracy of 96% at a p of 0.5 (the target firing time is the middle of the total time step). As the value of p decreased, the accuracy of the network degraded. Additionally, the approximated power consumption of the network was the lowest at a p of 0.5. We assumed the power consumption of the network by the total amount of the synaptic weight updates.
We also investigated the effects of three types of non-ideal device variation: pulse-to-pulse, device-to-device, and stuckat-off variations. The proposed on-chip trainable network was compared with the network using the off-chip learning scheme under the presence of device variation. For presented types of device variation, the proposed system showed excellent immunity compared to networks using the off-chip learning scheme.