A 18.7 TOPS/W Mixed-Signal Spiking Neural Network Processor with 8-bit Synaptic Weight On-chip Learning that Operates in the Continuous-Time Domain

We present a mixed-signal spiking neural networks processor with 8-bit synaptic weight on-chip learning in 40 nm CMOS that consists of a 10k mixed-signal synapse circuit and 100 analog leaky integrate-and-fire (LIF) neuron circuits. The processor has no clock signal except in peripheral circuits for I/O, and neuron and synapse circuits can operate asynchronously in the continuous-time domain, just like biological neurons. We demonstrate the energy efficiency of 6.24–18.7 TOPS/W in a multitarget spike learning task.


I. INTRODUCTION
Transistor shrinking is approaching its physical limits, so three-dimensional (3D) integration technologies are being studied for next-generation semiconductor devices.With 3D integration technologies, it is expected that new applications can be realized by stacking dies fabricated using different technologies, such as complementary metaloxide-semiconductors (CMOS), micro-electro-mechanical systems (MEMS), and dynamic random-access memory (DRAM). Such technologies will also reduce the delay and power consumption for communication with other chips and system areas. However, a concern is that thermal problems will be more serious than those encountered with singledie integrated circuits. Excessive heat should be considered because heat is a more serious problem in 3D integrations than in thin dies and thus limits the number of stacked layers per volume [1]. To realize highly stacked systems, it is important to develop a highly efficient arithmetic scheme that can void thermal problems. In hardware research for machine learning (ML), mixed-signal hardware based on computein-memory (CIM) architectures has been proposed to realize high-efficiency application-specific integrated circuits (ASIC) [2]- [11].
CIM architectures are used to reduce power consumption for multiply-accumulate (MAC) operations. In CIM, MAC operations are carried out using analog current and voltage, and processors employing CIM architectures have demonstrated high energy efficiency [2]- [8]. Moreover, CIM processors based on resistive random-access memory (ReRAM) have been proposed to achieve higher energy efficiency [9]- [11]. The CIM approach has been shown to work effectively in the ultra-deep submicron regime [8]. CIM architectures can potentially allow AI processors to directly process sensor data without analog-to-digital conversion, thereby realizing extremely high-efficiency 3D integration in intelligent processors for data output from MEMS sensors. However, CIM circuits are more sensitive to fabrication mismatches than are conventional digital circuits. On-chip learning can potentially reduce the influence of mismatches [12], but hardware based on CIM processors is mostly inference hardware that has synaptic weights of 1-4 bits to reduce the footprint and energy consumption of the digital-to-analog converter (DAC), and implementation of hardware using the CIM approach with on-chip learning of synaptic weights exceeding 4 bits is a challenge.
Besides thermal problems, 3D integration has a global clock distribution problem. Because it is difficult to synchronize global operations among several chips using a common clock signal, it is important to select a configuration that does not require synchronization between chips in a 3D stacked circuit. Spiking neural networks (SNNs) have been proposed as an asynchronous operation model. SNN hardware has already been implemented as digital circuits with a clock signal [13]- [17] and analog or mixed-signal circuits without any clock signal [18]- [23]. Analog SNN hardware operates in the continuous-time domain without a clock signal, eliminating the power consumption that would be necessary for clock signal distribution.Furthermore, system scaling by chip stacking is easy.
With the aim of realizing a component for scalable ML systems using 3D stacking technology, we propose an SNN that satisfies three important criteria: high-efficiency computing with a CIM architecture, asynchronous operation without clock signals, and on-chip learning with a synaptic weight exceeding 4 bits.We designed a prototype using a TSMC 40 nm CMOS that operates in the continuous-time domain, the same as biological neurons.We employed the remote supervised method (ReSuMe) [24] as a supervised algorithm.
The remainder of this paper is organized as follows: In Section 2, we describe the learning algorithm implemented in our circuit. Section 3 describes the proposed circuit and implementation of the synapse and neuron circuits. Section 4 presents experimental results for the proposed circuit, and Section 5 concludes.

II. LEARNING ALGORITHM
The remote supervised method (ReSuMe) [24] shown in Fig. 1 is a supervised-learning algorithm for SNNs, in which algorithm weight updates are based on the ith presynaptic spike train S pre,i (t), the postsynaptic spike train S post,j (t) output from the jth neuron, and the target spike train S tgt,j (t) for the jth neuron. This algorithm can learn multi-target spikes, and can also be applied to various spiking neuron models, including leaky integrate-and fire (LIF) [25], and the Hodgkin-Huxley [26] and Izhikevich [27] neuron models.This algorithm is expressed as where t is the continuous time, a d is a non-Hebbian term, and s ij is the delay between the S pre,i (t) and S tgt,j (t) firings (s ij = t pre,i − t tgt,j ). The exponential kernel f ij (s ij ) is where A R is the amplitude of long-term potentiation, and τ R is the time constant of exponential decay. In our circuit, we set A R = A + = A − .

III. PROPOSED CIRCUIT A. CHIP ARCHITECTURE
The proposed circuit is implemented based on the computein-memory architecture shown in Fig. 2 to achieve highefficiency MAC operations. This architecture consists of a mixed-signal synapse circuit and an analog leaky integrateand-fire (LIF) neuron circuit. The synapse and neuron circuits have no clock signal and operate asynchronously in the continuous-time domain, the same as actual neural systems. The neuron-synapse array macro performs a MAC operation when a pre-spike arrives. Thus, processor power consumption depends on the frequency of the pre-spike input and the values of the voltage sources. Synaptic weights are stored by localized flip-flops in the synapse circuit, and each synapse outputs an analog current weighted by the synaptic weight. The macro in the fabricated chip consists of the column circuit shown in Fig. 2. Figure 3 shows the architecture of the SNN processor, which consists of a 100×100 mixed-signal synapse array and a 100 × 1 analog LIF circuit array. Input spikes (pre-spikes) and output spikes (post-spikes) are parallel input and output using a decoder and an encoder with 7 bits, respectively. Each decoder has 100 output nodes for pre-spike inputs. Target spikes for supervised learning using ReSuMe are input by a serial-to-parallel convertor (S2P). By restricting the operating neuron circuits, the processor can select between 100-input mode and 1,000-input mode. The mode is changed by a 1-bit selection signal SL. Subsection III-B describes the method for restricting neuron circuits.  Figure 4 shows details of the neuron circuit, which consists of a pulse generator (PG), M lk for leakage, a reset switch, M ip , M in , and a membrane capacitor, where voltage V xrst is the reset voltage of the membrane potential. The PG realizes threshold processing for generating a spike pulse using an inverter. Bias voltages V bp and V bn adjust the threshold voltages of the inverters for threshold processing by restricting the current for charging/discharging gate capacitance in the next stage. Transistors M ip and M in supply bias current to reduce time variation of V x,j (t) induced by leak current from the synapse array. Membrane capacitor C x,j consists of a MOM capacitor and a parasitic capacitor for the synapse array. The design value for C x,j is 32.7 fF. Note that bias voltages V ip , V in , V lk , and V bn , V bp and reset voltage V xrst are common to all neuron circuits.

B. NEURON CIRCUIT
Three registers in the neuron circuit change the number of synapse circuits per neuron circuit. The first register sets the neuron circuit to active or inactive. Membrane potential V x,j (t) connects the next membrane potential V x,j+1 (t) and the output node of synapse circuits via switches SW 2 and SW 1 . The ON/OFF state of SW 1 and SW 2 are controlled by the second and third registers, respectively. For example, in the case of 100 synapses per neuron, the values of the first, second, and third registers are 1, 1, and 0, respectively. In the case of increasing the number of synapse circuits per neuron, the registers of an inactive neuron are set to 0, 0, and 1, and a metal line-connected membrane capacitor C x,j is shared with the next neuron. Figure 5 shows a block diagram of the synapse circuit. The synapse consists of a delay-line array and update signal gen-erator (DLA&USG), 8-toggle flip-flops (T-FFs), and a DAC. The DLA&USG generates update signals for the synaptic weights held in flip-flops. The DAC outputs analog current according to the synaptic weight when pre-spike S pre,i is input. Voltages V bDAC , V bUSGA , V lkn , V lkp , V rsn , and V rsp are analog bias voltages, the roles of which are described below.

C. SYNAPSE CIRCUIT
Because synaptic weights are held in flip-flops in our circuit, if kernel functions are expressed as analog continuous waveforms, an analog-to-digital converter (ADC) is required because f ij (s ij ) has an analog value. To avoid using an ADC, we discretize the kernel function into five digital time windows consisting of five digital pulses S D1 (t)-S D5 (t), as shown in Fig. 6(a). By this modification, f ij (s ij ) and τ R are respectively expressed as f D ij (s ij ) and 5 q=1 T wq . Pulse widths T w1 -T w5 are adjusted by V b1 -V b5 , respectively. The non-Hebbian term a d was not implemented in our circuit.
Synaptic weights are varied when a spike pulse of The value for dw i,j /dt and f D ij (s ij ) depends on the time-window index q. In the case of a positive update, one is added to the qth flip-flop when S tgt,j (t) is included in the qth time window. In the case of a negative update, one is added to all flip-flops except the qth, then one is added to LSB when S post,j (t) is included in the qth time window. Note that negative updates are achieved as complements of the number 2. Figure 6  The DLA&USG consists of five DL circuits and one USG. The digital time window S D,q (t), having pulse width T wq , is output from each DL. The DL includes a transistor biased on V bias (see Fig. 7(c)). This transistor suppresses the rising slope of V A generated at the trailing edge of S in (see Fig. 7(d)). This suppression is generated by a current limit for charging the parasitic capacitor at the drain node of the biased transistor, with V bias adjusting the slope. Varying the slope changes the time needed to reach threshold voltage V invth of an inverter. As a result, T wq is varied. Figure 8 shows the details of the T-FF. The T-FF is inverted at the trailing edge of an update signal. To achieve asynchronous addition, subtraction, and carry, we employed a circuit comprising a T-FF and an XOR. By connecting the XOR to the output stage of a T-FF, the T-FF is inverted when an adjacent lower bit switches from High to Low (carry) or when S U D,n switches from High to Low, where n is the index of the T-FF. Calculation results for synaptic weight can be unstable in subtraction if S U D,n are input at almost the same VOLUME 4, 2016  time. To avoid this problem, we shifted the timing of S U D,n so that signals arrive in order from the MSB to the LSB.  The DB consists of seven AND gates, seven OR gates, and one NOT gate, and generates switching signals for current sources in the AB. S 1p -S 7p and S 1n -S 7n are connected to PMOS/NMOS switched-current sources (SCSs).

3) Flip-flops in Detail
The AB consists of a current mirror block (shaded area) and an NMOS/PMOS transistor acting as the SCS. The current mirror block generates gate voltages for the SCSs. The generated voltages depend on V DAC , which is set to the source-drain current values of M 7p and M 7pb . We can obtain a gate voltage for which the source-drain current value of M 6nb will be half that of M 7nb by setting aspect ratio W/L of M 7nb to twice the size of M 6nb . The gate voltage corresponding to the current value of the lower bits can be generated by the same procedure. Tables 1 and 2 show W/L ratios of transistors comprising the AB when the W/L of M 7pb and M 7nb are defined as unity, respectively. Current ratios in these tables are designed values when the sourcedrain current of M 7pb is defined as unity. Figure 10 shows waveforms of each nodes during ReSuMe learning. The process of supervised learning in the designed circuit is summarized as follows:

D. SUPERVISED LEARNING OPERATION
1) Pre-spike S pre,i (t) is input and the membrane potential than or equal to 127 and negative otherwise.

E. FABRICATED CHIP
We designed and fabricated the proposed circuit using TSMC 40-nm (1-poly, 8-metal) CMOS technology. Figures 11(a) and (b) show a whole-chip microphotograph and the singlesynapse circuit layout, respectively.

IV. RESULTS OF CIRCUIT EXPERIMENTS
The prototype chip has nodes for observation and experiments. We can measure the membrane potential  Figure 12 shows measurement results for synaptic weight versus change in membrane potential when the synapse receiving simultaneous input was changed from #1 to #4. Note that this membrane potential waveform is that of the 100th neuron obtained through a source follower. The characteristics shown Fig. 12 are equivalent to DA conversion characteristics and thus should ideally be linear. However, as the figure shows, the characteristics were sigmoidal, nonlinear, and very noisy. Table 3 shows the slopes and intercepts obtained by linear fitting. The slope for #2 is about twice as larger as that for #1, but slopes for #3 and #4 are not three and four times the slope for #1. This is attributed to the transistor characteristic that the current value decreases as the drain-source voltage decreases.

B. MULTI-TARGET SPIKE LEARNING TASK
To demonstrate functionality for high-efficiency on-chip learning, we conducted a multi-target spike learning task. In    100-input mode, one learning period was set to 80 µs (prespike train input: 60 µs; wait: 20 µs), and all synaptic weights were set to 255 (= (11111111) 2 ). Pre-spikes were input in order from S pre,1 to S pre,100 every 0.6 µs. Triple spikes were used as target spikes. Firing times were set to 1,960 ns, 2,930 ns, and 4,420 ns. Target spikes were set to the same spike train for all neurons. Figure 13 shows the voltage waveforms of S tgt,100 (t), S post,100 (t), and V x,100 (t) during the task. The neuron fires at high frequency when the number of iterations p is unity, as shown in panels (a) and (b). We can see that the number of spikes decreases as learning progresses (see panel (b)), but there is nearly no decrease in the case of no learning (see panel (a)). The firing times of S post,100 almost converged to the firing times of S tgt,100 after thirty learning cycles.   Figure 14 shows power consumption in the standby, learning, and inference states. Power consumption in the neuron array is low in the standby state, because the neuron circuits do not fire. The power consumptions in the neuron array and I/O during learning and inference were nearly the same, but those in the synapse array during learning were not the same. This difference was likely due to DLA&USG and T-FF updating.  Table 4 shows performance results and a comparison of the proposed method with conventional SNN hardware and artificial neural network (ANN) hardware. We calculated energy efficiency with 1 MAC defined as 2 OP. Energy efficiency was higher in mixed-signal processors with a CIM architecture than in designed digital processors. Of these, our prototype processor showed the best energy efficiency in learning operations with 8-bit synaptic weights. These results show that even when using conventional CMOS technology, SNN hardware with on-chip learning can achieve very high energy efficiency when combined with CIM and an asynchronous architecture without a clock signal.

V. DISCUSSION
CMOS ANN hardware for inference has already achieved high energy efficiency of more than 600 TOPS/W with MAC operations based on a CIM architecture [29] by limiting the bit-width of the synaptic weights as shown in Table 4. The energy efficiency is higher than that of ANN hardware using ReRAM [10], [28]. Thus, if we focus only on the efficiency of the MAC operation, it would seem to be no advantage of adopting non-CMOS memory such as ReRAM and phase change memory (PCM) for synapses in exchange for the risk of higher manufacturing cost and lower yield.
However, the physical characteristics of non-CMOS memory that allow analog information to be input and output as an analog signal can be a basic element of information processing without an ADC. Such an element is suitable for realizing information processing devices in which a sensor is directly connected to the information processing unit. In such a system, the power required for ADC can be reduced, and thus highly efficient information processing can be expected when viewing the system as a whole.
As Table 4 shows, the energy efficiency of ASICs with learning is only a few TOPS/W, which is lower than that of ASICs without learning. This is presumably due to the von Neumann-type architecture, in which different blocks are used for the memory and weight update blocks. In this study, we sacrificed integration and achieved high energy efficiency of up to 18.7 TOPS/W during learning by distributing the memory and weight update circuits in each synaptic circuit. This efficiency during learning was higher than the 15.4 TOPS/W during inference because the number of operations was larger than the 2 OP of the MAC operation (see Note C in Table 4), and this operation is executed efficiently by using the time domain.
The general flow of learning is to calculate the difference between the target value and the output of the neuron, and then to apply a function to the difference to determine the weight update amount. By expanding the information in the time domain, the difference and the function can be calculated simultaneously using a time window function. SNNs can naturally handle temporal information and thus are suitable for implementing efficient on-chip learning hardware.
Network configurations and learning algorithms that can take advantage of the characteristics of SNNs are still in the exploratory stage. Loihi [15] and SpiNNaker 2 [34] are designed to allow a flexibility in the learning algorithm and network configuration. In this study, we limited the learning algorithm to ReSuMe and restricted the flexibility, which resulted in high energy efficiency during learning as shown in Table 4. This result is one example of high energy-efficient on-chip learning hardware that can be realized even with a a) This value was calculated from "Energy per synaptic spike op (min) = 23.6 pJ," shown in Ref. [15], when 1 MAC is 2 OP. b) These values were obtained from the macro, which is the synapse circuit array. c) A learning operation consists of pre-spike × weight and its summation (2 OP), post-spike × time window and its subtraction from the weight (2 OP), and target spike × time window and its addition to the weight (2 OP). One learning operation is thus generally 6 OP, but not always, because of the case where the neuron circuit does not fire when there is pre-spike input. d) To increase the range of conductance available in a synapse, multiple PCMs were used for a given magnitude of conductance update. e) Executing 8-bit matrix multiplications from local SRAM in the 16×4 MAC accelerator.
manufacturing process as old as 40-nm CMOS by using a circuit configuration that is specific to a particular application.

VI. CONCLUSION
We fabricated prototype SNN hardware with 8-bit synaptic weight on-chip learning based on CIM in TSMC 40 nm CMOS, and demonstrated high-efficiency on-chip learning operations using the fabricated chip. The prototype operates in the continuous-time domain, the same as biological neurons, because the neuron and synapse circuits have no clock signal. The architecture based on CIM and asynchronous operations without a clock signal showed energy efficiency higher than that of conventional CIM-based SNN hardware. Furthermore, even when the input-output characteristics of the synapses were noisy and non-linear, the output of the fabricated chip converged to the target signal. This architecture can contribute to the implementation of highly energyefficient learning in an SNN processor using conventional CMOS technologies, but an integration of the neuron and synapse circuits remains low because the processing units are physically arranged without reusing a single processing unit in a time-division manner.
In future studies, we will consider that system scaling using a stacked die depends on the energy efficiency of the stacked chips comprising the system, because heat dissipation will limit the system size. Therefore, highly efficient circuit architectures that sacrifice integration on a chip-bychip basis may be an option for realizing large-scale systems using 3D integration technologies. .

APPENDIX A CALCULATIONAL PROCEDURE FOR ENERGY EFFICIENCY
We explain the calculational procedure for the energy efficiency of our fabricated chip during inference and learning. These values of energy efficiency were calculated from the power consumption of the synapse array without the standby power consumption( = 27.02 µW). The energy consumption was calculated from the power consumption when running an 80 µs operation sequence in which spikes were input into 10,000 synaptic circuits.

A. INFERENCE
Power consumption using over inference, including standby, was 43. 25  A learning operation consists of pre-spike × weight and its summation (2 OP), post-spike × time window and its subtraction from the weight (2 OP), and target spike × time window and its addition to the weight (2 OP). One learning operation is thus generally 6 OP.