A Survey of Intelligent Chip Design Research Based on Spiking Neural Networks

The traditional neural network Intelligent chip has the problem of high power consumption due to classical computing architecture, limiting the development of neural network Intelligent chips. Stochastic computing (SC) encodes binary numbers into stochastic pulse sequences in operation, taking advantage of low power consumption and high performance. The application of SC in spiking neural networks (SNNs) Intelligent chips is beneficial to solving the high power consumption of traditional neural network chips. This article first summarizes the basic elements of SNNs and the basic principles of SC. Then, we review the development trends of the stochastic computation-based neural network chips and existing SNN chips under research at home and abroad, respectively, and analyze the current problems. Finally, a review of SNN chips based on SC is highlighted. This paper aims to provide new research directions and to learn ideas for the field of SNN chips through systematic summaries.


I. INTRODUCTION
The rapid development of computer hardware has promoted the development of deep learning, which has been developed to date and has made great achievements in autonomous driving [1], pattern recognition [2], data classification [3], etc. However, the current stage of deep learning has hindered the further development of artificial intelligence due to its high power consumption, long training time, and low brightness [4]. Unlike the traditional artificial neural networks (ANNs), biological neural networks communicate through discrete pulses rather than numerical values to form SNNs. In SNNs, neurons are activated only when input pulses are received. Thus inactive neurons without input pulses can be placed in low-power mode, thereby reducing power consumption and simplifying computation. As a result, SNNs can potentially achieve extremely low power consumption compared to ANNs, especially when implementing The associate editor coordinating the review of this manuscript and approving it for publication was Songwen Pei . analog/mixed-signal (AMS) circuits. In addition, spiking neural network-based brain-like computing is a better way to overcome the shortcomings of the current deep learning stage and solve artificial intelligence problems because the working mechanism is closer to that of the biological brain [5], [6], [7].
The implementation methods of SNNs can be divided into software and hardware implementation. Software simulation can quickly realize neuron modelling and real-time data analysis of network communication systems; it plays a vital role in numerical processing and optimization. Hardware implementation focuses more on the design and implementation of neuromorphic hardware architectures. Studies have shown that although the software implementation has the characteristics of solid flexibility and high precision [8], it cannot fully use the high parallelism of neural networks, the processing speed is slow, and power consumption is high. Hardware implementation can improve the deficiencies of software implementation, fully reflecting the characteristics of high parallelism of the neural network [9].
According to different hardware implementations, neural network chips can be divided into analog, digital, and digital-analog hybrids [10]. Analog circuits have higher computational accuracy, but due to the complexity of their design, the scale of neural networks is usually small. The factors also cannot guarantee the consistency of neural network behavior between chips, so pure simulation methods are rarely used to build large-scale brain-like systems. Most large brain computers that have emerged are digital or mixed digital-analog. However, the current general-purpose general-purpose CPU,GPU, and TPU chips have fatal problems: first, the convolutional computation of too much data causes a surge in power consumption of traditional AI chips, which is not conducive to edge-side deployment of AI, i.e., the ''power wall'' problem; second, the deep network contains a large number of neural networks. Second, because the deep network includes many weight parameters, it creates higher requirements for bandwidth and latency, resulting in a computational bottleneck in the whole system, i.e., the ''memory wall'' problem [11].
As an unconventional computing paradigm, stochastic computing is one of the important implementations of neural networks. Its low-cost and low-loss circuitry can perform the functions of complex circuits, with lower hardware overhead and better fault tolerance, etc., has attracted attention. The stochastic computing uses discrete pulse sequences instead of sequential binary numbers, occupying lower computational resources. At the same time, it also pays the price of higher computational latency and lower computational accuracy. Some researchers have already made some preliminary attempts to address the above issues. Brown and Card used finite state machine (FSM) processes to implement nonlinear functions and thus improve the computational accuracy [12].
Based on this idea, Smithson et al. proposed using FSM processes to implement the pulse issuing process of leaky integrate and fire (LIF) neurons and its hardware architecture [13]; the deterministic bit stream proposed by Faraji can significantly reduce energy consumption. Without changing the network performance [14], Sim et al.
proposed an improved probabilistic coding method to optimize computational latency and computational accuracy [15]. Kim et al. proposed using SC in deep neural networks for dynamic energy-accuracy trade-offs to improve computational hardware efficiency [16]. Liu et al. proposed an energy-efficient Deep Belief Network (DBN) based on the online learning ability of SC and improved its computational efficiency by improving random numbers [17]. Jenson and Riedel proposed a SC method for deterministic bitstreams, which can achieve better computational accuracy [18]. In 2020, Huang's team investigated the reliability of stochastic logic circuits based on Fin Field-Effect Transistor (FinFET) technology, which provides a good prospect for the application of new nanodevices [19].
However, building neural network computing architectures using SC methods is still challenging. SC is used in the design of SNNs chips to construct more efficient parallel computational processing units based on the characteristics of SC and to improve computational efficiency to enhance the performance of SNNs gas pedals. The coding method of SC is enhanced so that the stochastic computation-based spiking neural network chip has the advantages of both low power consumption and high efficiency.
In this paper, from the characteristics of SNNs, such as good bionic properties, high efficiency and low power consumption, we review the neuron model, network topology, and learning algorithm for SNN-like brain neural chips in order. In section 2, the basic elements and biological background of SNN and the basic principles of SC are described. Section 3 presents the software optimization of SNNs and the communication protocol of the SNN accelerator. Section 4 introduces the research progress of three traditional SNN chips, and summarizes the existing SNN chips. Section 5 focuses on the application of SC in traditional neural network chips and SNN chips. Section 6 provides an outlook on the future development of SNN chips. Section 7 summarize the work of this paper.
Specifically, the main contributions of this paper are as follows: 1) We briefly outline some basic structures of SNNs and probabilistic computation, including neuron models, network topologies, encoding methods, and learning algorithms.
2) The research progress of traditional neural network chips and neural network chips based on SC is summarized. To help readers understand the current problems faced by smart chips.
3) The challenges faced by traditional SNN chips are mainly discussed, and the SNN chips based on SC are introduced. 4) We propose two development directions for the development prospects of spiking neural network smart chips: crossbar switch arrays with NVM devices and 3D integration technology and stochastic computing-based SNN smart chips.

II. BASIC ELEMENTS OF SNN AND STOCHASTIC COMPUTING A. SPIKING NEURAL NETWORK
The ANN is an abstraction and simulation of the structure and function of the biological nervous system and plays an important role in information processing and pattern recognition. SNNs are special ANNs, also known as third-generation ANNs, in which neuron units communicate using discrete spike trains, as shown in Figure 1. Similar to biological neurons, the input of a spiking neuron is a discrete spike, and only when the input exceeds a certain threshold will a pulse be released to the next neuron. SNNs also incorporate temporal dynamics, which makes them suitable for real-time operation with the event-and data-driven updates. Figure 1(a) depicts a typical ANN with artificial neurons as the computing unit. A continuous function is used as the neuron's activation function to realize real value's input and output processing. The calculation process can be described by the formula 1 as follows: where x, y, ω, and b represent the input function, output function, salience weight, and bias, j is the input neuron index, and φ(.) means the activation function. Neurons in an ANN communicate using high-precision, and continuous-value encoded and propagate information layer by layer only in the spatial domain [20]. Figure 1(b) depicts the SNN with spiking neurons as the computing unit. Compared to ANN neurons, which have a similar structure but behave differently, spiking neurons communicate information through binary time-coded spike trains. Dendrites integrate input spikes, unlike continuous activation in ANNs. A spiking neuron can be described as follows [21]: where X represents the vector consisting of the state variables of the neuron, f (.) represents the differential equation for the evolution of the state variables, and g i (.) represents the change of the state variables caused by the spike events of synapse i. Neural information in spiking neurons is transmitted and processed by precisely timed spike trains. Compared with the ANN model, SNN can describe the real biological nervous system more accurately, thus realizing efficient information processing.

1) NEURON MODEL
The pulse neuron is the basic unit that constitutes the SNN, and its main function is to integrate and transmit the coded information of the pulse sequence. Whether a pulse neuron releases a pulse or not is closely related to the neuron's membrane potential and activation threshold. In the process of a single spiking neuron firing a pulse in a spiking neural network, a spiking neuron receives input pulses from several dendrites and outputs axons from it. Many neurons form a network and learn systematically [22], as shown in figure 2.
The H-H model is a biologically interpretable physiological model. It describes the change process of the membrane potential of neurons through the dynamic characteristics of Na ion and K ion channels. The H-H model is very complex and is the most widely used model at present, and is a simplification of the H-H model. It achieves a good balance between complexity and computational accuracy. The higher the frequency of external stimulation of the LIF neuron model, the larger its activation probability [33]. The LIF neuron unifies the expressions of action potentials. It reduces the complexity of operations, but it cannot explain the fundamental pulse generation mechanism and does not include rich behavioural properties, so the LIF neuron model can only simulate a small number of neuronal behaviours [34], [35], [36]. The IZH neuron model has high computational efficiency and physiological characteristics similar to the H-H model. Different neuron firing patterns can be simulated through the selection of parameters [37], [38]. Therefore, the IZH neuron model is implemented in some digital-analog hybrid neuromorphic systems through analog circuits, but it is not widely used because the equations are still complex. The IF model can only passively accept the external current input. After the circuit has experienced a long time, the charge of the capacitor will be released. Moreover, the IF model does not have any reset method. The IF model already possesses some of the physiological characteristics of neurons and has also been widely used in neurocomputing science. However, through many simulation experiments, it will be found that the IF neurons cannot meet the real neuron charging and discharging requirements, so the model is not perfect. The SRM model is a generalization of the IF model, so its simulated biological properties are improved. Due to the extensive selection of kernel functions, the SRM model has a certain generality compared with the LIF model. However, the SRM model is too simple to simulate many neuron charging and discharging characteristics and has certain limitations. A summary of the five common spiking neuron models is shown in table 1.

2) ENCODING METHOD
In addition to the difference in neurons, the most significant difference between SNNs and traditional neural networks is information encoding and processing. In the SNN, the input and output of data are in the form of pulse sequences, so it is necessary to convert the analog quantity into a pulse sequence with time information. The most widely used coding methods are time coding and frequency coding.
Temporal encoding focuses on differences in temporal structure, the time from receiving a stimulus to sending the first pulse, and the temporal logic between pulses containing   [39], which uses the time when a neuron first fires a spike to represent information, emphasizing the time of the first spike and ignoring other spike times. Or reduce the weight. Chien et al. proposed to use Inter spike Interval (ISI) to encode activation strength [40].
The frequency coding is mainly based on the frequency of pulse firing, the average number of pulses fired by the neuron over its corresponding recording time. Because the frequency of neuron firing pulses is positively correlated with the intensity of external stimulation, the intensity of stimulation can be expressed by the frequency of neuron firing pulses. Strong stimulation will lead to high-frequency pulse trains, and weak stimulation will lead to low-frequency pulses sequence. Frequency coding only pays attention to the number of pulses in the time window while ignoring ISI. It cannot full use the temporal and spatial information contained in the pulse sequence, so the efficiency is not high [41]. Still, the non-uniqueness of the pulse sequence makes the frequency coding highly effective. Noise immunity. Time coding and frequency coding are compared, as shown in figure 4.

3) NETWORK TOPOLOGY
The topology of the SNN directly reflects the connection between neurons and synapses. The existing structure of a SNN can be divided into static and dynamic structures 89666 VOLUME 10, 2022 according to whether the network changes. The static structure means that the number of neurons and layers of the SNN remains unchanged, and only parameters such as weights are changed during the training process. Common structures include multi-layer feedforward and recurrent network structures [42]. Dynamic structure refers to the dynamic adjustment of the number and connection of neurons during the training process, typically represented by evolutionary spiking neural networks. The construction idea of the evolutionary SNN comes from the connected evolutionary system of biology, which can dynamically change the structure and function of the system in an adaptive, self-organizing, and online continuous manner. The input samples are encoded, converted into spike sequences and passed into the network. Then, according to the samples' characteristics, the spike network's evolutionary structure can dynamically generate new neurons and add them to the corresponding neuron reserve category. Therefore, the order rule is used in the evolutionary SNN to learn and represent the output category. The earliest activated neuron represents the corresponding category [43].

4) SNN LEARNING ALGORITHM
As the core model of brain-like computing, the learning algorithm of SNNs has always been the research focus. Studying SNN learning algorithms is beneficial for realizing higher-level artificial intelligence. With the continuous development of the field of neuromorphic chips, in addition to the performance indicators of traditional learning algorithms, indicators such as algorithm storage resource usage and weight update logic complexity are also used to measure the performance of SNN learning algorithms implemented in hardware. Because neuromorphic chips have limited computing and storage resources, there are many constraints on the algorithm to complete low-power dedicated computing modes. At present, SNN algorithms can be divided into unsupervised learning, supervised learning, online learning, network conversion, and network compression algorithms.
Unsupervised learning algorithms usually only need local information to adjust the synaptic quality during the learning process and are usually implemented in hardware. Supervised SNN learning algorithms are usually based on gradient descent algorithms to directly learn input patterns' labels. Due to the discontinuity and non-differentiability of spike trains, it is difficult for such algorithms to be extended to deep layers and effectively applied to complex datasets. Zhang et al. proposed the SpiKL-IP algorithm based on strict information theory to minimize the value of KL divergence between the actual and expected pulse firing frequency to learn the input pulse pattern in real-time. The algorithm achieved the highest recognition coefficients in the CityScape image dataset and TI46 speech corpus, reaching 97.78% and 98.46%, respectively [44]. Building on this work, Kasabov proposed a dynamic eSNN to learn information from more complex spatiotemporal input patterns consisting of multiple spikes. They added learning rules for synaptic plasticity to update the weights and used the EGG dataset. 83.33% recognition accuracy was achieved on [45]. Wysoski et al. proposed an evolutionary spiking neural network eSNN based on hierarchical sorting learning. It uses a single-pulse hierarchical time coding algorithm to encode data. It can continuously change its structure in the real-time learning process and better respond to different input modes. The recognition performance on the visual and audio datasets reaches 60% and 40%, respectively [46].
Online learning algorithms mainly refer to algorithms that can learn the information flow of external input in real-time; the online learning algorithm of SNNs with the dynamic adaptive structure proposed by Wang et al. The classification accuracy of the dataset reached 91.8% [47]. Courbariaux introduces a novel weight binarization scheme in forwarding and backward propagation. It can reduce the multipliers by 2/3 and is three times faster when training; this method is very effective for neural networks. The hardware implementation greatly impacts the classification accuracy of 91.35% on the CIFAR-10 network [48].
The network conversion algorithm learns from the ANN, whose training algorithm is already very mature, and quickly obtains an SNN with good performance. Cao et al. describe a method for converting a convolutional neural networks (CNN) architecture to an SNN architecture that can be directly mapped to certain spiking-based neuromorphic hardware with a little performance penalty. In software, the classification accuracy rate of this architecture on the CIFAR-10 image dataset reached 77.43% [49]. The network compression algorithm mainly reduces the network size through structure pruning and weight quantization operations to facilitate hardware implementation. Rueckauer et al. address some important shortcomings of the existing ANN-to-SNN conversion, deduce a theoretical analysis of the error introduced in the previous conversion process and based on this theory, realize the conversion of the VGG-16 architecture to SNN on the ImageNet dataset. The accuracy rate of 84.86% is achieved [50]. A comprehensive analysis and comparison of several common algorithms are carried out, and the results are shown in table 2.

B. BASIS OF SC
SC was first proposed in the 1960s to simplify complex binary computing units [51], [52], [53]. And SC has attracted much attention due to its fault tolerance and low-cost arithmetic functional units [54], [55], [56]. Because of this characteristic, SC can realize complex operations such as addition, subtraction, and multiplication through simple logic gates [57]. Compared with traditional binary operations, SC can greatly save hardware resource overhead [58].

1) STOCHASTIC MULTIPLIER
The multiplier is the basic calculation unit of SC. Usually, the random sequence is represented by the multiplication operation realized by the AND gate to represent the unipolar type. The multiplication operation realized by the XOR gate is used to represent the bipolar type. x and y respectively represent two mutually independent random sequences, then a single AND gate can be calculated with high accuracy [59]. As shown in figure 5(a), in the unipolar representation of SC, the real number X is interpreted as the probability of a single bit being ''1'' in a random binary bit stream, i.e., x = p(X ). For example, the binary number x = 0.375 is interpreted as p(x) = 3/8 and can be represented by the bitstream X = 01001010.The number of 1s in the bitstream and the bitstream length are 3 and 8 [60]. Note that the unipolar representation is in the range [0, 1], whereas in the bipolar representation of SC, as shown in figure 5(b), the real number x is in the range [−1, 1] and is interpreted as x = 2P(X ) − 1.

2) STOCHASTIC ADDER
There are many implementations of adders based on SC. Common implementations include three basic structures: OR gate, multiplexer (MUX), and parallel counter (PC). As shown in Figure 6(a), the overhead of the OR gate is the smallest, but when there are multiple SC sequences superimposed, the error will gradually accumulate, reducing the accuracy; as shown in Figure 6(b), the circuit of the MUX structure is a scaled adder. Compared with the OR gate, the accuracy has been improved to a certain extent. However, as the number of inputs increases and the scaling factor increases, the cause is compressed into a small value, causing errors; as shown in Figure 6(c), the PC is a probabilistic adder with high precision, but the hardware overhead and delay are also large. The structure of the approximate unit and parallel addition tree based on alternating APC and OR gate proposed by KIM et al. has certain limitations. The accuracy is high only when the random sequence length is large [61]. many samples that can be fed into the network in batches. Use the parallel computing paradigm to exploit the inherent parallelism of layers to improve the performance of hardware implementations of neural networks. In parallel computing solutions, the time and space architectures are different [63]. Both architectures contain processing elements (PEs) that perform parallel operations on the same or other data. In the time architecture, PEs can only access data from central storage, and centralized control, and there is no connection between PEs. Conversely, in a spatial architecture, each PE can also have its control logic and one or more local storage locations. Most importantly, in a spatial architecture, PEs are connected to exchange data with each other, creating a processing array. Figure 7 shows the difference between temporal and spatial architectures.

A. TIME FRAME AND SOFTWARE OPTIMIZATION
Common platforms for time architecture are CPU and GPU. The CPU is a vector processor that can process multiple data elements simultaneously. A vector processor consists of multiple arithmetic logic units (ALUs) that work synchronously and execute instructions on a data vector. Vector processors, on the other hand, use Single Instruction Multiple Data (SIMD) technology.
Among the available hardware platforms, CPUs are often used for SNN inference or training because they offer lower FLOPS and FLOPS/WATT performance. GPUs are architectures with up to thousands of cores designed for parallel computing (e.g., 5120 cores in Nvidia V100 GPU [64]). Similar to vector CPUs, GPUs employ the single instruction multithreading (SIMT) execution model first introduced by Nvidia. The SIMT model executes a single instruction on multiple cores simultaneously. Each core receives different data belonging to multiple threads running in parallel. GPUs are the real workhorse of SNN training and, in some cases inference.
Nvidia GPUs are often used for hardware and software optimization of neural networks. Most neural network frameworks support execution on Nvidia GPUs, such as Pytorch [65], Tensorflow [66], or Caffe [67]. A big advantage of Nvidia GPUs is cuDNN [68], a highly optimized library of DNN primitives. In the latest high-end GPUs, Nvidia combines traditional CUDA cores with tensor cores [64], optimized for large matrix operations. Tensor cores can also support mixed-precision operations. In the new Nvidia A100, the tensor core supports a new format, the tensor format (TF32), which provides a 10 times performance improvement over the performance of the FP32 format on the V100 architecture [69]. In addition, the Nvidia A100's tensor cores can also take advantage of the sparsity of tensors common in DNNs to achieve up to 2 times performance gains.
And a single GPU is composed of multiple stream processors, and each stream processor can process data in parallel. Due to the high parallelism of GPU, the current deep learning widely uses GPU for accelerated training. In essence, the neural network used in deep learning is also inspired by the neural network inside the biological brain. To some extent, the two also have certain similarities, so there are also SNNs developed on the GPU. Schemmel et al. proposed a simulation platform for SNN simulation using GPU [70]. Compared with the CPU, the performance has been greatly improved. At the same time, their proposed simulation platform is also scalable and can simulate large-scale SNNs. Although GPU has unique advantages in parallel computing, there has not been a good solution to its power consumption problem in the face of event-driven spiking neural networks.

B. COMMUNICATION PROTOCOL OF SNN CHIP
When evaluating ANNs, the main performance constraints in terms of throughput and power consumption come from memory bottlenecks [71], [72], [73]. Inspired by the computing paradigm of the brain, neuromorphic processors aim to allocate memory across an architecture close to the PE. This leads to the parallelization of storage and computation in different layers of the SNN while reducing the power consumption of the entire operation. The computing core is divided into several small neural cores: memory and PE. The layers of the network are distributed among nuclei, each of which stores some of the synaptic weights.
The multi-core parallelism of SNN accelerators relies on a specific network-on-chip (NoC) communication protocol to transmit events among a large number of neural cores with minimal power and latency. In input/output, the core receives/transmits spikes via communication schemes such as address event representation (AER) protocol via (NoC), which encodes event times and organizes connections with low communication costs [74], [75]. As shown in figure 8, such a communication method can greatly reduce the communication bandwidth between chips. This involves sending a packet containing the address of the spiking neuron on a digital bus with asynchronous logic. Once the neuron fires, the address is sent to the NoC, and the firing time is encoded in real-time on the asynchronous bus [76]. This type of architecture can scale as long as routers and control circuits can manage AER requests. Nonetheless, the definition of the number of neurons and synapses per core and the number of cores per chip will limit the class of topologies implemented VOLUME 10, 2022 on a chip. But this limitation can be overcome by implementing a scalable multi-chip architecture [77]. Compared with the frame-driven approach of ANN, the AER protocol has many benefits for large-scale SNN computation [74]. Boahen et al. pointed out that it can reduce the size of the network bus while retaining a large connection capacity [75]. Therefore, the AER NoC area requirements are low, enabling large-scale designs. In addition to reducing the transmitted packets to a single address, it also ensures small latency and power overhead. Furthermore, the neuron activity sparsity of the SNN prunes the NoC activity, reducing the number of packets sent on the network.
The NoC communication protocol is easily scalable, and any number of units can be connected as long as the router can manage the requests. Therefore, suitable for multi-chip implementation. However, the power consumption of such a system increases with the number of connected neurons, the average firing frequency of each neuron, and the overall chip size [78], [79]. To address this problem, two different schemes are adopted: one reduces AER's power and latency overhead by exploiting the locality and clustering of neural network algorithms [74], [80]. The other adopts a hierarchical router level, which helps reduce the power impact of large-scale systems [80], [81]. Furthermore, rate and time encoding have significantly different effects on such overhead. Rate encoding usually employs a sequence of Poisson spikes, so a single spike timing error has little effect on the accuracy of the network. Mostafa et al. pointed out that time encoding sends many pulses through the network [82], which can drastically increase the power consumption of the whole system. In temporal coding, timing errors or loss of information can occur when the algorithmic time step duration does not allow the neural core to complete its operation. In this case, pulses may be lost or removed from the network. Hence there is only a trade-off between accuracy and throughput or power [83].

IV. TRADITIONAL SNN CHIP
The neuromorphic computing system imitates the neuromorphic device of biological neurons as the basic unit. The main body is an SNN similar to the neural network approximation in the human brain. Unlike the traditional way of working by following computer instructions, neuromorphic computing systems follow parallel work and distributed processing mechanisms to complete cognitive tasks such as learning, memory, and reasoning [91]. There have been many studies at home and abroad on developing of neural network chips [92], [93], [94]. At present, the research of neuromorphic chips is mainly divided into three directions: (1) digital-analog hybrid neuromorphic chips designed by analog CMOS circuits for neural computing units, synaptic circuits, and digital CMOS circuits for routers; (2) by all-digital CMOS circuits Design a pure digital neuromorphic chip; (3) design a new type of neuromorphic chip with a new type of resistive memory to design the synapse circuit and part of the computing unit, and a CMOS circuit to design the routing circuit. Three different chips will be described below.

A. DIGITAL-ANALOG HYBRID NEUROMORPHIC CHIP
Neuromorphic engineering based on analog circuits proposed by Mead et al. [95]. Using large-scale digital-analog hybrid circuits to simulate the electrophysiological behavior of real neurons and synapses is more efficient and energy-efficient than using semiconductor devices than traditional CPUs [96]. The most famous is the neuromorphic supercomputer Neurogrid system developed by Stanford University in the United States, which flexibly utilizes the similarities between the dynamic characteristics of neuron ion channels and the electrical characteristics of transistor subthreshold regions to design neuron circuits and synaptic circuits. This is a neuromorphic chip based on an AMS circuit. The chip consists of software, driver, and hardware system. The hardware system consists of a PCB of 16 Neurogrid chips, and the chips are linked through a tree structure [97], [98], [99]. Each chip contains a square matrix of 256 × 256 neurons. The resulting single board can simulate large-scale intracerebral neurons and synaptic connections, which can be applied to brain-computer interfaces [100].
The ROLLS chip of the University of Zurich in Switzerland [101], which has only 256 neurons and 128k synapses, is used to simulate the physical activities of the biological nervous system, study computational neuroscience models, and build brain-like computing systems. Unlike traditional Von Neumann processors, the ROLLS neuromorphic processor uses memory and computation co-located. The architecture includes a configurable array of synaptic circuits where spiking neurons produce biologically realistic response properties that express a wide range of real behaviors [102]. The ROLLS chip features online learning algorithms, such as spike-driven synaptic plasticity rules (STDP), that can validate multiple neuromorphic computing modalities. The HICANN chip of Heidelberg University uses a special interconnection technology to interconnect 352 chips in the entire wafer to realize a wafer-level neuromorphic system [103], [104], which is 10,000 times faster than real-time and is used for large-scale parallelism. The HICANN chip is designed to provide scientists with supercomputers to speed up large-scale SNN simulations. The cxQuad chip is a novel multicore device that includes analog neuron and synaptic circuits and an asynchronous digital routing fabric optimized to minimize memory requirements, and maximize scalability and reconfigurability [105].
Hybrid implementations are powerful tools for real-time simulation of large-scale neural networks. Calculations can be performed using analog circuits while maintaining the flexibility of digital programmable devices. The hybrid implementation combines space-saving, power-saving analog circuitry with a scalable binary-digital system, and clockless communication through sparse pulse coding improves computational efficiency. The generated precise time-to-time correspondence improves the time-sensitive performance of neural network scale operations. In hybrid-implemented neuromorphic systems, digital systems implement network connections, while analog electrical circuits are dedicated to reproducing neuronal dynamics. Because the on-chip topology of an analog circuit is usually embedded and fixed after manufacturing, the entire system needs to be prototyped and optimized in an FPGA before manufacturing the analog circuit. Hybrid digital-analog technology represents an advance in neuromorphic systems, integrating largescale, biorealistic configurable neural networks on a single chip.

B. PURE DIGITAL NEUROMORPHIC CHIP
The advantage of simulating neuromorphic systems is to study the interaction problem with the system and the environment in real-time. Analog neuron circuits only provide a precise deterministic computation of digital analog neurons, which is not ideal for detailed quantitative studies. At the beginning of this century, many purely digital neuromorphic systems appeared. Based on the stability and reliability of digital circuits, these systems can realize large-scale neuromorphic systems with ultra-low power consumption and accurately reproduce algorithms. Many chips can already, to a certain extent, solve real-life problems.
In 2014, IBM Corporation of the United States launched the famous TrueNorth chip [106], which adopted a fully customized ASIC solution. Implementing a specific network model, LIF neuron model, and connection methods that support limited programming and using Samsung's 28-nm technology, the chip consists of 5.4 billion transistors and occupies only 4.3cm 3 area. Through the asynchronous circuit design, it has extremely low operating power consumption and can achieve a computing scale of 1 million neurons and 256 million synapses. TrueNorth has good scalability, does not rely on the global clock to coordinate work, and does not affect the overall work due to a chip failure. The chip supports deep feedback algorithms so that it can be practically applied in fields such as image recognition and speech processing. The vision application system 161, composed of 3 million neurons implemented by the TrueNorth chip, consumes about 200 mW [107]. The DARPA SyNAPSE system consists of 48 TrueNorth chips interconnected in an array [108], which can realize a neural network with a scale of 50 million neurons. The peak performance of fixed-point computing can reach 266 GB/s. Although IBM uses SRAM, SRAM is a special 8T structure, and it is only possible to achieve the expected scale and power consumption requirements under specific circumstances.
Although the TrueNorth chip, as a breakthrough development in the field of brain-like computing, has greatly promoted the development of artificial intelligence, the TrueNorth chip only supports the reasoning of the SNN. It does not support neuromorphic plasticity, and parameter updates can only be learned on software but not on a chip, so there is still a lot of room for the development of brainlike computing.
SpiNNaker network architecture was developed at the University of Manchester, UK. The exponential function of SpiNNaker is implemented through a lookup table. However, as more complex neural models are developed, the memory requirements grow. To save the limited memory resources in the SpiNNaker chip, a single hardware accelerator for exponential and natural logarithms is specially built and designed through a fixed-point method, which improves energy efficiency with a certain loss of precision. This fixed-point approach enables custom configurations for other systems with different power, area, accuracy, and delay constraints [109], [110]. Part of the brain function model is implemented based on an ARM chip, which can imitate the function of brain regions, and its communication mechanism is suitable for real-time modelling [111].
The neuromorphic chip Loihi released by Intel Corporation combines the STDP model through the pulse transmission data between neurons. The difference from other chips is that the chip has an autonomous learning function [112]. Loihi chips can not only support neuromorphic computing with extremely low power consumption [113], but also support a variety of neuromorphic plasticity, integrating various achievable learning rules, complex neuron models, and various information encoding protocols. Together, multiple algorithms can be simulated. Loihi's chip uses Intel's 14-nm FinFET process and contains 2.07B transistors, which was Intel's fifth-largest chip at the time. With 128 cores integrated into a single Loihi chip and 1024 neural pulse units integrated into each core, SNNs can be run and trained with extremely low power consumption using the Loihi chip. Loihi is the first system that can simultaneously support mechanisms such as sparse network compression, inter-kernel multicasting, variable synaptic formats, and Population-based hierarchical interconnection [114]. Intel also released Pohoiki Springs, a neuromorphic system that can support 100 million neuron computing. The system is built through a data center rack composed of 768 Loihi brain-like computing chips. The overall neural capacity is comparable to that of small mammals. VOLUME 10, 2022 It is the largest brain-like computing system implemented by Intel [115], [116], and [117].
The current neuromorphic chips mainly follow the principles of neurodynamics to build brain-like neural networks, but they are not compatible with mature models such as ANN. In this regard, Tsinghua University has developed an artificial intelligence computing chip called ''Tianji'', with a size of 3.8*3.8mm2 and a 28nm process. Although the operating frequency is only 300MHz, its performance can reach 1278 GOPs/W. The chip consists of more than 150 computing units, which can satisfy the calculation of about 40,000 neurons and 10 million synapses. The biggest feature of ''Tianji'' is integrating two different artificial intelligence research directions based on computing science and neuroscience into one platform. This chip can support both existing machine learning algorithms and brain-like computing algorithms. [118]. ''Tianji'' combines the two technical routes of neuroscience and computer science and adopts a non-von Neumann paradigm. With hybrid compatibility, multi-core architecture, localized memory, and streamlined data flow, it can support cross-paradigm modeling, maximize parallelism, improve power efficiency, and communicate seamlessly between models [119].
Zhejiang University and Zhijiang Laboratory used 792 Darwin 2nd-generation chips to jointly develop DarwinMouse, a brain-like computer whose computing scale is equivalent to that of a small mammalian brain [120]. The computer was the largest brain-like computing system in the world at that time and could achieve a computing scale of 120 million neurons and nearly 100 billion synapses. The chip adopts standard 180nm CMOS technology and helps realize applications such as collaborative robot work and EEG signal potential decoding. The overall computing power consumption is between 350-500W. The computer consists of nearly 800 ''Darwin 2'' chips, each containing 576 computing cores, each of which can realize the computation of about 256 neurons and 10 million synapses [121]. BrainScales is the realization of several interconnected chips, each composed of several HICNN neural cores, to accelerate the time simulation of brain-like neural networks with accurate biological neural behavior [122].
A purely digital implementation consumes more silicon area and power per function but has significantly reduced development time and is not affected by a power supply, thermal noise, or device mismatch. In addition, high-precision digital computing can realize network communication systems with high dynamic range, higher stability, reliability, and repeatability.

C. MEMRISTOR-BASED NEUROMORPHIC CHIPS
Besides being based on conventional CMOS, memristors with desirable properties have become one of the main device choices for neuromorphic computing [123]. The brain's neurons are connected in three dimensions, allowing very dense and highly parallel networks to be implemented on a minimal scale. Neural networks on silicon, on the other hand, are mainly two-dimensional, so they cannot be integrated to achieve similar densities. Many researchers have proposed the latest parallel implementation: crossbar switch arrays [124], [125], [126], [127], [128]. This design aims to combine the memory and neuron update parts of the neural core in a single unit, resulting in a speed increase and a reduction in energy consumption, and true non-von Neumann computation [124], [129]. It consists of two metal wires intersecting orthogonally, as shown in Figure 9. The nanoelectronic device mimics the behavior of synapses set at each intersection. One direction represents the output of the presynaptic neuron. The other direction represents the connected postsynaptic neuron. Thus, the operation of an analog crossbar array consists of applying voltages on input lines and reading currents on corresponding output lines. The conductance of each device represents the synaptic weight of the connection, the resulting currents are summed according to Kirchhoff's law, and a dot product operation is implemented [126]. To achieve high-density crossbar switches without loss of accuracy, the device must be three things: (1) small, (2) low power consumption for reading and write operations, and (3) stable [126]. Pulse code modulation (PCM) [130], [131] and metal oxide resistive devices [132], [133], [134] are good candidates for (1) and (2) because their power dissipation decreases with their size. However, on a small scale, they cannot yet meet the third requirement, which greatly limits the use of crossbar switch arrays as accelerators for neural network algorithms. The realization of crossbar switch arrays is still under investigation due to this limitation. Ankit et al. achieved the simulation of SNNs at the placement level of fully neuromorphic architectures [124], achieving huge gains in energy and speed when using crossbar switch arrays compared to traditional neuromorphic architectures. While hundreds of gains are achieved when simulating FC networks and a few tenths when evaluating CNNs, confirming that the crossbar gain is highly dependent on the network topology. Recently, Ambrogio et al. showed that relative to ideal simulation [129], equivalent accuracy was achieved on ANN evaluation using a PCM crossbar switch array controlled externally by a computer through the detector. And the potential power efficiency is one to two orders of magnitude better than a standard CPU or GPU.
One advantage of SNNs over ANNs is that the activations are binary. This also simplifies the surrounding circuitry since the same voltage is applied to each crossbar input. The update of neurons can be implemented in analog circuits or digital circuits. Analog implementations can achieve very high throughput, ideally with a capacitor at the output of each line [135], especially when the memristive element is placed directly between the two access lines. However, due to the lack of control over the intersection design [136], crossbars are usually implemented using access devices, which greatly reduces the design density. A digital implementation of the output circuit would require an analog to digital (A/D)converter and memory to store neuron states [137], [138]. Neuron updates can again be time multiplexed to reduce hardware requirements. In this case, the crossbar array still has an advantage as the computation happens in memory and requires transfer costs to and from neurons' state memory and A/D convert. However, ANNs with full-precision activations on such designs require simplified network topology [139] or improved network circuitry to provide accurate voltage values to each input line [140] and apply nonlinearity after MAC operation.
In conclusion, crossbar arrays with NonVolatile Memory (NVM) devices are promising in terms of performance and scalability, especially in the case of a fully analog implementation, where neuron updates are implemented in parallel. It has shown good results but still needs improvement to guarantee reliable, fully on-chip, and long-lasting lifetimes.
At present, memristor-based synapses have mimicked several synaptic functions of biological synapses. Most of the currently studied neural networks are mainly based on CMOS circuits, which require many active components to realize the function of neurons. But CMOS-based neurons naturally take up much space and suffer from high power consumption. Therefore, the memristor-based neural network is constructed according to the principles of ANN and SNN. Diffusion memristors can provide a highly desirable kinetic description of synapses and neural functions in neural networks. Synapses are represented in Figure 10(a) by arrays of synapses, connected by axon terminals to the dendrites of individual neurons, similar to the biological scenario in Figure 10 There are four main types of memristors: redox reaction memristors [141], phase transition mechanism memristors, ferroelectric tunnel junction effect memristors, and magnetoresistive effect memristors. Due to the small surface area and ease of integration, memristors are often used in arrays, such as 3D memristor arrays based on stacking technology and crossbar-type 2D arrays [142]. In addition to high integration, memristor arrays have multiplication and computing capabilities. By applying a voltage to the WL line of the array and reading the current collection on the SL line, the power of multiply-add operations, which have always been the most resource-intensive parts of neuromorphic computing, can be efficiently calculated. Some small networks have demonstrated efficient operations based on memristor arrays [143], [144], [145]. In addition to their advantages in computing and integration, memristors are considered devices with synaptic plasticity [146], [147]. In addition to synaptic realization, memristors with threshold switching properties are considered to enable the realization of high-density neurons [148]. In addition, memristor neurons can be used in perceptual systems to convert analog perceptual signals into pulsed signals [149].
Memristor-based neuromorphic computing is still in the potential of using device principles to explore neural computing or to verify small networks by constructing small circuit systems, limited by the integration difficulty of the memristor system itself and the limitations of anti-biotic synaptic learning rules. Sexual, large-scale memristor spiking neural networks have remained largely unreported.

D. PROBLEMS WITH TRADITIONAL SNNS
Traditional von Neumann computing architectures suffer from scalability limitations regarding computational speed and power consumption. Novel brain-inspired architectures have emerged as alternative computing platforms, especially for cognitive tasks requiring massive parallel data processing. As discussed in Section 3 above, one of the main bottlenecks in the CMOS implementation of these neuromorphic parallel architectures is the physical implementation of large-scale synaptic interconnections between neurons and synaptic adaptation. Implementing adaptive synaptic connections in CMOS technology requires using many circuits for analog memory or digital memory blocks, which are expensive in terms of area and energy requirements. In addition, learning rules that update these synaptic memory devices must be implemented. Developing compact adaptive devices that conform to biological learning rules to achieve synaptic connections has stimulated research into alternative nanotechnology to complement CMOS technology. Memristive devices are novel two-terminal devices capable of changing their conductance depending on the voltage/current applied to their terminals.
The current mainstream SNN chips usually use pure digital circuits to simulate the functions of neural synapses VOLUME 10, 2022 and neurons to build neuromorphic cores. Connect multiple neuromorphic cores through on-chip routing to form neuromorphic chips and use digital-analog hybrid circuits to simulate neural synapses. The neuromorphic core is constructed with the dynamic changes of neurons, and multiple neuromorphic cores are connected through on-chip routing to form a neuromorphic chip. Although these two technical routes use advanced technology to simulate a neural network with a scale of hundreds of millions of neurons, due to CMOS Circuits are limited by two-dimensional connections and a limited number of interconnected metals and routing protocols, and there are still enormous difficulties in realizing biological brain simulations with 3D structures. One of the major bottlenecks in the CMOS implementation of these neuromorphic parallel architectures is the physical implementation of large-scale synaptic interconnections between neurons and synaptic adaptation. An ideal hardware deep learning system should be able to have online learning capabilities and reconfigurability for different applications. The challenge of designing a highly scalable and parallel hardware deep learning system and providing online learning capabilities needs to span hardware, algorithms, and applications. New computing paradigm. Since the measurement units of parameters such as Synapses, neuron size, number of synapses, number of chip cores, and power consumption are not uniform, there is no way to compare, so only the chip area as shown in figure 11(a) and manufacturing process as shown in figure 11(b) of different SNN models are compared. Summarizes the current mainstream spiking neural network chips and compares their relevant characteristics in table 3.

V. THE SNN CHIP BASED ON PROBABILITY CALCULATION
SC is considered the next frontier of energy-efficient edge computing [150] because of its energy-efficient operation and ability to tolerate fault tolerance in areas such as recognition, vision, data mining, etc. At the same time, many applications are trying to move challenging workloads from cloud computing to edge devices. Therefore, SC has become a research hotspot.

A. APPLICATION OF PROBABILISTIC COMPUTING IN TRADITIONAL NEURAL NETWORK CHIPS
Deep learning has an increasing demand for energy-efficient, high-computing power, and low-power hardware processing systems. However, computing systems using classical computing architectures encounter the famous ''von Neumann bottleneck'', ''memory wall'', and ''functionality''. Problems such as ''wall consumption'' severely limit the improvement of the processing energy efficiency of deep neural networks [151]. The computational method of the biological brain is completely different from the von Neumann computing system. Biological neurons use pulse sequences based on time and space encoding to transmit information rather than encoded binary data. The SNN is a neural network that simulates the biological brain. It is completely different from the traditional neural network and requires fewer computing resources. Therefore, studying the neural network deep learning architecture based on impulse power is a breakthrough in solving the computational bottleneck and has new research value.
SC is one of the important realization methods of a neural network [152], [153], [154]. In 2001, Brown and Card first applied SC to neural network calculation [155], replacing traditional binary number calculation units with SC units. The calculation results show that the accuracy of the SC unit will decrease, and there are obvious advantages in hardware circuit area reduction, power consumption reduction, and calculation speed improvement. Ardakani et al. proposed an effective scheme for implementing DBN using integral probability. The experimental results show that the system's delay is reduced by 84%, the hardware occupied area of the modified scheme is reduced by 66%, and the power consumption is reduced by 33%, which effectively improves the calculation efficiency and accuracy [156]. LI et al. proposed an efficient stochastic computing-based large-scale deep convolutional neural network (DCNN) framework, using approximately parallel counters and optimizing the train multiplier, and proposed the stochastic computing-based ReLu activation function for the first time. The results show that the hardware circuit can accurately simulate the function output when the input range of the function is limited to [−5,5] [157]. LI et al. introduced two important technologies, Normalization, and Dropout, in the deep convolutional neural network (DCNN) based on SC and implemented the corresponding functions on the hardware. When using the AlexNet model to verify the ImageNet dataset. The results show that the accuracy of Top-1 is improved by 3.26%, and the accuracy of Top-5 is improved by 3.05% [158].
Ren first proposed the comprehensive design and optimization framework of a deep CNN based on probabilistic computation. Achieves extremely low hardware occupancy, low power consumption, and power consumption while maintaining high network accuracy. Comparing the improved model with the traditional model, SC increases the throughput of the hardware circuit by as much as 100 times [159]. Zhang et al. designed a motor controller based on a neural network and implemented its specific functions on FPGA. The neural network is implemented by stochastic computing. Compared with the traditional microcontroller or DSP controller implementation, the experimental results The motor controller was shown to achieve lower cost and higher performance [160].
Hirtzlin and Penkovsky et al. proposed to apply SC to binary neural networks, and tests on the Fashion-MNIST and CIFAR-10 datasets showed only a 1.4% drop in accuracy. Still, the circuit area could be reduced by 62% [161]. Sim et al. proposed a matrix-vector probabilistic multiplier [162]. The results show that the multiplier can balance the random sequence length and calculation accuracy and reduce the calculation delay compared with the traditional algorithm. Energy consumption also has a significant effect. Hojabr [102], a Stochastic computation-based convolutional neural network architecture for embedded devices, reducing the computation time of Stochastic-based computation in the CNN convolutional layer compared with traditional algorithms. Experimental results show that the SkippyNN architecture achieves a 1.2-fold increase in computational speed and a 2.7-fold energy saving compared to the conventional binary implementation. Xiong et al. use a non-correlation independent non-random encoding of random sequences and apply this encoding to a random multiplier to adjust the sequence length by an adaptive algorithm. The results of the study show that the sequence length was significantly reduced to 64 bits, reducing the overall computational latency [103]. Wang and Zhang et al. designed a non-scaling high-precision random adder applied to a CNN in combination with the Winograd algorithm. The results showed computational accuracy of the arbitrary computation was guaranteed while reducing the hardware complexity of the convolutional operation [104].
Xiong et al. proposed a non-correlated and independent non-random encoding method for random sequences [164]. They applied this encoding method to random multipliers to adjust the sequence length through an adaptive algorithm. The results show that the sequence length is affected by significantly reduced to 64 bits, reducing overall computational latency. Wang and Zhang et al. designed a non-scaling high-precision random adder and applied it to a CNN. Combined with the Winograd algorithm, the results show that the hardware complexity of the convolution operation is reduced VOLUME 10, 2022 while ensuring random computing. The computational accuracy of [165]. Neil et al. propose Minitaur, an event-driven FPGA-based SNN accelerator for investigating the capabilities of FPGA platforms to implement a real-time, eventdriven deep spiking network that achieves 92% accuracy on the MNIST dataset. accuracy [166]. Stromatias et al. implemented a spike-based DBN on the SpiNNaker platform, achieving 95% classification accuracy on the MNIST dataset, which is only 0.06% lower than the software implementation, while consuming 0.3 W, the average classification latency is 20ms [167].
For the first time, Esser et al. used the offline training method of backpropagation to create a network that reconciled the incompatibility between the backpropagation algorithm and neuromorphic hardware. The proposed SNN architecture achieved a recognition accuracy of 99.42% on the MNIST dataset and ran the network in real-time on TrueNorth chips [168]. Luo et al. proposed a network consisting of interconnected nodes, each containing logical computation, enhanced dynamic random access memory (eDRAM), and a router structure. This architecture can achieve 450.65 times speedup and 150.31 times cooling reduction on GPU [169]. Chen et al. proposed a CNN accelerator named Eyeriss, which can support the computation of high-throughput CNNs and is optimized for the energy efficiency of the whole system, including the accelerator chip and off-chip dynamic random access memory (DRAM) [170]. Han et al. proposed an Energy Efficient Inference Engine (EIE), which utilized weight sharing and distributed storage computing to accelerate neural networks, showing obvious advantages in energy consumption and hardware area reduction [171]. Ren et al. adopted a top-down approach to design an optimization framework for SC-based DNNs, fully using SC's advantages, significantly reducing the hardware area and achieving low power consumption while maintaining high network accuracy [172]. Table 4 shows the parameter comparison of neural networks based on SC and other software and hardware platforms. It can be seen that there is a theoretical and time basis for applying SC to neural networks. The application of neural networks on the device side is greatly limited, mainly because embedded devices cannot provide enough computing power, storage units, and bandwidth. The bit-wise execution of SC can greatly reduce the hardware complexity. On the other hand, the neural network model usually has multiple passes and iterations in the inference process, which also makes the random error in the inference process less likely to occur. It severely impacts the overall accuracy, which also matches the highly error-tolerant nature of stochastic computing. Compared with traditional binary computing, SC significantly improves computing speed and computing energy efficiency and reduces resource consumption. Since the parameters such as accuracy, Throughput, Area Efficiency, and Energy Efficiency are too different, only the chip area as shown in figure 12 (a) and power consumption as shown in figure 12(b) are compared. Compares the stochastic computing-based neural networks and other software and hardware platforms in table 4. It can be seen that there is a theoretical and temporal basis for applying SC to neural networks. The bit-by-bit execution of SC can massively reduce the hardware complexity. On the other hand, there are usually multiple passes and iterations in the inference process of neural network models, which makes the random errors in the inference process do not have a serious impact on the overall accuracy, which also matches the highly error-tolerant nature of Stochastic computation. Compared with traditional binary calculation, SC is significantly faster, more energy-efficient, and less resource-consuming.

B. APPLICATION OF SC IN SNN CHIP
The SNN is a new data storage and computing technology based on neural networks. Simulating the working mechanism of the brain can break through the von Neumann bottleneck encountered by traditional computers when dealing with large-scale problems and significantly improve the speed of information processing. Significantly reduces power consumption and has self-learning and adaptive capabilities.
In recent years, some researchers have attempted to reduce power consumption and area overhead while retaining the original advantages of SNN. Kuang et al. proposed to use the Euler approximation method to design the LIF neuron model to solve the problem of non-differentiable impulses [173]. Using the Euler method to simplify and accumulate neuron models can significantly reduce the computational complexity of SNNs. However, this approach still suffers from a large area overhead due to the many multipliers involved.
As a unique data representation and processing technology, stochastic computing many complex arithmetic operations can be implemented using simple logic gates in the SC framework, providing a huge design space for neuron integration. And SC has strong fault tolerance, because in SC, data is processed in the form of a bit stream, and these data are interpreted as probabilities. So SC enables fully parallel and scalable hardware implementation of large-scale deep learning systems.
The rate coding method in SNN is similar to the SC coding in SC-based computation [174]. However, the results of the SC method depend on the correlation of the input pulse sequences involved. When two (or more) pulse trains are used as the input to the SC circuit, the cross-correlation between them will affect the computational accuracy [175]. If the pulse train involved has high cross-correlation, then the pulse train output through the SC circuit has low randomness, and vice versa. On the other hand, many researchers use finite state machine (FSM) processes to implement many nonlinear functions to improve accuracy [12]. By following this design concept, Smithson et al. proposed using the FSM process to implement the LIF neuron model and its hardware architecture [176], [177]. However, this approach still suffers from a large area and power overhead.
To reduce the area overhead, Chen and Kou et al. proposed to use SC adders and multipliers to implement low-cost SNN neurons [178]. Using a large-scale SNN structure, to further reduce the calculation error caused by the cross-correlation of the bit stream. The pruning method can be used to avoid unnecessary calculations in the SNN process, which makes the pulse transmission between each SNN neuron layer. Becoming sparse also helps to improve computational accuracy. Using 40nm process technology to implement the SC-based SNN architecture and analyze the hardware efficiency can save 72.38% to 75.64% of the area overhead and 81.37% to 90.58% of the power consumption compared to the SNN model without the SC method. Xiao et al. proposed an adaptive exponential integral-excited neuron model SC-AdEx based on SC, using probability integrator and AND gate as the basic calculation unit. Compared with AdEx without SC, it occupies a larger area small, faster, and has a lower cost [179].
A high-precision SC-SNN hardware design framework proposed by Tang and Han utilizes the cumulative distribution function of the input signal to generate pulse trains and a priority encoder to convert these pulse trains into indexbased signals. In this way, the connection between neuron layers is reduced from O(N 2 ) to O(NlogN ), which solves the problem of relatively low information density and realizes efficient hardware design. Implemented on FPGA for classifying the MNIST dataset, experimental results show that almost the same accuracy as ANN is achieved [180]. Liu and Liang et al. proposed an efficient hardware tripartite synapse structure based on SC. SC is used to replace conventional computing components such as DSP in hardware devices, and the extended SC logic is used to scale the data range during the computing process. The results show that the proposed hardware architecture has the same output as software simulation with lower hardware resource consumption so that it can be applied to large-scale SNNs [181]. Chen and Song et al. designed a probabilistic spiking neuron and realized the reconfigurable computing architecture of the neural network. 8.82 times that of a conventional binary accumulator [182]. Gao and Chen et al.
proposed an asynchronous architecture of SNN based on SC, which realized the forward inference operation of SNN based on LIF neurons with 784 inputs and 10 outputs. Used the method of SC to convert the numerical value into a pulse sequence and realized further reduced the power consumption of circuits and systems [183].
SC has been widely used because it can reduce the energy cost of hardware computing [184]. Two main approximation strategies are used for neural network applications: network compression and classical SC.
Because neural networks have too many parameters, researchers targeting embedded applications began to reduce weights and activation accuracy to reduce the memory footprint of ANNs, a method known as network compression or quantization. Also, due to the fault tolerance of neural networks and their ability to compensate for approximations while training, the reduced bit precision results in only a small loss of precision [185], [186], [187], [188]. When implemented in hardware, weight quantization (WQ) shows an energy gain of 1.5 to 2 times with less than a 1% loss in accuracy [189], [190]. Rathi et al. achieved an accuracy loss of about 3% with an energy gain of 2.2 to 3.1 times [191]. There can be a trade-off between the accuracy of the SNN application and the energy and area requirements of the neural network. SC can also implement computational circuits of neurons, where unimportant cells can be deactivated to reduce the computational cost of evaluating SNNs [192].
Training ANNs with random synapses leads to better generalization and has already yielded better accuracy on the test set [193], [194]. The same method applies to SNNs. Spikes with synaptic randomness FPGA implementations of neuromorphic systems have been shown to improve the accuracy VOLUME 10, 2022 of the network while reducing memory requirements [195]. And nanoelectronic devices with inherent cycle-to-cycle variability, such as memristors [196] or VO 2 [197], can reduce the area and power overhead of random number generation. Chen et al. [198] also exploited probabilistic rewiring to increase their throughput, with fewer synapses meaning fewer spike integrations, and thus an increased algorithmic time step. Experimental results show an 8 times speedup and a 7.3 times increase in energy gain, and the accuracy loss in MNIST digit recognition is only 0.25%, from 98.15% to 97.9%. Thus, randomized and quantized synapses can significantly reduce the memory requirements and power consumption of SNN accelerators, and can even be further reduced by pruning insignificant weights. Another approach is to design PEs that approximate their computations by employing modified algorithmic logic units [201]. Jin et al. have shown that when evaluating SNNs on neuromorphic hardware for character recognition [202], carryingskip adders can achieve 2.4 times and 43% faster speed and energy gains, respectively, with an accuracy loss of only 0.97 %.
Therefore, software and hardware levels SC methods can significantly advance power consumption and speed. However, as the complexity of the dataset increases. with the depth of the network topology, such as the use of ResNet on ImageNet [199], the accuracy loss becomes a non-negligible factor [200].
Shows the classification results on the MNIST dataset based on random SNN, DNN, SNN, and optimized SNN [212] and also compares the energy consumption in table 5. The performance of SNNs on the very simple MNIST image recognition dataset is still marginal; the test accuracy is less than 95% [213], [214], and Smithson et al. found that spiking neurons perform the same as SC when performing rate encoding. The proposed SNN based on SC can further reduce the power consumption of the hardware and achieve 95% accuracy on the MNIST dataset [215].
DNN uses the RELU activation function, and SNN uses the RELU function to convert into IF neurons. Then, from [216] we find the energy consumption required for each operation. The results in Table 5 show that our proposed scheme achieves almost the same performance as the original DNN and better performance than the state-of-the-art SNN. Furthermore, random-based SNNs are more energy efficient compared to other networks. At the same time, Figure 13(a) represents the recognition accuracy of the neural network on the IMDb dataset; Figure 13(b) represents the energy consumption of the neural network on the IMDb dataset; Figure 13(c) represents the recognition accuracy of the neural network on the MNIST dataset; Figure 13(d) shows the power consumption of the neural network on the MNIST dataset.

C. DELAY PROBLEM OF SC
SC uses discrete pulse sequences to replace sequential binary numbers to achieve lower computing resource consumption. However, its latency or computational accuracy sacrifice also becomes a challenge for hardware design. In response to the above problems, some researchers have made preliminary attempts. Lu et al. proposed a new architecture to implement the fast Winograd algorithm on FPGA [203], reducing the calculation delay and improving the accuracy. Mathieu et al. used FFT and the convolution theorem to reduce the arithmetic complexity of convolutional layers [204], and Vasilache et al. improved a fast Fourier transform convolution implementation based on NVIDIA's cuFFT library [205], and in NVIDIA cuDNN implemented in the library. Strassen's algorithm [206] for fast matrix multiplication was used by Cong et al. [207] to reduce the number of convolutions in a convolutional network layer, thereby reducing its overall arithmetic complexity.
Lavin et al. proposed a new fast algorithm for convolutional neural networks [208], which is based on the minimal filtering algorithm discovered by Toom [209] and Cook [210] and popularized by Winograd [211]. Compared to direct convolution, this algorithm can reduce the arithmetic complexity of convolutional layers by up to 4 times. Arithmetic is performed by dense matrix multiplication of sufficient dimensions. Memory requirements are also low compared to traditional fast fourier transform(FFT) convolution algorithms. These factors make practical implementation possible. And achieved state-of-the-art throughput for all measured batch sizes from 1 to 64 for the NVIDIA Maxwell GPU implementation, found to use up to 16MB of workspace memory simultaneously.

VI. CHALLENGES AND THE ROAD AHEAD
Artificial intelligence and deep learning are already being applied in many different areas, and in the coming years, AI will be the economy's driving force. The development and popularization of artificial intelligence applications are closely related to technological progress. The algorithm is deployed on a chip consisting of several devices implemented in a certain technology, such as CMOS technology. The growth in the number and complexity of AI applications places increasing performance requirements on hardware (application-driven development). On the other hand, the development of new technologies and hardware improvements allow the development of more complex and, therefore, more accurate applications. The two development directions continue to complement each other and form a virtuous circle. To maintain such a high growth rate, industry and academia will face new challenges in the coming years, proposing two possible development directions.

A. DEVELOPMENT PROSPECTS IN TERMS OF HARDWARE IMPLEMENTATION
The advent of memristors and their synapse-like behavior opened up the possibility of overcoming the limitations of CMOS technology. Memristors can be as small as a few nanometers, but can be densely packed in two-dimensional layers with nanoscale spacing, potentially providing higher neuron and synapse densities. Since the manufacturing process is much cheaper than CMOS, the memristor layers can be stacked in 3D. This approach can achieve the neuronal and synaptic densities of the human brain on a single plate. Furthermore, the tight 3D dense packing between the CMOS neural computing unit and the memristive adaptive memory synaptic element can significantly reduce the current consumption of the final system [129].
The 3D integration technology centered on TSV through-silicon interconnection technology mainly affects the interconnection structure between chips, so this technology mainly reduces the circuit board area required for interconnection between chips. This technology is generally implemented by vertically stacking multiple memory or logic function chips, and connecting the TSVs made in the upper layer of the stacked structure to the bond pads on the top of the lower chips. However, at this time, each layer of chips in the stacked structure adopts its own design and is still a traditional two-dimensional structure, so the circuit-level interconnection inside each layer of chips is still a traditional two-dimensional design.
In contrast, in the monolithic 3D technology, the 3Dization of the interconnection layer inside the chip is more thorough, so people usually call this technology ''true 3D integrated design''. At this time, each layer of chips in the chip stack structure is designed as a functional unit in the whole, so that each layer of chips in the stack structure can use the same interconnection structure inside, so this design can further reduce the length of interconnect lines. Moreover, due to the unified design, the area occupied by the signal relay circuit and the like is also smaller, so the overall footprint of the chip can be smaller.
3D integration technology brings high bandwidth advantages; shorter interconnect designs and potentially high parallelism. Circuits can be interconnected on multiple planes and routed vertically through the planes. Using 3D techniques to improve neuromorphic computing efficiency by implementing 3D layers layer by layer [217], [218], it is also possible to separate memory and logic parts on different layers [218], [219]. Zhang et al. proposed that in monolithic 3D [217], the implementation of digital neuromorphic chips for formal processing is more performant if both memory and logic are distributed across multiple chips, which can save relative to the same 2D implementation about 20% power. However, regarding speed, Kim et al. proposed that stacked memory can greatly improve throughput relative to traditional memory-to-side implementations [219]. In addition, some researchers utilize through-silicon via technology processes to limit interconnect density [220], [221] to design analog neuron models with reduced capacitance footprints, increasing throughput and reducing power consumption. However, 3D circuits are not yet a mature technology [222], [223]. 3D technology has no unique constraints in terms of design but also has high process costs. Furthermore, since the AER protocol already allows low-power communication between neural cores, further work is required to understand to what extent 3D techniques can improve the performance of SNN accelerators.
Except for crossbar arrays and 3D circuits. Morro et al. propose to replace part of an ASIC designed with traditional digital gates with neuromorphic hardware [224]. This hybrid neuromorphic integration may be relevant for other applications. Yousefzadeh et al. developed a chip for evaluating formal ANNs that uses event-driven communication between layers [225], a topology they refer to as a hybrid neural network. It requires circuitry to convert asynchronous event-driven information into frames and vice versa. They employ the AER protocol between layers, where the information is encoded in 4 bits, thus guaranteeing minimal AER bus width overhead. For each non-zero activation, a single-word packet is sent through the asynchronous NoC, which is enough to transmit the complete information from one neuron to another. This technique can combine the advantages of SNNs and ANNs while mitigating their disadvantages. VOLUME 10, 2022

B. DEVELOPMENT PROSPECTS OF SC IN SMART CHIPS
By comparing the application of SC in various neural networks, it can be found that the neural network based on SC has great advantages in the area and power consumption. Due to the extreme parallelization of the SC circuit, all data can technically be preloaded into local memory before the start of the SC cycle. At the same time, a random stream can take hundreds or even thousands of clock cycles to complete (each clock for each random bit). SC can pipeline all SNN arithmetic operations from top to bottom, with all bits at a particular moment in each SC clock cycles through all SNN layers. Therefore, the memory bandwidth bottleneck is not a problem in SC circuits. The arithmetic circuits in SC allow massive parallelization, which benefits SNN hardware implementation in edge computing applications. This advantage is prominent when noise margin is essential at higher clock speeds when parallelizing large SNN models with big data.
Due to the high degree of parallelism of SCs, SC designs can achieve similar performance to traditional binary designs. These advantages make stochastic computation-based SNNs a potential competitive candidate in resource-constrained applications. Despite the strong parallelism of SC, the data bandwidth bottleneck remains a significant challenge. To address this, the algorithm can be modified to reduce the number of data items used (for example, model compression, pruning, or quantization).
One solution is increasing memory bandwidth, which is what high-bandwidth memory (HBM) is for, a stacked DRAM integrated with processing elements through a silicon interposer. The bandwidth of a single HBM2 block is 256GB/s, which is lower than the 616GB/s bandwidth of more traditional Graphics Double Data Rate 6 (GDDR6) memory. However, a stack with four HBM blocks achieves a bandwidth of 1TB/s. HBM2 memory is currently used for Nvidia V100 and P100 GPUs.
Another option is in-memory computing (IMC), which involves moving logic in memory. IMC enhances SNN acceleration by reducing the latency and power consumption required to access memory hierarchies in traditional von Neumann architectures. In addition, parallelization is increased by processing all memory cells simultaneously.
SC takes hundreds or even thousands of clock cycles to complete so that data transfers can be pipelined and buffered asynchronously. Furthermore, a large amount of data needs to be prepared in addition to SC elements. Therefore, limitations of local storage elements such as SRAM (ASIC term) or BRAM/Flip Flop (FPGA term) should be a concern. In any case, memory-centric computing design should be the direction of SC development, especially in SC SNNs, where hundreds of thousands or even millions of operations can be parallelized. Since most modern FPGAs consist of 6input lookup tables, there is still much room for optimization in implementing SC on FPGAs. ASIC logic may not translate efficiently to FPGA fabric because lookup tables are hardwired. Although FPGAs are flexible in terms of hardware implementation, they are not as customizable as ASICs. Modern FPGAs also include other resources capable of performing performance calculations, such as digital signal processors or arithmetic logic waiting to be used. There are also challenges in overcoming randomness, further improving classification accuracy, and at the same time maintaining high energy efficiency. However, with the application of SC designs in large networks of SNNs, SC provides an alternative, scalable solution for the hardware implementation of spiking neural networks with the potential for efficient machine learning.

VII. SUMMARY
This paper summarizes five neuron models, coding methods, network topology and learning algorithms commonly used in SNNs, and introduces the basic principles and application scenarios of SC. On this basis, three traditional spiking neural network chips, namely digital-analog hybrid neuromorphic chip, pure digital neuromorphic chip and memristor-based neuromorphic chip, are reviewed. The relative characteristics and advantages and disadvantages of synapse scale, chip area and manufacturing process are compared and summarized. It can be seen that SNN chips have opened up a new way for high-performance neural computing platforms to realize low-power neural network computing. However, the research on SNN chips is not mature and is in a stage of rapid development, facing many challenges. At present, there are mainly the following: Several problems: First, the parameters of the processor are highly configurable and have the ability to accurately handle actual tasks; the second is to realize the software-hardware correspondence to run the same program on the simulator and the chip, so that the parameter update can not only be It can be implemented in software, and parameters can be adjusted on the chip. It can support the parallel computing of SNN algorithm and traditional ANN algorithm, making the chip universal. The third is to use extremely low power consumption to run and train SNN to realize low power consumption of neural network. edge computing. It is still difficult to design a low-power, highly scalable and parallel SNN chip.
The advent of memristors opened up the possibility of overcoming the limitations of CMOS technology. Memristors are small in size but densely packed in two-dimensional layers with nanoscale spacing, providing greater neuronal and synaptic density. Memristors are made in a much cheaper process than CMOS and can be stacked in 3D. This approach can achieve the neuronal and synaptic densities of the human brain on a single plate. Furthermore, the tight 3D dense packing between the CMOS computing unit and the memristive adaptive memory synaptic element can significantly reduce the current consumption of the final system. 3D integration technology brings the advantages of high bandwidth, shorter interconnect designs and potentially high parallelism. Circuits can be interconnected on multiple planes and routed vertically through the planes. Leverage 3D technology to improve neuromorphic computing efficiency by implementing 3D layers layer by layer.However, memristorbased neuromorphic computing is still in the early stage of research, and the main research is still to verify the possibility of realizing neural computing with a single device in principle or to conduct small-scale experiments by building a small-scale non-reconfigurable memristor network. Achieving large-scale multi-core reconfigurable memristor neuromorphic chips remains a challenge.
SC is a logic calculation that converts binary numbers into probability-encoded digital pulse code streams, which has the advantages of extremely low area, low power consumption, and high energy efficiency. And the application of probability calculation in traditional neural network chips can not only solve the problems of high power consumption and large memory bandwidth of traditional von Neumann architecture processors but also maintain a high accuracy rate. The SC is added to the SNN chip to realize a computing circuit with extremely low power consumption and high computational efficiency at the milliwatt level. Based on the SC, the traditional ANN and SNN algorithm can be implemented, and ANN transformation can also be completed. For the calculation of SNN, the chip is universal. Although some achievements have been made in building neural network chips using SC, there are still shortcomings. The first is that the accuracy of SC is closely related to the correlation of the input sequence. Second, increasing the sequence length can increase the calculation's accuracy. Still, an excessively long sequence will lead to long delays and low throughput, making it difficult for the network to operate at high speed and in the real-time application below. To improve the computational performance of SC, reducing the sequence length and combining it with the Winograd algorithm can reduce the computational delay and energy consumption. It is also possible to avoid unnecessary computation during the SNN operation by a pruning-based method. It can also alleviate the irrelevance between input signals, making the transmission of spikes between each spiking neuron layer sparse, which also helps Improve calculation accuracy.
The SNN based on SC retains the original advantages of the SNN. It provides a new idea for the chip design based on the SNN. However, there are still several problems. First, it is impossible to use the greatest advantages of neuromorphic chips in terms of efficiency and energy consumption for on-chip learning; second, the high configurability of system parameters and the diversification of configurable parameters have not yet been achieved. However, with the rapid development of artificial intelligence, the development of SNN chips will further tap the potential of ANNs.