A Software-Circuit-Device Co-Optimization Framework for Neuromorphic Inference Circuits

Neuromorphic circuits, which usually use analog computation for vector-matrix multiplication (VMM) in neural networks (NN), are promising machine learning accelerators with much lower latency and power consumption than digital ones. Analog computation is expected to have a more efficient design space than digital computation since the signals are not digitized. Therefore, it is very suitable for Internet-of-Thing (IoT) applications that require ultra-low power consumption at a low cost. For IoT applications, sometimes it is also desirable to eliminate the digital circuits (such as adders, registers, shifters, multiplexers, and Analog-to-Digital Converters) between the VMM arrays to further reduce the power consumption. However, the optimization of a purely analog circuit is more difficult and requires full SPICE circuit simulations. In this paper, we present a software-circuit-device co-optimization framework using a python wrapper for automatic full circuit SPICE simulation and analysis for neuromorphic circuits. This framework allows users to experiment with how the NN design (software) affects the performance of the hardware neuromorphic circuits. It takes Verilog-A or SPICE models from calibrations or PDK in various technologies and emerging memories (such as ReRAM) without further calibration (unlike using behavior models). We show that the simulation time is reasonable even with hundreds of thousands of synapses under limited computation resources. Using ReRAM and a 45nm generic technology as an example, the effects of feedback network and OpAmp design, software ML architecture, and input data accuracy on the inference accuracy are studied.


I. INTRODUCTION
Machinelearning (ML), particularly the neural network (NN), has revolutionized almost every aspect of our daily life. There are two phases in an ML process, namely machine training and data inference [1]. In the machine training phase, a machine is trained by, usually, a large amount of data. In the data inference phase, the trained machine is used to infer the properties of the data (e.g. determines the number value of a hand-written digit image [2]). Both phases involve the movement of a large amount of data, resulting in an increase in latency and energy consumption [3]. This creates an almost impassable barrier to the further scaling of the traditional von Neumann computing architecture, in which the memory and computing units are separated by a higher and higher ''memory wall'' [4].
The associate editor coordinating the review of this manuscript and approving it for publication was Mostafa Rahimi Azghadi .
To circumvent the memory wall, Compute-in-Memory (CiM) has been proposed and attracted a lot of attention in the last decade [5]- [7]. By performing the computation in the memory element, most of the data movement is obviated and power consumption can be reduced substantially. Among various CiM ideas, using emerging memories, such as Resistive Random-Access-Memory (ReRAM) [8], [9], Ferroelectric Random-Access-Memory (FeRAM) [10], Spin Transfer Torque Random-Access-Memory (STT-RAM) [11], Phase Change Memory (PCM) [12], Electro-Chemical Random-Access-Memory (EC-RAM) [13], etc., is the most promising one to be used for the task for NN. This is because 1) they are non-volatile, 2) they naturally form an array for a single constant time step analog computation to replace the Vector-Matrix-Multiplication (VMM), which is the most power-and time-consuming operation in NN [6], [14], [15], and 3) they have very small form factors (e.g. ReRAM and PCM are simple cross-point memories which can be formed using the Back-End-of-Line (BEoL) in a Complementary Metal-Oxide-Semiconductor (CMOS) process [16]). In this neuromorphic computing architecture, the emerging memories mimic the synapses in a biological neural network.
Due to its non-volatility and low energy consumption compared to the traditional architecture, the NN accelerator using emerging memory is expected to be used in Internetof-Things (IoT) [17], [18], which is usually powered by an irreplaceable battery or energy harvested from the environment. Very often, the NN used in IoT is trained offline using large servers and the weights are uploaded to the IoT only for inference. Therefore, in this paper, only the inference properties of the emerging memory-based NN are studied.
Since emerging memory-based NN are mostly analog in nature, its optimization is not trivial. The accuracy, power consumption, area, and speed depend on the NN (number of layers and nodes), the Digital-to-Analog (DAC) converter accuracy, emerging memory electrical characteristics, amplifier properties, and the rectification unit performance. For IoT applications, to minimize the cost while satisfying the accuracy and power consumption requirements, a cosimulation framework for software-circuit-device optimization is necessary.

A. RELATED WORKS
There have been various studies of the neuromorphic circuits using simulations but most of them do not use SPICE simulations and cannot use the technology Process Design Kit (PDK) directly. In [6], a MATLAB framework was built to study the temperature, loading resistance, and input voltage range effect on the performance, however, without considering their interactions. It only studies the precision of circuit behavior instead of the accuracy of the final ML outputs. Since neuromorphic circuits are fault-tolerant, the findings might be over pessimistic. In [19], a system-level simulator is built intended to be used with behavioral models for large system simulation. However, calibration against the SPICE model is required for behavior models. Despite its lower speed, SPICE simulation provides more insight into circuit design and avoids the use of behavior models and additional calibrations. In [14], SPICE is used but only for studying the behavior of loading resistance and has no interaction with ML algorithm design. Some work has been done to develop system-level simulation frameworks to quantify the performance tradeoffs between area, read/write energy, read/write latency, and leakage power in implementing ReRAM arrays [20]- [24]. However, they do not cover other performance metrics when they are applied in compute in-memory architectures and neuromorphic circuits, such as classification accuracy, and computational speed.
Lammie et al. [23] developed a simulation framework that allows the sparse weights and neurons using L1 regularization and dropout during training. Also, the training is quantization-aware, which takes into account the limited resolution of the ReRAM weights. After training, the neuromorphic circuit is generated in SPICE or other RRAM-based DL simulation frameworks to obtain the test accuracy. The framework allows the simulation of various device nonidealities such as device-to-device variability, finite conductance states, stuck R ON , R OFF , etc., as well as functional models of the circuit blocks.
A Pytorch-based analog ANN simulation framework was developed with hardware-aware training and a CUDAcapable (GPU accelerated) C++ simulator [26]. The circuits such as DAC, ADCs, etc., are replaced with functional models and are not capable of transistor-level accuracy.
A simulation framework of differential-architecture crossbar arrays is developed in [27] to simulate spiking recurrent neural networks with PCM.
Finally, in [28], a comprehensive summary of various simulation frameworks of CiM is provided. The simulation frameworks are classified based on e.g. the programming languages, the capability of training simulations, the inclusion of periphery circuit simulation, the types of device supported, etc.

B. SUMMARY OF THIS WORK
In this paper, we developed and realized a softwarecircuit-device co-optimization framework for neuromorphic inference circuits using purely SPICE simulation based on our previous works [29]- [31]. This framework automatically constructs the neuromorphic circuit based on software ML. It can take the Verilog or SPICE model for its 4 major components, namely the DAC, neuromorphic memory, current comparator, and rectification unit. It performs full analog simulations. To minimize the energy consumption and the area for IoT applications, no select transistors are included and no digital circuits are used in the design (although they can be included if needed). This allows us to study the close interaction between the neuromorphic device and various parts of the circuits, which is only possible with a full analog circuit simulation. Since the periphery circuits are an important part of the simulation, using SPICE models from the PDK makes it more accurate and easier to perform cooptimization.
The paper is organized as the following. Section II explains the framework and its capabilities with an example to show the interplay of the resistors in the feedback network of the current comparator. Section III discusses the effect of the DAC and input voltage range on inference accuracy. Section IV discusses the OpAmp design consideration based on inference accuracy requirement and its interaction with the feedback network. Section V demonstrates the softwarecircuit co-optimization, followed by conclusions.

II. THE FRAMEWORK
A typical neuromorphic circuit for NN is shown in Fig. 1. As mentioned earlier, to minimize the area and power consumption for IoT applications, only pure analog neuromorphic circuits are studied. Therefore, it does not have VOLUME 10, 2022  inter-layer Analog-to-Digital Converters (ADC), multiplexers, registers, shifters, and adders. It has 4 major components, namely the DAC, emerging memory (ReRAM is shown as an example), current comparator, and rectification unit [32]. The weight of the NN is encoded as the conductance of the ReRAM. Since the conductance is always positive, to encode negative weights, two ReRAM is used to encode one weight and a current subtractor is used to convert the difference of the currents to reflect the true values. When the weight is positive, it is applied to the left string and the right string element is set to the maximum gap size to achieve minimal conductance (corresponds to ∼300k ) and vice versa when the weight is negative. Fig. 2 shows the framework. It consists of a jupyter and python (version 3.7.3) wrapper. The jupyter wrapper is the graphical user interface of the framework which allows the user to view the plots and results of the simulations. It also serves as a user-friendly interface for setting the simulation and plotting options. The python wrapper receives these settings and initializes the training of a softwarebased Neural Network. The weights of the NN are then automatically mapped into resistances of the memory devices in the crossbar array ( Fig. 1 and Fig. 3). In this study, ReRAM is used and its Verilog-A model is developed based on [33]. Temperature dependence of leakage current is added by calibrating to the experiment in [6]. Moreover, time integration methodology in the Verilog-A code is improved over the original code so that program and erase are independent of the initial bias before voltage sweeping [29]  (a limitation in the original model). The framework is also compatible with other types of memory devices as long as the SPICE or Verilog-A model and the equation relating the conductance to the parameter of the device are provided. In the case of the ReRAM, the programmed parameter is the gap size between the filament and the top electrode. Fig. 4 shows the electrical characteristic of the ReRAM and its relationship to the gap size. Cadence Spectre is used for the circuit simulation [34] but other EDA software can be integrated using the same setup. A generic 45nm technology available in Cadence is used to design the peripheral circuits.
Once the weights have been mapped into the parameters of the memory devices, the python wrapper proceeds to generate the netlist of the neuromorphic circuit from the model files and subcircuit templates of the memory devices and other circuit blocks such as the DACs, current subtractor, and the rectification unit as shown in Fig. 1 to Fig. 3. The input test data is also scaled and converted into binary representation to be fed into the DACs.
In order to assess the accuracy of the neuromorphic circuit, multiple simulations are performed using the co-optimization framework. Each of these simulations corresponds to a data point in the test dataset. Multiprocessing is enabled in the framework allowing up to 30 parallel Spectre simulations.
As an initial illustration, using the framework, a softwarebased NN of size [50,20,8] (i.e. 3 hidden layers each with 50, 20, and 8 nodes, respectively) was trained with 1617 images from the UCI dataset of handwritten digits [35]. Each image is an 8 × 8 matrix. Therefore, the input layer has 64 nodes and the output layer has 10 nodes for the digits from 0 to 9. Fig. 1 illustrates the corresponding circuits with 8966 ReRAM (note that this includes the bias rows and each ReRAM array corresponds to the VMM from one layer to another layer and therefore there are 4 ReRAM arrays). The python wrapper generates the netlist of the neuromorphic circuit using Verilog-A models of ReRAM weights, DAC, Op-Amp, and ReLU circuit blocks. Here only Verilog-A models are used. The effect of Op-Amp design using SPICE models will be discussed in the following sections. The open-loop gain, A OL , of the op-amps is set to 80dB. The resistances of the feedback network in the current subtractor ( Fig. 2), R in , R 1 , and R 2 were set to 100 , 1k , and 100k respectively. The neuromorphic circuit and the software-NN were then both tested on the remaining 180 images. The accuracy of the neuromorphic circuit and the software-NN are both 96.67%. The MLPClassifier in Scikit-Learn (version 0.20.3) is used with default settings (e.g. adam solver is used with the regularization parameter of 0.0001, shuffle=True, and batch_size=auto, which is 200 in this study) [36].
The output of the current subtractor in Fig. 2 is given by the equation: here, I col+ and I col− are the currents flowing through the positive and negative columns (Fig. 1). Ideally, with a very large open-loop gain, the output is given by: Therefore, the current subtractor also acts as a current-tovoltage converter with a ratio of R in R 2 R 1 and, thus, 1 is the non-ideality factor. Based on the non-ideal equation (Eq. (1)), it is expected that for a fixed R 2 , as R 1 increases, the non-ideality factor and gain error will decrease. However, if R 1 is too large, the closed-loop gain would be too small to amplify the currents to useable voltage levels and causes errors in the computation. Therefore, to keep R 1 large enough for high close-loop gain while reducing the gain error, R in can be increased as long as R in is still much less than R 2 + R 1 and the resistances in the crossbar. Therefore, this is a nontrivial optimization problem. Fig. 5 shows that the trade-off between R in and R 1 is not trivial (for R 2 = 100k ). Based on the area requirement, resistance accuracy, and inference accuracy requirement, one may choose different R in and R 1 pairs for the application. For example, R in /R 1 = 10 /100 gives the highest accuracy of 97.22%. But one will choose R in /R 1 = 1 /0.1 (96.67% inference accuracy) to get the smallest layout area but might have the worse process variation. The framework also enables the plotting of the voltage, current, resistance, power, and gap size of the ReRAMs for visualization and debugging. Fig. 6 shows the potential distribution in various layers (layer 0 to layer 3) of an example (hand-written digit ''3'') and the input potential. It can be seen that the potential is very uniform across the rows because of the small potential drop across the horizontal lines. The difference in the currents flowing through the ReRAMs in adjacent columns represents a multiplication operation on the input voltage and the conductance of the ReRAM. The total current flowing in a column is the sum of all these products and represents the accumulation operation. These form the basis of the Multiply-and-Accumulate operation needed in ANNs. In Fig. 7, the total column current in the last layer is maximum in column 3. The neuromorphic circuit correctly predicts the label (3) of the input shown in Fig. 6. Note that the current plot has half the width of other plots because VOLUME 10, 2022 FIGURE 5. Training accuracy as a function of R in and R 1 in the current comparator in Fig. 3. This is for the [50, 20,8] NN for the UCI hand-written test set with R 2 = 100k . the difference in the currents of two adjacent branches is displayed. Fig. 8 and Fig. 9 plot the ReRAM resistance maps and power consumption maps. The total power consumption is only 69µW.
By enabling parasitic simulation in the simulation options of the Jupyter wrapper, parasitic components can be added to the netlist as well as the post-layout extractions of the subcircuits. This is useful for measuring the speed of the neuromorphic circuit with transient simulations.
The ReRAM may be formed between the silicide poly and M1 (poly/M1) or M1 and M2 (M1/M2). This framework allows users to incorporate the parasitic resistance and capacitance of the wires in the simulation (Fig. 3) to study its transient response. Table 1 show the extracted parasitic capacitance and resistance when ReRAM is formed at the cross-points of poly/M1 or M1/M2. Minimum spacing and line width of the 45nm technology are used.

III. THE EFFECT OF DAC AND VOLTAGE RANGE
In reality, the input data to the neuromorphic circuit in an IoT application can be analog or digital. Here it is assumed that the input signal has been digitized (e.g. the digitized camera image). Depending on the accuracy and power requirements, it can be digitized to different numbers of binary digits. Therefore, a DAC is needed to convert the digital signal to the analog voltage for the neuromorphic circuit (Fig. 1). The UCI images are used in this study. The pixel values in the UCI data  set are in the range of p = 0 to 16 and can be represented by 5 bits. To emulate images digitized with a different number of bits, M , the following equation is used to transform the pixel values, p, in the testing data set.
where V i is the input voltage to the neuromorphic circuits (e.g. i = 0 to 63 in Fig. 1), N = 5, and V s is the scaling voltage. Int() is a round-off-to-integer function. Note that the training process still uses the original UCI data (i.e. 5 bits). We then study how M and V s affect the inference accuracy. Two NN are tested, namely [50, 20,8] and [8,8,8]. Fig. 10 and Fig. 11 show the surface plots of the prediction accuracy of these two NN as a function of M and V s .
The accuracy of the [50, 20,8] and [8,8,8] software neural networks are 96.67% and 90% respectively. As expected, for a reasonable V s , as the DAC resolution is increased, the neuromorphic circuit accuracy increases until it reaches the software-neural network accuracy as the limit, in general. However, note that for the [50,20,8] NN, the hardware accuracy when V s = 0.1V and M ≥ 5 is 97.22% and is higher than the software accuracy. This shows the non-trivialness in neuromorphic circuit optimization.
It can also be seen that the larger NN is more robust than the smaller NN. Firstly, the smaller NN [8,8,8] only has high accuracy (within 10% of the peak accuracy) from V s = 0.1V to V s = 0.4V while the larger one [50, 20,8] has high accuracy from V s = 0.04V to V s = 0.48V . Moreover, there is a wide range of V s (also 0.04V to 0.48V ) in which the accuracy is high even with M = 3 for the large NN but this happens only for V s = 0.16V in the small NN.

IV. EFFECT OF THE OPAMP DESIGN
As shown in Fig. 1, the OpAmp plays an important role in the neuromorphic circuit. It is an essential part of the current comparator. It also acts as the buffer for the rectification unit (see [30]). Therefore, it is important to study its effect on inference accuracy. Table 2 shows the effect of the OpAmp open-loop gain (A OL ) on the inference accuracy compared to the software one. R 2 is set to 100k and R 1 /R in = 10. It is found that an open-loop gain of 80dB is required to attain software accuracy. When R in is small (e.g. 10 ), as discussed earlier, there is a requirement of high A OL so that the gain error in Eq. (1) can be reduced. For example, at A OL = 60dB and R in = 10 , the inference accuracy is reduced substantially by 47%. Therefore, it might be desirable to use R 1 /R in = 50/500 by using unsilicided poly resistance to reduce the design requirement of the OpAmp.
A two-stage amplifier using folded cascode followed by a common source amplifier with a layout area of about 3µm 2 is designed using the generic 45nm PDK. Fig. 12 shows the schematic with A OL = 68dB. Fig. 13 shows it is stable with the feedback. It is found that the stability increases R 2 /R 1 increases. This is expected because the feedback factor is  R 1 /R 2 . Therefore, the system is more stable when R 1 /R 2 is smaller.

V. SOFTWARE CIRCUIT CO-OPTIMIZATION
An optimal NN is not necessarily the one that gives the highest accuracy. Particularly, in IoT applications, the NN size, power consumption, and circuit areas need to be considered, to minimize the power consumption and cost. This cannot be studied by software ML alone.
As an example, we consider the application of 1-hidden layer NN for UCI and MNIST [37] handwritten datasets. MNIST is a larger database of handwritten digits (70000 images) and the neural networks are trained and tested with 60000 and 10000 images, respectively. Moreover, each MNIST has 28 × 28 pixels with pixel values between 0 and 255.
We study the change of inference accuracy as a function of the number of nodes in the hidden layer. The node number changes from 100 to 13. For the 100-node case, the MNIST circuit uses about 160,000 ReRAM. With 30 CPU cores, the inference simulations are completed within about 70 hours. Fig. 14 shows that by using software ML, one cannot predict the trend and actual performance precisely when it is applied to the neuromorphic circuit. For example, as the number of nodes decreases in the UCI case, the software ML predicts that the accuracy will drop rapidly when the number of nodes is reduced from 100 to 50. However, the actual hardware accuracy does not change. This results in a larger design space than that predicted by the software.
On the other hand, the MNIST study shows that even with 13 nodes, the software ML can still achieve >90% accuracy (equivalent to almost a 10X reduction in the array size and number of the current comparator). However, the accuracy is not acceptable (only 75%) when it is implemented in the neuromorphic circuit. Therefore, it is very important to co-optimize the software architecture and the neuromorphic circuit.
The UCI inference accuracy increases again when the NN only has 13 hidden nodes. This shows that NN is too complicated to understand (known as a black box) and often gives unexpected results. Therefore, it is very important to perform a full SPICE circuit simulation to find the optimal setup.

VI. CONCLUSION
A Software-Circuit-Device co-optimization framework for neural network inference is developed and presented. This framework allows users to perform software machine learning and co-optimize with the corresponding neuromorphic circuit through SPICE simulations. It takes Verilog-A or SPICE models from the PDK without the need for additional calibration to behavior models.
It is shown that this framework is particularly useful for IoT edge inference device which has stronger requirements on power and cost and where fully analog circuits are desired. Fully analog neuromorphic circuits using ReRAM have been simulated as an example. In addition, the framework can handle MNIST data and perform inference accuracy simulation in a reasonable time even with limited computation resources. It is demonstrated that the co-design of software NN architecture, DAC, OpAmp, and its feedback network are important to optimize the neuromorphic circuit in a 45nm technology.

VII. LIMITATIONS AND FUTURE WORK
This framework requires accurate SPICE or Verilog-A models of the emerging memories. If they are not available, a Verilog-A model needs to be developed based on the behavior model, if available. Note that the frame supports stochastic simulation. For example, the initial gap size can be randomized as a parameter to the subcircuit of each ReRAM. In this demonstration, the ReRAM weights are not quantized. However, this can be done easily by digitizing the gap size for each ReRAM before calling the subcircuit.
In the current framework, modular crossbar tiles and ReRAM selectors are not used. Moreover, only fullyconnected neural networks with ReLU activation functions have been tested. However, since the activation function is implemented as a subcircuit, other activation functions can be quickly added to the framework as long as the model (Verilog-A or SPICE) is available. The current version of the framework also does not support convolutional layers. There are some methods to implement the convolution operation as a vector-matrix product that is compatible with the ReRAM array such as in [38] and the framework can be modified accordingly.
In this framework, we assumed the circuit is fully analog but noise is not considered. The impact of noise due to having fully analog communication between layers should be studied to ensure the circuit works in non-ideal environments.

DATA AVAILABILITY
The framework and data that support the findings of this study are available from the corresponding author upon reasonable request. The coding of the critical parts of the framework is explained and can be found in [39], [40].