Evaluation Platform of Time-Domain Computing-in-Memory Circuits

Computing-in-memory (CIM) architecture is considered an effective way to reduce the energy efficiency of deep neural networks (DNNs). Compared with conventional all digital implementations, time-domain CIM designs have shown great potential with better energy efficiency and less area cost. However, due to the non-idealities of devices and PVT variations, time-domain CIM may face computational errors, resulting in the reduction of network accuracy. In this brief, we proposed an evaluation platform based on typical time-domain CIM circuits to optimize the design process. Non-idealities in CIM calculation and the impact of key parameters on the network accuracy were analyzed. Based on this evaluation platform, a 28nm time-domain CIM test chip with configurable computing channels was fabricated. Under 2–128 computing channels, the measurement results show the same trend as proposed evaluation platform. And this chip achieved 66.8 TOPS/W energy efficiency and 80.68% inference accuracy based on the CIFAR-10 dataset.


I. INTRODUCTION
D EEP Neural Networks (DNN) have emerged as a fundamental technology for machine learning (ML). High performance and extreme energy efficiency are critical for the deployment of DNN, especially in devices of Internet of Things. However, with the rapid development of neural network, the large number of weights and input feature maps consume considerable storage and memory bandwidth. As the large amounts of synaptic weights incur intensive computation and memory accesses, efficiently processing large-scale neural networks with limited resources remains a challenging problem, which is also known as "Memory Wall".
In order to solve the "Memory Wall" problem, there are two mainstream trends. One is to simplify the network complexity to reducing the number of parameters [18], [19]. The other is to reduce memory access and power consumption through inmemory computing architecture [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [17]. Firstly, reducing computation complexity is a useful trend to decrease the energy Manuscript  consumption, such as binary neural network (BNN) or binary weight network (BWN). BWN has also demonstrated its high energy efficiency and good recognition which is near to full precision network in many databases. BNN saves more data movement energy, but leads more accuracy loss. Secondly, in conventional digital architecture, which is composed of separate memories and computation units, requires large latency and energy consumption due to data movement and bandwidth [15], [16]. To reduce power, the weights or feature maps memory and the computations are integrated together, into Computing-in-Memory (CIM) architecture. Most CIM designs use charge manipulation schemes to realize efficient computations in neural network accelerator [1], [2], [3], [4], [5]. However, the voltage-domain CIM designs need enough voltage headroom and power/area intensive ADCs/DACs. This limits voltage scalability and the energy efficiency. The current-domain calculation method is common in the application of non-volatile memory devices [6], [7], [8]. But the large computing current also limits the scale and utilization of CIM arrays, which cannot be well applied to large-scale neural networks. In time-domain signal processing, the multi-bit digital data is represented as the pulse width or the circuit path latency.
Comparing with voltage and current-domain processing, timedomain processing has smaller dynamic capacitance, toggle activity and better scalability [11], [12], [13], [14]. It is more suitable for higher precision network and also a promising computational method in CIM implement.
However, due to the non-idealities in CIM design, CIM may produce calculation errors comparing with traditional digital logic operations, which will eventually lead to the reduction of network accuracy. In this brief, we proposed an evaluation platform to guide and accelerate the design process. And We further verified the evaluation platform with a 28nm time domain CIM test chip.
The remainder of this brief is organized as follows: Section II introduces the based operation and non-idealities of time-domain CIM. Section III presents the proposed evaluation platform of time domain CIM architecture. Section IV analyze the impact of CIM calculation errors on network accuracy. Section V verified the platform with a fabricated 28nm time domain test chip. Finally, Section VI concludes this brief. operation is the multiply-and-accumulate (MAC) operation between input feature maps and kernels.

A. Time-Domain CIM Basic Operations
As shown in Fig. 1, the input feature map size is fs × fs × c, the kernel size is k × k × c × n, where c is the number of channels of the input image, k is the size of the kernel, such as 3 or 5, etc., n is the number of kernels. In each convolutional layer, each kernel performs MAC operations with different positions of the input feature map according to stride, and the Os × Os × n output image is passed to the next layer, where Os = (fs−k) + 1/stride. In time-domain CIM design, the most common method is using the delay chain architecture, which modulates the edge of the input pulse, and accumulates by series delay units finally. This evaluation platform mainly aimed on the time-domain CIM design that modulate the delay of the delay unit by controlling the pull/down current through different voltage. Fig. 1 also shows a general overall architecture of time-domain CIM, which includes input driver, delay units, pulse generator, time-digital converter (TDC), etc. Different kernels are spread and stored in different delay units by row, each delay unit stores one weight data. The input feature data is converted into analog quantity (usually voltage) by input driver and transmitted to each delay unit by column. During the calculation, the pulse generator sends fixed width pulses into each calculation chains, and each delay unit modulates the passing pulses according to input feature and stored weight data. Positive MAC makes the pulses wider, and negative MAC makes the pulses narrower. The number of delay units is also flexible, which is also an important factor to be considered later in this brief. After the calculation pulse passes through all of the delay units, it will be quantized into digital output by j-bits TDC as the partial MAC result, where j is the resolution of TDC, which determines the accuracy of the output result and the quantization power consumption.
In this brief, we applied a typical time-domain CIM circuits [12] as the research object. This design is mainly applied for the BWN network, that the weights are either +1 or −1. The calculation method of this design is performed by controlling the discharge slope of the inverters, and the slope only varies when the input is a rising edge. Meanwhile, the discharge slope is mainly controlled by the different gate voltages (Vgs) of the pull down NMOS, which according to input feature data to obtain a linear modulation effect. The binary weight controls the positive and negative directions of the pulse modulation. In simple terms, the effect of pulse modulation mainly depends on the discharge speed of the pull-down

B. Non-Idealities of Time-Domain CIM Architecture
The precision of feature and weight determine the accuracy of the convolutional neural network, but in the CIM implementation, analog non-idealities reduce the accuracy of the network. There are many kinds of analog non-idealities [17], such as IR drop, RC delay, local mismatch, intrinsic noise, etc. In time-domain CIM architecture, the following key parameters are mainly considered in this brief: 1) The voltage fluctuation of the input driver. The voltage (V gs ) generated by the input driver is sensitive to PVT variations, which may cause a DAC error, and then directly affects the linearity of the pulse modulation. This platform uses normal distribution to fit the input voltage fluctuation. Furthermore, the input Vgs is connected to the gate of the delay unit, and there is no DC current flowing. But there will still be the effect of leakage current, especially in the advanced technology, which resulting in the voltage drop of Vgs. The IR drop of V gs causes the V gs difference for each delay unit in the same column, which mainly depended on the column length. 2) Cell variation, which makes the pull-down capability of each delay unit fluctuate, and makes the pull-down current I 0 not the same although the delay units have the same feature and weight. According to the pull-down current I0 expression, it can be mainly defined with the normal distribution of threshold voltage, which is related to the standard deviation of threshold voltage and the W/L of the MOSFET [20]. 3) Quantization error. TDC is used to convert the pulse width into a digital result. The determining factor is the j-bits resolution of TDC. Larger j means better computing output precision, but with higher quantization energy consumption. The balance between energy consumption and computing error is the key point in chip implementation.

return MAC_OUT, Energy_Consumption
parameters have a few effects on the accuracy, which will be considered as constant in this model.

III. EVALUATION PLATFORM OF TIME-DOMAIN CIM
ARCHITECTURE According to the analysis of the basic operations and analog non-idealities of time-domain CIM, we conduct evaluation platform on the following key parameters: 1) Gate voltage fluctuation, which is generated by the input driver and modeled with a normal distribution (Vgs_sigma); 2) Cell variation, which can be defined as the standard deviation of threshold voltage (Vth_sigma); 3) The number of computing channels. A longer number of channels is beneficial to improve the area efficiency, but it will bring challenges to the design of computing circuits; 4) The number of calculation rows, which determine the V gs route length. More rows make IR drop of V gs worse. and 5) quantization resolution of TDC.
The evaluation platform uses Algorithm 1 to simulate the time-domain CIM operations of MAC in the convolutional layers. Classic VGG11 network is employed as the reference model, which contains 8 convolutional layers and 3 fullyconnected layers. The kernel size in the convolutional layers is 3×3. Each 8-bits input feature is converted into an ideal input voltage through the input driver, and then a random value (Vgs_real) with the normal distribution is generated according to the standard deviation (Vgs_sigma). The Vgs_real is shared with K_num kernels. Line 6-9 describe the operations of a delay unit. The real threshold voltage (Vth_sigma) of each delay unit is generated according to the standard deviation of the threshold voltage (Vth_sigma). The pull-down current and the time of pulse modulation are calculated, where μ n C ox is a constant, and W/L is related to the circuit design. Then the modulation time is multiplied with the input weight as the result of delay unit, and accumulated together. After finish all CIM calculating channels, the quantized partial MAC result is obtained according to the quantized resolution of TDC. Finally, after all channels are calculated, the final MAC result is accumulated and output. The evaluation platform transmits the CIM model MAC results of current convolutional layer to the next layer for post-processing operations, and then transmits input data of the next convolutional layer to CIM, and repeats until the inference accuracy of the dataset is obtained.

IV. ANALYSIS OF THE IMPACT OF CIM ERRORS ON FINAL
NETWORK INFERENCE ACCURACY To evaluate the impact of the network accuracy, this evaluation platform applies the VGG11 BWN network and obtains the benchmark accuracy of TOP1 88.71% on the CIFAR 10 dataset. Then the time domain CIM evaluation platform with error parameters is inserted into the calculation of convolution layer. Finally, the influence of the injected error model on the network accuracy is analyzed.
Assume that the CIM computing channel number = 64, rows = 64 and the resolution of TDC = 10bits, the influence of Vth_sigma and Vgs_sigma on the network accuracy was analyzed. Fig. 2(a) shows the effect of Vgs_sigma changes on the accuracy with fixed Vth_sigma. Fig. 2(b) shows the relationship between the variation of Vth_sigma and the accuracy with fixed Vgs_sigma. By comparing these two figures, it can be found that when Vth_sigma is fixed, the change of Vgs_sigma does not have a great impact on accuracy. For example, when Vth_sigma = 10mV, Vgs_sigma changes from 0 to 30mV, the accuracy only reduces 1.36%. After fixing Vgs_sigma, the change of Vth_sigma has a greater impact on accuracy. For example, when Vgs_sigma = 10mV, Vth_sigma changes from 0 to 30mV, and the accuracy decreases by 20.72%.
To further analyze the more specific effects of Vth_sigma and Vgs_sigma on the accuracy, the standard deviation changes from 0 to 30mV in 1mV steps @Channel = 64, Rows = 64, TDC resolution = 10b. As shown in Fig. 3, as the standard deviation gradually becomes larger, the accuracy decreases. The three curves represent three degrees of accuracy degradation. When Vth_sigma <= 9mV and Vgs_sigma <= 22mV, the accuracy reduction is basically less than 2%; when Vth_sigma <= 16mV, the accuracy reduces less than 5%; when Vth_sigma <= 22mV, the accuracy reduces less than 10%.
Based on the proposed evaluation platform, the set level of Vth_sigma is more significant than that of Vgs_sigma in time domain CIM circuits. This is because that most computing cells face the same voltage variation of the input DAC driver, and the TDC circuits can track the voltage variation of input DAC driver in this design.  The number of computing channels in a CIM calculation is also a key parameter in the CIM design. The longer computing channels, the better it is to save computing energy and area cost, but it can also generate larger cumulative errors and quantization errors. The number of CIM computing channels is considered in the CIM evaluation platform. The effects of computing channels on the accuracy under the conditions of Vgs_sigma = 10mV were simulated and verified. As shown in Fig. 4, when the TDC resolution is fixed 10-bits, the larger the computing channel makes more decrease in accuracy. When  Vth_sigma = 15mV, the accuracy of Channel = 128 has 6.64% lower than that of Channel = 2.

V. VERIFICATION OF THE EVALUATION PLATFORM BY
CHIP MEASUREMENT To further verify the proposed evaluation platform, a test chip (TD CIM) with configurable computing channels is designed based on Sandwich-RAM [12] and fabricated in the 28nm CMOS technology. Fig. 5 shows the die photo and the test platform scheme. We first get VGG11 quantized BWN data by Pytorch, and then transmit data to test chip through the controller. After finishing MAC operations by the TD CIM macro, test chip returns the convolution results to the PC. Finally, the inference accuracy can be obtained.
Based on the simulation results by performing Monte-Carlo post simulation with 32k samples, the Vth_sigma of test chip equals to 14mV@ TT 0.9V 25 • C, and the Vgs_sigma of input DAC circuit is around 16mV@ TT 0.9V 25 • C. The resolution of TDC is fixed 10bits and TD CIM macro has 128 computing rows. TD CIM macro can be configured into several calculation mode with different computing channels through the number of calculation chain selection.   Table I presents the modeling accuracy, inference accuracy and energy efficiency of the test chip with different CIM computing channels. The accuracy of test chip has the same trend as the proposed evaluation platform, that accuracy reduces when channels increasing. However, the energy efficiency shows the opposite trend. When the number of CIM computing channel increases from 2 to 128, the TOP1 accuracy drops from 87.18% to 76.69%, while the energy efficiency increases from 4.48Tops/W to 85.5Tops/W. As shown in Fig. 6, the balance between accuracy and energy efficiency appears under the condition of Channel between 32-64, which achieves 80.68% accuracy and 66.8Tops/W energy efficiency.

VI. CONCLUSION
In this brief, we established an evaluation platform based on typical time-domain CIM circuits, and conduct a detailed analysis of the errors that may be generated in the timedomain CIM. After applying the time domain CIM evaluation platform to the modified VGG11 network under CIFAR-10 dataset, we can find that the variation of Vth makes more effects on network accuracy than the input driver fluctuation. Based on this evaluation platform, a 28nm time-domain CIM test chip with configurable computing channels was fabricated. The measurement results show the same trend as proposed evaluation platform under different computing channels. And this chip achieved 66.8 TOPS/W energy efficiency and 80.68% inference accuracy based on the CIFAR-10 dataset.