An Energy Efficient Time-Multiplexing Computing-in-Memory Architecture for Edge Intelligence

The growing data volume and complexity of deep neural networks (DNNs) require new architectures to surpass the limitation of the von-Neumann bottleneck, with computing-in-memory (CIM) as a promising direction for implementing energy-efficient neural networks. However, CIM’s peripheral sensing circuits are usually power- and area-hungry components. We propose a time-multiplexing CIM architecture (TM-CIM) based on memristive analog computing to share the peripheral circuits and process one column at a time. The memristor array is arranged in a column-wise manner that avoids wasting power/energy on unselected columns. In addition, digital-to-analog converter (DAC) power and energy efficiency, which turns out to be an even greater overhead than analog-to-digital converter (ADC), can be fine-tuned in TM-CIM for significant improvement. For a 256*256 crossbar array with a typical setting, TM-CIM saves <inline-formula> <tex-math notation="LaTeX">$18.4\times $ </tex-math></inline-formula> in energy with 0.136 pJ/MAC efficiency, and <inline-formula> <tex-math notation="LaTeX">$19.9\times $ </tex-math></inline-formula> area for 1T1R case and <inline-formula> <tex-math notation="LaTeX">$15.9\times $ </tex-math></inline-formula> for 2T2R case. Performance estimation on VGG-16 indicates that TM-CIM can save over <inline-formula> <tex-math notation="LaTeX">$16\times $ </tex-math></inline-formula> area. A tradeoff between the chip area, peak power, and latency is also presented, with a proposed scheme to further reduce the latency on VGG-16, without significantly increasing chip area and peak power.


I. INTRODUCTION
D EEP neural networks (DNNs) have been widely implemented in various fields with unprecedented success, such as autopilot, aerospace, wearables, security, and so on [1]. With the ever-increasing complexity of DNNs, the modern computing systems have to cope with massive parameters and operations. Due to the physical separation between the processing units and memory units, conventional von-Neumann architectures suffer from the limited on-chip memory size and memory bandwidth, resulting in the ''Von-Neumann bottleneck'' [2]. What is more, conventional processors for DNNs such as GPUs require ultrahigh power consumption, which is not suitable for some applications such as edge intelligence.
Computing-in-memory (CIM) is considered a promising candidate to surpass the ''Von-Neumman bottleneck'' with much lower power consumption and much higher energy efficiency. CIM performs in situ computing within the memory, significantly reducing data movement and thus facilitating high energy efficiency. Emerging non-volatile memristors such as phase change memory (PCM), spin-torque-transfer memory (STT-MRAM), and resistive random access memory (RRAM) [3], [4], [5] have been widely explored as fundamental building blocks of CIM schemes. Fig. 1 shows the diagram of conventional CIM schemes, the input circuit for each row is usually composed of a digital-to-analog converter (DAC) with an operational amplifier (OP-AMP)based voltage follower output stage [6]. The memristors are usually arranged as a crossbar array. The conductance of the memristors behaves like a synaptic weight, and according to Kirchhoff's law, the combined bitline current of each column corresponds linearly to the weighted sum of the respective neuron. This arrangement of the array corresponds to the matrix in a layer of a neural network and implements what is referred to as a cross-bar in neural network hardware implementation. In conventional CIM schemes, each input circuit is required to drive multiple devices (e.g., 256), which means the DAC will become area-and power-hungry [6]. Some research works use digital input signals to avoid such overhead [7], [8], [9]. However, it requires multiple cycles to compute a high-precision activation, which will increase its total energy consumption. What is more, the shift and add operation for different input bits will accumulate the quantization errors if each cycle requires an ADC conversion, leading to a reduction in the robustness of this architecture. Moreover, in the above two architectures, each column has its dedicated peripheral circuits, including CIM neuron and ADC. However, both the CIM neuron and ADC are power-hungry components and have significant chip area. In [8], the trans-impedance amplifier (TIA) neuron consumes more than 95% power in the scheme. In [9], the analog-to-digital converter (ADC) consumes more than 92% energy and 75% area in the system. Since edge intelligence prefers compact and energy-efficient DNN chips over those with high computing throughput, one way to reduce the area overhead of peripheral circuits is to share them via time-multiplexing (TM). However, for conventional memristor array, forcing simple TM would lead to significant leakage currents on unselected columns, which will be discussed in detail in Section II.
In light of the above limitations, we proposed a novel TM-CIM architecture to save area without incurring additional power consumption. In addition, since TM-CIM computes one column at a time, the input circuit only needs to drive one device (at a time). Therefore, the area and power consumption of input circuits can be significantly reduced. The major contributions of this article are summarized as follows.
1) The TM architecture shares the TIA and ADC with multiple columns to reduce the area overhead of the peripheral circuits. In a typical setting, the area can be saved 19.9× for a 256*256 1T1R array and 16.9× for a 256*256 2T2R array at a latency of 5140 ns, versus 210 ns without TM. 2) The memristor array is arranged in column-wise form rather than row-wise form. The cells on the same column are controlled with a column-wise signal. In this way, unselected columns will be completely turned off, thus avoiding leakage currents.
3) TM-CIM is flexible and efficient at implementing complex DNNs. Compared with conventional architectures with analog input, the area saving is 16×, and energy saving is 30× on VGG-16 [10]. Compared with conventional architecture with digital input, the area and energy saving is 14.2× and 5.9×, respectively. 4) A tradeoff analysis between chip area, peak power, and system latency gives the best TM strategy for different DNNs. Under a similar setting as in previous points, TM-CIM can implement VGG-16 with an area of 118.09 mm 2 , peak power of 0.797 W, and energy consumption of 1.968 pJ/image at a latency of 16.056 ms.
The rest of the article is organized as follows. Section II introduces the background and related works. Section III discusses the detailed design of the proposed TM-CIM architecture. Section IV provides performance evaluation of the proposed architecture. Finally, the conclusion is drawn in Section V.

II. BACKGROUND AND RELATED WORKS
In a CIM crossbar array, typically each memristor is connected with a select transistor to form a 1T1R cell. As shown in Fig. 2(a), each word-line controls the gate of each 1T1R cell on a row, generally the bitline of each 1T1R cell on a row is fed by the same corresponding voltage (Act i ) which corresponds to the input activation x i in a DNN, and all columns are computed and sensed in parallel. Based on Kirchhoff's law, each column's current is therefore the weighted sum of products of input activations and weights on the corresponding neuron i x i w ij . The conventional 1T1R array is widely used in recent research [7], [9], [11], with some researchers using source-line to represent the activations and bitline current to represent the weighted sum [12], [13].
As shown in Fig. 2(b), 2T2R cells have been proposed to represent signed weights [14], [15]. In a 2T2R cell, the first memristor represents the positive portion of a weight and the second represents the negative portion. So, if the weight is positive, the second memristor will generally be programmed with high(est) resistance state (HRS) possible so that it represents a weight of zero. A similar converse arrangement is made if the weight is negative. If the input activation x i is positive, then it will be represented by the voltage Act i,p , and Act i,n will be zero (with respect to ground seen at SL i , which could either true ground or virtual ground depending on the implementation). Conversely, if x i is negative, then voltage Act i,n will be negative, and Act i,p be zero (again w.r.t. ground seen at SL i ). With this arrangement, the difference of the pair of memristors' currents would represent x i w ij . These architectures all have per-column peripheral circuits including CIM neurons and ADCs, which will lead to high power consumption and area overhead. One promising way to reduce the power consumption and area overhead of the peripheral circuits is to share them via TM. Yao et al. [9] shares each ADC with four columns to reduce the area overhead. However, it still requires per-column sample and hold (S&H) circuits which also implies per-column TIAs, and the ADCs still consume most of the energy and area. Yoon et al. [12] use 32-to-1 multiplexers to share a 4-bit ADC with 32 columns. Nevertheless, the ADC precision is so low that it requires additional shift and add circuits to achieve high precision. However, the effective number of bits (ENOB) of the weighted sum is limited by the ADC resolution, rendering the shift and add unuseful in practical sense. In addition, its binary input mode requires multiple cycles to implement n-bit activations. In this article, a novel TM architecture is proposed with analog input and high precision ADC output, which will be discussed in Section III.

III. TM-CIM ARCHITECTURE
TM-CIM is designed to reduce the peripheral circuit area overhead of CIM architectures as well as to avoid additional power/energy overhead. Fig. 3 shows the top view of proposed TM-CIM. For simplicity of illustration, the column-wise array is composed of 1T1R cells, which can be replaced by 2T2R cells to represent signed weights. The neuron is shared with multiple columns to reduce the area overhead. The details of each block will be introduced in the rest of this section.

A. COLUMN-WISE ARRAY
Memristor array is the key component for CIM architecture. On a conventional CIM memristor array (e.g., 1T1R or 2T2R), because during inference word line i would turn on all transistors in row i, if we force TM when one column is selected for processing, the unselected columns will continue to draw current and hence waste power and energy. In TM-CIM, not only the array is selected and computed on a column-by-column basis, but the array is also designed to be column-wise to avoid wasting energy on unselected columns. As shown in Fig. 4, the activation voltages (Act i ) are sent into the array by rows, the cells on the same column are controlled with a column-wise signal (SEL j ), and the weighted sum is computed by the multiply-and-accumulate (MAC) operations, which can be represented as   where V i is the corresponding voltage of the input activations, G i,j is the corresponding conductance of the cell representing the i th weight of neuron j.
To program the cells, SEL j is set to an on-voltage to turn on the gates of the 1T1R cells of the selected column j, and for the selected row i, for SET operations, the input (Act i ) is set to a high voltage (such as V prog ) and the source line (SL j ) is set to a low voltage (such as 0 V); and for RESET operations, SL j is set to V prog and Act i is set to 0 V. V prog may take on different VOLUME 8, NO. 2, DECEMBER 2022 values depending on the target state (i.e., target conductance value). If there is a need to program multiple cells on the same column, each row can be a corresponding selected row with its corresponding suitable Act i . For the unselected rows, the input activation lines are set to floating so as not to alter states of these unselected cells. For the unselected columns, SLs can be set to any voltage that does not alter states of these unselected cells. During inference, all input activation lines are fed with voltages that correspond to input activations from a DNN.
The scheme of the 2T2R cell is proposed by [14], which is shown in Fig. 5(a). For positive weights, the weight value is stored as G p , with G n = conductance of HRS. For negative weights, the weight value is stored in G n , with G p = conductance of HRS. Therefore, the weight can be represented as the difference of G p and G n . The gates of two transistors are connected with the selected signal (SEL j ). V p and V n are connected to the input activations. Then the output current can be expressed as For positive activations, V SL − V Act i,n = 0, and for negative activations, V Act i,p − V SL = 0. Therefore, the current on j th SL can be represented as where G i,j is the corresponding conductance of the signed weights, and shows the proposed column-wise 2T2R array which is used to represent signed weights. Programming cells in the proposed column-wise 2T2R array is similar to that of the proposed column-wise 1T1R array, with specialization for 2T2R. When programming a positive weight for SET, Act i,p should be V prog while Act i,n should be either equal to SL j (which is preferably 0 V) or floating, so as to not SET the negative portion of the weight. Conversely, when programming a negative weight for SET, Act i,n should be V prog while Act i,p should be either equal to SL j (which is preferably 0 V) or floating. Similarly, for RESET, the activation line of the selected polarity should be 0 V while that of the unselected polarity should be either V prog or floating, and selected column's SL j should V prog .
Note that in the proposed column-wise array structure, whether for 1T1R or 2T2R, if there is a verification procedure after programming (often referred to as write-verify) since all cells in a column will be turned on, to avoid unselected rows from this column to contribute unwanted read current, their activation lines should be set to floating, to guard against the case where supposed 0 V activations on unselected rows have systematic offset voltages.

B. ENERGY-EFFICIENT TIME-MULTIPLEXING NEURON
The proposed TM neuron consists of a series of switches, a TIA, and an ADC such as high-precision successiveapproximation (SAR) type [16]. As shown in Fig. 3, the source lines are connected to the TIA via a series of switches (SW) acting as a MUX. The switch on the jth column (SW i ) is controlled by the signal SEL j , which is also the select signal of the array. TIA converts the source line current to voltage and sends it into the ADC. Finally, the ADC latches the TIA's output and converts it into a digital signal. Fig. 6 shows the workflow of the proposed TM neuron. In the first phase (P 0 ), SEL 0 is turned on, and the first (0th) column generates the results as a current signal, which is passed by SW 0 to the TIA for converting the current to voltage. At end of P 0 , the ADC latches TIA's output voltage to start A/D conversion. In the second phase (P 1 ), the ADC converts the output of 0th column to digital signal, while SEL 1 gets turned on and the TIA converts SL 1 's current to voltage. At the end of P 1 , the ADC completes the conversion and then latches the TIA's output voltage to start A/D conversion, and so forth. We are assuming that memristor column current stabilization and TIA output voltage stabilization are happening within the same phase. Therefore, if an array shares a ADC with m columns, the latency of each phase is t p , the latency of this array would be (m + 1) × t p .

C. NETWORK IMPLEMENTATION
When implementing a DNN, such as a convolutional neural network (CNN) with TM-CIM, typically each column of the array stores the synaptic weights of a Conv neuron, and each row corresponds to an input activation of the current convolution window. As illustrated in Fig. 7(a), each neuron has N × k × k synapses, where N is the input feature maps, and there are M neurons for M output feature maps. For a practical CNN, the first convolution layer is usually small and is able to fit on a single array. For the later layers, the number of inputs can be much bigger than the number of rows in an array. Therefore, as shown in Fig. 7(b), multiple arrays are required to map a layer, and the partial weighted sums of each array can be summed together in the digital domain and in synchronized TM across these cores/arrays to obtain the final weighted sum with negligible extra latency.

IV. PERFORMANCE EVALUATION
In this section, the energy and area of the proposed TM-CIM are evaluated both on a 256 × 256 core and on the VGG-16 CNN. The evaluation is based on known parameters in 65-nm technology. What is more, a tradeoff strategy between the area overhead, peak power, and system latency is also illustrated.

A. CORE-LEVEL EVALUATION
Here, we assume 4-bit input is sufficient for quantizationaware trained DNNs like VGG-16, hence core level evaluation is based on 4-bit input and 256*256 crossbar 1T1R/2T2R cells. The area, peak power, and latency of each 1T1R cell is 0.169 µm 2 , 1 µW, and 10 ns, respectively [17]. 1 By definition, the area of the 2T2R cell is twice that of the 1T1R, but the power and latency remain the same because only one out of the two RRAM devices in a 2T2R cell is turned on at a time. 6-bit DAC is used to represent 4-bit inputs. The 6-bit DAC speed is 100 MS/s, driving 256 devices, consumes 390.6 µm 2 in area and 60 mW in power by scaling and deriving from [6, Fig. 7]. In contrast, the input circuit driving 1 device is assumed to only consume 50 µm 2 and 0.001 mW for DAC, and 10 µm 2 and 0.005 mW for OP-AMP [6]. To reduce the area overhead of ADC, [17] uses single-slope (S/S) ADC to complete the data conversion, whose area and power consumption are 3000 µm 2 and 0.2 mW. However, its latency is as long as 200 ns. In the proposed TM-CIM, a 9-bit 100 MS/s SAR ADC whose area and power consumption are 13 000 µm 2 and 1.2 mW (on 65 nm) is assumed per [16]. A TIA with 2000 µm 2 and 0.5 mW is assumed to be used to provide input to the ADC, and the latency of the TIA is 10 ns. We further assume that both the DAC's and the TIA's output stabilization times can be overlapped and merged as 10 ns. Table 1 gives the energy and area comparison between conventional CIM, digital-input CIM, and the proposed TM-CIM architecture. Since conventional architecture computes all columns simultaneously, the analog input circuits would consume huge power (15 360 mW) and energy (2.343 pJ per MAC), which is extremely unfriendly for edge intelligence implementations. In contrast, the circuit of the proposed TM-CIM array only consumes 3.492 mW and 0.136 pJ (per MAC). Digital-input architecture removes the DACs to avoid the substantial power consumption of the input circuits, but multiple cycles are required to implement high-precision inputs, incurring additional energy consumption and latency. Furthermore, digital input's latency is much higher than TM-CIM's. Therefore, the proposed TM-CIM shows the best energy efficiency in contrast with other architectures.
The 256*256 1T1R array consumes 0.011 mm 2 , while the 2T2R array would consume 0.022 mm 2 . For conventionalanalog architecture, the input circuits would consume 0.100 mm 2 , and ADC would consume 768 mm 2 . There are more than 98.7% and 97.5% area consumed by the peripheral circuits, which will lead to an overall huge chip yet with incompetitive capacity. The digital-input architecture removes the input circuits to reduce area overhead. However, the TIAs and ADCs for each column still consume a considerable area. In TM architecture, the area consumed by TIAs and ADCs can be saved significantly by sharing them with multiple columns. Nevertheless, the digital-input architecture incurs extra energy consumption and latency increase as the input resolution increases. In the proposed TM-CIM, the DAC and OP-AMP can be designed with low power and a small area since only one column is computed at a time. Moreover, if every 256 columns share one ADC, the area consumed by peripheral circuits can be saved significantly even though a single SAR ADC will consume more area. In TM-CIM, 1T1R architecture has a total area of only 0.045 mm 2 , saving around 19.53 times compared to the conventional analog-input scheme. The total area of 2T2R architecture is 0.056 mm 2 , which is 15.89 times less than that of the conventional analog-input scheme.
Because the DAC and op-amp in TM-CIM are tuned for driving 1 device only at 100 MS/s (i.e., 10 ns), and yet during row initialization (before column TM starts) they will see parasitic capacitance of a whole row of (i.e., 256) transistors, we further conservatively assume that the initialization will take as long as multiplexing all 256 columns, which is 2570 ns. Hence, the latency of TM-CIM is 2570*2 = 5140 ns, which is reasonable for edge intelligence since the latency is mostly determined by the slowest layer in the network level Table 2 gives the core-level comparison    between conventional CIM and the proposed TM-CIM. The proposed TM-CIM shows the best energy efficiency and density, even though with lower throughput. Moreover, at the network level, TM-CIM can achieve a throughput comparable to conventional CIM by simply increasing the number of ADCs in the early layers, and the tradeoff strategy will be illustrated in Section IV-C. Table 3 gives the comparison between the proposed TM-CIM and other ADC-shared architectures. Yao et al. [9] share each ADC with four columns to reduce the area overhead. Yoon et al. [12] use 32-to-1 multiplexers to share a 4-bit ADC with 32 columns. However, these architectures still require per-column S&H circuits which also implies percolumn TIAs, and the ADCs still consume most of the energy and area. Therefore, the proposed TM-CIM still consumes the lowest area and energy.

B. NETWORK-LEVEL EVALUATION
The network-level energy estimation is based on VGG-16 using ImageNet dataset, and the accuracy estimation is based on VGG-11 using CIFAR-10 dataset. Multiple 256*256 2T2R crossbars are required to implement signed weights in each layer of the network. Fig. 8(a) shows the required number of crossbars for each layer of VGG-16, and the total number of required crossbars is 2121. The accuracy simulation is performed in PyTorch platform using the DoReFa quantization-aware training framework [18] and core-size limitation. As shown in Fig. 8(b), the accuracy drops slightly when activations and weights are quantized to 8-bit, and is suitable for implementation on the proposed TM-CIM.
Note that in this evaluation we focus on the aggregation of power/energy consumption of CIM cores, and the aggregate latency seen by the application. Hence, in this study, we do not include energy overhead for data traffic in the chip, nor for input control blocks that manage and fetch the input data for each convolutional layer, as their estimations would vary significantly depending on the choice of implementation of network-on-chip routers and input control blocks. Table 4 gives the performance estimation of implementing VGG-16 with 256*256 2T2R arrays and 8-bit ADC, and Table 5 gives the performance comparison between the proposed TM-CIM and other architectures. It is assumed that the arrays used to implement the same convolution layers can be computed in parallel. For conventional analog-input architecture, the total area is 1887.997 mm 2 , while the array area is only 46.983 mm 2 . The chip area is huge and most of which is consumed by the peripheral circuits. In fact, due to size constraints of photomasks, a chip even if occupying the entire mask is usually limited to around 800 mm 2 (exemplified by some of the biggest GPU dies). For conventional digital-input architecture, the total area is reduced to 1675.911 mm 2 , most of which is consumed by per-column TIAs and ADCs. The TM digital-input architecture has the minimum area of 85.161 mm 2 . However, its total energy consumption and latency are still higher than the proposed TM-CIM, after considering the number of computing cycles. In TM-CIM, if every 256 columns share one ADC, the area consumed by peripheral circuits can be reduced significantly to 70.757 mm 2 even though a single faster ADC will consume more area, the 2T2R array consumes 46.983 mm 2 , and the total area is 117.739 mm 2 . Compared with the conventional analog-input architecture, the area saving is more than 16 times.
In conventional architecture, up to 256 columns in each array will compute at a time. Therefore, the peak power will be particularly high, especially for the analog-input circuits. This is evidenced in Table 5, where total peak power is as high as 2527.996 W which is in fact impractical. Note that this estimation is already significantly reduced by considering that the fully connected (FC) layers can be calculated gradually instead of all at once. The key reason is that the DACs here are too power-hungry despite scaling down their resolution following the benchmark in [6]. The digital-input architecture reduces the peak power significantly, but it requires more computing cycles which also increases energy consumption, and for our evaluation on VGG-16 the proposed TM-CIM is still the most energy efficient.
In TM-CIM, only one column will be turned on in each array, thus the peak power can be efficiently reduced. Since the unselected columns can be totally turned off, TM-CIM would not consume extra energy. Contrasted with conventional analog-input architecture, the energy consumption per image can be saved by 7.67 times. The latency in the proposed TM-CIM is acceptable since in this illustrative example we adopt a SAR ADC with higher speed by sacrificing area moderately.

C. AREA, POWER, AND LATENCY TRADEOFF
The latency can be further reduced by increasing the number of ADC in some arrays. In VGG-16, the latency is mainly determined by the first two convolutional layers. Therefore, two ADCs can be adopted in the four arrays which implement the first two layers. In this way, these arrays will compute two columns simultaneously, and the latency will be reduced in half with minimal area increase.
As shown in Table 6, to minimize the latency of VGG-16, 32 ADCs are adopted in the first two convolution layers, and in the later layers, the number of TIAs and ADCs can be scaled down. The minimum latency is 2.007 ms with an area consumption of 1276.431 mm 2 .
However, it is not efficient to push to the minimal latency, because increasing the number of ADCs in each array requires the input circuits to drive more devices, which results in a significant increase in peak power and area.  The number of TIAs and ADCs has increased, resulting in more area and power consumption. The number of OP-AMPs remains the same, but the area and power consumption have been increased to drive more columns at the same time. Therefore, there is a tradeoff between the area, peak power, and latency. As shown in Fig. 9(c), the latency can be reduced to 16.056 ms (1/4 of not using TIA and ADC overhead) by only increasing the area of 0.297 mm 2 (0.25%) and peak power of 0.055 W (7.41%) on VGG-16. In contrast, the area and peak power will increase considerably if a much smaller latency is sought.

V. CONCLUSION
In this article, an energy-efficient TM memristive analog computing architecture is proposed. A column-wise memory array is designed to reduce peak power and area consumption as well as avoid wasting energy on unselected columns. The TM neuron is designed to take full advantage of the ADC performance. The core-level evaluation on 256*256 crossbar has shown that the proposed TM-CIM has a small energy consumption of 0.136 pJ/MAC, and the area is only 0.044 mm 2 for the 1T1R array and 0.055 mm 2 for 2T2R array. When implementing complex DNNs such as VGG-16, TM-CIM can save area and energy consumption significantly. The tradeoff strategy between the area, power and latency is used to find the best way to implement a DNN like VGG-16. The proposed TM-CIM has low energy consumption, small area overhead, and acceptable latency, which is well-suited to edge intelligence applications.
Because the DAC and op-amp only need to drive 1 device in TM-CIM, their power and energy are much improved compared to the case where they need to drive an entire row (of 256) devices. However, more optimizations should be possible for the input circuit, which we expect to investigate in future work.