A High-Parallelism RRAM-Based Compute-In-Memory Macro With Intrinsic Impedance Boosting and In-ADC Computing

Resistive random access memory (RRAM) is considered to be a promising compute-in-memory (CIM) platform; however, they tend to lose energy efficiency quickly in high-throughput and high-resolution cases. Instead of using access transistors as switches, this work explores their analog characteristics as common-gate current buffers. So the cell current can be minimized and the output impedance is boosted. The idea of In-ADC Computing (IAC) is also proposed to further decrease the complexity of the peripheral circuits. Benefiting from the proposed ideas, a pretrained VGG-8 network based on the CIFAR-10 dataset can be implemented, and an accuracy of 87.2% is achieved with 8.9 TOPS/W energy efficiency (for 8-bit multiply-and-accumulate (MAC) operation), demonstrating that the proposed techniques enable low-distortion partial sum results while still being able to operate in a power-efficient way.


FIGURE 1. Comparison between (a) conventional RRAM-based CIM macro and (b) proposed RRAM CIM macro.
parallelism to only nine rows. In addition, the RC consumes a large area and requires 32:1 time multiplexing to share among columns, which significantly limits the throughput. Yin et al. [12] and He et al. [13] demonstrate a read scheme using a simple voltage-divider-based network with voltage-mode flash ADC that can reduce the multiplex ratio. These designs, nevertheless, suffer from large read power issues and are prone to read disturbance. Input-aware current control [10] and sparsity-aware clamping [14] are later developed to improve the voltage-mode read performance, but they come at the expense of reduced parallelism. In another work [15], a single-slope ADC is directly used as the RC through its embedded V -I conversion. While it achieves good parallelism with high A/D resolution, the throughput and chip area are severely undermined. In addition to the above limitations, most existing RRAM-CIM macros also share a common drawback that the analog MAC results suffer from considerable distortion. To compensate for this, they rely on externally generated and calibrated ADC reference levels, which are impractical for compact and energy-constraint applications, such as sensor nodes. Therefore, further research is expected to solve the tradeoffs between power, speed, and area.
In this article, we present an innovative RRAM-based CIM macro that unifies accuracy, compactness, and energy efficiency. We propose an intrinsic impedance boosting (IIB) technique, which exploits the access switch's analog property and turns it into a common-gate (CG) current buffer. This technique enables accessing a large number of rows and computing large MAC values with little distortion using simple interface circuits. The idea of the In-ADC Computing (IAC) technique is also proposed, which reuses the successive-approximation-register (SAR) ADC's capacitor to rebuild multibit-weight MAC results in the charge domain within the sampling process. This not only effectively reduces both the total A/D conversions number and digital shit-andadd overhead but also achieves column parallel A/D without time multiplexing and an extremely compact layout. Fig. 1 provides a high-level comparison between the conventional RRAM CIM macro with the proposed one. With the proposed techniques, an accuracy of 87.2% is achieved based on the VGG-8 network for CIFAR-10 applications. The simulated energy efficiency is 8.9 TOPS/W (for 8 × 8 bit MAC), and the throughput is 256 GOPS, which indicates that the proposed techniques enable low-distortion, high-parallelism, and power-efficient RRAM-based CIM macro.
The rest of this article is organized as follows. Section II provides background. In Section III, the idea of RRAM IIB is introduced, and its design considerations are analyzed. In Section IV, the idea of IAC is proposed. Section V presents the system-level simulation results, and Section VI concludes this article.

II. RRAM-BASED CIM BACKGROUNDS
The core idea behind CIM is to leverage a crossbar structure to conduct massive parallel MAC operations. In the context of RRAM-CIM, the weights are represented by the resistance/conductance of the crossbar cells. Ideally, the inputs can be applied across the resistors, and the generated currents represent the computation result. Such implementation needs only resistors and can support input and weight with arbitrary resolution. Nonetheless, to avoid the sneak-path current issue and maintain better retention/reliability, practical RRAM-CIM macros are commonly implemented using a 1T1R array with binary cellwise computation [10], [11], [12], [13], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], [28], as illustrated in Fig. 2(a). In this scheme, RRAM resistances are programmed between only two states, where the high-resistance state (HRS) represents logic ''0'' and the low-resistance state (LRS) represents logic ''1''. Binary wordline (WL) voltages serve as inputs to turn on/off the access transistor. Only when RRAM is in LRS and the access transistor is turned on, a substantial cell current can be generated, which maps the binary multiplication. To implement multibit MAC under this scheme, the inputs and weights need to be decomposed to perform bitwise multiplications, as depicted in Fig. 2(b). The input bits follow a bit-serial manner, applied one by one to the WL sequentially, while the weight bits are parallelly programmed across adjacent columns (and unweighted). The complete MAC results are rebuilt via weighted-summing the partial results across the columns and cycles, which is typically done using digital shift-and-add (S&A) after the A/D.
On the other hand, despite providing good robustness, the binary computing RRAM scheme is subject to several shortcomings in terms of energy efficiency. The first issue is the large number of ADC operations required. It can be seen that the number of ADC firings grows proportionally to both the input and weight resolution. For example, if 4-bit inputs and 4-bit weights are utilized, a full MAC result requires 16 ADC firings. The second challenge lies in the power consumption of the RRAM array itself and the I -V interface. The low LRS resistance of RRAM is typically in the range of a few k . With high parallelism (i.e., the number of simultaneously accessed rows), the output current can be as large as tens  of mA, and the lumped output resistance can be down to tens of . Assuming that no distortion correction is employed, the I -V circuit must provide very low input impedance to guarantee low distortion during read-out, leading to several mW power consumption per column. While this can be relaxed by utilizing the ADC reference levels to compensate for the distortion, the overhead is just moved to the ADC end. Large currents can also bring an extra source of errors due to the IR drop along the long array lines. In essence, these drawbacks impose a steep tradeoff between energy efficiency, parallelism, and throughput for RRAM CIM design, limiting the scalability of RRAM CIM to high-resolution DNN applications. This motivates us to develop solutions to improve both the array-level design (see Section III) and interface design (see Section IV).

III. PROPOSED IMPEDANCE BOOSTED RRAM SUBARRAY
In existing RRAM CIM macros, the access transistors are always fully turned on in the triode mode as simple switches. While this is helpful for fast current development when the RRAM works as memory, it loses the impedance and current regulation capability in the saturation mode that can be useful for analog computing. Motivated by this, we propose the IIB technique, which exploits the access transistor's saturation-mode properties to solve the steep tradeoff between power and parallelism.

A. CIRCUITS AND OPERATIONS
The schematic of the proposed IIB-RRAM is shown in Fig. 3(a). Note that this idea retains the WL-input 1T1R scheme; hence, it is fully compatible with the current RRAM array. In contrast to the common approach in Fig. 3(b), which uses the bit-lines (BLs) for output with the source-lines (SLs) grounded, the proposed design adopts a swapped connection (using the SLs as output and BLs as ground). On top of this, instead of being driven all the way to VDD, the WLs are connected to a bias voltage (V B ), which is slightly above the threshold voltage. This arrangement brings two key benefits. First, the access transistors in the RRAM array are designed as CG current buffers that amplify the output impedance. Second, the saturation mode of the access transistor isolates the SL voltage from the RRAM cell voltage, enabling the current reduction of each cell.
Without loss of generality, we can use the square-law model to gain intuition. Despite being inaccurate for exact current calculation, it can well represent the trend. Assuming that an RRAM cell is programmed to LRS and logic ''1'' is applied to WL (WL voltage is V B ), the value of cell current and the cell output resistance observed from the SL (R o,cell ) can be expressed as where V TH and K n are the threshold voltage and the lumped process coefficient of the access transistor, R L is the LRS resistance, g m denotes the transistor transconductance, and r o is the small-signal output resistance of the access transistor. From (1), the drain and source voltages are isolated in the saturation mode. By making V B close to and only slightly above V TH , the voltage drop across the RRAM cell is kept very small, thus facilitating a much lower cell current. The choice of V B is discussed in Section III-B. From (2), it can be known that the equivalent output impedance from drain to source is amplified by a g m r o term, which is usually called ''intrinsic gain.'' Typically, the intrinsic gain of access transistors can reach a few tens to hundreds due to the extra channel length to support high-voltage programming. In short, the IIB technique transforms the RRAM array into a small-value high impedance current mode digital-to-analog converter (DAC) with computation capability. It simplifies the interface design, making the simultaneous design of high parallelism and low power possible. Taking advantage of the optimized array current and impedance, this work employs a structure resembling a gmboosted CG amplifier to collect the column MAC current and convert it to voltage, as illustrated in Fig. 3(a). Note that, though this transimpedance stage may share similarities with the current interfaces in some existing works [11], the proposed IIB technique makes their specifications less demanding. In traditional designs without IIB, the read voltage at V SL must be clamped very stably through a strong auxiliary amplifier to prevent creating nonlinear current (by boosting the gm to reduce input impedance of CG amplifier). Thanks to IIB boosting the array impedance and reducing the cell current, V SL can be easily stabilized. Therefore, we are able to use a small CG transistor M 1 and a five-transistor OTA to guarantee robust operation.
To verify the effectiveness of the IIB, we simulate a 128-parallelism XNOR-based [12] RRAM array using Cadence Spectre. The RRAMs are implemented using the ASU RRAM model [29], [30], with the transistor model from TSMC 28-nm CMOS. Based on simulation results, an output voltage versus MAC value transfer curve can be obtained, as shown in Fig. 4(a). To facilitate comparison, we also simulate two baseline RRAM read schemes and plot the results alongside in Fig. 4(a). The first baseline is currentmode-based RCs [11], and we convert the output current to voltage through linear mapping. The second one is dividerbased RCs [12]. Note that both two baselines did not use the IIB-based technique, so the cell current is relatively high compared to our work, which results in nonlinearity. The output voltage range is normalized to [0, 1] for a fair comparison. It can be seen that our proposed technique produces a transfer curve that exhibits minimum distortion over a large MAC range compared to the baselines. We further quantify the linearity by calculating the code distance versus MAC value, as shown in Fig. 4(b). It proves that the code distance in our work keeps a constant, and DNN computational error due to distortion will be minimized. We also examine the read voltage on the SL with IIB enabled and disabled, respectively, as shown in Fig. 4(c). Without IIB, the SL voltage fluctuates with the partial sum results, and voltage variation can reach up to 0.46 V, leading to large second-order effects on cell current. However, with the help of IIB, the SL voltage is stable over a large range (variation is less than 60 mV). Thus, the channel-length modulation effect of each cell can be minimized. In conclusion, compared to prior switch-based CIM macros, our work keeps a linear transfer curve even with large parallelism. To explore the linearity performance of the proposed IIB technique, the output voltage is tested under different parallelism of 128, 256, 512, and 1024, as shown in  To eliminate the effect of the process, voltage, and temperature (PVT) variations, instead of using a fixed V B , we propose a replica bias voltage generation circuit, as shown in Fig. 5(a). This structure ensures the similarity of working conditions between the bias generation branch and the RRAM cells, thus allowing the RRAM cell current to be well defined by the reference (I ref ) regardless of PVT variation. Based on a 1000-point simulation across different PVT conditions, the standard deviation of cell current is less than 0.05% of the mean value, which proves the robustness of the proposed circuits. It is worthwhile to mention that this bias generation scheme can be extended to support a multibit-per-cycle input scheme. For example, Fig. 5(b) shows a 2-bit-per-cycle example, where four different bias voltages (including GND) are generated by current sources with power-of-two weighted strength. Each 2-bit input serves as a control signal to choose from GND to V B1−3 . Note that the bias generation circuit can be shared by all rows, so the bias overhead is negligible.

B. DESIGN CONSIDERATIONS AND TRADEOFFS
The key design consideration of the proposed technique lies in the choice of the access transistor turn-on voltage V B or, equivalently, the cell reference current I ref .
In the ideal case, I ref can be arbitrarily small, as the access transistor will always operate in either the saturation or subthreshold mode, keeping the IIB effective. In practice, a lower bound for I ref is limited by the following three factors: 1) thermal noise; 2) random mismatch; and 3) output swing and latency.
To illustrate the tradeoff between I ref and thermal noise, we, hereby, introduce the concept of peak SNR (SNR peak ), defined as VOLUME 9, NO. 1, JUNE 2023

FIGURE 5. (a) Bias voltage generation circuit for binary inputs. (b) Bias voltage generation circuit for multibit inputs (2-bit for example).
where I max [k] and P n [k] are the maximum column current and electrical noise power under parallelism k, respectively. E Q represents the quantization noise of ADC in the current domain [31]. SNR peak can be viewed as a measurement matrix of the accuracy for reading out and digitizing the analog MAC value from the RRAM array. Without loss of generality, we assume the on-state cell current equals to I ref , and the OFF-state current is zero. In addition, we also assume that the ADC's resolution is log 2 (k) bits such that the currentreferred quantization step is also I ref .
Then, the following expression can be obtained: where BW denotes the bandwidth of the RCs. K is Boltzmann's constant, and T is the temperature in Kelvin. γ is the ''excess noise coefficient'' [32], g m is the transconductance of access transistor, and R LRS is the LRS resistance of RRAM. Based on (4) Fig. 6(a) plots the simulated SNR of our testbench design at the maximum MAC value (where the SNR drop is the largest) as a function of I ref . Therefore, cell current needs to be large enough to reduce thermal noise effects.
In addition to thermal noise, random mismatch also plays an important role in determining I ref . Although the proposed bias generation circuit in Fig. 5 mitigates PVT variation issues, random mismatch effects between the bias branch and RRAM cells, such as threshold voltage mismatch and size mismatch, still induce deviations in the RRAM cell current. In Fig. 6(b), a 1000-point Monte Carlo simulation with mismatch is performed, and the current mismatch statistic is collected as a function of I ref .
It shows that the cell current standard deviation can reach up to 5.21% of I ref with a 1-µA reference current, and increasing I ref has the benefit of reducing the ratio of cell current standard deviation to I ref .
The reason is that the higher I ref results in a higher overdrive voltage, which suppresses the current deviation.
Finally, the output swing and latency are also affected by I ref .
A larger output swing is desired not only because it has better noise rejection characteristics, but also it relaxes the ADC requirements and makes ADC easy to design. A low I ref with a high R L are expected to generate a high output swing and keep a low power consumption. However, the intrinsic tradeoff between output swing and latency limits R L to be too large. In Fig. 6(c) Based on the discussions above, we select I ref to be 3.9 µA for the balance of SNR, random mismatch, and output swing. In this case, the SNR peak reaches 52.6 dB, which means the signal power is almost 160k times larger than the noise power, so thermal noise will not interfere with signals. To maximize the output swing and reduce the latency, a load resistance of 1800 is chosen, so a 0.63-V output swing of the RC can be obtained, which is efficient to drive ADC and keeps output voltage with good linearity. In addition, the output latency is also determined by the load resistance and cell current. As shown in Fig. 6(a), as long as I ref is not too small, the SNR is dominated by quantization error instead of the RC noise. In this region, we do not necessarily need to trade the sampling BW, so sub-ns latency can still be achieved. In this work, 0.43-ns latency is obtained if 3 fF is used as unit capacitance in CDAC, which enables a highthroughput design. The cell current variation attributed to transistor mismatch and PVT variations is limited within 3%, which is negligible compared to RRAM device variation [22], [33]. The comparison of the operating cell current is illustrated in Table 1. Compared to the conventional RRAM cell, the proposed IIB-based cell reduces the ON-state and OFFstate currents by 10.3× and 2.6×, respectively. Our proposed IIB-based RRAM cell helps to reduce I ON and I HRS ; however, the ON-OFF ratio becomes smaller, indicating a more severe ambiguity issue. Fortunately, this problem can be solved by  implementing XNOR RRAM cells, which encode two cells as one weight [12].

IV. PROPOSED ADC DESIGN WITH IN-ADC COMPUTING
In most existing RRAM-based CIM macros, flash ADCs are commonly employed [10], [11], [12], [13], [17], [18], [19] for their flexibility in individual transition level tuning to compensate for the readout distortion (more details in the Supplementary Material). With the read-out linearity greatly improved through the IIB technique, such distortion compensation can be obviated. This allows us to employ energy-and area-efficient ADC options, the voltage-mode SAR architecture. It saves the large overhead for tunable transition voltage generation and consumes less energy compared to the flash ADC at 5 b and beyond [34].
We observe that the CDAC capacitors are binary-weighted in SAR ADCs, which inherently turns the voltage across each capacitor into a weighted charge and adds up naturally during the SAR conversion. This characteristic allows weight reconstruction to be done inside the ADC, named IAC. Fig. 7(a) demonstrates one possible mode of IAC (Mode A), which performs a one-shot weighted summation of a 1-b-I-4-b-W MAC operation on a 5-b SAR ADC. When the first input signal X [0] drives the WL, it multiplies with binary weights: W[0], W [1], W [2], and W [3], respectively. Then, the readout voltages of V 3 , V 2 , V 1 , and V 0 can be sampled to different capacitors with a capacitance ratio of 2, while other capacitors are connected to a fixed dc voltage V CM . Note that this configuration is generally referred to as bottom-plate sampling (BPS) in ADC design terminology. Following the sampling, the comparator side of the CDAC will be floated, while the input side merges to V CM . This will initiate charge redistribution and create the weighted sum of V 3 , V 2 , V 1 , and V 0 as V x0 , which can be expressed as where V x0 is the initial comparator input voltage (V x ) after BPS. Note that the equation suggests a negative weighted sum. This, in fact, is useful because V 0-V 3 have a negative slope over MAC value (see Fig. 4). The negative weighted sum allows the ADC output to be proportional to the MAC value. In addition, the comparator negative input voltage V center is chosen to be the average voltage of the largest and smallest V x0 , which removes the dc offset due to V CM and the I -V 's inherent dc-level. With Mode A, only one ADC is required for four columns, so the ADC area and power overhead are reduced by 4× compared to the conventional A-D method. It also avoids the use of multiplexers, which eliminates the tradeoff between latency and area consumption.
To further increase the throughput, we propose Mode B IAC, as shown in Fig. 7(b). In this scheme, the DAC is divided into two parts, where the size ratio is 2:1. When X [0] is connected to the array, its MAC results will be sampled by the smaller part of the DAC. Then, the top plate (i.e., the input side) of the DAC will be kept floating. Then, X [1] is fed into the array, and its MAC result will be sampled to 2× capacitance DAC, constructing a 2× weight of data reconstruction. With Mode B, a 4 b × 4 b MAC requires only four samplings and two conversions of a 6-bit SAR ADC.
The operation of different modes of IAC is shown and compared in Fig. 7(c). The Mode A scheme collects the MAC results from every column and conducts S&A operation by utilizing the DAC capacitance ratio. For Mode B, an extra 2× DAC is implemented to help reconstruct the input weight. The advantages of Mode B are given as follows: 1) since only two 6-bit results are obtained for the subsequent process, less digital calculation is needed and 2) Mode B also reduces the latency of the CIM macro, as shown in Fig. 7(d). For Mode A, since each A-D conversion time is 4 ns, a total of 16 ns is needed to finish a 4-bit MAC operation. Note that, since the read-out circuits are disconnected with ADC after sampling, the next input can be fed to the array at the beginning of the ADC conversion phase, as shown in Fig. 7(d). In other words, the next MAC result can be calculated simultaneously   with the ADC conversion phase. For Mode B, since the SAR conversion time is reduced by half, the total latency is only 11 ns, which means the throughput is improved by 45%.
In essence, the IAC-SAR approach allows ADC sharing across columns without the cost of time multiplexing. This not only retains throughput but also relaxes the area constraint on the ADC design. Still, because state-of-theart RRAM technology can produce an array pitch to be as small as approximately 0.25 µm [36], careful design practice is needed to ensure a small ADC area. The detailed ADC schematic and waveform are shown in Fig. 8, where we devise the following strategies.
1) Synchronous timing is adopted for this design so that the clocking generation circuits can be shared by all ADCs.
2) We apply the Vcm-based switching technique [37] to eliminate the MSB capacitor and its control logic. 3) Fast SAR logic is utilized, which is based on latch instead of D-FFs, so the SAR logic circuits can be simplified [38]. 4) A compact CDAC layout reported in [39] is implemented. Furthermore, the CDAC capacitors, made on the metal layers, are overlappingly placed with transistors to further reduce area consumption. In this work, the ADC's total length is 48.5 µm, and the width is only 3 µm, which makes it easily match the pitch width of the RRAM array.

V. SYSTEM-LEVEL SIMULATION RESULTS
The performances of our work and other state-of-theart works are summarized and compared in this section. We benchmark our performance through SPICE simulation and NeuroSim simulator [42], [43]. We first train a VGG-8 network based on the CIFAR-10 dataset on 8-bit precision and get 90.8% accuracy as the baseline. Then, we use an RRAM-based CIM macro for inference. For the circuit simulation, we use Cadence Virtuoso EDA software and test the performances under TSMC 28-nm CMOS technology. First, we compare the energy performance of proposed ideas with different configurations, as shown in Table 2. For the simulation setup, the array size of 256 × 128 (with XNOR RRAM cells) and the multiplexing ratio of 1:1 are chosen as macro structure. The ADC resolution is chosen to be 5 bit to provide enough inference accuracy and avoid consuming too much area and power. With Mode-A IAC, the ADC energy to finish 128 8 × 8 MAC is 81 pJ, and the area consumption is only 1164 µm 2 , which is reduced by 4× compared to the conventional method. Mode B further reduces the latency by 31.25% compared to no IAC and Mode A cases. However, the tradeoff that it needs is a 3× sized CDAC, which burns  3× CDAC power. In addition, the ADC resolution should be extended to maintain the same inference accuracy with 2-bit input and 4-bit Mode B IAC, which results in an exponentially increased CDAC power. The detailed normalized power breakdown under different configurations is compared in Fig. 9. In this work, we adopt a 2-bit input scheme with a 4-bit IAC (Mode A) to achieve a balance between energy efficiency, area, and accuracy.
Then, we compare our work with other state-of-the-art works [12], [17], [18], as shown in Table 3. In our work, we adopt 2-bit input and 1-bit weight; then, we use 5-bit ADC to digitize the result. A voltage-mode SAR ADC with IAC is proposed for A-D conversion and data reconstruction. Just as discussed in Section III, the output swing can be adjusted by changing R L resistance, and 128 cells can be turned on simultaneously. Benefiting from the IIB technique, the cell operating current can be saved by 90%, and thus, the array energy can be minimized. The latency in our work comes from input digital delay, read delay, and analog-to-digital conversion delay. However, these delays can be implemented in a pipeline way (the SAR conversion can be processed simultaneously with RRAM array computing), so the final read delay will be 4 ns. For energy efficiency, we normalize all the work to 8-bit MAC. According to the simulation result, the energy efficiency of the proposed CIM macro is 8.9 TOPS/W due to power-efficient RRAM design and IAC, which is the highest among all the state-of-the-art works. Benefiting from the low distortion RRAM array and RCs' design techniques, the accuracy of 87.2% is achieved on the CIFAR-10 dataset.

VI. CONCLUSION
This article presents an intrinsic impedance-boosted RRAM array and its peripheral circuit design. This design reuses the access transistors as CG current buffers, which reduces the cell current and enables a linear read voltage with low complexity. In addition, a compact voltage mode SAR ADC with IAC further reduces the complexity of peripheral circuits and saves power. The proposed ideas make our RRAM macro achieve 87.2% inference accuracy while operating under 8.9 TOPS/W energy efficiency for 8-bit MAC operation.