A Nonvolatile Compute-in-Memory Macro Using Voltage-Controlled MRAM and In Situ Magnetic-to-Digital Converter

Compute-in-memory (CIM) accelerator has become a popular solution to achieve high energy efficiency for deep learning applications in edge devices. Recent works have demonstrated CIM macros using nonvolatile memories [spin transfer torque (STT)-MRAM and resistive random access memory (RRAM)] to take advantages of their nonvolatility and high density. However, effective computation dynamic range is far lower than their static random access memory (SRAM)-CIM counterparts due to low device ON/ OFF ratio. In this work, we combine a nonvolatile memory based on a voltage-controlled magnetic tunneling junction (VC-MTJ) device, called voltage-controlled MRAM or VC-MRAM, and accurate switched-capacitor-based CIM using a novel in situ magnetic-to-digital converter (MDC). The VC-MTJ device has demonstrated <inline-formula> <tex-math notation="LaTeX">$10\times $ </tex-math></inline-formula> lower write energy and switching time compared to STT-MRAM device and has comparable density, read energy, and read latency. The in situ MDCs embedded inside each VC-MRAM row convert magnetically stored weight information to CMOS logic levels and enable switched-capacitor-based multiply–accumulate (MAC) operation with accuracy comparable to the state-of-the-art SRAM-CIM. This article describes the schematic and layout level design of a VC-MRAM CIM macro in 28 nm. This is the first nonvolatile CIM design to enable analog MAC computation with 256 parallel rows turned ON simultaneously without degradation in dynamic range (< 1 LSB). Detailed circuit simulations including experimentally validated VC-MTJ compact models show <inline-formula> <tex-math notation="LaTeX">$1.5\times $ </tex-math></inline-formula> higher energy efficiency and <inline-formula> <tex-math notation="LaTeX">$2\times $ </tex-math></inline-formula> higher density compared to the state-of-the-art SRAM-based CIM.


I. INTRODUCTION
D EEP learning algorithms have been widely used in computer vision, natural language processing, and data analytics [1], [2]. Deep neural networks require many convolutional layers and a huge number of parameters learned from training data to achieve good inference accuracy. The latest image classification neural network has more than 100 layers and several million weight parameters [3]. This poses a big challenge to the current computing architecture: model parameters and/or intermediate results must be moved, repeatedly, between off-chip memory and on-chip memory, or between the on-chip memory and processing elements. The energy cost and latency of data movement is much higher than the compute logic and can overwhelm the entire system's energy budget. As such, there is a huge demand for hardware accelerators that can process deep learning algorithms efficiently on edge devices.
Compute-in-memory (CIM) is an emerging solution that reduces data movement by embedding the computing logic inside the memory. In typical CIM, the weight parameters are stored in the rows of a memory array, inputs are converted to analog voltage or pulsewidth-modulated signals and applied on the array's compute word lines, and dot products between the weights and inputs are computed in current or charge domain by simultaneously enabling multiple rows; column analaog-to-digital converters (ADCs) digitize the result. The area and energy cost of the circuitry, especially that of the ADCs, is amortized by computing long dot products, i.e., enabling as many rows as possible for the analog computation.
Several works have demonstrated CIM-based deep learning processors or macros using static random access memory (SRAM) that is available in the standard CMOS process [4], [5]. However, the number of parallelly enabled rows is limited by the large mismatch of the minimum sized transistors. Jia et al. [6] embed charge-based CIM using switched-capacitor circuits. A metal-oxide-metal (MOM) capacitor is added on top of each SRAM cell and, owing to its large size, provides good matching property that has been demonstrated to support over 1000 rows turning on at the same time without degrading the dynamic range. However, SRAM cells occupy a large area and limit the density of the macro.
In addition to high energy and area efficiency, machine learning (ML) accelerators in many edge devices desire nonvolatile storage of model parameters. Many embedded nonvolatile memory (eNVM) technologies, such as embedded flash (eFlash), MRAM, and resistive random access memory (RRAM), also offer higher storage density than SRAM. While eFlash technology has been popular in planar CMOS processes, it is increasingly difficult to scale in fin field-effect transistor (FinFET) technology beyond 22 nm [7]. MRAM technology has demonstrated better compatibility with advanced CMOS technology. It uses magnetic tunneling junction (MTJ) as a storage element that is fabricated between two interconnects in CMOS process backend. The 1T-1MTJ spin transfer torque (STT)-MRAM cell has demonstrated more than two times higher density than SRAM [8].
There have been many recent efforts to embed compute logic in MRAM and RRAM to take advantage of their nonvolatility and high density. However, only limited compute dynamic range is achieved. Hung et al. [9] and Xue et al. [10] demonstrated CIM macros using RRAM; however, they were only able to turn on eight parallel rows simultaneously arguably due to the large variation of the RRAM resistances. Wan et al. [11] demonstrated current accumulation from 256 rows within the RRAM array during computation to achieve high energy efficiency. However, they employed accurate tuning of each RRAM device to meet the target resistance value during write operation. Although it helps reduce device mismatches, the calibration process requires off-chip equipment and takes a long time and large energy consumption. The device resistance also suffers from drifting over time.
Deaville et al. [12] and Jung et al. [13] proposed CIM macros using STT-MRAM. However, the STT-MRAM's MTJs exhibit low resistances during both ON and OFF states and a low ON/OFF ratio. The small MTJ resistance value causes large current consumption when many rows accumulate at the same time. The large current causes substantial IR drop caused by the wiring parasitic resistance, which makes the design challenging. The low ON/OFF ratio severely degrades achievable compute dynamic range in the presence of inevitable device mismatches. Note that the CIM macro in [12] achieves only four effective rows. 1 While [13] turns on 64 rows at the same time, achieved dynamic range is far less. 2 A comparison between different CIM solutions are shown in Fig. 1.
It is important to note that the primary challenge in realizing CIM in MRAM/RRAM is that of a low ON/OFF ratio. 1 As evident from the plot in Fig. 5 of [12]. 2 The statistical plot of the measured versus ideal MAC result (Fig. 2(c) in [13]) shows that ADC input has an effective variation of 7 (out of a maximum of 64) presumably due to MTJ resistance variation and/or mismatch, i.e., the column ADC cannot distinguish MAC results less than 7 (out of a maximum of 64). These technologies have an ON/OFF ratio less than 10, whereas each SRAM cell's read port can be completely turned off providing an ON/OFF ratio in the thousands. We show in Section II that the low ON/OFF ratio makes CIM very sensitive to device mismatch. Furthermore, prior MRAM CIM implementations have poor energy efficiency since the MTJs draw current for as long as they are being read.
In this work, we propose a robust and accurate nonvolatile CIM architecture using voltage-controlled MRAM (VC-MRAM) that addresses the aforementioned challenges. Our work has three main contributions as follows.
1) We demonstrate the feasibility of using new voltagecontrolled MRAM technology for highly parallel, and accurate, in-memory computing. The core device, VC-MTJ has been demonstrated to achieve 10× lower write energy and smaller write time than STT-MRAM, thus making it a promising candidate for the next-generation MRAM technology [18]. Due to its large resistance (>10× than STT-MRAM), MTJ current is very low and VC-MRAM can achieve very low-power read/CIM operation. A detailed discussion follows in Sections II and III.
2) We propose compact magnetic-to-digital converters (MDCs) that can be embedded inside the VC-MRAM array to overcome the aforementioned challenges posed by a low MTJ ON/OFF ratio. The MDC is a single-ended 8T-1C offset-canceling sense amplifier that translates information from magnetic domain to CMOS logic HIGH/LOW with high accuracy and is embedded ''in situ'' or within each VC-MRAM row. Since stored bits are available as CMOS logic levels, high accuracy CIM such as capacitor-based CIM is enabled with high parallelism (>1000 rows) without degrading the signal dynamic range [6], i.e., the low ON/OFF ratio problem is effectively resolved.
3) We propose a new ''bit-serial weight'' CIM macro architecture that improves reuse and further amortizes the MDC's energy and area overhead. Essentially, since the MDC generates CMOS logical levels, the stored weight information is reused over several compute lines (CLs) that operate on the same weights.
This article presents circuit-and system-level design of a nonvolatile CIM macro and appropriate simulations that  establish the feasibility and utility of the proposed approach. The simulations include an experimentally validated compact model of the VC-MTJ device. Section II introduces the challenges of compute in RRAM/MRAM due to large resistance variation and small ON/OFF ratio. Section III describes our proposed solution, and Section IV discusses the results and performance evaluation.

II. CHALLENGES OF COMPUTE IN NONVOLATILE MEMORY
As mentioned before, typically, CIM macros turn on as many rows as possible simultaneously to increase processing parallelism and cut down the area/energy overhead of the column circuitry such as ADCs. To understand the challenge posed by limited ON/OFF ratio, consider the simplified in-memory compute circuit model shown in Fig. 2. Fig. 2(a) corresponds to CIM scenarios that employ current summation. This is common in RRAM/MRAM-based CIM and early SRAM-based CIM. The model shows a differential implementation, where each weight is stored in two complementary cells, which is common in many such CIM solutions. 3 The LOW/HIGH resistance represents MTJ or FET resistance during ON/OFF state. Each cell will draw high current (I H ) or low current (I L ) from the shared bit line (BL) depending on multiplication results with input (A). The capacitor-based model of Fig. 2(b) applies to switched-capacitor-based compute. The capacitor stores a weight as charge. During accumulation, charges on all the capacitors are shared.
Whichever model is considered, when N rows are turned on together, in the absence of device mismatches, the sum signal can be one of at most (N + 1) possible levels. Typically, a column ADC is designed to reliably resolve these levels either in the current domain, or after converting into a proportional voltage, or time domain. Fig. 2(c) shows the quantization levels assuming current or charges are converted to voltage that has an LSB = V H − V L .
Invariably, mismatches between the cells in different rows degrade the effective resolution and the total variation increases with the number of rows. Assuming that the mismatches are independent zero-mean Gaussian random variables with a normalized 4 standard deviation of σ , the worst case standard deviation of the multiply-accumulate (MAC) sum is σ sum = √ N σ . To reliably achieve no degradation of dynamic range, half the LSB should be greater than 3σ sum . It can be shown that The effective number of rows is inversely proportional to the mismatch and the ON/OFF resistance ration (RT), as shown in Fig. 2(d), which plots N versus σ for different values of RT. Now, SRAM-based, RRAM-based, and MRAM-based CIM can be compared. STT-MRAM with tunneling magnetoresistance (TMR) ratio of 200% has been reported and corresponds to RT = 3 [14]. 5 However, due to the larger resistance value of the access transistor compared to the MTJ resistance and MTJ resistance variations, the effective bit cell ON/OFF ratio is much lower. In fact, Chiang et al. [15] claim that the tail bit in a large STT-MRAM array only has 20% TMR ratio. Since both the access transistor and the MTJ contribute to mismatch, a 3% σ -mismatch is optimistic and would limit the effective number of rows to eight. RRAM has a much higher ON/OFF ratio (5-10) compared to STT-MRAM, but a 3% device mismatch would limit the number of rows to 20 during compute; state of the art in RRAM-based CIM has demonstrated 16 rows [9], [10]. SRAM-based CIM that sums FET currents has > 100 cell ON/OFF ratio, but the mismatch in the minimum sized FETs can easily be up to 3%-5% limiting operation to only about 8-32 simultaneously enabled rows. In contrast, SRAM-based CIM using large MOM capacitors, which owing to their relatively large size, achieve much better matching and can achieve more than 1000 parallel rows computation without reduction in dynamic range [6]. Note that the large MOM capacitor does not require additional die area since it is overlaid on top of the SRAM cell.
Our proposed solution combines the benefits of the VC-MRAM technology and the high dynamic range and parallelism of charge-based accumulation.

A. VOLTAGE-CONTROLLED MRAM
VC-MTJ is a magnetic storage device that uses two magnetic layers sandwiching an oxide tunneling barrier. A simplified stack structure of the VC-MTJ is shown in Fig. 4. The parallel/anti-parallel state exhibits different resistance values (R P /R AP ), and the TMR ratio is expressed as (R AP − R P )/R P . The VC-MTJ device is similar to STT-MRAM but has a thicker MgO layer and the write operation is based on voltage control of magnetism (VCM). The mechanism of the VCM is due to the modulation of the charge carrier density by an applied electric field, which has an impact on the magnetic properties [16]. The thicker MgO barrier leads to much higher resistance and lower current than STT-MRAM. During switching, the applied voltage eliminates the perpendicular magnetic anisotropy field (H PMA ) and the free layer starts to precess around the in-plane reference field, which is provided by the stray field from the in situ reference layer [17]. The free layer's magnetic moment starts at t 1 and then precesses to the opposite direction noted as t 2 in Fig. 4. If the voltage pulse is removed when the magnetic moment reaches t 2 , the moment becomes stable as H PMA is recovered.
The switching of the VC-MTJ is ultrafast and typically less than 1 ns [18], which is 10× faster than STT-MRAM and RRAM. Fig. 5 plots the experimental switching probability versus write pulsewidth. The VCM-based switching only changes the VC-MTJ into the opposite state, and a readverify-write procedure is needed to write the same state. The write voltage is inverse proportional to the voltage-controlled magnetic anisotropy (VCMA) coefficient; 0.8-V write voltage and 115-fJ/V m VCMA coefficient are measured in the literature [19]. Due to the low write voltage and fast speed, the VCM write operation consumes very low power comparable to STT-MRAM. A summary of the VC-MTJ is provided in Fig. 6.
The larger resistance of VC-MTJ helps the readout circuitry achieve lower power consumption and smaller area. A resistance-area (RA) product of 600 µm 2 is shown and leads to parallel resistances around 100 k [18]. The total resistance of the 1T-1MTJ cell is the sum of MTJ and access transistor resistances. The typical TMR ratio in VC-MTJ (100%-200%) is comparable to STT-MRAM. Unlike STT-MRAM, VC-MTJ's resistance is 10× higher than the access transistor, therefore achieving higher effective cell TMR ratio. During read operation, a small voltage in opposite polarity to the write pulse is applied that can enhance device stability and lower disturbance rate [20]. Fig. 6 shows a summary of the comparison between VC-MRAM and STT-MRAM.
The proposed VC-MRAM CIM macro includes peripheral circuitry, which supports normal memory read/write operations. To write the VC-MTJ device, a voltage pulse is applied on the source line (SL) shared along the column and the local BL (LBL) is shorted to ground through the write switch, as shown in Fig. 3. To read the VC-MTJ device, the appropriate wordline (WL) is asserted and the MDC within the selected weight-group converts the VC-MTJ state into an electrical bit as described in Section III-B. φ sense , φ comp , and the corresponding input line (IL) are set HIGH. If the MDC decision is ''1,'' the compute cell will discharge the CL and the decision is read out through the column peripheral circuit. The focus of this work is the in-memory compute operation in VC-MRAM CIM macro, so the details of the normal read/write operation are not further elaborated.

B. COMPACT IN SITU MAGNETIC-TO-DIGITAL CONVERTER
As mentioned before, embedding the MDC inside each CIM row allows accurate capacitor-based CIM. Such an MDC needs to be very compact and consumes low power. However, since the MDC is essentially a sense-amplifier, large VOLUME 9, NO. 1, JUNE 2023  devices may be needed in principle, to keep V th mismatches and other offsets to a minimum. Fig. 7 shows the proposed 8T-1C implementation of the MDC based on a local offset cancellation scheme, thereby eliminating large devices. In the sensing phase (φ sense ), V ref sets the read voltage at LBL to about 200 mV nominally and this generates the cell current I cell = I P or I AP depending on whether the MTJ is in P or AP state. The WL is asserted, and M p1 and M p2 form a cascoded current source that pushes a reference current I ref = (I P + I AP )/2 into the sense node V sense . Note that VC-MRAM, by virtue of its higher RA, exhibits comparable cell resistance to r o of minimum-sized FETs in saturation, leading to a large cascode impedance of several 100 k at V sense . This translates to a large transresistance gain at V sense . I ref and I cell are compared at V sense and a large voltage swing of ∼400 mV is obtained. The swing at V sense far exceeds the variation range of the decision threshold of a subsequent gain stage, eliminating the need for a precise second stage. A simple minimum-sized inverter suffices as the second stage to drive the decision to full rail logic levels. Furthermore, the lower read currents in VC-MRAM ensure that all devices remain in saturation during φ sense , ensuring reliable operation in a low VDD of 0.8 V.
The accuracy of the first stage is key to achieve a low readerror-rate (RER). The effect of such errors on the compute accuracy is discussed in Section IV. Previous works [8], [21], [22] generate a precise reference current and mirror it to be compared with the cell current at V sense . The V th mismatch in the mirror transistors as well as the clamping transistors in the two distinct current paths leads to errors in the sense current I . Mismatch is typically controlled by upsizing the current conducting devices. In contrast, the MDC generates I ref locally (as described in the following) by sampling a corresponding V GS on C cal during φ cal . The stored V GS is reused in the sense phase (φ sense ) to compare with the cell current I cell . Since I ref and I cell see the same current paths, the V th variations of the FETs in the sense path do not generate an error current, leading to accurate read. The once sampled reference can be reused for several reads before recalibrating. Furthermore, cascode FET M p2 prevents coupling of swing on V sense to C cal , allowing better calibration reuse. A small C cal of 0.5 fF is enough to allow reuse for eight reads. This calibration scheme cancels the circuit offsets and allows to use minimum-sized FETs for all devices in the compact 8T-1C MDC.
To generate an accurate I ref over process voltage temperature (PVT) corners, one extra VC-MTJ is added in each weight-group. Ideally, the reference VC-MTJ should present a conductance of (G P + G AP )/2 during the calibration phase (φ cal ) to maximize the sense margin. This is practically implemented, as shown in Fig. 8, by combining two MTJs: one in P state and the other in AP state. The reference MTJs of adjacent weight-groups store these complementary states. The LBLs of the adjacent weight-groups are connected by a switch controlled by φ cal . During φ cal , adjacent LBLs are shorted and the two complementary reference VC-MTJs are connected in parallel and present an equivalent conductance of (G P + G AP ). V ref sets the voltage across the VC-MTJs, and the current is provided by diode-connected M p1 within the two identical MDCs. The two MDCs share the generated current and each effectively see I ref = (I P + I AP )/2. Since the reference VC-MTJ is inside the local array, it closely tracks the on-chip variation and provides a reference current that maximizes the sensing margin.

C. SWITCH-CAPACITOR-BASED BIT-SERIAL BIT-PARALLEL COMPUTE
While the energy and area costs of the proposed MDC are small, we propose to amortize them further by reusing the MDC's output bit among multiple switched capacitor-based compute cells. Note that without the MDC, it is difficult to share/reuse the bit stored in the MTJ. Fig. 9 shows the switched capacitor compute cells and the reuse strategy. Each read out weight bit is reused eight times using eight switched capacitor compute cells. Inside each   compute cell, the weight bit is AND-ed to an input bit, and according to the result, a small capacitor that is precharged to VDD is either discharged or left alone. During a subsequent ''compute'' phase set by HIGH φ comp , the capacitors of multiple parallel rows are connected to a shared CL and CIM summation is achieved by charge sharing. Since advanced back-end processing in modern CMOS technology allows up to 0.8% mismatch for a 1.2-fF capacitance [23], high dynamic range of the summation is achieved.
Note that the input bits of the different compute cells are chosen to be the binary bits of 8-b activations and are brought into the row in parallel. The MAC result on each CL is digitized by column ADCs, binary weighted, and added in the digital domain. This corresponds to a 1 b weight × 8 b input operation. The same sequence repeats for eight compute cycles corresponding to each weight bit. The partial sum in each cycle is shifted and added in the digital domain to complete 8 b weight × 8 b input operation. The sequence described above essentially implements bit-serial weight bitparallel input. It may seem that although only one out of eight weight bits is used for compute in a cycle, leading to a lower throughput, the eight-way reuse of the weight bit effectively compensates for the apparent throughput loss. It is important to note that without the in situ MDC, such weight reuse is not feasible in today's STT-MRAM or RRAM CIM solutions.

IV. RESULTS AND DISCUSSION
The proposed VC-MRAM and capacitor-based CIM architecture is designed and evaluated in 28-nm CMOS. The VC-MTJ compact model from [24] is used for evaluating performance. VC-MTJ resistance variation and transistor mismatch are considered in the accuracy evaluation. The 8-b VC-MRAM CIM unit cell is laid out to show the feasibility of the physical design. The area and the throughput of the macro is also estimated for comparison.

A. COMPUTE ACCURACY
We demonstrate, in simulation, a 256 tall macro performing 256-way dot product accumulation using a 0.5-fF MOM capacitor in each compute cell. As described later, the MOM capacitor is overlaid on top of the compute cell's transistors. These 0.5-fF MOM capacitors have a mismatch standard deviation of 1.2%, which is still within the 3σ margin of 256-way, i.e., 8-bit compute, as shown in Fig. 2. Note that using a larger 1.2-fF MOM reduces mismatch to 0.8% allowing a dynamic range up to 1000 rows but would result in larger compute cells.
In the compute phase (φ comp ), the CL parasitic capacitance siphons away signal charge during charge redistribution among the MOM capacitors and leads to a gain error in the MAC versus voltage characteristics, as shown in Fig. 10. Every row adds 0.5 fF of parasitic capacitance on the CL limiting the full-scale range to 1/2 VDD. However, this does not impact compute accuracy as the resulting shrunk LSB is still an order of magnitude above the kT/C noise limit at the 8-bit level. Furthermore, this confirms that thermal noise does not limit the MAC SNR.

B. MDC READ ACCURACY
To ensure that the compact MDC does not limit the MAC accuracy, we evaluated 1) the MDC's RER and 2) the effect of MDC RER on MAC accuracy. To evaluate the former, we simulated the MDC's read error rate as a function of the VC-MTJ's TMR for a typical 5% mismatch with and without the proposed offset cancellation scheme. We observe that for a typical 100% cell TMR of VC-MTJ, the offset-canceling sensing scheme discussed achieves better than 10 −4 RER offering at least two orders of magnitude of improvement, as shown in Fig. 11.
To evaluate the effect of read errors on the compute accuracy, we model each of the MOM capacitors (C k ) as a Gaussian distributed random variable with a mean of 0.5 fF and (X k ) and weight bits (W k ) are Bernoulli distributed random variables with equal chances for 0 and 1. In a charge-based VOLUME 9, NO. 1, JUNE 2023 accumulation scheme, this corresponds to the worst case for the compute error as pointed out in [25]. We further introduce random errors on the weight bits to model the MDC error rate. The weight bits with injected errors are represented asW k . The analog MAC value on the CL is then evaluated as ( This is then compared with the ideal MAC value to estimate the compute error. Fig. 12 shows the compute error standard deviation normalized to an LSB at the 8-bit level as a function of MDC read error rate, using 1-million-point Monte-Carlo sampling. We also observe that at the 10 −4 RER level, we add an excess compute error standard deviation of just 5% LSB at the 8-bit level.

C. 8-Bit VC-MTJ CIM UNIT LAYOUT
The 8-bit VC-MTJ weight-group along with the MDC and compute cells is laid out in 28-nm CMOS technology to show the feasibility of physical design and allows us to evaluate the density of the proposed solution. The layout of the local array is 2.8 × 1.8 µm, shown in Fig. 13. There are eight VC-MTJs within the weight-group in a 4 × 2 arrangement connected to the same LBL. The VC-MTJ is fabricated between M5 and M6. Every group of two access transistors share the same source diffusion to save area. The MDC is laid out on the top of the array, along with the reference MTJ and write switch. The output of the MDC is shared with eight compute cells, which are placed on the right side of the MTJ array, also in a 4 × 2 arrangement. A MOM capacitor is placed on top of each compute cell in M6-M8 and does not occupy extra area.
As mentioned before, to maintain high compute density and throughput, the bit-serial weight bit-parallel input scheme was implemented. This requires routing eight parallel ILs and eight WLs horizontally within the compact local array vertical pitch. While the 16 lines can be readily accommodated in a single horizontal routing layer (M5 was chosen) within the 1.8-µm row height, the routing is nontrivial since the MTJs partially block M1-M5 and the MOM capacitors above the compute cells completely block M6-M8. So, the ILs use a bridge on M7 in the MRAM region, as shown in Fig. 13(b) and (c). The CLs are routed vertically in M4. This arrangement frees up lower metals M2 and M3 for control signal routing and MDC output. LBL runs vertically on M6. VSS and SL run vertically on M2 and VDD runs vertically on M6-M8 on the shared MOM capacitor top plate.

D. ENERGY EFFICIENCY
The energy efficiency of the VC-MRAM CIM macro is evaluated at 0.8-V supply and estimated based on SPICE simulation that incorporates parasitic capacitances. The macro has 32 slices, each implementing a different weight filter. Each slice has 256 rows arranged vertically.
The MDC consumes 2.6 fJ per read operation on average. This read cost is amortized by a factor of eight through the bit-serial weight and bit-parallel input scheme described in Section III. The input bits are applied from outside of the macro and will consume communication energy from the input buffer next to the macro. The bottom plate of the MOM capacitors in the compute cells is precharged to VDD during the reset phase and discharge to VSS if the multiplication result is 1. The compute cells consume energy from switching the nMOS-based multiplication circuit and charging the capacitors. A 50% switching activity factor is assumed for the input buffers and compute cells. Furthermore, the parasitic capacitance on the CL consumes energy due to charging up to VDD and down to analog MAC value during the precharge and evaluation phase respectively. Each slice has  eight CLs, and the analog voltage is converted to digital bits by eight ADCs. The 6-bit SAR ADC accounts for 33% of the macro energy, as shown in Fig. 14. The 6-bit ADC does not use the full precision of the MAC result, but previous works of capacitor-based CIM accelerator such as [26] have demonstrated negligible loss in network classification accuracy with respect to fixed point implementation for the same ratio between MAC and ADC precision. The macro energy consists of four main components: input buffer, in situ sense amplifier (MDC), compute cell, and column ADCs. Each slice consumes 2.02 pJ and corresponds to 32 TOPS/W for 8-bit operation.

E. THROUGHPUT AND AREA
The macro operates at 250 MHz. Before the compute starts, the in situ MDC first reads one MTJ from the weight-group. Unlike conventional NV-MRAM memory access, the specific MDC implementation along with smaller weight group (eight cells) choice decouples BL capacitance from the read time constant, enabling fast read time of 0.5 ns. The weight group's size can be increased to 16 or 32 to amortize the area/energy overhead of MDC further but might lower the macro utilization ratio. The compute cells multiply and store the results in the capacitors. The 1-bit multiplication in the compute cell happens simultaneously with the MDC read. The charge sharing between capacitors on the same CL takes just 0.2 ns, since all the rows in the CL participate in the charge sharing irrespective of the individual 1-bit multiply result and the time constant is of the order of just 20 ps. This contrasts with current summation CIM architectures, where it is possible that only one cell discharges the CL in the worst case and could potentially limit the throughput. The 6-bit SAR ADC uses a 2-GHz clock to convert the analog MAC result on the CL in 3.5 ns. The macro performs 8-bit MAC operations with a throughput of 303 GOPS/mm 2 . Table 1 shows the comparison table with the other stateof-the-art works. Our proposed VC-MRAM CIM solution achieves 256 rows in parallel during compute, while not sacrificing the theoretical maximum signal dynamic range. The effective dynamic range in each CIM array is calculated as the ratio between maximum signal amplitude and worse case error (at a 3σ confidence interval) from the reported statistical measured versus ideal MAC result plot. Although Deaville et al. [12] achieve 128 rows in parallel, the effective dynamic range is only 12 dB. Due to the capacitor-based compute, no active current is drawn during compute and VC-MRAM CIM solution achieves 1.5×-30× higher energy efficiency compared to other nonvolatile solutions. The throughput density is 4×-30× higher than RRAM/STT-MRAM solutions because of the high parallelism. Compared to the SRAM capacitor-based CIM solution [6], our solution achieves 2× higher energy efficiency and 1.4× higher throughput density, while having the benefits of nonvolatility. The macro in [6] achieves 1152 parallel rows with less than 1 LSB degradation of dynamic range due to the large MOM capacitor on top of the SRAM cell. The VC-MRAM CIM macro can also achieve more than 1000 parallel rows if the compute cell uses a larger MOM capacitor, but the density and energy efficiency will be sacrificed.

V. CONCLUSION
We have proposed a robust CIM architecture using a new VC MRAM technology combining with the high-parallel capacitor-based compute. We have presented a compact in situ MDC that is offset tolerant, compact, and ultralow power. The proposed solution is evaluated by simulation and shows much higher energy efficiency and throughput than other nonvolatile memory-based CIM solutions.