An SRAM-Based Multibit In-Memory Matrix-Vector Multiplier With a Precision That Scales Linearly in Area, Time, and Power

A novel interleaved switched-capacitor and SRAM-based multibit matrix-vector multiply-accumulate engine for in-memory computing is presented. Its operation principle is based on first converting an SRAM-stored n-bit weight into a proportional voltage using a pipeline D/A converter built from $n+1$ equally sized stages. A switched-capacitor stage then multiplies these voltages with an m-bit digital input activation. Finally, the output voltages that correspond to the different multiplication results are accumulated along one column by means of charge-sharing. With our proposed architecture, the required circuit area, computation time, and power consumption scale linearly versus the bit resolution of both the inputs and the weights. Analytical formulas are presented for the energy consumption in both capacitors and switches. Moreover, the impact of fabrication mismatch on analog computation accuracy is examined. The full system architecture is described, and the feasibility is demonstrated, via a full macroimplementation study in 14 nm, detailing area and energy consumption, as well as the overall latency. Finally, a specific design of a $128 \times 2048\,\,6$ -bit weight and 6-bit input signed matrix-vector multiplication accelerator system in 14 nm is presented, which runs at 2.43 TOP/s at an efficiency of 16.94 TOP/s/W, while using the nominal supply voltage of 0.8 V. If the operands’ precision is considered in the metric, then the efficiency becomes 609.7 TOP/s/W.

physically separated memory and processing units. This costs time and energy and constitutes an inherent performance bottleneck. Overcoming the restrictions of the classical von-Neumann-based architectures, which enforce a dogmatic separation of the processing unit and the memory subsystem, requires reevaluating the well-established charge-based memory technologies, such as SRAM, DRAM, and Flash [1]- [3], as well as the emerging resistance-based nonvolatile memory technologies [4], [5].
It is becoming increasingly clear that, for application areas, such as artificial intelligence (AI), we need to transition to computing architectures in which logic and memory are colocated [6]. In-memory computing (IMC) is a novel nonvon Neumann computing paradigm where certain computational tasks are performed in the memory itself by exploiting the physical attributes and state dynamics of the chargeor resistance-based memory devices [6]. Several computational tasks, such as logical operations, arithmetic operations, and even certain machine learning tasks, can be implemented in such an IMC-based system. As a result, the execution time, energy consumption, and silicon area of an IMC-based architecture are reduced compared to von-Neumann architectures, yielding a compact and highly efficient system [7].
Generally, IMC-approaches perform computations with relatively low numerical precision. Hence, IMC does not aim to replace digital floating-point arithmetic units and, instead, targets applications, such as deep neural network inference, which are resilient to low precision. To reach the numerical accuracy typically required for data analytics and scientific computing, the limitations arising from device variability and nonideal device characteristics need to be addressed. Thus, the concept of mixed-precision in-memory computing, which combines a conventional high-precision von Neumann machine with IMC, was introduced in [8]. Its application to deep neural network training was proposed in [7] and [9].
In IMC, the physics of the nanoscale memory devices and the organization of such devices in crossbar arrays are exploited to perform certain computational tasks within the memory unit. For instance, crossbar arrays of phase change memory (PCM) and resistive random access memory (ReRAM) devices can be used to store a matrix and perform analog matrix-vector multiplications at constant O(1) time complexity without intermediate movements of data. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Specifically, modulation of the read voltage amplitude or duration that is applied to a PCM cell leads to a proportional change in the current that flows through it. This property can be used to perform an analog-domain multiplication with the value stored in the PCM cell in the form of a conductance. If many cells are activated on the same line, their respective currents are summed up in accordance with Kirchhoff's law. Thus, by combining the multilevel storage capability of PCM with Ohm's and Kirchhoff's laws, an analog IMC multiplyaccumulate (MAC) engine can be constructed [10].
Besides the classical analog-domain processing issues, such as thermal-and A/D converter (ADC) noise, memristorbased computation is, furthermore, subject to drift and 1/ f noise [11], which are issues that can only be addressed through material and device innovations [12]. In addition, the largescale current summations require wide and highly conductive bit-lines, which leads to a reduced area efficiency [13].
Static random access memory (SRAM) can in a similar way to PCM and ReRAM be enhanced with IMC capabilities. A native approach consists of using the ON-resistance R DS,ON of the pull-down transistors to convert the digital bits stored in the SRAM cells into a proportional current value [14]- [16]. However, serious practical limitations of R DS,ON -based IMC in SRAM arise from cell-to-cell variations, as well as from the risk of unintended overwriting of stored bits. To address these issues, charge-based analog computation has been explored as an alternative to those that are current-based, due to the much better matching of back-end-of-the-line (BEOL) capacitors. This feature is also expected to improve with CMOS scaling.
For example, the 8T SRAM cell in [17] is enhanced by a transmission-gate (TG) and a single capacitor giving rise to a 10T + C unit cell structure. Thus, by introducing the highly matched metal capacitors, the IMC design eliminates mismatch-ridden minimum-size transistors. However, the elements of the input vector and synaptic matrix of this XNORbased IMC engine are limited to binary values only.
A solution for supporting at least multibit inputs is presented in [18]. Nevertheless, the weights are still kept in binary format. Moreover, since a partially pulsewidth-modulation (PWM)-based D/A converter (DAC) is used, the relationship between the number of input bits and the computation cycle time becomes inherently exponential, significantly impacting the latency. An example of a near-memory computing architecture that supports multibit inputs and synaptic weights is described in [19]. However, bandwidth limitations also arise here due to the restriction to row-by-row processing and the PWM-based digital-to-analog (D/A) conversion procedure for mapping the matrix values to voltages. Finally, another multibit near-memory computing architecture, based on a passive switched-capacitor approach, is described in [20]. Here, the exponential area requirements for the flying capacitors used in the implementation of the input DAC prevent an efficient extension of this approach to large-scale in-memory computing (IMC) arrays.
As can be seen from the aforementioned recent work, severe limitations in performance, precision, scalability, and circuit complexity have arisen in all CMOS-based IMC implementations, as soon as support for more than binary quantization for both the input signals and the synaptic weights was required. This article proposes a novel SRAM-based architecture supporting in-memory MAC-operations, where both input signals and matrix elements can assume multibit values while still exhibiting a linear dependence between cycle time, silicon area, power consumption, and the number of bits of inputs and weights.
The remainder of this article is organized as follows. In Section II, the in-memory computation circuit used for the analog MAC operation is presented, and an analytical model describing the functionality is given. Moreover, the energy consumption of an analog multiply operation is analyzed in detail, and the effect of mismatch is examined. The full memory architecture is described in Section III. An examination of the computational memory system's area and the energy consumption is presented in Section IV based on an implementation study. In addition, the performance is compared with the state of the art. Finally, Section VI concludes this article.

II. MULTIBIT IN-MEMORY COMPUTE UNIT
In essence, the energy and bandwidth gains achieved through IMC stem from moving digital arithmetic and logic operations into the analog domain. As implied by the term "inmemory," all computational primitives are executed within the memory subsystem, as opposed to near-memory computing architectures that keep the processing units separated at close proximity.
Matrix-vector multiplication and transpose-matrix-vector multiplication are among the most important computational primitives for machine learning and deep learning applications. Thus, the goal of this article is the acceleration of MAC operations, which can be written in the form of where denotes the transpose vector or matrix operator. Each element y m of the vector y can be written as a sum of N products Fig. 1 contrasts the conventional von-Neumann architecture, where the memory and the processing units are separated, to the in-memory computing architecture. In the latter, the matrix elements that are stored in SRAM cells remain stationary, whereas processing occurs in the so-called inmemory compute units (IMCUs), which are collocated with the SRAM. Specifically, the stationary matrix elements w n,m (called weights) are stored in the SRAM array, and the inputs x n are fed from outside to the IMCU. Both inputs and weights are assumed to be signed quantities. For a given fixed-point arithmetic precision, the elements w n,m are stored in binary form wordwise, and in column-major format, into the SRAM array, as shown in Fig. 1(b). Thus, the SRAM cells, which store the binary representation of weight, are collocated with one IMCU in a crossbar configuration. Using signed magnitude representation (SMR), the following correspondence between a weight value w n,m and the stored bits b n,m k is established In the following sections, for simplicity, the column index m will be dropped from w n,m . Hence, without any loss of generality, the MAC-operation will be illustrated for a single column only. Similarly, the corresponding weight bits will be denoted by b n k and the input bits by i n p . Moreover, the magnitude of a weight w n will be represented by n w bits and the magnitude of the input x n by n x bits. The binary variables b n sign and i n sign denote the respective sign bits. Finally, 8T SRAM cells will be used to store the weight bits.
The two main blocks of the IMCU, namely, the DAC for transforming the binary weight into a voltage value and the analog multiplier, will be detailed in the subsequent sections. The combination of both circuits will be referred to as an interleaved switched-capacitor-based multiplier.

A. Matrix Value D/A Conversion
The initial step in the analog computation procedure consists of performing a D/A conversion, which turns the stored weight bits b n k into a proportional voltage V w,n . This can be achieved by connecting the digitally stored weight bits from the SRAM cells to the inputs of a DAC. Fig. 2(a) demonstrates an implementation built from equally sized unit capacitors C k , interconnected through switches. The total amount of required unit capacitors in the pipeline DAC scales linearly with the number of stored bits n w in the weight magnitude n unitcap,w = n w + 1. ( The circuit is built around the quasi-passive charge-sharing DAC design presented in [21] and [22] to which two key elements have been added to realize the proposed IMCU. First, a dynamic precharge voltage selection has been integrated to support SMR. Second, a single additional switched-capacitor stage unit is connected in cascade to perform the multiplication of the D/A converted digital weight with the input data.
To conduct the D/A conversion, a set of three nonoverlapping digital pulse signals φ 0 , φ 1 , and φ 2 is used. Each SRAM cell, containing one bit b n k of the weight, controls one stage in the pipeline DAC. Based on this weight bit b n k , the corresponding capacitor C k is initially precharged to either a voltage V pre or to the common-mode V CM . Next, the top plates of capacitor C k and its predecessor C k−1 are shorted, effectively averaging their voltages The first capacitor C 0 is always precharged to V CM such that the significance of the various bits is respected during the subsequent voltage-sharing operations. This procedure can be continued until all magnitude bits of the weight are processed, finally yielding the weight-proportional voltage V w,n on the last capacitor C n w after n cyc,w cycles The voltage V pre is selected based on the desired sign of V w,n so that both positive and negative weights can be represented in the analog domain. In total, a number of n cyc,w cycles is required until the voltage V w,n is generated, whereby n cyc,w depends linearly on the number of bits in the weight n cyc,w = n w + 1.
Once the first valid V w,n voltage is obtained on the last capacitor, pipelining allows consecutive replication of V w,n every three clock cycles. This is required to perform the analog multiplication, described in the next section, at high bandwidth.

B. Analog Multiplication
As a next step, the analog multiplication of the voltage, which is proportional to the weight, with the input must be performed. It is assumed, similar to the weight w n , that the input x n is represented as an n x -bit fixed-point number in SMR. Thus, the sign of the multiplication result s n result can be immediately obtained through an XOR operation between the respective signs of weight (b n sign ) and input (i n sign ) Consequently, the precharge voltage V pre can be selected based on s n result . The corresponding circuit implementation is shown in Fig. 3, which illustrates an example of a transistor-level implementation of an SRAM-based 3-bit signed IMCU. In the case of an unsigned multiplication, the precharge voltage selection step and the related circuitry can be omitted. Furthermore, if the distributive law is used, a multibit fixedpoint multiplication of an input x n by a weight w n can be reformulated as a sum of n x binary products Therefore, the multiplication can be carried out successively in n x multiply and add steps while going through the input bits one by one. In hardware, this can be optimally implemented by modifying the control signals on the switches of the MSB capacitor C n w , which, at the n cyc,w -th cycle, is charged to V w,n φ MSB,add = i n p AND φ (n w ) mod 3 (9) φ MSB,rst = i n p AND φ (n w ) mod 3 .
The key advantage of the proposed implementation is that, in an IMC system, these two signals need only be generated once for the full row and not on a per matrix element. Depending on the input bit i n p , in every three cycles, the capacitor C n w either produces the weight proportional voltage V w,n if i n p = 1 or otherwise zero This binary multiplication result is then accumulated via charge-sharing on a dedicated capacitor C out,n . Starting from an initial voltage V 0 C,out,n = V CM on C out,n , each step of charge-sharing halves the voltage on C out,n and, depending on the current input bit, adds 0.5 · V w,n to it The input bits need to be traversed from LSB to MSB to ensure that the added charge corresponds to the respective bit's significance. Conversely, traversing in the opposite direction, i.e., from MSB to LSB, would require doubling a charge and adding it to that of lower significance, which is nontrivial in terms of a circuit design implementation. The accumulation of the binary multiplication can start as soon as the first pipelined D/A conversion is flushed. To this end, the signal d valid indicates, after n cyc,w cycles, that the correct voltage V w,n is available, after which the accumulation Fig. 3. Transistor-level implementation of a 3-bit signed in-memory compute unit (IMCU) in combination with SRAM cells. All the transmission-gates (TGs) are comprised of a pMOS and nMOS pair so that the charge-sharing procedures are rapidly executed for the full voltage range. In the shown configuration, both 6T and 8T SRAM cells would be supported since the IMCU is not multiplexed to be shared with several memory cells. The XOR logic element is implemented here using nine transistors. is initiated To summarize, an additional input bit is processed every three clock cycles, until all n x bits of the input magnitude have been multiplied and accumulated. The required number of cycles is obtained with the following formula: Note that the first input bit is processed one cycle after the pipeline DAC settles the first time. Ultimately, the following correspondence between a final output voltage V C,out,final,n and the multiplicands w n and x n is established

C. Analog Accumulation
At this stage, all multiplications between the rows of the weight matrix and the input data vector have been executed. The last operation needed to complete the matrix-vector multiplication is to sum up the results along each column. In the analog domain, this is achieved by shorting all output capacitors along one column to the node V col using dedicated switches controlled by the φ acc signal. Since only one capacitor size is used throughout the entire array, the respective voltages V C,out,final,n are averaged The number of cycles n cyc,acc required to finalize the accumulation is given by the RC time constant of the chargesharing procedure. Given the high number of switches that are used and the relatively small unit capacitance size, a settling time of n cyc,acc = 1 will be assumed. All the resulting column voltages are proportional to the entries in the result vector y and can finally be digitized using integrated ADCs. By including the number of cycles n cyc,adc that the ADC needs to sample V col , the following equation for the total number of cycles needed for the in-memory MAC operation can be defined n cyc = n cyc,w + n cyc,i + n cyc,acc + n cyc,adc + n cyc,rst = n w + 3 · n x + 2 (18) where, in one cycle, n cyc,rst = 1 is used to reset the full system. Moreover, the assumption is made that the ADC sampling time for an 8-bit output resolution is below the digital circuits' cycle time [23] so that n cyc,adc = 1. Note that (18) gives evidence for the linear relation between latency and the number of weight and input bits.

D. Numerical Example
In the following, a simple numerical example will be given to demonstrate the functionality of the MAC circuitry. To this end, the weights will be quantized as 3-bit signed fixedpoint numbers (n w = 2), and 4-bit signed quantization will be used for the inputs (n x = 3). A possible transistorlevel implementation with appropriate control signals is given in Fig. 3. Waveforms and capacitor voltage evolution throughout the analog multiplication process are given in Fig. 4. For the reasons of simplicity, the common-mode voltage will be defined as 0 V , and the precharge voltages will be set to ±1 V . A weight value of w = −3 and an input of x = −5 will be used Using (7), the sign s result = +1 of the multiplication result is determined immediately. As the result will be positive, the pipeline D/A will only use the positive precharge voltage +V pre . The pipeline D/A starts synthesizing the weight voltage V w from the magnitude bits b 1 and b 2 . Accordingly, the first capacitor (LSB) V C1 settles on a voltage 1/2 V and the second one In accordance with the first input bit that is processed (i 3 = 1), this voltage is accumulated on the output capacitor in cycle 4, thus yielding Since the second input bit is zero, the MSB capacitor V C2 is discharged in cycle 6, after which the common-mode voltage V CM is merged with V C,out in cycle 7 Finally, V w,n is merged a second time with V C,out in cycle 10, thus processing the last input bit This value can also be obtained via (16) and corresponds to the final result of the multiplication operation. Had the sign been negative, the negative precharge voltage −V pre would have been used, and V C,out [10] would have been −15/32 V.

E. Energy Consumption
In order to quantify the energy consumed by the chargebased analog multiplier, one full operational cycle will be analyzed. Since only one single IMCU will be examined, the row index n will be omitted in the next sections for simplicity. The basic circuit operation of each 10T + C cell can be summarized in three steps: an initial unit-capacitor precharge or discharge cycle depending on weight (b k ) and input bits (i p ) and then two charge-sharing procedures: first, with the previous capacitor, and then, with the consecutive capacitor. Note that charge-sharing itself is quasi-passive and consumes no energy except when switching the connected TG. Only when the capacitors are precharged from the ±V pre supplies, electrical energy is consumed. In addition to datadependent energy consumption occurring during precharge events, there is a data-independent part caused by the switching events of the TGs. Both contributions to the circuit's energy consumption will be examined in the following paragraphs, and the corresponding analytical formulas will be presented.

1) Precharge Events During Initialization:
Assuming all unit capacitors C k in the circuit are initially reset to the common-mode voltage V CM , the energy drawn during the initial precharge cycles differs from energy drawn once the D/A circuit operates in steady state. The following formula characterizes the capacitive energy consumed during any precharge event: where C unit denotes the unit capacitor size, V C,L represents the capacitor's last voltage before the precharge event occurs, and b k indicates the fact that only precharging to nonzero-bit values consumes energy. By using the pipelined DAC output voltage equation from (5), V C,L can be determined as This voltage corresponds to the pipeline DAC output at the bit number k 2 , where the bits of higher significance from k 2 + 1 to n w have not yet been processed. To capture the case of an uninitialized pipeline D/A, the parameter k 1 is introduced. This missing initialization can be observed in Fig. 5, where the initial voltages in cycle 1 on all capacitors except C 1 do not contain any LSB information. Since all capacitors are initially reset to the common-mode voltage V CM , these unprocessed bits are assumed to be zero for all voltages that are developed on the subsequent capacitors of the pipeline DAC via chargesharing. The total number of these initialization pipeline runs n r,init depends on the number of weight bits n w used in the IMCU In each of these runs, a number of LSBs is missing, depending on the position of the start bit. For the runs starting in cycle 1 at capacitor 1, no bits are omitted, and for the run starting at capacitor 4, bits 1-3 are missing. Furthermore, due to the special signals of the analog multiplier [see (9) and (10)], the MSB capacitor is not precharged when the first input bit is zero (i 1 = 0). Finally, the capacitor voltage prior to a precharge event can be given in dependence of V C,L for a run number r and unit capacitor number k x The total amount of energy dissipated in the initial phase can now be obtained as the sum of all consumed capacitive energy 2) Steady-State Pipeline D/A Precharge Events: Once in steady state, each run in the pipeline D/A will consume the same amount of energy, except for the input-dependent power dissipation in the MSB and the MSB-1 capacitor. The input bit count n x determines the number of the pipeline runs necessary to perform the analog multiplication. Note that the first bit is processed at the end of the initial phase in Fig. 5. Since the circuit operates in pipeline mode, an additional number of incomplete runs is initiated depending on the number of weight bits n w . The number of steady-state pipeline runs n r,st , thus, becomes By using (6) and (15), the duration of each steady-state run, including the incomplete ones, can be determined n d,st (r ) = min{n w − 2, n cyc,w + n cyc,i − 3 · r }.
Finally, the total energy dissipated in the pipeline D/A during steady state can be written as Fig. 6. Energy consumed by the multibit IMCU during one complete analog multiplication operation. The weight and input magnitudes |w| and |x| are both quantized to 5 bits. For |w| = 0, the least amount of energy E TG is consumed since only the TGs operate and no energy is drawn from the ±V pre supplies. The maximum amount of energy E IMCU,max is consumed for |w| = 21 and |x| = 0. A C unit of 2 fF and a capacitive load C TG = 1 fF were selected for the inputs of one TG in 14 nm.

3) MSB-1 Capacitor Precharge Events:
Although, in steady state, the MSB-1 capacitor does not draw a constant amount of charge from the supply, its energy consumption depends on the input bit i p that is currently processed in the analog multiplier. If this input bit i p is equal to one, then MSB and MSB-1 capacitor are shorted, which does not occur if i p = 0. The various conditions are captured in the following formula: Given that, in steady state, the MSB capacitor is precharged exactly n x times, once for each input bit, the energy consumption can be computed as follows:

4) MSB Capacitor Precharge Events:
Similar to the MSB-1 capacitor, the energy consumption on the MSB capacitor also exhibits an input data dependence. In steady state, the MSB capacitor is shorted to the output capacitor C out to accumulate the results of the single-bit multiplication. During a subsequent precharge event, not only the current input bit i p but also all input bits that have been multiplied and added so far must be considered when calculating the voltage on the capacitor prior to precharging. To this end, the instantaneous output capacitor voltage is defined as a function of the number of input bits that have been processed The total energy consumption in the MSB capacitor is the sum over all precharge events occurring throughout the multiplica-tion and accumulation of each input bit

5) Switching of TGs:
All TGs used in the presented analog multiplication circuit are assumed to be built exactly the same, from equally sized pMOS and nMOS transistors. While turning on the TG, the charging procedure of the nMOS transistor gate consumes energy, and while turning it off, the same is true for the pMOS transistor gate. From circuit simulation with the extracted TG netlist, the energy consumed during one turning on and off transient E TG,transient can be obtained If the total number of turning on and off events n TG,events is determined, then the total amount of energy consumed can also be obtained. From observation of the circuit in Fig. 3, an expression for the number n TG,φ 0 of TGs connected to the signal φ 0 can be derived The corresponding number of turn-on events n TG,ev,φ 0 follows from Fig. 4 n TG,ev,φ 0 = 1 3 · (n cyc + 2) .
The sum over the number of TGs multiplied by their respective number of switching events is finally multiplied by E TG,transient , thus yielding E TG,total , which is the total amount of energy consumed for switching the TGs in one analog multiplication unit. If added to the capacitive energy figures, the total amount of energy consumed during one IMCU multiplication procedure can be obtained

6) Discussion:
The significance of each contribution in the overall energy balance can be seen in Fig. 6 for an implementation using a quantization of 5 bits for both the weight and the input magnitude (n w = n x = 5). It is clear that the energy E TG spent for switching the TGs dominates the overall consumption. The peak, denoted by E IMCU,max , occurs at |w| = {1, 0, 1, 0, 1} 2 when each charge-sharing procedure leads to the maximum V possible, and thus, the highest energy is consumed during precharging. Conversely, the minimum amount of energy E TG is consumed for w = 0 when only the TGs are switched and no precharging occurs. Fig. 7. Dependence of weight (n w ) and input (n x ) magnitude quantization on the energy, time, and average power consumed for one complete analog multiplication operation of a multibit IMCU. The sign bit is not included since it does not significantly impact the overall energy consumption. The red and blue stars indicate the design point selected for the implementation study. Fig. 7 shows the impact of weight and input quantization on the peak energy, average power, and time per multiply operation. Since each additional weight bit implies another set of capacitors and switches, as well as one additional clock cycle, both latency and power consumption increase linearly. Thus, energy consumption, which is obtained as their product, shows a square dependence. Nonetheless, with respect to the input bits n x , the scaling versus energy remains completely linear since the circuit only needs to operate for three additional cycles because of pipelining, without any additional hardware required within the IMCU.

F. Noise and Mismatch Impact
To ensure the highest possible accuracy of an analog IMC operation, all sources of nonlinearity and noise need to be known before an adequate circuit can be designed. Since the presented IMC circuit utilizes charge-sharing procedures between capacitors for the analog computation, the impact of TG ON-resistance R TG,on mismatch remains negligible. This is because the circuit's cycle time is defined to always ensure complete voltage settling on a unit capacitor given a target n acc,mac -bit precision (from [21]) R TG,on · C unit · n acc,mac · ln 2 < T cycle . (37) The thermal k b T/C noise impact is reduced by determining a minimum size for C unit such that, once all capacitors along one column are shorted, the remaining noise amplitude is below the LSB of the employed ADC. Basically, a tradeoff between analog computation precision and latency, as well as energy efficiency, has to be made. The main source of nonlinearity in this system is the mismatch between the C unit capacitors due to manufacturing tolerances. As a consequence, charge-sharing procedures will not result in perfect averaging of the respective capacitor voltages. Given equally designed unit capacitors C 0 , C 1 , . . . , C n w , C out , the relative errors C k = C k /C unit can be modeled as independent random variables, normally distributed with N (0, σ 2 ). These errors impact the D/A conversion process of the two multiplicands, weight w and input x, differently. Consequently, the weight is converted into the nonideal voltage V w by the pipeline DAC. In addition, the subsequent analog multiplier performs a nonideal scaling operation by a factor α x , which is proportional to an input value x n . Specifically, (39) Using (38) and (39), the nonideal multiplication result V out is obtained as Note that the pipeline DAC output is impacted by the mismatch of all capacitors except C out , whereas the analog multiplier is impacted only by the relative mismatch between C n w and C out . Since the IMCU can be described as a dual input DAC, the common metrics of integral nonlinearity (INL) and differential nonlinearity (DNL) can be used to obtain a quantitative measure of the analog multiplication accuracy. A change in the stored digital values of either weight or input alters the analog output voltage V out . The impact of these variations can be measured by the weight-and input-related DNL metrics, i.e., DNL w and DNL x . Specifically, Given a technology's capacitor fabrication tolerance σ and a maximum allowed DNL value for both weight and input DNL max = max {max w |DNL w,x |, max w,x |DNL x |}, a target yield Y DNL can be defined to determine the usable design space via Monte Carlo simulations, as shown in Fig. 8. From the confined areas shown in the plot, it becomes evident that an accurate BEOL fabrication process is required for highprecision analog computing. Advanced patterning techniques in deep submicron nodes allow to keep the mismatch below 0.05%, even for small C unit sizes [24]. Furthermore, matching will gradually improve in upcoming technologies since the metal fabrication precision is increasing as the transistors are shrinking in accordance with Moore's law [25], [26]. Additional improvements of the analog computing precision could be achieved by adding calibration circuitry but only at a significant cost of area and complexity. Considering that both these factors are critical in determining the IMCU's competitiveness, calibration overhead will be avoided in the presented implementation, and only the precision provided "for free" by the underlying technology node will be used. Assuming a σ between 0.1% and 0.02%, the design point with five weight bits and five input bits will be chosen for implementation. Since only the magnitude is accounted for in Fig. 8. IMCU design space obtained from 2000 Monte Carlo simulations considering mismatch between the unit capacitors C unit . A target yield of Y DNL ≥ 99% for a maximum DNL smaller than 0.5 was assumed. Only the magnitude bits are considered. The sign bit is not included for either weight or input as it only determines the polarity of the precharge voltage and does not impact the DNL. The red star indicates the selected design point. Fig. 9. Output characteristics of an IMCU using 5-bit unsigned weights and inputs. It is based on simulations using ideal switch models and nonideal unit capacitors C unit with a relative matching variation of σ = 0.1 %. The respective DNL curves for the weight DNL w (red) and input DNL x (black) are shown in the top left plot. Both the DNL and INL remain bounded within ±1 LSB out of 10 bits. the plot, this can be extended to a 6-bit signed weight and a 6-bit signed input. This is because the sign changes only the polarity of the precharge voltage V pre and, thus, doubles the output range unimpaired by the capacitor mismatch. Instead, the precharge voltages ±V pre , which are assumed to be provided from externally, are required to be highly accurate with regards to symmetry around the common-mode V CM to avoid inconsistent scaling of positive and negative multiplication results. In Fig. 9, the transfer characteristic of the analog multiplier is shown. Despite the presence of mismatch, the 10-bit multiplication result tracks the ideal output curve to a great extent.
In order to assess the performance of a full column of 128 IMCUs in the presence of other nonidealities, such as device mismatch and thermal noise, a transistor-level simulation is performed on a 14-nm implementation. The results are presented in Fig. 10 alongside the ideal outputs of a fixed-point digital implementation. It can be seen that the error waveform of the analog MAC output, plotted in the lower-right insert of Fig. 10, exhibits the typical S-shape characteristic, which arises from the v gs -dependent TG ON-resistances, superimposed on the thermal noise error waveform. Furthermore, since this error is bounded between ±0.2 of an 8-bit LSB, it can be deduced that, when digitizing using an ADC with a minimum ENOB of 8, the obtained result will be very close to the truncated output of a full-precision fixed-point operation.
Deep neural network inference tasks, which are the designated applications for the presented IMC system, can tolerate this small reduction of precision of the MAC operation with usually no loss or in certain cases with insignificant loss in classification accuracy. The effects of ADC quantization, to which any reduced-precision implementation is subjected, are studied in detail in [27].

III. SYSTEM ARCHITECTURE
The goal of this work is to enhance standard SRAMs with IMC capabilities while maintaining as much as possible the original memory architecture. Conventional wordwise read and write procedures for instance are still needed to carry out fundamental memory I/O routines. In the system architecture, as depicted in Fig. 11, these basic elements are maintained with the same functionality. Their design and the mode of operation are described in [1]. The three building blocks that differentiate the novel SRAM architecture from the original will be described in the following sections.

A. IMC Subblocks
Support for performing multibit in-memory MAC operations is achieved by integrating the IMCU described in Fig. 3  into the SRAM array. If the bandwidth requirements for the IMC operation can be relaxed, then significant amounts of silicon area can be saved by time-multiplexing each analog multiplication circuitry between a set of n shared words organized in a memory subarray. As a result, the execution of an inmemory MAC operation on the complete array will require an additional number of cycles, which is proportional to n shared . Furthermore, this entails a change in the storage element. Instead of directly using the internal nodes of the 6T cell, the data now need to be read locally, preferably simultaneously in all IMCUs. To this end, the 8T SRAM cell is employed since it allows local read procedures in the memory subarrays via local bit-lines, as well as global read/write operations via the legacy periphery [28].
For performing the local read, additional sense-amplifiers (SAs) will be required, which can be much smaller in size than the peripheral amplifiers due to the lower number of cells that they cover. Finally, the outputs of the SAs are connected to the inputs of the multibit IMCU, as shown in Fig. 11(b). By taking into account the number of cycles n read,local required for the local read operation, the total latency for a full matrix-vector multiplication becomes n cyc,total = n shared · (n read,local + n cyc ) where n cyc was defined in (18). Consequently, the local read procedures also enter the energy balance as an array-sizedependent term E read,local .

B. IMC FSM
The in-memory MAC operation requires the IMCU circuitry and a well-defined sequence of pulses, similar to those shown in the example of Fig. 4. One way of generating these signals would be to adopt a timing block similar to those in classical SRAM architectures. In this article, a simpler solution using clock-gated shift registers is proposed, which is more flexible in the implementation and gives a more conservative and comprehensible estimation for energy consumption.
After receiving a positive edge signal, the FSM enables the clock signal of a large block of shift registers, as shown in Fig. 11(a), for a well-defined number of cycles. Some signals are common for the entire array, for instance, the three pulse signals: φ 0 , φ 1 , and φ 2 . Others, such as the signal pair of φ MSB,add and φ MSB,rst , are generated per each column depending on the input vector bits. In addition to the signals shown in Fig. 4, all signals required for performing the local read operation must also be provided by the FSM. Finally, all signals need to be buffered sufficiently in order to drive the respective inputs across the array.

C. ADC Design
After the analog MAC operation is completed, the final result, which corresponds to a voltage stored as charge across a column of output capacitors, needs digitization at sufficient precision. This necessitates the use of voltage input A/D converters. Given a large number of input signals, the ADCs, as shown in Fig. 11(c), should be designed to be as small, as fast, and as energy-efficient as possible. With this in mind, the SAR ADC design shown in [23] is used as a starting point. Since the input consists of charge on a large capacitor, voltage buffers and complicated sampling circuits can be avoided, and the input can instead be transferred by means of charge-sharing to the capacitive DAC (CDAC) of the SAR ADC. The conversion procedure itself can be executed using a self-timed state machine to achieve high-speed conversion cycles of below 1 ns in 14 nm [29]. Moreover, the cost of each conversion is bounded at 3.3 pJ. By pitch-matching the ADC circuit to the width of one IMC subblock, the conversion latency impact on the overall bandwidth can be kept minimal.

IV. SYSTEM IMPLEMENTATION STUDY AND ANALYSIS
To demonstrate the benefit of the IMC-based architecture, a full system implementation study is performed, detailing the various components' area and power consumption. The 6-bit signed weights and 6-bit signed inputs are again assumed. The full memory has 128 × 2048 weights, arranged in 128 × 64 IMC subblocks each of 32 weights.
If standard design rules are employed rather than specialized SRAM push-rules, then the 8T SRAM cell size becomes 0.312 µm × 0.768 µm. Accordingly, the local SA circuit is designed with a matching height, using an area of 0.504 µm × 0.768 µm. Furthermore, this height is maintained in the various IMCU blocks so that, given the subcomponents' size reported in Table I, the total area of one IMCU can be determined to be 0.756 µm × 5.376 µm. Note that the unit capacitors are designed in the metals located above the transistors to keep the footprint as small as possible, similar to the approach taken in [17]. Finally, the size of one IMC subblock, consisting of IMCU, SAs, and SRAM array, is determined as 11.24 µm × 5.376 µm.
If the peripheral circuits, decoders, read/write circuitry, IMC FSMs, and ADCs are added, the area becomes 769.980 µm × 792.398 µm for the full system. Regarding area efficiency, 56.4% is used by the SRAM cells, and the IMC-related overhead amounts to about 35.4%. Figures for the energy spent in each operation are listed in Table II. Initially, before the actual IMC operation begins, the first column of SRAM words has to be locally read to make the corresponding values available to the attached IMCUs. Note that this operation can be completed rapidly in 2 ns because the local SAs cover a relatively small SRAM subarray with an accordingly small local bitline capacitance, and the obtained results are used locally and not transferred to the periphery. If executed in all 8192 subarrays, total energy of 196.61 pJ is consumed, according to simulations.
In the following step, after the local read, the IMCUs generate the voltages, which correspond to the MAC result and are eventually digitized by the ADCs. This process of alternating local-read and IMC operation is repeated 32 times, taking 216 ns, until the full matrix has been processed. Including the FSM and ADC energy, a total amount of 30.96 nJ is spent. These figures can be used to determine the full system throughput as 2.43 TOP/s at an efficiency of 16.94 TOP/s/W.
In relation to the throughput and energy efficiency figures, i.e., TOP/s and TOP/s/W, it has to be noted that the bit precision is not taken into account, thus putting the lowest precision implementations at an advantage. To adequately reflect the additional computational complexity tackled by multibit accelerators, the respective quantization of weight n w and input n x can be factored in, similar to the approach taken in [19], yielding precision scaled TOP/s and TOP/s/W. This is shown in Table III, where recent implementations of analog in-memory MAC-operation accelerators using SRAM combined with capacitors [17], [18], [30], [31] are compared with the presented work.
For example, a scheme that could scale in terms of weight and input bits is demonstrated in [30]. In this article, a specific implementation for 4-bit inputs is presented. These multibit inputs are realized by expanding the 4-bit input value into a number of pulses, which, based on the different weight bit values along the rows, causes the capacitive read-bit-lines to discharge by a proportional amount. However, this input-totime conversion creates an inherent exponential dependence of the latency on the number of input bits, leading to a limiting factor for finer input quantization. On the other hand, the multibit weights are realized in a single time step by employing a number of compensation and computation  III   COMPARISON TABLE OF SRAM-AND CAPACITOR-BASED ANALOG IN-MEMORY MAC-OPERATION ACCELERATORS capacitors. Since the total capacitance has an exponential dependence on the number of weight bits (∼2 n w ), the chip area scales exponentially with the number of bits as well. Finally, note that the overall area overhead introduced for enabling the IMC capabilities remains manageable since, similar to [19], the standard SRAM-macrointernals remained unmodified, and exclusively, pitch-matched components are added to the periphery. In summary, the system described in [30] delivers high energy and area efficiency for the selected 4-bit input and weight quantization with the relatively low throughput being the only downside. This architecture might match well with the common requirements for IoT and edge applications. However, scaling the inputs and the weights beyond 4-bit incurs significant area and latency penalties. The same observations apply to the design described in [18], which demonstrates simultaneous vector processing, though for binary weights only. This architecture supports input precision scalability, albeit at an exponential cost in terms of latency. Furthermore, the high energy-efficiency number shown in [18] becomes less than that presented in this article, once the operands' precision is factored in.
Both the accelerator systems presented in [17] and [31] demonstrate high parallelism with completely binary implementations for inputs and weights. Note that, in binary cases, a multiply operation can be reduced to a single XNOR opera-tion [32]. As a consequence of this simplification, no scalability in terms of precision can be achieved. In addition to restricting the operands' precision to binary only, [17] also applies binary batch-normalization on the analog MACoperation result instead of digitizing using an ADC, thus contributing to the immensely high efficiency reported there.
In addition to the pure analog IMC approaches listed in Table III, a very interesting combination of a binary IMC approach with digital techniques to increase precision was presented in [33]. Specifically, through the use of digital shift and add circuitry, binary-only analog accelerators as, for example, in [31] and [17], can gain linear scalability in terms of weight and input quantizations. However, this incurs a cost in terms of latency and energy due to the multiple ADC cycles, as well as an increase in the area due to the peripheral digital adder circuits.

V. CONCLUSION
The cost, in terms of time and energy, associated with data movement has driven the concept of in-memory computing for neural network applications. According to this approach, the dominating matrix-vector operations are performed inplace, i.e., in the memory itself by exploiting certain physical properties of memory technologies. However, the main challenge of in-memory computing is the accuracy of the analog MAC operations.
In this article, we introduced a linearly scalable multibit inmemory computing system for accelerating MAC operations in standard SRAM. A novel interleaved switched-capacitorbased IMCU was proposed for conducting the analog computation, and its potential with regards to speed and energy efficiency was demonstrated. Although various SRAM-based matrix-vector multiplication engines have been proposed in the literature, our approach is the first to achieve computational precision that scales linearly in time, power, and area. Moreover, we have shown, via transistor-level spectre simulations, that, by using multibit representations for the input signals and the weights, there is no significant penalty in the accuracy of the SRAM-based analog MAC operations compared with a corresponding all-digital implementation with the same precision.
From a system design perspective, applications requiring a rather high quantization or precision (4-8 bits) will benefit substantially from the linear scalability of the presented IMC circuit and architecture, which can offer high throughput at an acceptable cost of area and energy. Finally, besides SRAM as the underlying memory technology, other volatile or nonvolatile memory technologies using simple SA-based read, such as DRAM, MRAM, or binary PCM, could also potentially be used in conjunction with our IMCU concept to provide multibit MAC computing capabilities.