C3PU: Cross-Coupling Capacitor Processing Unit Using Analog-Mixed Signal for AI Inference

This paper presents a novel cross-coupling capacitor processing unit (C3PU) that supports analog-mixed signal in-memory computing to perform multiply-and-accumulate (MAC) operations. The C3PU consists of a capacitive unit, a CMOS transistor, and a voltage-to-time converter (VTC). The capacitive unit serves as a computational element that holds the multiplier operand and performs multiplication once the multiplicand is applied at the terminal. The multiplicand is the input voltage that is converted to a pulse width signal using a low power VTC. The transistor transfers this multiplication where a voltage level is generated. A demonstrator of <inline-formula> <tex-math notation="LaTeX">$5 \times 4$ </tex-math></inline-formula> C3PU array that is capable of implementing 4 MAC units is presented. The design has been verified using Monte Carlo simulation in 65 nm technology. The <inline-formula> <tex-math notation="LaTeX">$5 \times 4$ </tex-math></inline-formula> C3PU consumed energy of 66.4 fJ/MAC at 0.3 V voltage supply with an error of 5.7%. The proposed unit achieves lower energy and occupies a smaller area by <inline-formula> <tex-math notation="LaTeX">$3.4\times $ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$3.6\times $ </tex-math></inline-formula>, respectively, with similar error value when compared to a digital-based <inline-formula> <tex-math notation="LaTeX">$8 \times 4$ </tex-math></inline-formula>-bit fixed point MAC unit. The C3PU has been utilized through an iris flower classification utilizing an artificial neural network which achieved a 90% classification accuracy compared to ideal accuracy of 96.67% using MATLAB.


I. INTRODUCTION
Multiply-and-accumulate (MAC) units are essential building blocks for digital processing units that are used in a multitude of applications, including artificial intelligence (AI) for edge devices, signal/image processing, convolution, and filtering [1]. Recently, research has been focused on AI applications to address complex machine learning problems such as image/speech recognition and language translation [2]. Deep neural networks (DNNs) are widely utilized in such applications since it can achieve high accuracy [3]. However, DNN algorithms are computationally intensive, with large data sets that require high memory bandwidth. This results in memory access bottlenecks that introduce considerable energy and performance challenges. The memory access energy is 1-3 orders of magnitude higher than the compute energy [4]. However, DNNs are approximate in nature, and many AI applications can tolerate The associate editor coordinating the review of this manuscript and approving it for publication was Paolo Crippa . lower accuracy [5]. This opens the opportunity for potential tradeoffs between energy efficiency, accuracy, and latency.
One direction to reduce the need for explicit memory access is to utilize in-memory computing (IMC) architectures. It has significant advantages in energy efficiency and throughput compared to traditional computing that is based on von Neumann architecture [6]. IMC can be implemented in digital [7], [8], analog [9], [10] or time [11], [12] domains for computing in artificial neural network (ANN), convolutional neural network (CNN), and DNN. Analog computing has gained a great interest as it shows significant advantages in computing efficiency especially for larger crossbar array sizes [13]. One way to implement analog computing is through the utilization of memristive devices that store the weights as conductance values [9], [14]. The voltage signal is applied to the memristor crossbar and a multiplication by each memristor produces output current, according to Ohm's law, that is accumulated across each column. Although the non-volatile memristor device has the feature to work in the analog domain, it suffers from low endurance and sneak path issues that may cause a VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ state disturbance [15]. Another way of computing in the analog domain is through capacitive network. The work demonstrated by IBM in [10] utilizes the capacitor as an analog memory to store the weights as charges that control the conductance of the transistors. However, the limitation of this solution is the relatively large and complex biasing circuit that is required to control the charges on the capacitor in addition to the non-linearity due to the variations of the drain-to-source voltage of the transistor. The work in [16] discusses the circuit design of an analog processor used for CNN face detection. It implements the analog memory through a sampling capacitor, an input capacitor, a unity gain buffer and several switches whereas the computation is deployed using SRAM. The work in [17] utilizes a CNN for face detection based on a hybrid analog-digital processor. It implements an analog memory using capacitor and source follower whereas the multiplication with the weight is performed using a switched drain regulation current mirror (SDR-CM). The analog-based computing is utilized only for the first layer in the CNN to reduce the power consumption. Note that the analog computing in both [16] and [17] works includes two separate blocks, one for storage and the other for computation. To eliminate the usage of additional block of the memory, the work in [18] utilizes a switched capacitor circuit where the weights are stored in the capacitors and the convolution operation is performed simultaneously. The work in [6] employs both 8T-SRAM as a memory and crosscoupling capacitor as an accumulator to perform binary MAC operation using bitwise XNOR gate. To implement an analog MAC operation, this paper develops a novel cross-coupling capacitor (C3) computing, hence, named, the C3 processing unit (C3PU) coupled with a voltage-to-time converter (VTC) circuitry. The C3PU performs multiplication using capacitive coupling and accumulation through the transistor bitline in the array. The main contributions of this paper can be summarized as follows: • According to the best of the authors' knowledge, this is the first circuit design that utilizes cross-coupling capacitor for IMC as both a memory and a computational element to perform analog MAC operation.
• The proposed C3PU can be utilized in applications that heavily rely on vector-matrix multiplications, including but not limited to ANN, CNN, and DSP. The design is ideal for applications with fixed coefficients such as pre-trained CNN weights and image compression [19].  voltage supply with an error compared to computation in MATLAB of 5.7%.
• The proposed C3PU usage has been demonstrated through iris flower classification on a two-layers ANN. The synaptic weights are trained offline and then mapped into capacitance ratio values for the inference phase. The ANN classifier circuit is designed and simulated in 65 nm CMOS technology. It achieves a high inference accuracy of 90% compared to the baseline accuracy of 96.67% obtained from MATLAB. The rest of this paper is organized as follows. Section II proposes the C3PU circuit design and explains how the MAC operation is performed. Section III discusses the implementation of the MAC operations in a 5 × 4 C3PU crossbar architecture. Section IV shows an example of C3PU's potential application targeting iris flower classification using ANN architecture in 65nm technology. Finally, Section V concludes the paper.

II. PROPOSED C3PU CIRCUIT AND OPERATION
The following subsections discuss the operational details of the proposed C3PU. The basic principle of the contribution is based on using a coupling capacitance to transfer the voltage to the transistor's gate. The generated voltage is linearly proportional to the current passed through the transistor.
A. C3PU OPERATION Figure. 1a shows the proposed C3PU circuit that performs in-memory multiplication operation. The C3PU consists of a CMOS transistor and a capacitive unit that includes a crosscoupling capacitor C c , a capacitor C b connected between the gate of the transistor and the ground, and a transistor's gate capacitor C g . The modulated input voltage amplitude V in , which is the first multiplication operand, is applied at the terminal of the capacitive unit. The second operand is stored in the capacitive unit as an equivalent capacitance ratio X eq = C c C c +C b +C g . The capacitive computational unit multiplies the two operands and generates a voltage V g that is a function of V in , C c , C b , and C g as given in Eq. 1. V g is applied to the gate of CMOS transistor producing a drain-source current I ds as given in Eq. 2 where G m is the transistor's transconductance. Note that I ds is proportional to the multiplication of its two operands V in and X eq . Since the multiplication is linear, the transistor must also operate in a linear mode in order to transfer the multiplication correctly to the output in an electrical current form.
The value of V g determines the operational mode of the transistor and affects its transconductance value and hence its linearity. Figure. 2 depicts the I ds of the transistor versus V g at VDD C3PU = 0.3 V. As shown in the figure, I ds is approximately linear only when V g is between 0.5 V and 0.8 V with a transconductance slope of 230.13 µS and a mean square error (MSE) of 2.37 pS between the observed and expected ones. The linearity over a small range of V g creates some design constraints. First, the input voltage has to be selected within a certain high value range. This means that V in requires normalization to tolerate the low V in values resulting in a mapping error. Second, even though V in is high, the capacitance ratio X eq should also be high enough to provide a large V g value to run the transistor in linear mode.
To overcome the former issues that significantly affect the functionality of the proposed C3PU multiplier, the analog input voltage will be processed in the time domain rather than the voltage domain. This is achieved using a voltageto-time converter (VTC), as shown in Fig. 1b, by converting the amplitude of analog input V in into time delay to generate a modulated pulse width signal V pw . This way, the voltage level of V pw is ensured to be high and having a value equal to the VTC's supply voltage VDD = 1 V. Consequently, the transistor will always operate in linear mode giving that X eq is selected within a specific high range between 0.5 and 0.75 and VDD C3PU is low with a value of 0.3 V. If X eq > 0.75, then the value of V g will saturate. The resultant I ds becomes a function of V pw as shown in Eq. 3 that is linearly proportional to the time delay. The proposed VTC circuit design, as discussed in section II-B achieves high conversion linearity over a wide range of V in . This guarantees that the C3PU performs a valid multiplication between V in and X eq by: a) providing a linear conversion from V in to V pw , and b) running the transistor in linear mode.
Presenting the data V in in the time domain has several advantages over the voltage domain, since both time and capacitance scale better with technology. In addition, it has less variations and provides better noise immunity compared to the voltage domain where the signal-to-noise ratio is degraded due to voltage scaling [20].
B. PROPOSED VOLTAGE-TO-TIME CONVERTER (VTC) Figure. 3 shows the block diagram of the proposed VTC circuit design. It consists of a sampling circuit, an inverter, and a current source. To achieve voltage-to-time conversion, the VTC has two operating phases: sampling and evaluation. The basic principle is to transfer the charges from the input to the capacitor during the sampling phase and then discharge this capacitor through a current source during the evaluation phase. A simple inverter is used to transfer the time it takes to discharge the capacitor into a delay. The delay will be linearly proportional to the input voltage.
During the sampling phase, as shown in Fig. 3b, S 1 and S 4 turn on when the clock V clk = 1 V and S 2 and S 3 are off when the inverted clock V clkb = 0. The capacitor C 1 is precharged with a voltage V c that is equal to the input voltage value V in . The capacitor C 2 is charged with a voltage V x that is equal to the supply voltage VDD. During the evaluation phase, as shown in Fig. 3c, S 1 and S 4 turn off when V clk = 0 and S 2 and S 3 turn on when V clkb = 1 V. The node V c is coupled to V x . In this phase, the functionality of the VTC depends on V in . When V in is high, i.e., V in = VDD, then, V c = V x and the initial charge across the capacitors is Q i = VDD(C 1 + C 2 ).   On the other hand, when V in is small, i.e., V in = 0, the initial charge across the capacitors is Q i = V in C 1 + VDDC 2 . Due to the potential difference between C 1 and C 2 , the charges are shared among them. Consequently, the current flows from C 2 to C 1 causing a voltage pump on V c . Then, it starts discharging through the current source I till it reaches the switching point of the inverter V sp resulting in a final charge Q f = V sp (C 1 + C 2 ). After that, the inverter pulls up the delayed output voltage V out . The time it takes to discharge V x to the inverter's switching point voltage is referred to as time delay t d . This time delay, given in Eq. 4, depends on four main parameters: voltage values of VDD and V in , voltage value of V sp , capacitors' size of C 1 and C 2 , and the average current I avg till it is discharged. The V sp value is set by the aspect ratio of PMOS and NMOS transistors of the inverter ( β n β p ) as given in Eq. 5. The I avg value depends on the amount of charges stored in the capacitors, which varies linearly with V in given that VDD is fixed. Thus, t d has a linear relationship with V in .
To implement the proposed VTC using CMOS, Fig. 4 shows the detailed circuit diagram. The switches S 1 and S 3 are replaced by the pass gates (M 1 , M 2 ) and (M 5 , M 6 ), respectively. The switches S 2 and S 4 are replaced by M 3 and M 7 , respectively. The current source is simply implemented using M 4 and controlled by a bias voltage V b to operate in the saturation region. The inverter is realized by M 8 and M 9 .
To generate a pulse width signal V pw , a digital logic block of inverter and AND gate is added. During the sampling phase, when V clk = 0 and V clkb = 1, M 3 is off, and M 7 is on, so that C 2 is charged to VDD. The pass gate (M 1 , M 2 ) turns on to precharge C 1 with V c = V in . On the other hand, the pass gate (M 5 , M 6 ) is off, which disconnects the node V x from V c to eliminate the short circuit current on the delay chain at low voltage levels of V in . At this phase, V x = VDD, which causes V out = 0. During the evaluation phase, when V clk = 1 and V clkb = 0, the pass gate (M 5 , M 6 ) and M 3 turn on, whereas the pass gate (M 1 , M 2 ) and M 7 turn off. In this phase, V c is coupled to V x and the charges redistribute between C 1 and C 2 . Initially, if V in < VDD, this means that V c < V x . As a result, a current flows from C 2 to C 1 , making a charge pump on V c as shown in Fig. 5 (see gray waveform when V in = 0.1 V). On the other hand, if V in = VDD, then V c follows V x as shown in Fig. 5 when V in = 1 V. In both cases, the capacitor current starts discharging through M 4 , equating it with the drain-source current of M 4 , I ds4 . This drops the value of V x till it reaches V sp of the inverter (M 8 , M 9 ). Then, it pulls up V out that is connected to an inverter chain whose output V out−b is ANDED with V clk to generate V pw . Figure. 5 depicts the waveforms of the proposed VTC. Note that the proposed VTC controls the delayed V out at the rising edge of V clk .
The proposed VTC circuit has been designed, implemented, and simulated in 65 nm industrystandard CMOS technology. The input voltage is set between 0 V to 1 V at VDD = 1 V. Both capacitors C 1,2 and transistor M 4 sizes are selected to support a minimum time delay of 107 ps at the minimum V in of 0 V. A metal insulator metal (MIM) capacitors of C 1 = 27 fF and C 2 = 10 fF are utilized. The M 4 size of 500 nm/140 nm controlled by its gate voltage of V b = 0.5 V provides a current of 14 µA. The inverter is carefully sized to provide the desired V sp . Hence, the aspect ratio of M 9 is 5× the aspect ratio of M 8 such that V sp = 0.35 V. Table 1 summarizes the specifications of the proposed VTC design.    Figure. 7 shows the output time delay t pw from the VTC versus the input voltage observed from the simulation in addition to the expected ones. As depicted from the figure, the time delay is linearly proportional to the input voltage. Note that the VTC is designed to operate in approximate computing architectures for AI applications that are statistical in nature and tolerable to variation and noise [5], [21]. Noise simulation has been carried out to analyze the input-referred noise and the SNR of the VTC at Vin = 1 V and frequency = 100 MHz. The input noise and signal power averages are obtained by integrating the noise and signal power spectrums over their frequency range. Spice simulation shows that the averaged input referred-noise and signal are 1.425 µV 2 and 5.67 V 2 resulting in an SNR value of 65.9 dB. The jitter of the VTC circuit has been analyzed and simulated and has rms value of 2.57 ps. The VTC has a low MSE value of 4.15e −23 s, low   To quantify the impact of mismatch variation on the pulse width value, Monte Carlo Spice simulation is carried out with 200 samples. Figure. 8 shows the effect of mismatch variations on the time delay obtained from Monte Carlo simulation at V in = 1 V. As depicted from the figure, the standard deviation is low such as 0.218 ns from the mean of 2.358 ns at V in = 1 V. Hence, the ratio of the standard deviation to the mean is approximately 9%. This variation can be reduced by cascading multiple stages of the VTC circuit as shown in Table 2. As the number of the VTC stages increases, the variation decreases down to 4.4% for 4-stages. For 4-stages VTC with 200 samples, 3-sigma variations of 13.2% can be covered which is equivalent to 2-sigma variations for 2-stages VTC. Table 3 shows the comparison between the proposed design and prior works. Although the proposed VTC circuit has a lower conversion gain, the linearity range across V in is improved by 4× and 5.33× compared to [22] and [23], respectively. Moreover, for IMC applications where the computation can be performed in a few ns, the pulse width of V pw doesn't need to be large, and hence the conversion gain. The figure of merit (FoM) is developed for the VTC circuit and given in Eq. 6. It indicates accuracy of the VTC in providing conversion gain per power. The VTC's accuracy is 99.7%, and hence the FoM equals 322 µs/V.W.

III. C3PU CROSSBAR ARCHITECTURE FOR IMC APPLICATIONS
To demonstrate the advantage of the proposed design, a crossbar architecture of the C3PU and periphery circuit is designed. Computational crossbars naturally realize highly parallel vector-matrix operations and hence efficiently support high throughput with significant savings compared  to the digital counterpart. This efficiency is achieved by performing the MAC operation in the same place where the data is stored. Therefore, the 5 × 4 C3PU crossbar architecture is proposed, as shown in Fig. 9. The transistor source in each C3PU computational element is connected to the supply voltage VDD C3PU . It is assumed that the analog input voltages V in,1−5 come directly from the sensors. These inputs are converted into modulated pulse width signals V pw,1−5 using 5 separate VTCs (discussed in II-B) instead of the need for the ADC as in the traditional design. The V pw,1−5 represent the wordlines connected to the C3PU computational block to run it in linear mode. Each current produced by the C3PU is controlled by the multiplication of V pw,i and capacitance ratio X eq,ij (i is the row and j is the column) and then summed by the shared bitline. The resultant currents I 1−4 represent the complete MAC calculation of each column.
The currents are integrated to generate an analog output V 1−4 to drive the actuator. Since the actuator function can be done in the analog domain, it reduces the overhead of going into the digital domain. The operation of the C3PU crossbar, given in Fig. 9, depends on two-phase functions: computation and isolation. In the computation phase, when the clock signal V clk = 1, the MAC operation is achieved by multiplying the V pw,i pulse widths with the capacitance ratios C c,ij C c,ij +C b,ij +C g,ij . Then, the transistors transfer this multiplication into a current that is summed on each bitline. The summed currents are integrated over a time t 1 − t 2 using a virtual ground current integrator op-amp to provide the outputs as voltage levels V 1−4 as given in Eq. 7.
The value of output voltages depends on two main parameters: a) time that the current will be accumulated t 1 −t 2 and b) capacitor size C j . The time t 1 − t 2 is usually fixed and represents the pulse width of the clock. This time is set to be greater than the maximum pulse width of V pw,i . The maximum pulse width of V pw is approximately 2 ns when the maximum input voltage V in = 1. Thus, the pulse width of the clock is set to 3 ns to ensure the completion of the computation and accumulation of the currents. In addition, the C j size plays an essential role in determining the scaling factor that is required to approximately allow V 1−4 to reach the expected output levels. The scaling factor is calculated by dividing the obtained MAC output voltages V 1−4 by the expected values, and hence the C j size is set. Once the approximate voltages are achieved, the C3PU elements are isolated from the outputs by setting V clk = 0 to enter the isolation phase. The isolation phase is essential to allow the proper functioning of the VTC and to initialize the output stage of the virtual ground op-amp. The period T , including computation and isolation time taken to operate the MAC calculations is 6 ns. Table 4 shows the specifications of the C3PU crossbar architecture. The value of C c has a range between 2.5 fF and 8 fF, and the value of C b is fixed with 2.5 fF. Note that the proposed C3PU design targets hardwired fixed functions for AI applications where the weights are fixed. It can be modified to support applications that require programmable weights using emerging memcapacitor [25], [26]. However, this requires control circuits and a tunable voltage to program the capacitance value, which adds power overhead. The 5 × 4 C3PU crossbar shown in Fig. 9 with the specifications in Table 4 is designed and implemented in 65nm technology. The input voltages are fed to the C3PU crossbar for 30 consecutive clock cycles representing the 30 input sets. Each cycle has different sets of input voltage levels that are converted into modulated pulse width signals. Figure. 10 shows the input/output time domain waveform of the 5 × 4 C3PU crossbar for two different input sets.  The input voltages are validated at the negative edge clock, and the modulated pulse width signals are generated at the positive edge clock. The average computing error in the 5 × 4 C3PU crossbar is 5.7%. The error is calculated and averaged for 30 input samples by comparing the observed MAC output from simulation with the expected values. Table 5 demonstrates the error matrix of the C3PU outputs when compared to the expected ones from MATLAB simulation at different input combinations selected from the test set. The energy efficiency of the 5 × 4 C3PU crossbar and the 5 VTC blocks is 26.3 fJ/MAC and 40.1 fJ/MAC, respectively, resulting in total energy efficiency of 66.4 fJ/MAC. Monte Carlo simulation is carried out for the 5 × 4 C3PU crossbar. Figure. 11 shows the distribution of MAC output from column 4 under mismatch variations where the inputs are set to V in1 = 0.3 V, V in2 = 0.6 V, V in3 = 0 V, V in4 = 0.1 V, and V in5 = 1 V. The output V 4 has a mean value of 0.315 V and standard deviation σ of 33 mV with a 9.5% variation. The minimum σ value is 7.3 mV at output voltage = 0 V, and the maximum σ is 77 mV at output voltage = 0.97 V.
Each MAC unit/column includes 5 multiplications and 4 additions. To further increase the number of operations, the crossbar array size can be enlarged. Some design constraints The number of samples used is 200. Transistor size is 500nm/60nm. need to be considered when increasing the C3PU crossbar size. Adding more rows to the C3PU array increases the accumulated currents, which require a larger capacitor size in the integrator circuit to achieve the desired output voltage. For example, every additional 5 rows demand an additional 300fF capacitor. Therefore, there is a tradeoff between the number of rows and the integrator's capacitor size. Increasing the number of columns is also limited as the line resistance affects the driving signal of the V pw . The resistance due to the line connected from the VTCs to the columns increases with the number of columns, and this degrades the pulse width of V pw signal. Simulation results show that the C3PU crossbar with 32 columns will suppress the pulse width of V pw by 10.8%. The maximum number of columns that the C3PU crossbar can afford is 46 with degradation of 13.4% in the pulse width. Another option to accommodate large MAC operations is to duplicate the C3PUs similar to memory arrays. For example, multiple C3PU arrays can be placed to increase the number of columns and rows where a repeater can be used instead of the VTC to generate the pulse width signal.
To compare the proposed 5 × 4 C3PU crossbar, a 5 × 4 fixed point (FXP) crossbar units have been implemented using ASIC design flow in 65 nm CMOS. Table 6 shows the 3 × 3-bit, 4 × 4-bit, 8 × 4-bit, and 8 × 8-bit FXP crossbars performance compared to the 5 × 4 C3PU crossbar. The error of the FXP MAC unit is calculated by comparing the observed output from the RTL simulation for each column in the crossbar with the expected ones from MATLAB simulation. The resultant error values are then averaged over 30 input sets.The average error of the C3PU,   5.6%, is comparable to the error percentage produced by the 8 × 4-bit MAC unit, 6.52%. Furthermore, the MSE values of the C3PU and 8 × 4-bit MAC crossbars are almost equal with 0.082 and 0.099, respectively. The advantage of the C3PU is the lower energy and area consumption by 3.4× and 3.6×, respectively, compared with the 8 × 4-bit MAC unit. Table 7 compares the prior and proposed work. The proposed C3PU utilizes an AMS circuit to perform analog MAC with two analog inputs, whereas the work in [6] and [27] uses an AMS circuit to conduct binary MAC with 1-bit × 1-bit inputs. Comparing the C3PU with its equivalent digital baseline (8-bit × 4-bit) in terms of accuracy, the energy efficiency is improved by 3.4×. Table 8 shows the comparison between prior analog MAC computing methods and the proposed work. The proposed work shows higher energy efficiency of 158 TOPS/W compared with other analog-MAC computing works. The reason behind the high TOPS/W is that the power consumption is low and the operations per second is high. Even though the operations per second in [16] is higher than our work, our power consumption is lower which results in a higher energy efficiency. The high energy efficiency in our designed is a direct result of the implementation of a simple computation cell that consists of one transistor and two capacitors.

IV. C3PU DEMONSTRATOR FOR ANN APPLICATIONS
The advantage of the C3PU is demonstrated by accelerating the MAC operations found in an ANN using iris database [28]. The data set consists of 150 samples divided equally between the three different classes of the iris flower, namely, Setosa, Versicolour, and Virginica. Each sample holds the following features all in cm: sepal length, sepal width, petal length, and petal width. The architecture of the ANN consists of two layers: four nodes for the input layer, each representing one of the input features, followed by three hidden neurons, and lastly, three output neurons for each class. To implement the MAC operations in the ANN, the iris features are considered as the first operands, which are mapped into voltage values, and the weights are considered as second operands that are stored as capacitance ratios in the capacitive unit of the C3PU. A simple linear mapping algorithm is used between the neural weights and capacitance ratios [13].
The training phase is performed offline using MATLAB by dividing the data set between 80% training, and 20% testing. Post-training weights can have values with both positive and negative polarities. Hence, before mapping these weights into capacitance ratio values, they need to be shifted by the minimum weight value w min . After performing the multiplication between the inputs and shifted weights, the effect of the shifting operation must be removed by subtracting the following term from all weights |w min | × n i=1 IN , where IN is the input to the hidden/output layer and n is the number of input/hidden nodes. Mapping such operation into C3PU architecture requires adding one column to the hidden and output crossbars to store the w min value in each layer. Figure. 12 depicts the algorithm flow of the ANN classifier for the iris data set. It is a feedforward ANN, and also called multilayer perceptron [29]. It has two operational phases: phase 1 and phase 2. In phase 1, when V clk = 1 and ∼V clk−d = 0, the inputs are processed in the first layer. In phase 2, when V clk = 0 and ∼V clk−d = 1, the outputs from the first layer are taken and processed in the second layer to generate the required output iris classes. In phase 1, the four input features are mapped into four voltage levels V in1−4 . These voltages are then converted into four pulse width modulated signals V pw1−4 using four VTC blocks discussed in section II-B. The bias voltage V bias is added as an input to better fit the ANN model, which is also converted into a pulse width modulated signal V pw5 . The V pw1−5 , first operands, are connected to the 5 × 4 weight matrix C3PU as explained previously in Fig. 9. The weights, second operands, in this case, are stored as equivalent capacitance ratios X eq in the C3PU. The output voltages V 1−4 from the current integrator used at the end of each column in the C3PU weight matrix will act as inputs to the second layer. The current integrator inherently takes care of the scaling factor, which is decided depending on the factor between the shifted output values from a neural network and the output from the C3PU. This is important to compensate for the mapping between the values.
Once V 1−4 are generated, the classifier switches to phase 2 to process them to the second layer. But before that, the impact of shift operation that is implemented on the weights needs to be removed by subtracting V 4 from V 1−3 . Then, the subtracted outputs are passed through the ReLu activation function. In the proposed ANN classifier, the subtraction operation and ReLu function are implemented in the time domain. To achieve such implementation, V 1−4 are first converted to pulse width modulated signals using VTCs and then passed to the time domain subtractor and ReLu activation function to generate V o−pw1−3 . These output signals may have small pulse widths due to the subtraction operation which does not correspond to the expected subtraction outputs. Therefore, the pulse widths of the V o−pw1−3 are scaled by a constant factor depending on the expected subtraction output from the ANN using MATLAB and the observed outcomes from the ANN using C3PU. After that, the scaled pulse width signals V o−pw1−3−s are fed to the 4 × 4 C3PU weight matrix. The output voltages from the weight matrix V o1−4 are passed to the subtractor and then the softmax function to generate the proper class based on the input features. Figure. 13 shows the detailed circuit design implementation of the time domain subtractor, ReLu activation function, and delay element. Since V 4 is subtracted from three variables of V 1−3 , then, each subtraction requires a separate digital circuit. The subtraction output can have a positive or a negative value. The ReLu activation function passes the positive value while assigning the negative value to zero. Such implementation is developed using AND,  XOR, and inverter gates, as highlighted in the brown block in Fig. 13. To detect the difference between the two pulse widths, the XOR gate is utilized and provides the subtraction output a 1−3 . To determine the sign of the subtraction, V 4−pw4 is inverted and then ANDED with V (1−3)−pw(1−3) to generate a signal b 1−3 . If any b 1−3 = 1, then the subtraction output is positive, whereas when b 1−3 = 0, the subtraction output is negative. Finally, AND gate is used to pass the positive subtraction output as V o−pw1−3 while setting the negative subtraction output to zero. Figure 14 shows the output waveform example of the subtraction and ReLu function when V 1 > V 4 and V 1 < V 4 . As depicted in the figure, when V 1 > V 4 , the modulated pulse width of V 1−pw1 is greater than the pulse width of V 4−pw−4 . This means that the subtraction output is positive and passed with V o−pw1 = 1 having a pulse width T o−pw1 that represents the difference between the pulse width of V 1−pw1 and the pulse width of V 4−pw−4 . On the other hand, when V 1 < V 4 , the subtraction difference is negative (b 1 = 0), resulting in V o−pw1 = 0. Note that when the pulse width of the positive subtraction output is very narrow, it is rounded to zero, and the signal V o−pw1 will disappear. This is referred to as quantization which is widely implemented in the digital domain to increase the computing energy efficiency while achieving an acceptable accuracy. The quantization in the time domain may affect the VOLUME 9, 2021 MAC outputs of the 2 nd C3PU crossbar. However, since the computation is employed for AI applications, relative results are sufficient for the classification purpose.
After that, the pulse width T o−pw1 of the signal V o−pw1 is approximately scaled by a factor of 18× chosen based on the subtraction output values between the expected and observed ones. Such a large factor cannot be implemented using inverter delay. Consequently, a VTC circuit is utilized as a delay element to scale the pulse width of the V o−pw1 by 18×. To achieve such a scale, the capacitors' values in the VTC are adjusted (C 1 = 50 fF and C 2 = 2 fF), and the input voltage is set to the supply voltage. The inverted subtraction output ∼V o−pw1 is considered as the clock of the VTC. Depending on its pulse width value, the capacitors of C 1 and C 2 (as discussed in section II-B) are charged to a specific voltage level in the sampling phase. The higher the pulse width of the ∼V o−pw1 , the higher the voltage level across the capacitors and the longer time it takes to discharge through a current source in the evaluation phase. This means that the delay of the VTC's output V o−pw1−s is proportional to the pulse width of the V o−pw1 . The ANN classifier has been designed and simulated in 65 nm CMOS technology with a supply voltage of 1V except the 5 × 4 and 4 × 4 weight matrices that operate at a supply voltage of 0.3 V. The input voltages V in1−4 have a range of 0 V to 1 V in addition to V bias = 1 V. The five input voltages are converted into modulated pulse width signals V pw1−5 that have pulse widths in the range of 165 ps to 2 ns. The modulated pulse width input signals V o1−4 of the second weight matrix have a pulse width in the range of 1.6 ns to 7.5 ns. The pulse width T 1 of V clk is set to 3 ns, and the pulse width T 2 of ∼V clk−d is set to 9 ns. The proposed ANN classifier using C3PU shown in Fig. 12 achieves an inference accuracy of 90%, whereas the ideal implementation of the ANN classifier in MATLAB has an inference accuracy of 96.67%. The variation of the supply voltage by 5% affects the inference accuracy and reduces it by 3%. The variation due to the supply voltage affects the width of V pw as it is a strong function of the current source. One way to reduce this variation is by replacing the current source with a current mirror that is more robust and less sensitive to variation. Monte Carlo simulation has been carried out to study the mismatch variations on the inference accuracy. Although the MAC outputs' values from the C3PU crossbars have changed slightly, the inference accuracy remains 90%. This is because the classification does not depend on the exact MAC outputs but rather on its relative values.

V. CONCLUSION
This paper presented an analog-mixed signal MAC unit using cross-coupling capacitor implementation named C3PU. The advantage of utilizing a cross-coupling capacitor for storage and processing element is that it can perform simultaneously as a high density and low energy storage. One operand in the C3PU is stored in the capacitive unit. While the second operand is a modulated pulse width signal using a voltage-totime converter. The multiplication outputs are transferred to an output current using CMOS transistors and then integrated using the current integrator op-amp. The 5 × 4 C3PU was developed to run all data simultaneously, realizing fully parallel vector-matrix multiplication in one cycle. The energy consumption of the 5 × 4 C3PU is 66.4 fJ/MAC at 0.3V voltage supply with an error of 5.7% in 65nm technology. The inference accuracy for the ANN architecture has been evaluated using the proposed C3PU for an iris flower data set achieving a 90% classification accuracy.