An Energy-Efficient and High Throughput in-Memory Computing Bit-Cell With Excellent Robustness Under Process Variations for Binary Neural Network

In-memory computing (IMC) is a promising approach for energy cost reduction due to data movement between memory and processor for running data-intensive deep learning applications on the computing systems. Together with Binary Neural Network (BNN), IMC provides a viable solution for running deep neural networks at the edge devices with stringent memory and energy constraints. In this paper, we propose a novel 10T bit-cell with a back-end-of-line (BEOL) metal-oxide-metal (MOM) capacitor laid on pitch for in-memory computing. Our IMC bit-cell, when arranged in a memory array, performs binary convolution (XNOR followed by Bit-count operations) and binary activation generation operations. We show, when binary layers of BNN are mapped into our IMC arrays for MNIST digit classiﬁcation, 98.75% accuracy with energy efﬁciency of 2193 TOPS/W and throughput of 22857 GOPS can be obtained. We determine the memory array size considering the word-line and bit-line nonidealities and show how these impact classiﬁcation accuracy. We analyze the impact of process variations on classiﬁcation accuracy and show how word-line pulse tunability provided by our design can be used to improve the robustness of classiﬁcation under process variations.


I. INTRODUCTION
Recent success of deep learning [1] in various domain specific tasks involving object recognition, classification and decision making [2] is largely driven by the architectural innovations of neural network commonly known as Deep Convolutional Neural Networks (DCNNs) [3], [4]. However, high performance of DCNNs comes with huge memory requirements for storing the network parameters for computation which constrains the deployment of such networks at edge devices for mobile applications. Moreover, data-intensive deep learning applications, when implemented in conventional von-Neumann hardware architecture, require frequent memory access and data movement The associate editor coordinating the review of this manuscript and approving it for publication was Cihun-Siyong (Alex) Gong . between memory and computation units which can overweigh the energy cost of computation [5] and add large energy overhead.
Binary Neural Networks (BNNs) [6] provide a hardwarefriendly solution to these problems without sacrificing the state-of-the-art DCNN accuracy. BNN and its variants (e.g. XNOR-Net [7]) reduce the precision of the network weights to single bit. Such extreme quantization opens up the prospect of storing BNN weights fully in the on-chip memory and reduces the energy consumption for off-chip data access. Moreover, energy intensive multiply-and-accumulate (MAC) operations of DCNNs are reduced to simple bit-wise XNOR and subsequent pop-count operations in BNNs which offers large computational energy saving. Nevertheless, even BNNs, the memory and energy efficient variant of DCNN, consume orders of magnitude more energy than the human VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ brain for similar recognition or classification tasks [8]. One of the prime reasons for this gap is that in von-Neumann architecture, unlike in the human brain, memory and computation units are physically separated and data movement between these two units incurs large energy cost.
To cut down the energy cost of data movement, the concept of In-Memory Computing (IMC) [9]- [27] has been proposed and applied to accelerate the computations in neural networks including BNN. Although most of these prior works showed potential for high energy efficiency, design concerns such as computational reliability degradation with increasing simultaneous memory row access [11] exists. Modern day high-performing neural networks usually have large filter sizes in different layers. Therefore, convolution operations (XNORs and pop-counts in BNNs) involving those filters would require large data row access in the memory concurrently. Hence, the issue of computational reliability due to signal-to-noise-ration (SNR) degradation needs to be solved in order to achieve high computational throughput. Additionally, although IMC designs such as local bit-line grouping [14] and word-line underdrive [11] is proposed in the literature to address data corruption issue, in general, 6T based IMC designs are prone to memory disturb [11]- [15].
To increase the computational scalability of in-memory computing, Valavi et al. [18] proposed charge-domain computation based on a linear and stable metal-oxide-metal (MOM) finger capacitor. They augmented a standard 6T SRAM bit-cell with two PMOS transistors to conditionally charge the MOM capacitor that has been laid over each bit-cell. Although accessing multiple memory rows of such an IMC bit-cell configuration improves computational SNR, [18] reports a relatively long time of 250 ns for one convolution operation involving charge, discharge and reset phase of the MOM capacitor.
In an effort to reduce the time per convolution operation, increase energy efficiency and at the same time ensure the scalability of in-memory computation, we propose a novel IMC bit-cell configuration with a 10T SRAM bit-cell augmented with a BEOL MOM capacitor on pitch. Additionally, use of the NFET read stack pairs, which decouple the read path from 6T storage unit, enable large voltage swing on the read bit-line to increase the signal margin without triggering read-disturb issues commonly observed for a 6T bit-cell based design [13], [15]. Unlike Valavi et al. [18], we precharge all of the MOM capacitors of a memory column at the beginning of each operation and then conditionally discharge them based on the XNOR conditions. We show that our in-memory computation scheme, involving discharge, compare and precharge phase, takes ∼70x less time per convolution operation. Moreover, in our scheme, we have eliminated the need for area and energy hungry ADCs [19] and Multi-level Sense Amplifiers (MLSAs) [20] for pop-counting operation. Rather, in the compare phase, we use a comparator at the end of each memory column to compare the bit-line voltage after the discharge phase (XNORs in IMC bit-cells) against a predefined threshold voltage and generate binary activation outputs for next layer. Furthermore, we show that the read word-line voltage amplitude and pulsewidth tunability provided by our design enables us to control the average energy consumption per convolution operations thus providing a knob to increase energy efficiency. We also show the usefulness of this on-Silicon tunability feature in improving network classification accuracy in the presence of process variation. Overall, the main contributions of our work are as follows: • We propose a novel IMC bit-cell, with 10T SRAM bitcell and a BEOL MOM capacitor laid on-pitch, that can perform binary convolution operations when organized in a memory array with a comparator per column.
• We show that 98.75% classification accuracy for MNIST dataset with high energy efficiency and throughput can be obtained when a trained BNN is mapped onto our IMC hardware for data processing.
• We analyze the impact of word-line and bit-line nonidealities on classification accuracy and determine the optimum memory sub-array size.
• We perform an analysis of read bit-line discharge voltage variability in the presence of process variations by Monte Carlo simulations using GLOBALFOUNDRIES 22FDX R PDK. We show tight classification accuracy distribution under process variations. We also show how on-silicon tunability can be used to reduce the impact of process variations further.
The rest of the paper is organized as follows: Section II describes the operation of our IMC bit-cell and how we perform binary convolution and activation generation when these cells are organized in a memory array. Section III shows the impact of nonidealities in determining the memory array size and reports the BNN classification accuracy on MNIST dataset when implemented on our IMC bit-cell arrays. In section IV, we analyze the impact of process variations on the classification accuracy and show how onsilicon tunability can be useful in mitigating the impacts. We discuss and compare our results with state-of-the art implementation in section V and in section VI we conclude the paper.

II. IMC BIT-CELL AND BINARY CONVOLUTION A. IMC BIT-CELL CIRCUIT AND XNOR OPERATION
A schematic of our proposed IMC bit-cell is shown in Fig. 1. In our design, we use a conventional 6T SRAM bit-cell for data storage. Complementary storage of a bit-cell is denoted by W and Wb. We augment the design with two readports (read-port 1 with M 1 − M 2 NMOS, read-port 2 with M 3 − M 4 NMOS) to form the 10T bit-cell structure which provides read stability. However, unlike conventional 10T SRAM bit-cell, where read-ports are differentially connected, in our design we tie them to the same read bit-line (RBL). Additionally, without area penalty, a BEOL MOM capacitor is laid above each bit-cell on-pitch and tied to the RBL.  The MOM capacitor is shared among all cells on column and its capacitance scales with the column height. The use of large MOM cap in RBL is an intentional design choice to limit the discharge rate in the appropriate range to resolve it in a reasonable RWL pulse width. As we increase the number of bits per bit-line the average number of bits discharging the bit-line capacitance increases proportionally with no performance degradation. We designed the bit-cell with GLOB-ALFOUNDRIES 22FDX R PDK which takes ∼ 0.25 µ m 2 of area and has ∼ 1.2 fF of MOM capacitance. The cell layout is shown in Fig. 2(a) for the active and poly layers where the additional read-ports needed for IMC operation double the cell height compared to the base 6T cell. Fig. 2(b) shows how the MOM backend capacitor, which uses metal 6 to 8, is placed above the SRAM WL metal layers. Our 10T cell derives from an area efficient 6T ''slim cell'' layout [28] using Foundry pushed rules. We are able to add the 4 readport NFETs to the base 6T layout in a more compact fashion than is possible in the 8T design of Valavi et al. [18] where adding PFETs would entail a well spacing constraint. Moreover, the simple operating principle of our design reduces the peripheral logic such as shorting switches in the accumulation phase in [18].
The write operation into our IMC cell is similar to the conventional 6T write operation through write ports (WWL, BL, BLB). For reading, at first, RBL is precharged to 'VDD' then the complementary inputs of the read word-lines (RWL, RWLB) are activated. Depending on the stored value on the storage nodes (W,Wb) and read word-line inputs, bit-line voltage will either discharge through any of the read-port to ground or no discharge will happen. The table in Fig. 1 shows discharge conditions for various combinations of RWL inputs and stored weight values (W) where a RBL discharge is defined as logic '1' and no discharge is defined as logic '0'. For RWL of '0' (RWLB of '1') and W of '0' (Wb of '1'), M 3 Thus, a read operation, through the read-ports sharing RBL column, in our design is functionally equivalent to bitwise XNOR operation. The XNOR operation in our IMCcell design largely depends on the coherent discharging of the read bit-line (RBL). Instead of relying on the parasitic capacitance of RBL we attached a MOM capacitor (C MOM ) to RBL which serves as a controlled discharge capacitor. Moreover, C MOM provides an extra knob in the IMC-cell design to control the discharge rate and hence the performance of the IMC-cell.

B. BINARY CONVOLUTION IN IMC ARRAY
In a neural network, convolution is defined by the dot product between input patches and filters. In BNN, as operands of the dot products are binary (+1/ − 1), the convolution operation can be reduced to binary convolution which is XNOR followed by bit-counting of the XNOR'ed output [6], [7]. As our IMC bit-cell can perform XNOR operation, it can be readily used to perform binary convolution operations. To do so, IMC cells are arranged in a memory array and weights of each filter of the BNN are mapped (written) into each column. One such memory column is shown in Fig. 3(a), where input patches are mapped as read word-line (RWL) voltage pulses. A collection of these columns produces the output feature map of a BNN layer where each column performs convolution and binarization operation. Notably, binary levels of +1 and −1 are mapped as logic 'High' and 'Low' in our design.
In our method, binary convolution operation has three phases (shown in Fig. 3(b)): Discharge, Compare and Precharge phase. At the start of the computation, RBL is charged to VDD by turning on precharge PMOS, M pch . During this phase, all the read word-lines (both RWL and RWLB) are kept at logic 'Low' thus cutting off all the discharge paths. Next, in the discharge phase, M pch is turned off and complementary read word-line pulses are set according to the binary input activations. Read word-line pulses trigger discharge conditions in different IMC cells in the column (as shown by the green arrow in Fig. 3(a)) and the RBL keeps discharging (shown in Fig. 3(b)) until the RWL pulses are turned off marking the end of the discharge period. Final RBL voltage (V RBL ) after this period reflects the dot product value of the input (I) and weight (W) vector. In the compare phase, we use a comparator circuit to compare V RBL with a reference voltage Here, we explain the operations in the compare phase for an uniform V REF . We set the comparator reference voltage at the RBL voltage value when exactly 50% of the bit-cells in a column are discharging (which is equivalent to bit-counting value of zero). We do so because with the way we map the XNOR computation into our IMC cell, more than 50% of the cell discharge condition would translate to a positive bitcount value while less than 50% discharge condition would map to a negative bit-count value. Since, binary activation generation for the next layer is simply done by taking the sign of the bit-count values, in the compare phase, our comparator simply senses the RBL voltage level and checks if it is more than or less than the 50% discharge condition (V REF ). In more (less) than 50% discharge condition, RBL voltage level will be lower (higher) than the V REF and comparator produce Logic 'High' ('Low'). Therefore, we can leverage our method of in-memory computing to eliminate the need for ADCs or MLSAs for efficiently computing binary activations from the XNOR'ed output from the IMC cells.
Word-line pulse amplitude, V RWL and pulsewidth, T PW are chosen to meet certain design specifications. One such case is the selection of comparator reference voltage level, V REF . Fig. 3(b) shows 50% RBL discharge condition for different V RWL amplitudes. The solid line indicates the V RWL and T PW conditions found for 50% RBL voltage level, hence V REF , to be at V DD /2. If V RWL is increased (decreased) keeping T PW fixed, 50% RBL discharge level will go down (up) due to faster (slower) RBL discharge as indicated by dashed blue (red) line. Thus, tunability of V RWL and T PW provides us flexibility in the V REF level selection.
Finally, after the binary convolution and activation generation is finished, a precharge phase is initiated for the next convolution cycle by turning on the M pch while turning off all the RWLs to restore the RBL voltage level to V DD . The amount of energy required for the restoration is the dominant dynamic energy consumption during a binary convolution operation. Moreover, with the way we perform binary convolution, we expect 50% RBL voltage discharge on average. From the energy efficiency point-of-view, therefore, the choice of V REF level is important as we are expected to precharge the V DD − V REF voltage in the RBL for a typical convolution cycle. This, as we will see in section IV, will provide a knob for energy-efficiency/accuracy tuning.

III. MNIST CLASSIFICATION UNDER IMC ARRAY NONIDEALITIES
We built an in-house simulation framework that trains the BNN on MNIST [29] dataset with Python and Pytorch [30] libraries. Within the framework, we then map the binary network weights and activations in the IMC array with the comparator and simulate the corresponding circuit in HSPICE using the GLOBALFOUNDRIES 22FDX R technology models. In this work, we used post-layout extracted netlist which gives the most accurate results obtainable by simulation. Computations of full precision layers (first & last layer) are done in software (Python). Ultimately, we extract the accuracy, energy and timing results using combined Python-HSPICE simulation framework. Table 1 shows the BNN network structure that we have used for our simulation. This is a simplified LeNet architecture [31] which consists of two convolution (Conv.) layers and two fully connected (FC) layers. The first and last layers are full precision layers [7] and we process these layers in software (Python). Conv. 2 and FC 1 layers are binary layers and notably, they have almost 99% of the full network parameters. Therefore, performing in-memory computation in such layers provides large energy efficiency and saves computation time. Full software implementation of full precision version of our BNN architecture provides 99.21% classification accuracy. Whereas its binarized version (with binarized Conv. 2 and FC 1 Layers), with some minor accuracy degradation, provides 99.03% accuracy in software implementation. We train and test the binary network for MNIST classification without any batch normalization [32] layers and without any bias units in the binary layers. This enabled us to use uniform voltage threshold (e.g. 50% RBL discharge) for all the neurons (comparators) for binarization. For MNIST dataset, such design choice did not lead to any accuracy degradation while it reduced the complexity of the peripheral circuits.

B. IMC ARRAY SIZING AND IMPACT OF NONIDEALITIES
Mapping of binary Conv. 2 layer weights requires a memory array size of 500 × 50 and 800 × 500 for binary FC1 layers. Since memory array size impacts the throughput of the design, determining the size of the memory sub-array is critical. In our design, we consider two main non-ideal factors that limit the size of the memory sub-array: word-line nonideality and bit-line nonideality. In modeling such nonidealities, we consider word-line and bit-line parasitic resistance and capacitance which is calibrated with GLOBALFOUNDRIES 22FDX R technology.
In our implementation, we use V DD = 0.8 V . We tune and set the word-line pulses at V RWL = 0.37 V and duration of the discharge phase at T PW = 1.2 ns to set the 50% discharge condition, hence V REF , to V DD /2 (0.4 V). Fig. 3 (a) implies that the same word-line pulses will be shared by the IMC cells in the same row in the memory array. When parasitic wordline resistance and capacitance are considered, there will be a RC-delay of the RWL pulses fed from the near-end IMC cells to far-end IMC cells along the memory row. This effectively means that the far-end IMC cells would get less discharge time then the near-end cells, since RWL pulse will be cutoff for all the IMC cells in a row after a constant time, T PW . As a result, for the same discharge condition, RBL voltage at the the far-end column of the same row would be higher than the near-end column (ideal discharge level). The RBL voltage difference between the near-end and far-end columns for 50% discharge condition are plotted in Fig. 4(a). For example, if 100 columns are used, for 50% discharge condition, RBL 1 will be 400 mV while RBL 100 is found to be 412.2 mV which results in a RBL difference of 12.2 mV. From the plot, we also observe that as the number of columns is increased this difference keeps increasing. Since V REF is the same for all the end-of-the-column comparators, this result indicates that as the number of columns increases, positive bit-count values will be more likely interpreted as negative by the comparators at the far-end columns. Fig. 4 (b) shows the impact on such nonidealities on MNIST classification accuracy. From the plot we observe that if number of columns is increased beyond a certain number (e.g. 150), classification accuracy degrades drastically. Although a high number of columns are desired for high throughput, we select a memory sub-array with 100 columns to attain high classification accuracy (98.75%). For bit-line nonidealities, we simulated 500 and 800 IMC cells in a single column with 50% discharge condition and observe that only ∼1 mV RBL discharge voltage difference between these two cases. We observed such negligible impact of RBL nonidealities on discharge voltage as RC discharge time constant depends mainly on the on-state resistance of the read-stack transistors and MOM capacitors (C MOM ) as they overshadows the parasitic resistance and capacitance of the RBL. Therefore, for the given BNN mapping we use one 500 × 50 sub-array for Conv.2 layer and five 800 × 100 subarrays for FC1 layer.

IV. IMPACT OF PROCESS VARIATION
Since process variation plays a critical role in the operation of circuits composed of nano-scale devices, we perform an analysis to determine the impact of device level variations on the discharge voltage and overall classification accuracy of BNN. Since RBL voltage primarily depends on the discharge currents through the read-ports, we consider variations of the NMOS devices in these paths. Fig. 5 shows histogram of RBL voltage variation for 50% discharge condition (with V RWL = 0.37 V, T PW = 1.2 ns for 800 IMC cells) from VOLUME 8, 2020  10,000 Monte Carlo simulations. Red line (distribution fit) shows that RBL voltage variation follows a normal distribution. Since the on-silicon tunability feature of our design can trim the impact of global variations in the transistors and MOM capacitors, we only considered local variations in our simulations.

A. DISTURB MARGIN
To quantify the impact of RBL variation on the classification accuracy, we define the disturb margin as the range of discharge levels that are prone to misclassification due to variation. Fig. 6 shows how we calculate disturb margin for 3σ variations. From the figure it is evident that, left-tails of RBL voltage distributions which are less than 50% discharge levels and right-tails of RBL voltage distributions which are greater than 50% discharge levels are at risk of misclassification due to variations. We define a discharge level as misclassified if its left or right distribution tail crosses the reference level (mean of 50% discharge level). Table 2 shows disturb margin for 3σ variation cases for different memory row sizes. For example, disturb margin of 50(±3)% for 800 rows means that, discharge levels from 47% to 53% will be misclassified in 3σ cases by the comparators.

B. IMPACT OF PROCESS VARIATION ON ACCURACY
From RBL discharge voltage distributions and disturb margin calculations, we determine the classification accuracy distributions (shown in Fig. 7(a)-(c)) of our network from Monte Carlo simulations. From Fig. 7(a) we observe that, for V RWL = 0.37 V, T PW = 1.2 ns word-line voltage configuration (case 1), we obtain a mean classification accuracy of 98.23% with a standard deviation of 0.39%, which is a slight degradation from the ideal case (98.75%). To improve mean classification accuracy and reduce the classification variance we can use the word-line voltage tunabilty feature of our design. Fig. 7(b) shows that for V RWL = 0.47 V, T PW = 0.5 ns word-line voltage configuration (case 2), we get an improved mean classification accuracy of 98.72% with reduced standard deviation of 0.12%. For this case, however, we need to set the V REF to 30 mV (near zero reference). Since, for case 2, V DD − V REF difference is almost twice than the case 1 configuration (with V REF = 400 mV), we need to spend twice the energy per unit computation on average than case 1. Thus, our on-silicon tunability feature enables us to perform accuracy/energy-efficiency tuning based on design requirements.
We also analyzed how the coefficient of variation (CV : ratio of standard deviation and mean of a distribution) and disturb margin vary as we increase the number of IMC cells in a column. Since in modern deep networks filter dimension is usually large, we expect to have large numbers of memory rows to map such filters. From Table 2 we observe that as the number of memory rows is increased, both CV and disturb margin decrease. To explain the reduction in CV with increasing IMC cells per bit-line let's consider the read bitline discharge rate, D = I /C V , where C refers to the read bit-line capacitance which is dominated by the MOM capacitor of the designed IMC-cell, V refers to the RBL voltage change (i.e., the difference between the precharged voltage and the voltage at the end of the RWL pulse), and I refers to the total discharge current through the readstacks. As the number of cells per bit-line increases average discharge rate remains the same as the number of cells discharging the bit-line capacitance increases proportionally. On the other hand, as we increase the number of cells per bit-line, while the variability in C assumes insignificant change, assuming normal distribution, the standard deviation of I follows the square root of the sum of square formula. For example, if we add n number of discharging IMC-cell in a RBL column, the mean of I increases by n while the standard deviation of I increases by √ n which in turn reduces the coefficient of variation (CV) by √ n. As CV defines the variability of any  system, with increase in cells per bit-line, the RBL discharge rate variability improves. This simple estimation of reduction of CV with increasing cells per bit-line matches very well with the simulated values in Table 2 for up to ∼ 1000 cells per bit-line. With very large cells per bit-line the reduction in CV saturates. Therefore, we expect to have less classification error due to process variability as larger BNN filters are mapped to our IMC array for binary convolutions.

V. RESULTS AND DISCUSSION
Performance of IMC array is commonly measured in terms of energy efficiency (TOPS/W) and throughput (GOPS).
Here, we present the TOPS/W and GOPS for our MNIST network. Since, in our design, we have a maximum subarray size of 800 × 100, each column can perform 800 operations (XNOR). Our simulation shows that, during inference, 800 operations on average consume 346.5 fJ energy which corresponds to energy efficiency of 2308 TOPS/W. For each column, an additional 18.2 fJ energy is required for binary activation generation which corresponds to energy efficiency of 2193 TOPS/W. Here we report energy efficiency and throughput numbers for the product-sum-binarization operation during inference which includes the RWL driver, RBL driver and comparator. In Table 3 we compare the energy efficiency and throughput of our design for productsum operation with recently published binary neural network accelerators that uses in-memory computing. This approach is aligned with the reporting of the energy efficiency and throughput numbers in the literature (e.g. see Valavi et al. [18]) and ensures a fair comparison. Since, inference operation does not require to operate 6T WL/BL driver, as the trained weights are already written in the memory, energy contributions from those elements are kept out of the calculation. We get better energy efficiency primarily from two sources. First, from voltage scaling enabled by 22FDX R technology node, we get ∼1.4x energy improvement compared to stateof-the-art accelerator proposed by Valavi et al. [18]. Second, on-silicon tunabilty feature of the design enables us to operate at V REF = V DD /2. Thus, on average (for a typical convolution cycle) we need to precharge V DD /2 voltage in RBL instead of V DD which provides the rest of the energy efficiency improvement.
As far as the throughput is concerned, we perform one binary convolution and activation generation (that includes discharge, compare and precharge phase) within 3.5 ns. Since we use a maximum array size of 800 × 100, we can achieve computational throughput of 22857 GOPS from our design. However, we did not use IMC arrays with Neuron-tile architecture like Valavi et al. [18]. Therefore, with our IMC arrays even larger throughput can be obtained by adopting architecture-level innovations.
In order to assess the scalability of our design for larger networks with complex tasks, we train a binarized version of convolutional neural network of Network-In-Network architecture (NIN) [34] and mapped it into our IMC hardware for CIFAR-10 [35] dataset. We use a 9-layer convolutional network, as shown in the Table 4, where we binarize all the layers except the first (Conv. 1) and last (Conv. 9) layer. We do not use any bias units in the binarized layers during training and inference. Since, the network has a maximum of 192 filters, the IMC array design requires us to consider maximum column size of 192. For this task, we partition the memory array along the columns and design the memory sub-arrays having column size of 50 and process the data in these sub-arrays independently to achieve good classification performance under circuit nonideality due to the RC delay of RWL pulses at far-end cells (as discussed in Section III.B). Moreover, the network has the largest filters in Conv. 4 layer each having 2400 binary weight values. We were able to map these filters in the IMC array having 2400 rows without needing array partitioning as our IMC design is not limited by the number of cells per bit-line. In a traditional 6T SRAM operation, larger bit-line capacitance related to the higher number of cells per bit-line provokes a read disturb. On the other hand, in our IMC design, we use a separate read-stack where we intentionally increase the capacitance of the RBL by adding the MOM capacitor. As discussed in section III.B, such design choice nullifies the impact of RBL nonidealities on classification accuracy and enables the design of large number of rows per bit-line. In our simulations for CIFAR-10 dataset, we used the maximum IMC sub-array size of 2400 × 50. For example, for Conv. 4 layer, 4 such sub-arrays are used to map the entire layer (2400 × 192) in the IMC hardware for data processing.
In order to achieve better classification accuracy for CIFAR-10, we trained the binary NIN with batch normalization layers [32]. In this case, since each neuron learns its own binary activation threshold value, we can not use a uniform threshold for all neurons. After training, we computed these threshold values from the batch normalization parameters as demonstrated by Valavi et al. [18] and used them as reference values for the comparators. With these design considerations, for CIFAR-10, we obtain classification accuracy of 85.83% from our hardware simulation which is comparable to the accuracy (86.61%) from software implementation. Moreover, for the IMC sub-array size considered for CIFAR-10, compared to MNIST, throughput will increase while energy efficiency will remain the same. Furthermore, as discussed in section IV, for larger rows or cells per bitline, variation in RBL discharge voltage (quantified by %CV and disturb margin) reduces. Thus, for larger network, like the NIN considered here, the impact of process variability is expected to be less pronounced.
Overall, with simulation based study, we show a methodology to integrate our proposed IMC bit cell in an array considering relevant circuit nonidealities and process variations while achieving high energy efficiency and throughput. Such analysis can be useful for optimizing and enhancing the performance and robustness of existing or future IMC bit-cell array design for deep learning applications.

VI. CONCLUSION
In order to meet the growing demands for running deep learning applications at the memory and resource constrained edge devices, data processing with binary neural networks running on in-memory computing hardware platforms emerge as the most promising approach. Motivated by the quest for energy-efficient and scalable in-memory computation, in this work, we propose a novel IMC bit-cell, with 10T SRAM and a MOM capacitor, that can perform binary convolution operations when organized in a memory array. We show that, when binary layers of BNN are processed on our IMC array, classification accuracy of 98.75% with 2193 TOPS/W and 22857 GOPS can be achieved (i.e. ∼2.5x TOPS/W and ∼1.2x GOPS improvement compared to high performing state-ofthe-art IMC solution [18]). We show that our design is robust against process variations. Moreover, our design provides a knob for improving classification accuracy under process variations trading-off energy-efficiency. Future work will be directed towards performance evaluation of our IMC arrays on larger datasets, overall system level evaluations with necessary architectural innovations and exploration of on-silicon tunability with back gate bias [36].  His research interests include emerging memories and devices, design-technology co-optimization, memory reliability, neuromorphic computing, and in-memory computing for neural networks.