3-D Monolithic Stacking of Complementary-FET on CMOS for Next Generation Compute-In-Memory SRAM

Monolithic 3D stacking of complementary FET (CFET) SRAM arrays increases integration density multi-fold while supporting the inherent SRAM advantages of low write power and near-infinite endurance. We propose stacking multiple 8-transistor CFET-SRAM layers on regular CMOS periphery to achieve an ultra-high-density array for computing-in-memory (CIM). CFET and regular CMOS (FinFET) devices are measured and calibrated with BSIM-CMG compact model. SPICE simulations are performed to evaluate the delay of CIM operation, power consumption, and analog computational error due to device non-linearity. The impact of device non-linearity on neural network inference accuracy is evaluated using the CIMulator simulation platform. Lower CFET current drive due to amorphous (deposited) silicon channel is shown to have negligible impact on CIM operational delay in many cases, as the maximum allowable current is limited by wiring resistance, not transistor drive strength while maintaining accurate weighted sum. Compared to regular 2D CMOS FinFET array. CFET SRAM cells show an improvement up to 57.19% in TOPS/W. Furthermore, the performance in TOPS/W mm 2 is improved up to $19\times $ . A factor proportional to the number of stacked layers for monolithically stacked CFET SRAM cells, makes it highly promising for future edge intelligence.


I. INTRODUCTION
The exponential growth of neural network (NN) size for artificial intelligence (AI) has placed an enormous demand on data-centric computational power. Computing-in-memory (CIM) has come to the rescue by integrating processors and data to overcome the von Neumann bottleneck. Static random-access memory (SRAM) is a promising candidate to achieve on-chip CIM [1], [2] compared to other electronic memory types for its superior endurance, low write power, noise immunity, etc., yet it occupies the largest layout area among all Fig. 1(a). This may be addressed by vertically stacking nFET and pFET nanosheet devices into the complementary FET (CFET) structure Fig. 1(b) [3], [4], and further stacking of CFET-SRAM layers via monolithic 3D integration, reduces the footprint by more than a factor of n, where n is the number of CFET-SRAM stacks Fig. 1(c). Silicon channel deposited at low temperature (< 600 • C) degrades carrier mobility, so transistor drive current is much lower than regular CMOS [5], yet for CIM where multiple SRAM rows are activated simultaneously for weighted sum operation, in many cases maximum current is limited by I-R drop rather than transistor drive strength. To minimize latency, we may employ regular CMOS FinFET in the bottom single-crystalline silicon layer for peripheral drive/read-out circuitry to drive and sense CFET SRAM arrays on top. The 8-transistor (8T) SRAM is suitable for CIM due to its high linearity and low read disturb. In this work, we designed and laid out CFET-based 8T-SRAM cell and compare it with regular CMOS FinFET 8T-SRAM processed with the same fabrication facilities in a similar test vehicle. BSIM-CMG [6] is calibrated to both technologies and SPICE simulations are performed to compare the two cases to quantify improvements in terms of power consumption, linearity, and latency. We used the CIMulator [7] platform to compare CIM-SRAM build in the two technologies in terms of NN inference accuracy with a given set of software-trained 5-bit-weight network, taking into account hardware extracted device variation and nonlinearity as found during SPICE simulation [1]. The advantages of 3D monolithic-integrated CFET SRAM is highlighted in terms of CIM operations per power, and operations per power per area.

II. CFET SRAM CELL DESIGN AND MONOLITHIC 3D INTEGRATION
Compute-in-memory operations is efficiently realized using 8T-SRAM cell. These cells have less read bitline discharge current compared to 7T-SRAM and a reduced footprint compared to 10T-SRAM. However, 10T-SRAM are used to perform other Boolean operations. Hence, this paper considers standard 6T-SRAM for memory and 8T-SRAM for computing-in-memory applications. The CFET device comprising of bottom n-channel and top p-channel is fabricated with a low temperature process (< 600 • C) [4]. CFET device is then used to realize an 8T-SRAM comprising of 2 pchannel and 6 n-channel transistors (2P6N configuration) with the aid of an additional dummy transistor T1 sharing the gate with transistor RA1 Fig. 2. The second read access transistor RA2 is combined with the read access transistor of adjacent SRAM cell RA'2 to complete the structure. The layout of the structure is shown in Fig. 3, it has an area of 240 λ 2 compared to the standard FinFET area of 270 λ 2 (not shown here for simplicity). Stacking the p-channel on top of the n-channel reduces the footprint for both 6T-SRAM and 8T-SRAM as shown in Table 1. The advantage is reduced in  the realization of the 8T-SRAM design due to the addition of the dummy transistor for realizing the CFET structure. Further layout area reduction for CFET is possible via the buried power rail approach [3], yet such method is not compatible with 3D monolithic stacking. The main advantage of our polysilicon-channel CFET device is that its channel is deposited using Plasma Enhanced Chemical Vapour Deposition (PECVD) and solid-phase crystallization. This enables us to monolithically stack one SRAM layer on top of another by simply repeating the fabrication process steps. The stacking of multiple SRAM layers also provides shorter global interconnects which minimizes R-C delay. Besides, the effective cost per layer would be less as we will use the same photomasks for each layer. However, the complexity of multiple layers will increase total processing costs and might cause yield reduction.
FinFET devices, on the other hand, are made of singlecrystalline silicon and hence has superior drive strength compared to polysilicon. The FinFETs [9], [10], [11], [12] are fabricated with a gate-first high-K and metal gate process, with Hf 0.5 Zr 0.5 O 2 gate dielectric of 10 nm thickness. The device is primarily made as part of ferroelectric-FinFET (Fe-FinFET). Nevertheless, the device can be used as a regular FinFET.

III. DEVICE CHARACTERIZATION AND PARAMETER EXTRACTION
For fair comparison, CFET and FinFET devices are made in the same nanofabrication facility with similar baseline processes. Fabricated devices are then characterized and calibrated to BSIM-CMG multiple-gate CMOS compact model (CFET and FinFET devices) following the benchmarking flow shown in Fig. 4. To ensure that the parameters are consistent with the physical properties we follow the extraction method mentioned in the BSIM-CMG manual [6]. The measured data show asymmetric threshold voltages (V th ) for n-channel and p-channel CFETs. The threshold voltage is tuned using post-calibration adjustment of effective gate work function assuming V th engineering is available by channel or gate stack engineering [13], [14]. 8T-SRAM is cell is designed and the widths of the transistors are tuned for better hold, read, and write stability which is described in Section IV. Peripheral components including sense amplifiers, counter, and drivers are realized using FinFET devices and perfected for speed and linearity and are described in Section V. Fig. 5(a) and 5(b) show the measured transfer characteristics of both the CFET and the FinFET, respectively. One key difference between CFET and FinFET devices is that the smaller current in CFET due to the lower mobility of carriers in the polycrystalline channel. Fig. 5(c) and 5(d) show the output characteristics and output conductance of n-channel CFET devices which will be discussed in detail in Section VI.

IV. READ, WRITE AND HOLD STABILITY OF CFET 8T-SRAM
8T-SRAM cells are designed for optimum stability, speed, and linearity. For accurate CIM results, and enhanced image recognition accuracy, superior read stability is needed. To enhance the read stability, an 8T SRAM cell is considered in which the additional two transistors provide the necessary decoupling of output loading from the storage node during the read operation [15], thus significantly improving the read stability. In fact, for 8T-SRAM the read static noise margin (RSNM) is nearly the same as hold static noise margin (HSNM) Fig. 6(a). Variation of HSNM as a function of V dd is shown in Fig. 6(b). Read stability is also quantified by read N-curve that is produced by biasing both the bitlines with V dd . Sweeping the voltage at one of the storage nodes and measuring the current that flows into it. Write stability, on the other hand, is quantified with write N-curves by biasing one bitline with V dd , and the other bitline to ground [16]. Subsequently, we sweep the voltage at the storage node to extract nodal current. Fig. 6(c) and 6(d) show the critical  write current and critical read current of the CFET SRAM cell as a function of V dd with the inset showing the write N-curve and read N-curve respectively [17].

V. 3D CFET SRAM COMPUTE-IN-MEMORY ARCHITECTURE
CIM is a highly energy-efficient way of performing multiply-and-accumulate (MAC) operations. Fig. 7 shows the proposed novel CIM architecture for 3D monolithically stacked CFET SRAM arrays. Peripheral circuitry aiding in data reading, data writing, and digital transformation of MAC results are placed in the first layer (bottom) with single crystalline silicon for optimal read/write latency. Multiple CFET SRAM layers with deposited silicon channels are placed above. The weights obtained from off-chip training of the neural network are stored as SRAM cell contents through write operation. CIM generally employs a word-by-word write operation for initialization of the neural network (NN) weights and to update them. Selection of a single row is made by selecting the layer first, followed by the row in that layer, employing layer decoder and row decoder simultaneously. MAC operation or read operation is done by applying pulses corresponding to the input image data on wordlines that get multiplied with the contents of the SRAM cell. MAC operation results in the current being discharged from the SRAM cell in accordance to data stored in the SRAM cell or weight W N and excitation on the read word line (RWL) T WL . Discharged current I N gets added along the column of the SRAM array and an equivalent amount of charge W N × 1 C BL × T WL 0 I N (t)dt is discharged from the chargesharing capacitors located on the bottom layer. The leftover resultant voltage of capacitors (Eq. (1)) [1] is then used to estimate the MAC output using a flash Analog to Digital Converter (ADC). The overall block diagram is shown in Fig. 8. The 8-bit input image is split into eight batches, each comprising of 5-bit input pulses (8 × 2 5 = 256 pulses). The adders and registers of the read periphery circuit carry out the accumulation of partial sums of each cycle resulting from the splitting of an 8-bit input image (256 pulses).

VI. SPICE AND NEURAL NETWORK COMPUTE-IN-MEMORY SIMULATION
To compare CFET and regular CMOS FinFET in terms of MAC latency, power consumption per operation, and impact  of read bitline nonlinearity [1], SPICE simulation is performed with 4 and 5-bit unary input pulses (2 4 and 2 5 pulses respectively) of 0.8 V each on arrays with 64, 256 rows and having 4 and 5-bit precision (weights). The resultant transient analysis and energy consumed in each of the step are shown in Fig. 9(a) and 9(b) respectively. The input pulse width to be applied on the worldline is tuned to ensure that V BL does not drop below 3% of V dd after the application of 2 4 and 2 5 pulses with all the SRAM weight being 1.
Performance evaluation of CFET 8T-SRAM array is obtained using CIMulator [7] a circuit-level benchmarking tool for neuromorphic circuits such as Neurosim [18]. We employ a 3-layer multi-layer perceptron (MLP) NN comprising 784, 200, and 10 neurons respectively to recognize handwritten digits from the MNIST [19] database. Table 2 describes the realization of the NN weights and biases of layers 1 and 2 using the SRAM CIM Marco of the size of 64 × 60 (Rows and Column) with 5-bit precision. The dimension of the macro 64 × 60 is chosen because in the case of 64 × 64 for 5-bit precision the last 4 columns would be unusable. To keep the number of computational units or neurons of the NN low. We have considered one device as synapse (1D1S) architecture [20] that makes use of a reference column of weights instead of negative synaptic devices. For instance, layer 1 comprises of 784 × 200 × 5 SRAM cells. The 784 rows in layer 1 are realized by using the 64 × 60 macro using 12 instances/copies of the 64 × 60 macro and using 16 rows from the 13 th instance. 200 × 5 (columns × precision) is realized 16 instances of 64 × 60 macro and using 40 columns from the 17 th instance. Training the network-on-chip is fine given the SRAMs have infinite endurance but the process is energy-draining and training the NN every time is discouraged. Moreover, for training a NN, one needs higher bit precision (8-bit) but for image recognition (feed-forward inference) a lower bit precision (4 or 5-bit) may be sufficient. Hence, we train the network off the chip with 8-bit weights and optimize it to 4 and 5-bit for feed forward inference. The above technique effectively reduces the footprint of the CIM macro and reduces the energy consumed by the chip with very minimal degradation in accuracy.
Access transistors of the SRAM cell connected to the RBL needs to be operated in the saturation regime. In a saturation regime, a device will pass a constant current irrespective of the voltage applied. The CFET technology is newly developed and is particularly very vulnerable to process variation. Fig. 5(c) and 5(d) shows a device having finite output conductance due to process variation [4]. Finite output conductance causes degradation in the forward inference accuracy of analog summing, or MAC operation [1]. NN are in general robust to synaptic array non-idealities and one can often retrain a NN with a small number of epochs to recover back the lost accuracy as described in the next paragraph. Fig. 11 shows the RBL curves of the CFET device for 4-bit and 5-bit ADC corresponding to 4-bit and 5-bit input pulses and weights, respectively. nonlinearity θ is extracted by fitting the ADC output to the nonlinearity model as shown in the inset of Fig. 11. The value of θ is easily adopted in the CIMulator platform to study the impact of the MAC nonlinearity on forward inference accuracy for the entire NN. Inference accuracy is severely degraded after considering MAC nonlinearity. Inference accuracy can be improved by retraining the NN Fig. 12(a). Following a brief re-training  (less than 5 epochs) inset of Fig. 11, accuracy is recovered to 85.19% and 89.63% for 4-bit and 5-bit weights, respectively. Fig. 12(b) shows the accuracy of NN at various stages of feed forward inference. It should be noted that the accuracy after retraining is limited to 90% is because of lower bit precision (4 and 5 bits).

VII. PARASITIC EXTRACTION
Parasitic resistance is obtained using the equation R = ρ(L/A). Resistivity ρ of the copper interconnect, A being the cross-sectional area of the interconnect trench. Length of the read bitline L RBL , and the corresponding resistances R RBL per unit SRAM cell using CFET and FinFET devices is summarized in Table 3. Unit capacitance C/L of value 0.378 fF/μm [17] is used to obtain the read bitline capacitance C RBL . Signals on wordline on the other hand do not impact the performance of the CIM if driven with sufficient power. Hence, we consider the parasitics of bitline in our simulations.   Table 4 shows the device, circuit, and system-level comparison of CFET and regular CMOS FinFET technology for SRAM-CIM. Both 64-rows and 256-rows arrays are considered for the evaluation. The area of the CFET SRAM cell (Area cell ) is smaller than the FinFET SRAM cell due to the stacking of the n-channel and p-channel in CFET. Smaller CFET SRAM cell area leads to small read bitline resistance and capacitance (R BL and C RBL ) of CFET SRAM arrays in comparison to FinFET SRAM arrays. Read bit line discharge current I Cell is a function of R BL , C RBL . FinFET having a large current and 3× lower latency due to higher mobility, has slightly larger wiring capacitance for the 64 rows case. CFET on the other hand, shows 3× lower power due to low I cell and hence the power efficiency (TOPS/W) is better for the CFET case. For 256 rows, to ensure that V BL does not drop below 3% of V dd , I cell is reduced by tuning the input pulse voltage and therefore has low bit-line current for the extracted R BL , C RBL values of 256 × 256 macro. The latency of CFET and CMOS FinFET become similar due to similar I cell currents.

VIII. PERFORMANCE BENCHMARKING
CFET shows an improvement of 4.24% and 57.19% for power efficiency in terms of TOPS/W due to lower CFET SRAM cell area leading to lower parasitics and the benefit becomes significant for large arrays. CFET SRAM cell requires about 7 masks for 1 stack of SRAM cells built using 2 metal layers. FinFET also needs 7 masks for the realization of SRAM cells using 1 poly and 2 metal layers. Multiple stacks of SRAM cells (3D Stacking) are workable in the case of a CFET device. The realization of the 3-layer MLP neural network described in Fig. 10 requires 17 stacks of CFET SRAM cells of dimension 64 × 60 and 4 stacks of SRAM cells of dimension 256 × 256. Stacking the layers on top of each other greatly improves the performance in terms of TOPS/(W-mm 2 ) proportional to the number of stacked layers. However, stacking would incur larger processing costs due to repeated processing steps for every SRAM stack at the expense of improved performance proportional to the stacked layers in terms of performance per power per layout area.

IX. CONCLUSION
Measured CFET and regular CMOS FinFET devices are accurately calibrated in transfer and output characteristics using the BSIM-CMG compact model. 8T-SRAM cell is realized using CFET technology and SPICE simulations of 8T-SRAM CIM macros are conducted for CFET and FinFET technologies. Evaluation of MAC latency, power, and non-linearity is carried out, a non-linearity model is developed and transferred to a higher-level NN simulator to evaluate system-level inference accuracy. Though FinFET has an advantage in-terms of latency, CFET shows better performance per power due to shorter SRAM cell height and reduced wiring resistance and the efficiency becomes more evident for large array sizes. Performance in (TOPS/W mm 2 ) gets improved by 19× and 7× for 17 and 4 stacked layers of CFET SRAM cells of dimension 64 × 60 and 256 × 256 respectively. Device nonlinearity caused due finite output conductance is recovered by retraining the NN with epochs as few as 5. CFET being a gate-all-around device has superior gate controllability and reduced short channel effects, making it scalable beyond N3 [3]. In addition, CFET with deposited channel has reduced footprint and is stackable in comparison to regular CMOS.