IGZO CIM: Enabling In-Memory Computations Using Multilevel Capacitorless Indium–Gallium–Zinc–Oxide-Based Embedded DRAM Technology

Compute-in-memory (CIM) is a promising approach for efficiently performing data-centric computing (such as neural network computations). Among the multiple semiconductor memory technologies, embedded DRAM (eDRAM), which integrates the DRAM bit cell with high-performance logic transistors, can enable efficient CIM designs. However, the silicon-based eDRAM technology suffers from poor retention time-incurring significant refresh power overhead. However, eDRAM using back-end-of-line (BEOL) integrated $C$ -axis aligned crystalline (CAAC) indium–gallium–zinc–oxide (IGZO) transistors, exhibiting extreme low leakage, is a promising memory technology with lower refresh power overhead. A long retention time in IGZO eDRAM can enable multilevel cell functionality, which can improve its efficacy in CIM applications. In this article, we explore a capacitorless IGZO eDRAM-based multilevel cell, capable of storing 1.5 bits/cell for CIM designs focused on deep neural network (DNN) inference applications. We perform a detailed design space exploration of IGZO eDRAM sensitivity to process temperature variations for read, write, and retention operations followed by architecture-level simulations comparing performance and energy for different workloads. The effectiveness of IGZO eDRAM-based CIM architecture is evaluated using a representative neural network, and the proposed approach achieves 82% Top-1 inference accuracy for the CIFAR-10 dataset, compared with 87% software accuracy with high bit cell storage density.

(DNNs). These emerging large size DNN models cannot fit 23 within the limited on-chip memory even in the latest server 24 CPUs [1], GPUs [2], and specialized machine learning (ML) 25 accelerators, such as Graphcore [3]. This necessitates a mas-26 sive amount of data movement from off-chip memory to 27 on-chip computer cores in modern ML accelerators, resulting 28 in increased energy for computation. Thus, it is important to 29 explore technologies and algorithms that maximize capac- 30 ity and further reduce the data movement for performing 31 multiply and accumulate operations (MACs) in case of ML 32 workloads. On the algorithms front, different low resolution 33 networks have been proposed to reduce the data movement 34 energy and computation cost. One such example is the usage 35 of binary/ternary neural networks that make use of binary+1, 36 −1/ternary+1, 0, −1 weights and activations to perform 37 MAC operations, resulting in the reduction of data movement. 38 These networks approximate the dot product as a simple 39 AND gate (binary) or a combination of AND and XOR gate 40 (ternary), further resulting in the reduction of computation 41 energy.
( 1 pJ/bit), thus having the desired attributes of a CIM 84 design [9], [10]. However, the major limitation from eDRAM 85 becoming a potentially strong candidate for CIM applica-  lizing a three transistor one capacitor (3T1C)-based IGZO eDRAM. However, the storage bit cell capacitor limits the 104 amount of 3-D stacking in the IGZO-based capacitor. In this 105 article, we explore the usage of capacitorless IGZO-based 106 eDRAM for storing three levels (1.5 bits/cell). The capaci-107 torless IGZO-based eDRAM can offer higher array density 108 benefits as opposed to the 3T1C eDRAM-based bit cell 109 because of 3-D stacking without loss in storage density, thus 110 making it opportune for high-performance, high-density CIM 111 designs. The IGZO eDRAM bit cells with dedicated read port 112 have been explored before [1]. However, prior approaches 113 [13], [14] consider only 1-bit/cell CIM design. However, 114 this work explores the possibility of MLC in IGZO eDRAM 115 and efficiently mapping ternary weights in a neural network 116 for CIM applications considering device-circuit-architecture-117 level analysis. This article is organized as follows. Section II 118 provides the case for the CAAC-IGZO eDRAM leakage 119 mechanism, advantages of IGZO in terms of retention time. 120 Section III describes modeling of the device. Section IV dis-121 cusses the different bit cell topologies. Section V analyses the 122 read/write timing diagram for capacitorless IGZO eDRAM 123 bit cell. Section VI analyses the variability study for the 124 bit cell in terms of the SN voltage for write operation. 125 Section VII analyses the read variability study in terms of 126 voltage at read bitline (RBL). Section VIII validates the MLC 127 potential by studying CIM design that is capable of pro-128 ducing accurate MACs in case of ternary neural networks. 129 Section IX presents the architecture-level simulations for dif-130 ferent benchmarks and understands the trade-off between 131 energy and latency for IGZO eDRAM over Si eDRAM. 132 Sections X-XII present the analysis of CIFAR-10 results on a 133 custom CNN. Section XIII concludes the key analysis results 134 and observations from this work on IGZO-based eDRAM.

136
The CAAC-IGZO transistors are typically realized as N-type 137 devices having a moderate on-current and are suitable for 138 low-temperature BEOL CMOS integration. This allows for 139 increasing the bit cell density by stacking multiple lay-140 ers of IGZO access transistors and backend capacitors in 141 a 3-D fashion. In addition to 3-D integration, IGZO-based 142 eDRAMs can increase bit density by storing multiple bits 143 per cell, owing to the extremely low leakage characteristics 144 of IGZO devices, with high sense margins for resolving 145 between multiple storage capacitor voltage levels. The SN 146 of eDRAMs and DRAMs starts leaking due to sub-threshold 147 leakage, band-to-band tunneling, and the gate-induced drain 148 leakage (GIDL) [15] of the access transistor and the storage 149 capacitor leakage, as shown in Fig. 1. The extent of this 150 leakage defines the refresh times of such memories. One 151 mechanism of reducing the subthreshold leakage (exponen-152 tially dependent on the transistor gate to source voltage) is the 153 use negative word line voltages in the off-state. However, this 154 negative voltage increases the electric field at the gate-drain 155 overlap region, which leads to an increase in GIDL. The 156 higher energy bandgap (E g ), higher effective mass of elec-157 tron (m 0 ), and higher relative permittivity (E r ) in IGZO as 158 compared with Si are the primary driving factors for reduced 159 GIDL. This increases the retention time to more than ten days 160 in IGZO-based eDRAMs [16]. Furthermore, low leakage 161 enables successful retention of the bit cell contents for a 162 longer time and, hence, enables reliable read (enough bitline 163  characteristics of low-leakage IGZO-based eDRAMs have 205 been obtained by carefully optimizing the device parameters, 206 such as doping of the channel (NBODY), mobility tempera-207 ture coefficient (UTL), and nonuniform doping in the lateral 208 direction (K 0). These parameters enable tuning of the sub-209 threshold slope and the off-state current. on-current, which is 210 typically in the range of µA and lower than the on-current 211 of the Si-based transistors, is modeled with a decreased 212 mobility value using the low field mobility coefficient 213 parameter (U 0). 214 FIGURE 3. Capacitorless IGZO eDRAM bit cell. SN is the bit cell storage node. WBL/RBL is write/read bitline. Write port transistor (orange) is of higher threshold voltage than read port transistor (green). RWL is read word line with the voltages required for read, write, and retention. Fig. 3 shows a three-transistor capacitorless eDRAM struc-216 ture that has been used for circuit simulations for multilevel 217 storage. A write port transistor (T 1), marked in orange, has 218 been optimized with a higher threshold voltage (V t ) using a 219 larger body bias voltage so as to reduce the leakage of the bit 220 cell. The read port transistors (T 2 and T 3) use low threshold 221 voltage (V t ), so that the read time is optimized effectively and 222 to have a wider bit cell swing on the RBL, thus enabling better 223 sense margin. This allows storage of multiple levels in the 224 same bit cell.

260
Write stability is measured in terms of SN voltage at the 261 end of write. The voltage at the SN modulates the effective 262 resistance of the read port transistors, which, in turn, affects 263 the read stability as well. Thus, it is important to note that 264 voltage at the SN at the end of write should be large enough 265 to differentiate between different levels during read.    This leads to lesser discharge rates of the RBL node, leading 303 to decrease in the voltage difference between levels. However, 304 it is important to note that, even at increased temperatures, assuming 1σ V t of 30 mV for the read port and access transis-310 tors are used. This analysis is performed with a read operation 311 performed immediately after a write operation (i.e., with 312 lesser amount of retention/standby time). This result captures 313 the effect of capacitive coupling degrading the SN postwrite 314 and the increase in SN value during read. In the case of 315 reading ''10,'' the voltage at RBL is close to 0 V, the voltage 316 at RBL for ''01'' is close to 0.65 V, and the RBL for ''10'' 317 is close to 1.3 V. Thus, there is a difference of 0.4 V in 318 RBL voltage at the end of read between ''10'' and ''01'' and 319 0.55 V between ''01'' and ''00'' in worst case scenario. This 320 suggests that there is sufficient difference in the voltages for 321 reading the levels ''00,'' ''01,'' and ''10'' efficiently, as shown 322 in Fig. 8. 323 Fig. 9 demonstrates the effectiveness of the bit cell to be 324 able to read multiple levels efficiently. This is performed 325 assuming the read is performed long (10 s) after it has been 326 written. This exploration captures the effect of the capaci-327 tive coupling degrading the storage node observed postwrite, 328 SN degrading because of leakage and the slight increase in 329 SN observed during read. Histograms corresponding to levels 330 ''10'' and ''01'' are shifted to the right in contrast to Fig. 8, 331 while ''00'' histogram is shifted to the left. This is because 332 the SN while storing 0 V increases over a period of time. This 333 analysis explains the feasibility of successfully differentiating 334 between different bit cell contents.    2) tCL is the time taken to read the data once it is latched 405 onto the row buffer. tWL and tWTR impose constraints 406 on the successive commands of the row buffer and 407 are independent of the underlying memory technol-408 ogy [22]. tRTP is a measure of the data stability in 409 the cross coupled inverters that feed into write drivers 410 of the memory array and are independent of memory 411 technology.

412
3) tRP is an indication of time taken between a successful 413 precharge and completion of activate command for 414 write in the same bank. Thus, it is a measure of the write 415 latency of memory array [22] and is roughly four cycles 416 at 400-MHz clock.

417
tRCD and tRP in case of IGZO-based eDRAMs are higher, 418 because the on-current is lower in contrast to IGZO-based 419 eDRAMs. Furthermore, a refresh time of 300 µs has been 420 assumed for eDRAMs, and a refresh command has been 421 utilized every 300 µs, which leads to of performance degra-422 dation as this is a dead cycle from a memory cycle point 423 of view. The abovementioned delay parameters are used 424 for calibrating DRAM for simulating IGZO-based eDRAM. 425 A CPU trace-driven approach, where instructions are directly 426 read from the proposed benchmarks and simulates a simpli-427 fied CPU core model that performs nonmemory instructions 428 and accesses memory for load store instructions, is used 429 for running these benchmarks. Few of the CPU SPEC2006 430 benchmarks have been chosen to simulate to capture the 431 effect of performance. Fig. 12  Ternary neural networks are an example of a low-resolution 465 neural network where the weights and activations can take the 466 value +1, 0, −1. A block diagram highlighting the features 467 of the proposed CIM architecture is shown in Fig. 14 The proposed CIM architecture operates on 1.5-bit wide input 485 activations and weights and is efficient in terms of storage 486 and compute. The data flow for performing CIM operation is 487 described as follows.  are assumed to be stationary, located in the compute array.   Fig. 8. Furthermore, the design was also tested across differ-524 ent temperatures to realize the temperature impact on the dot 525 product computation. The design was extremely resilient to 526 temperature variations, as observed in the read analysis.

528
The proposed multilevel cell IGZO-based eDRAMs CIM 529 design efficiency is quantified using CIFAR-10 dataset with 530 a representative convolutional neural network that has been 531 trained for effectively utilizing 1.5-bit weights and activations 532 as shown. The network has four convolutional layers, with the 533 first and second layers containing 32 channels each of size 534 32 × 32 and the third and fourth layers containing 64 channels 535 each of size 32 × 32. A 3 × 3 kernel has been used. The 536 proposed design achieves 82% Top-1 classification accuracy, 537 compared with the 87% accuracy obtained from ideal soft-538 ware implementation for the same network. The difference in 539 accuracy stems from quantization loss. The design specifica-540 tions of the proposed CIM design are specified in Table 1. 541 This analysis indicates that the CAAC-IZGO eDRAM can 542 be a promising candidate for performing large scale, ternary 543 CIM designs with good accuracy.

545
In this article, we make use of the extreme low leakage 546 CAAC-IGZO-based eDRAM to perform CIM operation for 547 ternary neural networks. The low leakage, high retention time 548 of IGZO can be leveraged to enable multilevel cell function-549 ality, which further increases the storage density. We present 550 a detailed study involving comparison between leakage of 551 IGZO and Si-based eDRAM, different available topologies, 552 and shortcomings of each of the topologies. We utilize the 553 capacitorless IGZO-based eDRAM for storing 1.5 bits/cell. 554 Architecture-level simulations comparing IPC and energy 555 between DRAM and IGZO-based eDRAM is also presented. 556 The feasibility of this proposal has been qualified by per-557 forming Monte Carlo simulations for read, write, and reten-558 tion. Monte Carlo simulations suggest that the multilevel bit 559 cell is not prone to bit cell variations and offers retention 560 time of 1000 s for the given modeled device. The storage 561 of 1.5 bits/cell allows efficient mapping of ternary weights 562 onto a single bit cell. Overall architecture of the compute 563 array along with charge share for performing dot product 564 compute has been presented. A validation of this approach is 565 obtained by performing compute for a custom neural network 566 on a CIFAR-10 dataset with the compute array showing 567 good accuracy. The susceptibility of CIM design to process 568 variations is investigated and detailed in the process varia-569 tions section.