Scalable 2T2R Logic Computation Structure: Design From Digital Logic Circuits to 3-D Stacked Memory Arrays

In the post Moore era, post-complementary metal–oxide–semiconductor (CMOS) technologies have received intense interests for possible future digital logic applications beyond the CMOS scaling limits. In the meantime, from the system perspective, non-von Neumann architectures, such as processing-in-memory (PIM), are extensively explored to overcome the bottleneck of modern computers, known as the memory wall, for high-performance energy-efficient integrated circuits. In this article, we propose functionally complete nonvolatile logic gates based on a two-transistor-two-resistive random access memory (RRAM) (2T2R) unit structure, which is then used to form a reconfigurable three-transistor-two-RRAM (3T2R) chain with programmable interconnects for complex combinational logic circuits, and a dense 3-D stacked memory array architecture. The design has a highly regular and symmetric structure, while operations are flexible yet simple, without the need of complicated peripheral circuitry or a third resistive state. Implementations of XNOR gate and full adder using 3T2R chain without extra routing/control gates or resistors are shown as demonstration examples of arithmetic unit design. The proposed computing scheme is intrinsic, efficient with superior performance in speed and area. Easily integrated as 3-D stacked array, the proposed memory architecture not only serves as regular 3-D memory array but also performs logic computation within the same layer and between the stacked layers. Concurrent computations under multiple computation modes for flexible operations in the memory are presented. Bias schemes for selected/half-selected/unselected cells are also explained and verified.


I. INTRODUCTION
O VER the past few decades, progress in the semiconductor industry was enabled by the downscaling of the metal-oxide-semiconductor field-effect transistor (MOSFET), serving as the workhorse of digital complementary metal-oxide-semiconductor (CMOS) systems for modern chips. However, technology scaling has reached a plateau due to critical factors, such as increasing power dissipation/heating issues, quantum mechanical effects, and intrinsic parameter fluctuations [1]. Additionally, a memory wall exists between computation and storage units in the conventional von Neumann architecture, which causes significant performance degradation [2]. Data movement across the two consumes most of the computation energy and operation time for applications, such as machine learning [3]. Thus, performance gap between the microprocessor and computer memory keeps growing [4].
Modern technologies are attempting to tackle these barriers from device to an architecture level. For instance, emerging devices, such as carbon nanotube FETs (CNFETs) [5], resistive random access memories (RRAMs) [6], and superconducting devices [7], are investigated to support post-CMOS technologies [8]. At the system level, graphics processing units (GPUs) are developed by extensively utilizing parallelism with increasing on-chip cores [9]. Applicationspecific processors with accelerators are also designed to improve efficiencies [10]. Furthermore, the concept of processing/computing in-memory (PIM/CIM) was proposed, aiming to subvert the von Neumann architecture by conducting computation tasks within the memory [11], providing a method to suppress latency and power consumption. As foundation of PIM schemes, in-memory logic computation based on computing memory devices [12], [13], [14], [15] shows promising potential with fast, low-power, and scalable devices.
In the past 20 years, digital CIM based on computational memory has been committed to define novel logic gate concepts and perform digital Boolean operations [16], [17], [18], [19], [20]. Resistive switching devices, such as RRAM, have widespread usage among these post-CMOS memories. RRAM possesses advantages of simple device structure, high density, low power, fast speed, descent scalability, and excellent compatibility with the CMOS process [21], [22]. Therefore, it has been considered a suitable candidate for emerging circuit design and novel computation systems [23], [24], [25].
In this article, we propose an in-memory logic computation solution based on a two-transistor-two-RRAM (2T2R) structure, with two bipolar RRAMs connected back-to-back. First, the principle of 2T2R logic gates is introduced with different logic operations (Section III). Then, a uniform threetransistor-two-RRAM (3T2R) circuit structure is proposed to achieve any arbitrary combinational logic with computation methodologies discussed accordingly (Section IV). In addition, a 3-D stacked memory array structure capable of both regular memory functions and logic computation is illustrated to support large-scale integration (Section V). The stacked memory array can perform computations flexibly within one single layer or between different stacking layers.

II. RELATED WORK
Many PIM techniques have been proposed recently, the majority of which are application-specific. For example, works such as [10] and [26] are dedicated to neural network acceleration. Some works [12], [16], [18] deal with general implementations, such as bitwise OR, AND, XOR, and INV. However, functions, such as addition and multiplication, are not supported due to difficulty and complexity in logic cascading. Several RRAM-based logic methodologies have been proposed to implement Boolean logics, but with their own limitations. In [16], a stateful nonvolatile RRAM logic was designed for normally-OFF digital computing by adopting a serial switch. But, it requires a third RRAM resistance state through adjusting compliance current, increasing the cost of overhead circuit. Stateful RRAM-based material implication (IMP) proposed by Borghetti et al. [12] has lengthy operations and low program margin due to voltage drop on extra resistors. One-transistor-one-resistor (1T1R)-based full adder proposed by Wang et al. [20] also requires complex reconfigurable wiring as overhead to change cell connection in each computation step.
Some other recent schemes, such as memristor-aided logic (MAGIC) [17], memristor ratioed logic (MRL) [26], and complementary resistive switches (CRSs)-based crossbar [27], were also proposed to extend CIM methodologies. However, these computation methods still have difficulties to realize complex functions with a simple and scalable structure. Most of these designs only focus on realization of fundamental logic functions, e.g., IMP and NAND. Furthermore, sophisticated overhead circuits are often mandatory, such as current-voltage converters to convert RRAM resistance to voltage operand.
In comparison, the proposed design stores intermediate result of arithmetic operations in RRAM cells for result cascading to avoid often involvement of complex overhead circuitry for converting results to voltage operand, shown in previous works. With a similar concept and operation scheme to [12] and [16], the proposed design avoids using additional resistors to assist logic operation [12] and provides more flexible routing with a p-n-n-p configuration than p-n-p-n in [16]. Furthermore, it possesses advantages of non-volatile memory storage and capability to build a high-density in-memory computing system. It provides alternatives to applications, such as erasable programmable read-only memory (EPROM), that has high programming voltage and field-programmable gate array (FPGA) that has reprogrammable interconnects.  [22] calibrated to the IMEC device [28]. Inset shows an RRAM device with two terminals (p as anode and n as cathode).

III. 2T2R LOGIC GATE
An RRAM device is a two-terminal (p as anode and n as cathode) element, as shown in Fig. 1 (inset), with a metalinsulator-metal stack. Its resistance state is subject to change by application of voltage (V pn ) across the two terminals. For a bipolar RRAM, appling V pn > 0 (SET) causes the transition from the high-resistance state (HRS) to the lowresistance state (LRS) due to defect migration and formation of conductive filaments (CFs). Conversely, applying V pn < 0 (RESET) causes transition from LRS to HRS due to the breakdown of the CF. Fig. 1 depicts the I -V characteristics of a bipolar HfO x -based RRAM device, generated by the Arizona State University (ASU) RRAM model [22] using HSPICE. The model used in this article is calibrated to match the experimental HfO x -RRAM device behavior from IMEC [28] with 20-mV/s SET/RESET pulses. In this article, fast programming pulses with a ramp rate of 0.2 V/ns are used. In addition, V SET = 2 V and V RESET = −1.33 V. The relevant parameters used in this RRAM model are listed in Table 1. The structure of the proposed 2T2R logic gates is shown in Fig. 2, where the two bipolar RRAMs are connected backto-back. Two nMOS transistors act as access devices to each RRAM. The operation is explained as follows. First, resistive states of the two RRAM are initialized as two inputs: P (lower cell) and Q (upper cell). In this work, the LRS (50 k ) and HRS (1 M ) represent logic ''0'' and logic ''1,'' respectively. The initialization can be done by applying SET/RESET voltages between top/bottom terminal and the middle node ''M '' (labeled in Fig. 2). Then, three voltage pulses are applied simultaneously on the corresponding terminals: V UL (operational voltage), G P (enable/control voltage of the lower cell), and G Q (enable/control voltage of the upper cell). Finally, the two outputs of the logic gate are in situ stored as the RRAM final states after the operation: P (lower cell) and Q (upper cell). Overdrive gate voltages are applied to transistors to reduce transistor resistance and avoid any threshold voltage drop between gate and source.
To explain the logic gate principles, a parameter k as ratio of SET and RESET voltage is defined as follows: For different RRAM devices with different k values, the available operation combinations (OPs) and their corresponding V UL ranges are different. All the possible cases are listed in Table 2. However, for any given k, operations AND and IMP are always available to guarantee function completeness. In this article, the operations are proved based on the IMEC's device presented in Fig. 1 with k = 1.5. As mentioned earlier, achievable logic operations of the 2T2R gates are different based on V UL amplitude, as summarized in Table 2. For the RRAM used in this work with k = 1.5, three operations OP1, OP2, and OP4 are analyzed in Sections III-A-III-C. OP3, OP5, and OPs with other k value can also be obtained according to Table 2. Considering variations observed in RRAM switching voltage, an actual value of k can fluctuate [30]. With such consideration, OP groups with k falling in a range (e.g., 1 < k < 2) are preferred, since the given range can tolerate variation in k and resistance shifting of RRAMs caused by continuous operations.

A. OP1: P = P (Bit HOLD) AND Q = P · Q (AND)
In OP1, amplitude of V UL is given by When P = Q = 0 or P = Q = 1, the voltage drops on upper and lower cells are both half V OP1 , because they have the same resistance states. Hence, the voltages across the P and Q can be calculated as follows: Because V Q pn < V SET and |V P pn | < |V RESET |, no SET/RESET transition can be triggered. When P = 0 and Thus, Q = 0 (LRS) due to a SET process triggered on the upper cell Q. Meanwhile, lower cell P remains at LRS, so that P = 0. In the case of P = 1 and Since P is already in HRS, no transition can take place. Thus, P = 1 and Q = 0. The truth table containing all the abovementioned cases is shown in Table 3. It indicates Boolean functions of P = P (bit hold) and Q = P · Q (AND). When Q = 1 (upper cell in HRS), function Q = P · Q is equivalent to Q = P (highlighted in green in Table 3). This can be used to conduct ''bit transfer'' between cells for logic gate cascade, chain logic, and CIM array operations in Sections IV and V. In OP2, the amplitude of V UL is as follows: In the case of P = Q = 0, V OP2 is equally divided by the upper and lower cells. Voltages across P and Q are as follows: Since |V P pn | > |V RESET |, it is sufficient to trigger RESET on P, so that P = 1. Similar analysis from OP1 applies for the cases of (P = 1, Q = 0) and (P = 0, Q = 1). In the case of P = Q = 1, the voltage distribution remains the same as (5). Table 4. The Boolean functions available in OP2 are P = Q → P (IMP) and Q = 0 (bit set). IMP guarantees logic completeness, and any Boolean function can be transformed into multiple IMPs. A ''bit transfer'' can also be obtained when P = 0 and P = Q → 0 = not(Q), which is highlighted in blue in Table 4. Note that the functional completeness is also provided by combing NOT operation of OP2 and the AND operation of OP1. In OP4, the operation voltage range V UL is given by The outputs of input cases (P = Q = 0), (P = 0, Q = 1), and (P = 1, Q = 0) are the same as in OP2. If P = Q = 1, since P and Q are both in HRS, their voltage drops are equal and where V Q pn < V SET . Therefore, the outputs stay the same as initial states. All the abovementioned OPs (OP1, 2, and 4) can be completed in one step by applying the corresponding operational voltage V UL . Because the input and output variables are both in RRAM states, there are two ways to easily achieve the cascade of 2T2R logic gates. Truth table of OP4 is summarized as listed in Table 5.
1) The output of the current gate propagates to next stage through the bit transfer operation in OP1 or OP4. 2) Different RRAM pairs are selected through pass gate transistors (PGTs) of the 3T2R chain described in Section IV. Due to the symmetry of the 2T2R structure, swapping positions of the upper and lower RRAMs is equivalent to reversing polarity of V UL , as shown in Fig. 3(a) and (b). However, the two gate control voltages V GQ and V GP need to be changed when negative V UL is applied on the top terminal. Instead, a positive V UL can be apply at the bottom terminal, while the top terminal is grounded, as shown in Fig. 3(c). Effectively, circuit in Fig. 3(c) is equivalent to the one in Fig. 3(a). These equivalent circuit conversions provide flexibility in designing combinational circuits and memory arrays, as shown in Sections IV and V. In this work, functionality and performance of the circuit design are verified through simulation using HSPICE. All transistors have the same size (W /L = 500/65 nm) using the predictive technology model (PTM) [29]. The width of transistors is selected to provide negligible of voltage drop across transistors (V DS ) in their ON states. If transistors with smaller channel width are used, the voltage drop across nMOS transistor will have to be considered and compensated in V UL . In addition, thick-oxide transistors might be required due to reliability concerns (out of the scope of this article). RRAM devices with low V SET and V RESET are expected to alleviate this reliability problem, since magnitude V UL can be lowered.
The low HRS/LRS ratio can become a potential problem for 2T2R gate operation, since voltage drop over RRAM in LRS becomes comparable to RRAM in HRS. However, with RRAM resistance ratio up to 10 2 or 10 3 [32], [33], correctness of the logic operations can be guaranteed. Lifetime of this 2T2R logic gate is also dependent upon endurance of RRAM, for which the best performance is reported to be 10 12 cycle [31]. With the continuous improvement of RRAM devices in terms of resistance ratio and endurance, the proposed logic gate can become reliable for long-term logic computation solutions.

IV. 3T2R CHAIN FOR COMBINATIONAL LOGIC
To design complex arithmetic logic circuits using the 2T2R logic gates, we propose a 3T2R chain structure. An nMOS  PGT (1T) is added to connect two adjacent 2T2R units, as indicated in the dashed box in Fig. 4. This allows any two 1T1R units in the chain to form a 2T2R pair and perform OPs discussed in Section III. Initialization of RRAMs states is done by applying SET/RESET voltages between ''In'' and ''Ti'' (i = 1, 2, . . .). The transistor gate control signals G1-Gx enable programmable different interconnections (1: connect; 0: disconnect). For example, with (G1, G2, G3, G4, Gx) = (1, 0, 0, 1, 1), P1 and P4 are connected and form a 2T2R logic gate to perform different OPs. This reconfigurability of the proposed structure provides additional routing flexibility that previous proposed studies do not have.

A. XNOR GATE
One 3T2R chain can implement an XNOR gate using IMP/AND functions as described in the following equation:

B. 1-BIT FULL ADDER
A 1-bit full adder can also be realized using the five-unit 3T2R chain to demonstrate the design of an arithmetic block unit, as shown in Fig. 6. The computation methodologies for carry out (C out ) and sum (S) from inputs (A, B, C in ) are given by the following equations: As plotted in Fig. 7, the adder unit requires nine steps to calculate C out and ten steps for S. The operations are simulated and verified to compute the results with A = 1, B = 0, and C in = 1. The intermediate result A B = 0 is obtained in the fourth step and duplicated in the fifth step through bit transfer operation. This result is reused, because both C out and S need it for their individual calculations. The results C out = 1 are stored in P10 in the ninth step and S = 0 is computed in P4. Considering all eight possible input combinations of A, B, and C in , the average number of switching (either SET or RESET) is ≈1 per RRAM per adder operation.

C. READOUT STRUCTURE
After the computation is completed, it may be necessary to read the final result stored in the RRAM cell. The readout process of the 3T2R chain can be done by placing readout circuitry at the top and bottom positions of the chain, as shown in Fig. 8. All horizontal PGTs are conducting (the red path) during the reading process. Then, a small read voltage V read (e.g., 100 mV) is applied on the middle nodes. The readcell selection (read enable control) is achieved by the gate control of each 1T1R, similar to the function of column multiplexer in the conventional static random access memory (SRAM) arrays. Readout circuitry can be either a current sense amplifier (CSA) or a transimpedance amplifier (TIA) to sense the difference between RRAMs in HRS and LRS. To reduce size overheard circuit, a block of 2T2R cells should share one CSA/ITA. The actual number of gates placed in one block is determined by factors, such as details of the physical layout. Therefore, it depends on the technologies of CMOS and RRAM used for actual implementation, which are not limited to ones selected by this study.

D. DESIGN EVALUATION AND COMPARISON
The proposed 3T2R chain architecture can implement any combinational logic due to function completeness of the 2T2R logic gate. A sequence of logic computation can be finished by managing control signals through finite state machines (FSMs). The 3T2R chain structure also provides reprogrammability on interconnecting PGTs. The regular and compact 3T2R chain structure can also significantly simplify design and fabrication of the post-CMOS digital circuits and systems. Due to non-volatility of RRAM, any stored computation result does not require data refreshing. This can significantly reduce power consumption during idle mode.
The performance of a computing method can be evaluated by its computational complexity in terms of the following: 1) spatial complexity and 2) temporal complexity. For evaluation of the proposed RRAM-based logic circuits, the required number of RRAMs and transistors represents spatial complexity and required computation steps/cycles for logic computation represent temporal complexity.
For the first evaluation, a 1-bit full adder is implemented with the 3T2R chain design and 65-nm CMOS technology. Device area associated with the used technology is summarized in Table 6. The 1-bit adder is compared with multiple 1-bit full adder implementation in terms of area, as shown in Fig. 9. Compared with the CMOS static adder, mirror adder, and transmission gate (TG) adder, the 3T2R   chain implementation saves about 53%, 45%, and 45% of area, respectively.  Additionally, the 1-bit full adder design of 3T2R chain is compared with other RRAM-based implementations of the latest technology [12], [16], [18], [20], [27], [34], [35] in terms of computation delay and circuit area, as shown in Fig. 10. Overall, the proposed 3T2R chain implementation demonstrates both low computation delay and small circuit area. CRS-based adder of Siemon et al.'s [27] requires complicated peripheral circuitry to assist operations. Moreover, CRS has a ''destructive read'' problem, which limits its practical application. Design proposed in [16] is capable of finishing 1-bit addition in seven clock cycles VOLUME 8, NO. 2, DECEMBER 2022  with 11 RRAMs. However, the adder requires different wiring between these RRAM cells in each step, making it impractical for real circuit implementations. Wang et al.'s [20] 1T1R RRAM implementation is efficient, as it encodes inputs as both voltages and RRAM states. However, sophisticated peripheral circuits, including sense amplifiers and block decoders, are for resistance-to-voltage conversion. On the contrary, the proposed 3T2R-based adder is regular and simple with no resistance-voltage conversion required.
Energy consumption comparison is not feasible in this case, since the proposed design is subject to technologies used in actual implementation and not limited to ones being used in this study. However, the propose 2T2R logic gate is estimated to consume ≈10-pJ energy per programming cycle considering operation voltage, resistance of RRAM, and duration of programming. Further lowering the computation energy is possible considering that sub-pJ switching in RRAM devices has been reported [36], [37].

V. 3-D STACKED ARRAY FOR DATA STORAGE/LOGIC COMPUTATION
In this section, we propose a 2T2R-based 3-D array to achieve dense in-memory logic operations for large scale integration. A 3-D crossbar array structure is shown in Fig. 11(a) with a two-layer structure connected back-to-back, with each layer being a 1T1R array, to illustrate relative positioning of transistors, RRAMs, and signal lines. One 1T1R cell can be connected to any other from the same layer or the other layer to from a 2T2R logic gate. Two sets of select lines (SLs) in the perpendicular with each other are used: SL U connected to the top nodes of 1T1R cells in the upper layer and SL L connected to the bottom nodes of 1T1R cells in the lower layer. Two sets of word lines (WLs), WL U and WL L , are used for switching controls 1T1R cells. A set of bit lines (BLs) run in parallel with SL U , connecting the middle nodes of each 2T2R stack. Upper-level 1T1R cells that share the same BL also share the same SL U . This 3-D array can be simplified to diagram, as shown in Fig. 11(b), with WLs omitted for clarity. Fig. 11(c) displays a 2-D stick diagram to describe the vertical view of one physical stack (5) of 2T2R gate in the center of the 3 × 3 array, as shown in Fig. 11(b).
The crossbar structure shown in Fig. 11(a) is not feasible for physical implementation, as RRAMs are implemented in back end of line (BEOL) during fabrication. As an example, a physical implementation for a 2T2R stack is shown in Fig. 11(d) and should be used as a basic structure for further implementing physical layout of a complete system. Note that the 3-D implementation here differs from the state-ofthe-art 3-D Xpoint [38] and 3-D vertical RRAMs [39] due to the need of nMOS transistors. Further area saving and structure simplification could be achieved by replacing the nMOS transistors with simpler selector devices.

A. CONVENTIONAL RANDOM ACCESS MEMORY OPERATIONS
The proposed 3-D array can be used as a conventional RRAM-based RAM for read and write operations, similar to 1T1R memory arrays. To write/read a specific cell [e.g., 5U in the array in Fig. 11(b)], WL UB is enabled, while all other WL U 's and all WL L 's are disabled. All BLs are grounded. SL UB is set to V SET (SET), V RESET (RESET), or V read (read), while other SL U and SL L are all grounded.

B. LOGIC COMPUTATION IN-MEMORY
In this 3-D array, the 2T2R OPs proposed in Section V-A can be carried out within the same layer or across the two layers by enabling the corresponding PGTs. With the different combinations, there are four computation modes available, as shown in Fig. 12, and listed as follows. cell (e.g., 5L + 6L) that share the same BL. Fig. 11(b) shows the first computation mode UL 1 with the 2T2R stack 5 (5U-5L pair) in the 3 × 3 array as an example. WL UB and WL LB are enabled to select the two 1T1R cells, while all the other WLs are disabled. All BLs are left floating to allow the middle node to be driven by any applied voltage between SL UB and SL LB . For example, SL UB can be biased at various V UL values, while SL LB is grounded to trigger different OPs, such as V 0 = V OP1 for OP1 or V 0 = V OP2 for OP2. Meanwhile, all other SL U and SL L are floating. Three types of cells can, therefore, be defined by their bias conditions under the abovementioned bias scheme, which is  also highlighted in different colors, as shown in Fig. 11(b). 1) Selected cells (5U, 5L), which are highlighted in red. 2) Half-selected cells (2U, 2L, 8U, 8L) sharing the same WL and SL L with the selected cells, highlighted in blue.

3) Unselected cells (others).
When the selected cells go through designated computation according to amplitude of V 0 , half-selected and unselected cells should not have voltage across exceeding V SET or V RESET to preserve stored states. Therefore, initial predischarge of may be required for SL U , SL L , and BL connected to half-selected cells before actual operations. On the other hand, unselected cells are less susceptible to accidental reprogramming because of protection provided by disconnected PGT.
For the four listed computation modes, parallel in-memory computations are supported. For instance, computations mode UL 1 can be enforced to 2T2R stacks 2, 5, and 8 in parallel, as shown in Fig. 12(a). Meanwhile, all other cells are unselected with OFF nMOS switch. Similarly, the parallelism of other operation modes UL 2 , LL 1 , and LL 2 is shown in Fig. 12(b)-(d), respectively.
Using the various flexible calculation modes proposed earlier, complex computation solutions can be implemented. We propose two options here as case studies: 1) use both upper and lower layers identically, for storage and com-putation and 2) use one layer (e.g., upper layer) as regular RAM only for the general purpose of memory storage and the other layer (lower layer) for data processing and logic computation. Data stored in the upper layer cells can be transferred to lower layer via bit transfer operation of OP1/OP4 under UL 1 /UL 2 modes. Then, the transferred data can be processed in the lower layer of 1T1R cells using computation modes LL 1 and/or LL 2 . More sophisticated combinations and processing algorithms can be further explored to manage data storage and sequential computation steps.

VI. CONCLUSION
PIM provides an effective approach to conquer the restrictions of existing von Neumann-based computing methodologies. This article proposes a promising scheme for such applications, from gate level to circuit level and systemarchitecture level. It illustrates the design of the following: 1) functionally complete, stateful logic gates based on 2T2R; 2) a regular, repeated, and reconfigurable 3T2R chain with programmable interconnects; and 3) a dense 3-D stacked memory array structure capable of performing concurrent computations. The 3-D array integrates the functionalities of processing element and storage together, with multiple computation modes available to achieve flexible calculations inside the memory. It can provide alternatives/improvement to existing EPROM, FPGA, and sensor node applications.