A Fully Non-Volatile Reconfigurable Magnetic Arithmetic Logic Unit Based on Majority Logic

Spintronics has been garnering great success in resolving the shortcomings of conventional charge-based electronics and von-Neumann architecture by offering novel computational paradigms almost devoid of leakage effects and volatility issues of traditional CMOS systems. Especially, hybrid CMOS/MTJ circuits integrating all benefits of spintronics with well-evolved CMOS technology have been extensively investigated as candidates for building next-generation processors. Motivated by the importance of magnetic processors, in this work, a novel fully non-volatile reconfigurable magnetic arithmetic logic unit (NVRMALU) based on majority logic has been proposed. This paper encompasses the prospects of NVRMALU at both the architecture level and circuit level by discussing multi-context hybrid CMOS/MTJ LIM architecture and operation of the circuit. Furthermore, the simulation results of the proposed fully non-volatile reconfigurable magnetic full adder (NVRMFA) making up NVRMALU reveal a remarkable total power reduction by around six to forty-seven folds compared to its contemporary magnetic full adders (MFAs) discussed here. Also, NVRMALU is superior to double pass transistor clocked CMOS (DPTLCMOS) ALU in terms of power reduction by six times, thus qualifying it as an excellent normally OFF, Instant ON digital system. A four-bit extension of NVRMALU has been presented as a sign of the feasibility of the design for multi-bit applications. The transient analysis demonstrates nonvolatile and dynamic reconfigurable traits of NVRMALU in addition to functionality verification. Additionally, variability analysis has been performed to study the factors controlling read and write performances of hybrid circuits from a device-level perspective.


I. INTRODUCTION
Excellent Computational and data storage capabilities with optimal power, speed, and area, have always been the Mantra of the Semiconductor industry.However, the growth of classical charge-based electronics in the past few years has been curbed at both the device and architecture levels.The saturated scaling down of transistors, owing to physical limitations and secondary effects like the leakage effect, in conjunction with the Memory wall issue at the architecture level poses severe challenges in meeting the growing demands of emerging data-centric applications like Artificial The associate editor coordinating the review of this manuscript and approving it for publication was Yiming Huo .
Intelligence, Image processing, and the Internet of Things.Ergo, the increasing emphasis on Spintronics, an emerging nanotechnology harnessing the intrinsic spin of an electron for computation and storage.Particularly, Magnetic Tunnel Junction (MTJ), the spin device making up the Magnetic Random Access Memory (MRAM) has been extensively investigated by both industrial and academic communities for its promising features [1], [2].
The long interconnects induced latencies between logic and memory, and power consumption, constituting the memory-wall issue in the traditional von-Neumann architecture are overcome by the adoption of Logic-In-Memory (LIM) architecture.This, in addition to addressing the leakage effect and power overheads by employing non-volatile memory devices like MTJ, also paves the way for the close integration of memory and logic into a single entity.Furthermore, the coalescence of the conventional CMOS technology with its high operating speed, reliability, and perks of MTJ such as nonvolatility, zero leakage power, large endurance, fast reading capability, high-density integration, and scalability [3] in the form of hybrid CMOS/MTJ circuits following smart architectures like LIM, and In-Memory Computing (IMC) (Fig. 1) serve as a successful alternative to charge-based electronics [4].Consequently, several hybrid CMOS/MTJ logic designs ranging from magnetic flip-flops [5], non-volatile basic gates [6], and magnetic decoders [7] to magnetic full adders [8], [9], [10], [11], [12] and nonvolatile magnetic arithmetic logical units [13], [14] have been reported in the literature.
Magnetic Processor, the extension of such hybrid CMOS/MTJ logic paradigms, leveraging the superior traits of MTJs and sophistications of the LIM architecture and matured CMOS technology, has been the need of the hour [15], [16].Its potential to alleviate cache coherence, power, and area overheads pertaining to current-day processors by offering complete utilization of the memory bandwidth and nonvolatility has triggered numerous designs in literature [17], [18], [19], [20].However, in [17] and [19] processing is still performed using CMOS, and MTJs are used as only non-volatile storage units, thus resembling logic near memory (LNM) architecture rather than LIM.All spin logic (ASL) used in [15] and [18] despite offering all the advantages of MTJ and elimination of CMOS transistors for logic and storage operations, possesses serious limitations like high short circuit power overheads, the need for complex control of clocking, spin diffusion length, and spin channel [21], [22].Thus, hybrid CMOS/MTJ employing Spin Transfer Torque (STT) switching mechanism, similar to MRAM cells, is opted for our design of arithmetic logic unit (ALU), the heart of a processor, instead of ASL.Furthermore, STT-MTJ devices have reached a matured state in terms of commercialization compared to ASL [18] and can be integrated with the existing CMOS technology thanks to advancements in 3-D fabrication techniques in the back-end-of-line (BEOL) process.Nevertheless, the majority logic paradigm, the inherent feature of ASL [18] is adopted in our application for its incredible ability to realize complex/intensive boolean functions with lesser gates compared to NAND/NOR logic [23].
Majority logic is a special case of threshold logic, where the output is evaluated as true or high (''1'') when more than half of the ''N'' inputs (here, N >1 and N is odd) are true [24].It can be expressed in terms of AND/OR as MAJ(a, b, c) =.b+b.c+a.c,where a, b, and c are the inputs for a three-input majority gate.Additionally, majority logic together with inverter is functionally complete, capable of performing any boolean expressions [23].Thus, spurring the design of non-volatile full adders using emerging nanotechnologies such as magnetic quantum dot cellular automata [25], ReRAM [23], ASL [26], nanomagnetic logic [27] and single-electron tunneling devices [28], whose primary logic primitive is majority logic.However, to the best of our knowledge not many hybrid CMOS/MTJ full adder designs using majority logic, following the LIM architecture, are reported in literature.Furthermore, the existing full adders [8], [9], [10], [11], [12] based on LIM are either partially non-volatile, where only few of the input operands are made non-volatile or they suffer from high power and area consumptions.Therefore, encouraged by the foregoing discussions, a Fully Non-Volatile Reconfigurable Magnetic Full Adder (NVRMFA) (Figs.3,4,5) based on majority logic using hybrid CMOS/MTJ structure LIM architecture is presented in this article as the building block of a Fully Non-Volatile Reconfigurable Magnetic Arithmetic Logic Unit (NVRMALU).Full non-volatility is incorporated in our designs by employing a modified multi-context hybrid CMOS/MTJ LIM architecture (Fig. 1) [22], where all the input operands are stored in the MTJs.Such a setup eliminates the need for backing up data during power loss scenarios as in the case of partially non-volatile counterparts [8], [9], [10], [11].Also, it is our understanding that the digital comparator functionality, a crucial operation, part of the processor datapath, is not included in the existing magnetic ALU designs reported in literature [13], [14].Therefore, a novel two-bit magnetic magnitude comparator and equality detector is integrated in NVRMALU, making it first of its kind in literature.Furthermore, the novel usage of multi-context hybrid CMOS/MTJ structure with LIM architecture, incorporating benefits of both LIM and IMC architectures, together with the ultra-low power performance of the proposed designs, distinguishes them from their contemporaries.
The remainder of this paper is organized as follow: Section II discusses some of the fundamentals of MTJ, STT switching mechanism, multi-context hybrid CMOS/MTJ architecture and their advantages over their counterparts, along with a brief review on contemporary full adder design using hybrid CMOS/MTJ LIM architecture.Section III presents the proposed NVRMALU design and unravels various functionalities of NVRMALU and the working of the circuit along with the four-bit extension of the ALU design.In section IV, a comparative analysis of the proposed designs with their contemporaries and variability analysis using Monte Carlo (MC) simulations have been performed.Finally, in section V the interesting findings of above sections along with insights on fabrication challenges have been discussed as a conclusion and scope for future work.

II. BACKGROUND AND RELATED WORK A. MTJ BASICS
Spintronics is an amalgamation of magnetism and electronics, exploiting the innate magnetic properties of an electron in spin devices such as MTJ for computational nanoelectronics.MTJ is a multilayered nanopillar comprising a non-magnetic dielectric layer sandwiched by two ferromagnetic layers, namely the reference layer (RL) and the free layer (FL).
A very thin MgO layer is employed as the dielectric layer to ensure high tunnel magnetoresistance (TMR) [29], a quantum mechanical phenomenon, the primary physics underlying the working of MTJ.The magnetic orientation of RL is fixed, while the relative magnetic orientation of FL with respect to RL decides the resistance of the device, making MTJ a programmable resistor due to the spin-dependent tunneling of electrons.For our application, Rap, the high resistance state of MTJ, when FL is anti-parallel to RL is used to store logic high or ''1'', while the low resistance state, Rp, when FL is parallel to RL is used to denote logic low or ''0''.This work utilizes p-MTJ (perpendicular magnetic tunnel junction) instead of i-MTJ (in-plane magnetic tunnel junction) for the low power dissipation, high thermal stability, low current density, and scalability traits of p-MTJ over i-MTJ [30].
The writing of data values into MTJ by switching the states of the device from Rp to Rap and vice-versa can be achieved by various switching mechanisms that run gamut from Field Induced Magnetic Switching (FIMS) [31], Thermally Assisted Switching (TAS) [32] to Spin orbit torque (SOT) based switching [33] and Voltage controlled Magnetic Anisotropy (VCMA) [34].However, only STT emerged as the optimal choice for our application given its simplicity and familiarization in terms of commercialization, while other switching techniques suffer from some limitations.For instance, FIMS and TAS suffer from high power dissipation and scalability issues [8], SOT-MRAM is very sensitive to process variations, difficult to scale and possess less memory density compared to STT [33], and VCMA involves complex voltage controlling for write operations [35].Moreover, these techniques are yet to mature commercially, which is required for a magnetic processor associated with MRAM.The nominal values of the vital MTJ parameters considered for this work [8] are summarised in Table 1.The write circuit from [36] is adopted for our design as shown in Fig. 5(C).The four transistor (P1,P2,N1,N2) form an H-bridge structure (Fig. 5 (C)) which provides a bidirectional current (denoted by red and blue arrows in Fig. 5(C) for switching the MTJ state from P to AP and vice-versa using the STT effect.

B. MULTI-CONTEXT HYBRID CMOS/MTJ LIM ARCHITECTURE
Various smart architectures such as LIM [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], LNM [17], [19], [37], and In-Memory Computing (IMC)/Computation in Memory (CIM) [38], [39] have been proposed as solutions to the von-Neumann bottleneck.The advancements in 3-D Stacked Integrated Circuit Technology, have made close positioning of logic and storage units possible, as done in LNM structures alleviating interconnect delays, and power and widening the memory bandwidth [40].However, it can not be deemed as true IMC, since the computation and storage units still remain as fundamentally distinct entities.IMC and LIM serve as enhanced versions of LNM, enabling in-situ computation, while offering nonvolatility.The thin line difference between IMC and LIM is that in the former, the memory array structure (MRAM, DRAM, SRAM) is exploited to perform computation, in addition to storage by utilizing the innate analog quantities of memory devices like variable resistance of MTJ, along with peripheral circuitry like sense amplifier and write circuits.Where else, in LIM, the computational ability is embedded into memory cells by adding a plane of distributed logic units, thus modifying the memory entities to provide in-situ computation [8], [40].Unlike IMC, LIM can be used for all hierarchies of memory, not limiting to main memory and cache, which aids in addressing big data challenges, where datasets are too large to fit in main memory [41].However, LIM inhibits the maximum usage of memory as a storage unit, as the additional transistors in the modified memory cells introduce power and area overheads during the inherent storage functionality [41].Therefore, we have adopted a modified multi-context hybrid CMOS/MTJ LIM architecture [22] (Fig. 1), that provides the flexibility to be used with IMC as shown in Fig. 1 (B), according to the applications, thus serving as a bridge between LIM and IMC architectures.The same structure as in Fig. 1(A) can be realized using the array architecture in Fig. 1(B) by configuring the bit lines, source lines, and word lines accordingly.Multi-context or multiple bits hybrid logic architecture (Fig. 1 (A)) has multiple non-volatile cells forming a configuration plane for fast switching between contexts [22], [42].
Such an architecture facilitates the simultaneous usage of memory cells as computational and storage similar to [41] by controlling the enable signals, En1 to EnN (here, N is a natural number) corresponding to the NMOS switch block Logic computation is carried out by making all enable signals En1 to EnN high along with CLK.While the two memory operations, writing into a MTJ and reading the MTJ states are done by selecting only one enable signal corresponding to the desired MTJ when the CLK is low and high, respectively.Also, the symmetric structure of the design (Fig. 1) enhances the circuit performance by mitigating the impact of sneak currents [22], masking/hiding the circuit layoutś functionality post fabrication [43] and improving data security through the provision for retrieval of lost data from neighboring MTJ cells containing the same data in either true or complementary form [22].The separated precharge sense amplifier proposed in [44] is employed as a read circuit (Figs.3,4,5) for its higher sensing reliability, and lower power consumption, albeit it consumes more area compared to the precharge sense amplifier.Together with the write circuit and MTJ logic tree, it constitutes three primary blocks of LIM architecture [6].

C. RELATED WORK
This section offers a brief qualitative review of existing magnetic adders and ALU designs using hybrid CMOS/MTJ LIM architecture.In [8], [9], [10], [11], and [35], the adders are partially non-volatile and use MTJs as an ancillary device only for storage of only one or few of the input operands (only B stored in MTJs [9], [10], [11], [35] and only C in stored in MTJs [8]), while CMOS transistors perform logic evaluation.Thus prohibiting their usage during power loss without data backup systems, which incur additional power (especially static dissipation) and area overheads.Such designs are also prone to input scheduling issues requiring precise syncing of input signals and CLK during the evaluation phase for proper output [45].Hence, understanding the need for full nonvolatility, researchers in [12], [46], and [47] have reported fully non-volatile adders by employing magnetic flip-flops or additional MTJs to store all the input operands.In [46] magnetic flip-flops are integrated with CMOS full adder circuit to provide nonvolatility to inputs but subjected to increased area and power.In [47] the additional MTJs part of the self-terminate write circuit are used for storing inputs A and B, however much like [12] using 2MTJ cells for storing each input (A,B), the logic computation is still done using CMOS transistors.Thus, they do not harness the full benefits of MTJ, and the power and area of such circuits are exacerbated while extending to multiple bits/inputs.These issues are addressed in [45], [48], and [49], where fully non-volatile full adders with MTJs as part of the logic tree performing computation.However the criss-cross arrangement and series connection of MTJs in these designs induce complexity in the writing process as a single write circuit will not suffice to write different input values, thus increasing write power and area.Furthermore, these designs possess the advantage of increased read margin due to higher resistances owing to the series connection of MTJs but at the cost of increased read latency [50].As for ALU designs following hybrid CMOS/MTJ LIM architecture, the circuits reported in [13] and [14] are partially non-volatile and the discussion of ALU designs using other nanotechnologies such as ASL, domain wall, etc is out of the scope of this work.

III. PROPOSED MAGNETIC ARITHMETIC LOGIC UNIT
In this section, a novel fully non-volatile magnetic arithmetic logic unit, coupling the prowess of multi-context hybrid CMOS/MTJ LIM architecture and majority logic paradigm is presented.Such an ALU finds itself in applications like data features extraction process in edge computing/detection in big data applications needing low data precision, as demonstrated in Fig. 2 [51].Which enables the main processor to operate on the already processed/cooked data from NVRMALU for high-precision computations, which in addition to reducing the von-Neumann bottleneck (reduced data transfer between main processor and memory), introduces parallel processing where the main processor and the proposed NVRMALU can function simultaneously, increasing the performance.This is made possible by the nonvolatile trait paired with the simultaneous usage of NVRMALU as memory and for computations allowing the main processor to perform a different operation on the same data operands stored in NVRMALU.Furthermore, the nonvolatility characteristic of the presented ALU reduces the standby power consumption by offering a provision to turn/switch off the passive/unused portions of the nonvolatile memory.
FIGURE 2. Block diagram depicting the data transfer between the main processor and NVRMALU, a part of the non-volatile memory unit (NV memory unit).The yellow arrows denote the conventional communication between processor and NV memory unit.In contrast, the green arrow illustrates the transfer of cooked data by NVRMALU from the NV memory unit to the main processor for further computations [51].
Since binary addition operation is the central pillar of an ALU, the design of a 1-bit NVRMFA is presented first in the following subsection.Dynamic reconfigurability, the ability of a circuit to realize multiple logic functionalities during run-time using an external trigger such as control signals [52], is embedded into the NVRMFA by virtue of design.This reconfigurability trait is harnessed to extend the proposed NVRMFA (Figs. 4,5) to NVRMALU by the addition of peripheral circuitry like Multiplexers, XNOR gates, AND gates, and inverters.Furthermore, the reconfigurability feature of the proposed design qualifies it as a polymorphic gate, which can be applied to realize hardware security primitives such as the prevention of IC tampering during fabrication, IC fingerprinting, and IC watermarking [43].

A. 1-BIT MAGNETIC ADDER DESIGN
In this section, a novel 1-bit two-input fully non-volatile magnetic full adder is presented to address the shortcomings of its predecessors in terms of power, power delay product (PDP), and non-volatility.The circuit works on dynamic logic, where the sense amplifier acts as a current comparator performing logic computation based on the difference in currents in the left and right branches (Figs.4,5), constituted by their corresponding resistance difference, |Rl-Rr|( ).Here, Rl and Rr denote the net resistance of the MTJ logic tree in the left branch and right branches, respectively.The adder circuit comprises a Carry sub-circuit (C ckt ) and a Sum sub-circuit (S ckt ), corresponding to the two primary outputs of an adder.Similar to [50] and [53], the Carry function is realized using a three-input majority logic (M3), while a fiveinput majority logic (M5) underlies the working of the S ckt .The governing boolean equations for carry and sum functions are given by, (1) (2) Here in Eq.1 and Eq.2, A, B, and C in are input operands, whose weights are one, and C out is the output of M3 with weight = 1.
Figs.3,4,5 show the S ckt -design 1, C ckt and S ckt -design 2, respectively, where the conventional multi-context hybrid CMOS/MTJ structure [42] is modified to achieve adder functionality along with full non-volatility.Here, unlike in [8], [9], [10], and [11] 1-bit of data is stored/represented by a single MTJ, instead of a pair of complementary MTJs.All the MTJs present in the C ckt and S ckt are reconfigurable whose state can dynamically be changed using the write circuit.This is in contrast to the conventional multi-context architecture [42], where either the right branch or left branch is composed of fixed MTJs serving as reference resistance.Such a modification is done to achieve a higher and read margin by storing all input operands using MTJs in the left branch and their complements in the right branch instead of fixing a reference resistance in the right branch yielding less .Also, it is difficult to realize a reference resistance value using the parallel arrangement of MTJs, while preserving the symmetric structure.Additionally, the use of reconfigurable MTJs in both branches aids in the extension of the adder to comparator as shown in subsequent sections, albeit at the cost of increased write energy.Two designs of S ckt are proposed, one following Eq.2 and the other is the optimized version of the former, referred to as design 1 and design 2, respectively.

1) DESIGN 1
This design follows the conventional 5-input majority logic, where five MTJs (Fig. 3) namely A, B, C in , C out bar 1, and C out bar 2 in the left branch, and their complements in the right branch are used to store the operands of Eq.2, given the symmetric structure.Thus, for ''N'' 1-bit inputs, 2N MTJs, N on each branch are required.Here, all MTJs are of TMR=200% and the truth table of sum logic along with corresponding resistances is summarized in Fig. 3(B).However, this design bears the disadvantage of increased write energy and area overhead which has been optimized to enhance performance in design-2.

2) DESIGN 2
The original Eq.2 is modified by replacing the two C out bar terms in Eq.2 with a single C out bar term but with weight = 2 as represented as follow, Here, the conventional norm of having an odd number of inputs for majority logic is amended by introducing the concept of weight/voter (denoted by *) [24] status to C out bar.The modified Eq. 3 is reflected in the hardware by tweaking the resistance value of the MTJ storing C out /C out bar terms to a higher value compared to other input operand MTJs.will not yield satisfactory alteration in resistance MTJ to satisfy the Eq.3.Thus, similar to [55] and [56], TMR, the cardinal physics behind MTJ, is chosen as the candidate for variation to achieve desired higher resistance for MTJ C out bar (or MTJ C out ).A heuristic approach was adopted in determining the required TMR ratio by calculating Rl and Rr for different combinations.Finally, a TMR ratio of 600% was fixed for MTJ C out bar (MTJ C out ), as supported by the MTJ model library [57] and advancements in MTJ fabrication with MgO as a barrier layer.For instance, S. Ikeda et al. have experimentally demonstrated a TMR=604% at room temperature for pseudo spin valve MTJ structure in [58].Fig. 5 shows the optimized S ckt with 4 MTJs on both branches, reducing the energy, latency, and area of the circuit.The carry output and its complement from C ckt (Fig. 4) is given as inputs to write circuits along with primary inputs (A,B,C in ) and their complements.(B) Buffer circuitry to shift CLK by 2.5ns to obtain CLK sum .(C) STT write circuit schematic [36].Here, Vdda =1.25 V and widths of transistors P1,P2,N1,N2 are taken as 1µm for reliable writing.Input data controls the direction of current, while the Enable signal turns the circuit ON or OFF using the control circuit.

TABLE 3.
Truth table for the proposed 1-bit NVRMFA along with corresponding resistances in left and right branches.
due to increased Rl and Rr (Table 3) compared to design 1 (Fig. 3(B)), a dip in sensing current (I sensing ) causes reduced read energy.Also, the read latency is improved owing to a predominantly lesser value for design 2 (3) compared to design 1 (Fig. 3(B)).Therefore, design 2 is adopted for the NVRMFA design, and in further discussions, design 2 is referred to as S ckt .However, it is to be noted that design 2 containing MTJs with different TMR and device parameters positioned in such close proximity needs special expensive manufacturing techniques with less yield [59], [60], while considering fabrication aspects.

3) OPERATION OF THE CIRCUIT
In C ckt (Fig. 4), the clock signal, CLK, is the primary control signal, typical of a dynamic logic, controlling the write and read operations of the circuit.From Eq.2, It can be inferred that C ckt has to be evaluated first, whose outputs (C out /C out bar) serve as the inputs for S ckt .Thus, mandating the need for two separate clock pulses with an offset/delay of 2.5ns in the time domain between the pulses to ensure evaluation of sum logic after carry computation.Fig. 5 shows the S ckt , where CLK is shifted by 2.5ns using a buffer circuitry (Fig. 5(B)) consisting of cascaded inverters to obtain CLK sum , the primary control signal for S ckt .The adder circuit operates in two phases, precharge phase (P) and the evaluation phase (E).In the precharge phase, when clocks CLK and CLK sum are low or ''0'', the writing of input values into corresponding MTJs is performed using the write circuit.
During the evaluation phase, when CLK and CLK sum are high or ''1'', the logic computation of Carry and Sum takes place based on the speed of current discharge induced by the difference in Rl and Rr.The left and right nodes of C ckt and S ckt connected to the outputs Carry & Carrybar (Fig. 4) and Sum & Sumbar 118950 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.where the values are written into MTJs simultaneously in both branches by using the same set of enable signals and the other being the simultaneous writing of primary inputs (A, B, C in ) and their complements in both subcircuits C ckt and S ckt .A duration of 2.5ns, the same as evaluation time, is required for the writing of each MTJ and the evaluation phase of C ckt always overlaps with the last write operation of S ckt for a given cycle.Accordingly, the precharge duration of 10ns is considered with 2.5ns evaluation phase, yielding the operating frequency for the adder as 80MHz.The remaining transistors during the precharge phase are in the cut-off region.In the evaluation phase, transistors M1, M4, M9, and M10 in C ckt and S ckt are switched OFF, cutting the connection between Vdd and the circuit.Transistors M11 and M12, the isolating transistors (which isolate the sense amplifier from The voltage difference between nodes X and Y, |V X − V Y |, a reflection of the difference in Rl and Rr, is amplified by inverters I1 and I2, which serve as gate voltages for M7 & M8, controlling the logic evaluation.Here, unlike in conventional LIM architecture [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], there are two tailing discharge transistors, M19 & M20 (C ckt ) and M21 & M22 (S ckt ), instead of just one transistor, in order to write distinct input values (excluding complements of inputs) into MTJs as in the case of 2-bit comparator design.When two or more of the inputs are ''0'' or low, Rl< Rr, and vice-versa for the case when two or more of the inputs are ''1'' or high.Table 3 presents the truth table for the adder along with Rl and Rr values for all input combinations.For Rl<Rr, the left node discharges faster than the right node and eventually reaches zero while pulling up the right node to Vdd owing to the cross-coupled structure of the sense amplifier and vice-versa for the case when Rl>Rr.Thus, evaluating carry logic (left node) and its complement (right node) in C ckt (Fig. 4) and sum(left node) along with its complement (right node) in S ckt (Fig. 5).The functionality verification is done by performing transient analysis, which is presented in Fig. 6.The working of the circuit can be better understood by considering an example case A, B, C in = 0,1,0 highlighted in Fig. 6.From Table 3, it can be observed that Rl = 0.428Rp, Rr = 0.6Rp for C ckt , Rl<Rr, and for S ckt , Rl = 0.403Rp, Rr = 0.375Rp, Rl>Rr, thus giving carry=0 and sum=1, in line with the foregoing discussions.

TABLE 5.
Truth table for the Subtraction functionality of the proposed NVRMALU, along with corresponding resistances in left and right branches.

B. 1-BIT MAGNETIC SUBTRACTOR
Fig. 7 is the schematic of the top-level block diagram of the proposed 1-bit NVRMALU cell, built using NVRMFA, including the input-output interconnects, while Fig. 8 shows its internal schematic.Each sub-circuit (Fig. 8(A),(B 4 to configure the NVRMALU to perform different functionalities.The output nodes, CL and CR (Fig. 8(A)) are mapped to Output 1 and Output 3 of the NVRMALU cell (Fig. 7), respectively.The left and right nodes of S ckt (Fig. 8(B)), SL, and SR serve as inputs to a 2:1 Mux controlled by the same abovementioned control signal, S0, whose output is connected to Output 2 of the NVRMALU cell (Fig. 7).
The reconfigurability trait of the MTJs and the resemblance between addition and subtraction functions are exploited to achieve subtractor functionality.Borrow (Fig. 8(A)) and difference (Fig. 8(B)) are evaluated using the boolean given by The MTJs CL1, CL2, and CL3 (Fig. 8(A)) are configured as Abar, B, and B in , respectively, while their complements are written into MTJs CR1 to CR3 (Fig. 8(A)) by setting the control signals as S0=1, S1=0 for Mux carryL and Mux carryR (Table 4).Similarly, MTJs SL1 − SL4 and SR1 − SR4 are configured to store Abar, B, B in , and B out , the borrow output, and their complements, using Mux sumL and Mux sumR respectively.The operation of the circuit is very similar to that of NVRMFA, where in the precharge 118952 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.   5.The subtractor transient waveform is given in Fig. 9, where an example case, A, B, B in = 1,1,0 is elucidated.Furthermore, the majority logic primitives can achieve other multi-input boolean functions, outside the framework of conventional functionalities discussed above.For instance, the IMPLY logic can be implemented by making B in =1 in subtractor mode (M3(Abar, B,1)=A−→ B), without any additional gates/circuitry.

C. BASIC LOGIC GATES
The six basic logic functions; AND, OR, XOR, NAND, NOR, XNOR are inherently encapsulated within the adder functionality, as the carry function when expressed in Sum of product form is essentially a combination of AND and OR functions, while the sum function stems from XOR operation.Thus, the full adder function can be modified to realize 2-input logic gates by configuring C in as a control signal to switch the circuit from AND/NAND/XOR functionality to OR/NOR/XNOR functionality (Table 4), given by Eq.5. ( The simultaneous production of true and complementary outputs in C ckt and S ckt , owing to the cross-coupled structure and differential nature of the sense amplifier, eliminates the need for additional inverters to produce complements of AND, OR, and XOR logics.When C in =0, carry output (CL) (Fig. 8(A)) gives the AND logic (Output 1), while CR gives NAND logic (Output 3) and in S ckt (Fig. 8(B)) SL gives XOR logic and SR gives XNOR logic (SL or SR selected as Output 2 according to S0 signal using 2:1 Mux in Fig. 7).
Similarly, when C in =1, CL in C ckt gives OR logic (Output 1) and its complement, NOR (CR) is available at Output 3, while SL in S ckt gives XNOR, SR gives XOR logic (Output 2 selected between SL and SR using S0 signal).Fig. 10 shows the transient response for all six logic gates along with an example case, A,B = 0,1 highlighted.Additionally, the 5-input majority logic underlying S ckt can be harnessed to achieve 3-input logic operations such as AND/NAND and OR/NOR by configuring MTJ C out bar as 0 and 1 respectively, according to Eq.6

D. 2-BIT MAGNETIC COMPARATOR
Digital comparators are very vital datapath elements, instrumental in implementing numerous algorithms of the controller within a processor for applications ranging from general-purpose computer architecture operations such as memory addressing logic, queue buffers, and test circuits to specific applications such as digital image matching, arithmetic sorting, data compression, and digital neural network [61].Hence, its inclusion in NVRMALU is crucial next step in realizing a complete magnetic processor.Digital comparators are made up of two components; the magnitude comparator (Mag-Comp) which compares the magnitudes of its inputs to assess which is greater or lesser and the equality detector (Eq-Detec) which checks if the inputs are equal in magnitude or not.Since, the concept of significant bits (MSB, LSB) is meaningful only for inputs with two or more bits, a 2-bit 2-input digital comparator (Fig. 8) is designed and demonstrated in this work.Also, the functionality of a 1-bit comparator diminishes into a 1-bit zero detector, which could be implemented using a magnetic flip-flop [5].The intrinsic comparator nature of the sense amplifier is utilized to perform the magnitude comparator operation using C ckt (Fig. 8(A)) instead of the usage of relatively computationally expensive subtraction operation as done in conventional magnitude comparators [62].However, the differential nature of sense amplifier inhibits the realization of equality detectors without peripheral circuits like XNOR gates.When the inputs A and B are equal the left and right nodes (Fig. 8) partially discharge to 350 mV and are susceptible to arbitrarily discharging based on the instantaneous current profile in the left and right branches.Thus, a 2-bit equality detector circuit using S ckt (Fig. 8(B)) and conventional AND and XNOR gates (Fig. 8(C)) is designed to realize the complete comparator operation.In Fig. 8, MTJs CL1, CL2 and CL3 store the values, A0, A1, and A1, respectively, using Mux carryL , where A0 & A1 are LSB and MSB of input A, respectively.Here, A1 is written twice (MTJs CL2 & CL3) to establish a weight = 2 for the MSB.Similarly, bits B0 and B1 of input B are written into the right branch MTJs, CR1 − CR3 in C ckt , using Mux carryR (Fig. 8(A) used as Mag-Comp).When, A >B, Rl > Rr, thus CL=1, CR=0 (Table 6), and vice-versa for A<B, following the same working of the NVRMFA discussed above.The truth table of the two-bit two-input comparator is given by Table 6 containing the resistances of each branch for all input combinations.Although the circuit in Fig. 8(C) evaluates inputs for equality, it is still volatile, and thus its output and complement are stored in MTJs SR4 and SL4, respectively, using Mux sumR and Mux sumL , respectively.Having stored the output of the equality detection operation, it is made fully non-volatile by storing the inputs, A0, A1 in MTJs SL1, SL2 and B0, B1 in MTJs SR1, SR2, using Mux sumL and Mux sumR , respectively (Fig. 8 (B) used as Eq-Detec).As shown in Table 6, SR outputs ''1'' or high only when A=B and to ensure the same, MTJs SL3 and SR3 are fixed as 1/Rap and 0/Rp, respectively.The transient response is presented in Fig. 11, where two example cases, one for the A=B case and one for A̸ =B case are highlighted.Additionally, the FIGURE 12. Demonstration of nonvolatility and dynamic reconfigurability (shown using the green box, illustrating the change in S0,S1 according to Table 4) of the proposed NVRMALU circuit.Power loss scenario has been simulated from 37.5ns to 45ns and from 70ns to 82.5ns.Data retention process during power loss and evaluation post power resumption has been elucidated using red square box.The black arrow in output signals highlights how evaluated outputs before power loss is recovered upon power resumption.Simulation performed for 100ns, considering example cases for all functionalities of NVRMALU.

FIGURE 13.
Block diagram of 4-bit extension of NVRMALU for all but comparator functionality.CLK and CLK sum are shifted by 2.5ns for every stage using buffer circuits as shown.
nonvolatility trait of the proposed NVRMALU has been demonstrated in Fig. 12 for all functionalities.Furthermore, the dynamic reconfigurability ability of the proposed design to switch between multiple functionalities of NVRMALU during run-time using S0 and S1 signals has been elucidated in Fig. 12.

E. FOUR-BIT EXTENSION OF ALU
Demonstration of the proposed NVRMALU for multiple bits (Fig. 13) has been carried out by designing a four-bit NVR-MALU (excluding comparator functionality) by cascading FIGURE 14. Transient response for 4-bit NVRMALU for the addition, subtraction and basic logic gates functionalities.Colored bands, A,B,C,D,E highlight the evaluation phases of C ckt and S ckt for cell i (i=0,1,2,3, respectively).The outputs 1&3 of cell i (i=1,2,3) overlaps/available with output 2 of cell i −1 .Example case, A=13, B=10 taken for 4-bit addition and subtraction operations.Example case, A=6, B=12 is considered for AND/NAND/OR/NOR/XOR/XNOR logic operations.
1-bit NVRMALU cells in a ripple-carry fashion.The fourbit extension for addition, subtraction, and logic gates has been shown in Fig. 13.Here, the four-bit extension of the comparator is not presented as it requires additional circuitry and special architecture [61] other than the ripple carry structure, which is chosen for its simplicity.In Fig. 13, four 1-bit NVRMALU cells (Fig. 7) are connected sequentially following the rule that N-bit computation requires N 1-bit NVRMALU cells.A0, B0 to A3, and B3 are given as inputs to corresponding cells, while the C out of the previous cell serves as input to the next cell.Also, here a certain degree of parallelism has been achieved by incorporating a pipelined architecture with four separate sets of clock pulses enabling the simultaneous operation of all four cells.The CLK and CLK sum signals for cell i (i=0,1,2,3) are each shifted by 2.5ns using the buffer circuit (Fig. 13) compared to cell i−1 to obtain a four-stage pipeline structure.Such a setup ensures that all four carry and sum output bits are available after 20ns instead of at the end of the 4th cycle of CLK and CLK sum as in the case of [8] which would be 50ns (1cycle = 12.5ns).Thus increasing the operating frequency by 250 %, albeit at the cost of area compared to conventional non-pipelined structure.Since all inputs and intermediate values such as C outi (i=1,2,3) are saved in MTJs the circuit is completely non-volatile and suitable for instant ON-OFF applications and power loss scenarios, while also eliminating the need for registers to store intermediate values [8].The control TABLE 7. Performance Comparison amongst 1-bit DPTLCMOS FA, existing 1-bit MFA Designs and the Proposed 1-bit NVRMFA @ 1.2V and 80MHz using 45nm CMOS technology node.

TABLE 8.
Performance metrics comparison for various functionalities between the proposed NVRMALU and DPTLCMOS ALU [63].signals S0 and S1 are common to all cells, configuring the functionality/mode of each cell according to the opcode in Table 4.

IV. RESULTS AND ANALYSIS
Assessment of the robustness of the proposed circuit design and analysis of design resilience as subjects of inevitable process variations during the fabrication process have been carried out in this section.In that regard, a comparative study between the NVRMFA around which revolves the working of NVRMALU and its contemporaries is performed using performance indicators such as power, delay, and PDP.Following that, a variability analysis for the adder circuit to evaluate the effects of process variations in MTJ on the read and write performances has been discussed.All the simulations were performed using CMOS 45nm GPDK technology library and PMA_MTJ_6.1.5_Beta5,a Verilog-A-based model library [57] on the Cadence Spectre simulation platform.

A. QUALITATIVE ANALYSIS OF PERFORMANCE METRICS
Table 7 presents a qualitative comparison of performance metrics amongst proposed NVRMFA, various existing hybrid CMOS/MTJ LIM adders, and double pass transistor clocked CMOS (DPTLCMOS) adder obtained from the standard cell library of STMicroelectronics design kit [63], [64].Here, to provide a fair comparison all adder circuits in Table 7 were simulated using CMOS 45nm GPDK technology and PMA_MTJ library with 80 MHz operating frequency (precharge phase=10ns, evaluation phase=2.5ns)and supply voltage of 1.2V.Thus ensuring the same simulation setup and environmental conditions as the proposed NVRMFA.Furthermore, the choice of magnetic adders for comparison is restricted to STT-switching mechanism-based adders to enhance the accuracy of the comparison.
From Table 7 it can be seen that a significant power reduction has been achieved for the proposed NVRMFA compared to other MFAs [8], [9], [10], [11], [12].The static power dissipation is high for MFAs [8], [9], [10], [11], [12], owing to the use of CMOS transistors for computation, which causes significant leakage losses during steady state.Also, the dynamic power for the proposed NVRMFA is greatly reduced compared to other MFAs [8], [9], [10], [11], [12].Because in the proposed design only transistors associated with CLK and CLK sum signals contribute towards dynamic power consumption during the evaluation phase unlike MFAs [8], [9], [10], [11], [12], where transistors corresponding to inputs switch when CLK transitions from high to low and vice-versa, in addition to other transistors associated with CLK.Also, such an incredible power reduction for NVRMFA is possible due to very low current in order of nA passing through MTJ logic tree owing to high resistance (order of M ) set by the write circuits connected in parallel, whose enable line is made low, during evaluation phase, 118956 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.contrast to other MFAs [8], [9], [10], [11], [12].Additionally, it is observed that most of the works [8], [9], [10], [11], [12] have not disclosed the integration of the write circuit with the main circuit and its performance indexes like write power and delay.Also given the primary idea of LIM is to use the stored values for computation, the write operation and associated values are excluded in the Table 7, where only values corresponding to the logic evaluation phase are reported.Furthermore, the power metrics discussed in Table 7 correspond to the active mode of the adder circuit.It is to be noted that the standby power mode of the proposed NVRMFA is zero, as power can be completely switched OFF given the full nonvolatility, unlike in the case of partially non-volatile MFAs [8], [9], [10], [11], [12] incurring standby power losses.However, the carry and sum delays for NVRMFA are higher compared to that of MFAs [8], [9], [10], [11], [12], owing to the double tail discharge structure of the adopted sense amplifier (Figs.4,5).Nevertheless, the marvelous reduction in overall power makes up for it as seen through PDP values in Table 7.
In order to benchmark the performance of the designed NVRMALU, the DPTLCMOS adder is extended into DPTLCMOS ALU as shown in the appendix section, whose performance is compared with NVRMALU and summarised in Table 8.The power corresponding to all functionalities; addition, subtraction, and logic gates of the NVRMALU is found to be reduced by around six folds, thus demonstrating superior performance.Therefore the proposed NVRMFA and NVRMALU emerge as excellent instant ON-OFF digital systems for applications where the system predominantly is in standby mode [65] despite suffering from area overheads (high device count).

B. VARIABILITY ANALYSIS
Reliability analysis is imperative to understand the deviation in circuit performance under process variation during manufacturing and its remedial solutions for a successful design.To that end, a comparative analysis of variability in the power of the proposed NVRMFA and other MFAs has been presented in Table 9.Also to evaluate the sensitivity of the designed NVRMALU to variation in MTJ parameters, Fig. 15 has been plotted to identify the dominant MTJ parameter controlling the circuit performance.(A) Plot for variation in sense margin (SM for C ckt & S ckt ) for variation in t ox in steps of 0.05nm in presence of 3% Gaussian deviation in t sl and TMR, obtained using 1000 MC runs.(B) Plot for variation in sense margin (SM for C ckt & S ckt ) for variation in t sl in steps of 0.1nm in presence of 3% Gaussian deviation in t ox and TMR, obtained using 1000 MC runs.(C) Plot showing variation Rp for stand-alone MTJ against t ox varied in steps of 0.025 nm, while keeping other parameters constant.(D) Plot for variation in I pulse length for different V MTJ values against t ox varied in steps of 0.025 nm, while keeping other parameters constant.Here, I pulse length for P->AP transition and AP->P transition are in close proximity, thus their average values are reported in Y-axis.
In Table 9, 1000 MC runs have been performed by considering a 3σ variation in CMOS device parameters such as channel length, threshold voltage, and 3% (σ =1%, as mentioned in [66]) Gaussian variation in geometric parameters of MTJ such as MgO barrier thickness (t ox ), free layer thickness (t sl ), and TMR.Since, MTJ parameters, t ox , t sl , and TMR are more susceptible to variations during fabrication [67], 3% Gaussian variation only in these parameters is considered as supported by the model, while keeping other material parameters constant.Additionally, the effects of Joule heating have also been considered to understand the stochastic nature of MTJ switching.In Table 9, the variability coefficient (σ /mean) for the proposed NVRMFA is remarkably less compared to all but similar to MFA [11], suggesting better immunity of the proposed circuit towards process variations.
An understanding of the various causes of failure in read and write operations, which are expressed as read error rate (RER) and write error rate (WER), respectively, is extremely crucial in determining the reliability of the designed circuit.Since NVRMALU revolves around NVRMFA, a step-wise approach to study the parameters controlling RER and WER corresponding to NVRMFA has been adopted and presented in Fig. 15.In order to illustrate the dominance of MTJ parameters in deciding circuit performance, only MTJ variations are considered in the following discussion, while keeping CMOS parameters constant.Process variations during fabrication affect all the MTJ parameters, deviation in a single MTJ parameter can not be isolated for analysis, hence simultaneous consideration of all three geometric parameters is done.However, it is carried out with a focus on t ox and t sl in Fig. 15(A) and Fig. 15(B), respectively to determine which parameter has the most detrimental effect.Also, it is to be noted that read disturbance error is eliminated due to the flow of only the leakage current (Figs.4,5), which is lesser than the critical switching current (I co = 38.2µA) of MTJ by a large margin.
Sense margin (SM) is defined as the difference in node voltages, |V X -V Y |, which is amplified by the inverters I1 and I2 in Figs 3,4,5 and 8 controlling the discharge speed of M7 and M8 transistors during logic computation.Thus, this voltage difference must be large enough for differential discharge rates of M7 and M8 transistors enabling reliable deterministic logic computation.SM is obtained as the product of sensing current (I sensing ) and resistive difference in left and right branches ( = |Rl-Rr|).SM can be calculated by just determining node voltages at X and Y for carry and sum sub-circuits of NVRMALU during the simulation and then evaluating their difference, which will be found to be directly proportional to .It is a crucial parameter indicating the success and efficiency of read operation.Firstly SM (given by Eq.7), controlling the read performance [68], is plotted as a function of t ox in the presence of 3% Gaussian variation in t sl and TMR in Fig. 15(A).Similarly, in Fig. 15(b), SM is plotted as a function of t sl while simultaneously varying t ox and TMR by 3% following Gaussian distribution.The relation among SM, , and Rp is given by Eq.7, which partnered with the Rp equation [68] given by Eq.8, explains the growing trend of SM with t ox in Fig. 15(A).
Here, K is a positive number from Table 3.
Here, F is the fitting parameter associated with resistancearea product (RA) of MTJ, φ is the energy barrier height and A F is the cross-sectional area of MTJ [68].The drastic trend in SM pertaining to S ckt compared to that of C ckt is due to two reasons, one being the higher resistance values caused by the use of TMR=600% in S ckt .Higher I sensing owing to higher access transistor widths (M13-M20 in Fig. 5) and lesser (Table 3) constitute the second reason.Thus, Fig. 15(A) shows improvement in read margin through an increase in Rp values, given the symmetric nature of the circuit where primarily controls the read decision process.This encourages the use of high t ox MTJs with high Rp values as in the case of VCMA-based MTJs [34], to the issue of reduced in modified multi-context hybrid architecture with parallel arrangement of MTJs.Furthermore Fig. 15(B) shows the minimalistic effect of t sl on SM, as t sl is not directly related to Rp (Eq.8), indicating the dominant nature of t ox compared to t sl .
It was observed during MC simulations (Fig. 15(A),(B)), that RER increases with increasing t ox despite increasing SM.This issue can be traced back to the failure to write the desired values into MTJs for proper logic computation, caused by increasing t ox value and, consequently rise in Rp (Eq.8) as shown in Fig. 15 Therefore, it can be concluded that I MTJ >I co and sufficient I pulse are the two crucial conditions for proper write operation for the STT mechanism, which in turn determines the success of logic evaluation/read operation.Consequently, it was observed that WER for S ckt is less compared to C ckt due to increased I MTJ (wider access transistors) and bit error rate (BER) ↑, when WER ↑.Also, it was inferred that the read operation is more prone to process variations in CMOS parameters as sense amplifier the primary component of the read operation is composed of only CMOS, while the write operation greatly depends on MTJ resistance variations.However, these failures in write and read operations are inevitable due to invariable process variations during fabrication at the device level.Furthermore, they are aggravated with device scaling to deep submicron technology nodes, where the restricted supply voltage causes reduced write current and switching probability.Thus, error correction schemes such as dynamic current/charge boosting technique, increasing I MTJ according to process variations [70], adaptive write scheme where I pulse length is modified as per deviation in MTJ parameters [71], and adaptive read schemes performing dynamic reference resistance changes for sense amplifier [72] at the circuit level are mandatory to tackle these issues.In addition, other methods at the circuit level like an amplification of |V X − V Y | using cascaded inverters (I1, I2 in Figs.4,5) to improve SM and increasing the widths of access transistors (Fig. 4,5) and write circuit (Fig. 5(C)) are adopted to improve the reliability of the circuit at the cost of area and power.

V. CONCLUSION AND SCOPE FOR FUTURE WORK
In this article, as an attempt to design a magnetic processor, a novel ultra-low power magnetic arithmetic logic unit (NVRMALU) has been presented, whose advantages over its counterparts include full nonvolatility, dynamic reconfigurability, and low power, making it ideal for normally OFF and instant ON applications.The extension of NVRMALU for multi-bit computations has been demonstrated for all but comparator functionality, which in addition to the inclusion of bit-wise functionalities such as logical and arithmetic shift operations for a complete magnetic ALU, constitute a scope for future work.Also, from variability analysis, it can be inferred that naturally the resistance of MTJ, a memristor, is the most crucial parameter.The precise control of parameters like t ox controlling MTJ resistance during manufacturing, using advanced fabrication tools and techniques [73], and error correction schemes at the circuit level are mandatory for robust performance.Furthermore, it is to be noted that spin devices such as MTJ are yet to evolve to a matured and sophisticated stage as achieved by CMOS technology, especially in terms of fabrication and reliability.Particularly, the fabrication of MTJs with different TMR and characteristics by employing special techniques such as localized rapid thermal annealing processes with different annealing temperatures, and annealing duration, along with different MgO crystal oxidation conditions are to be explored [74], [75].

APPENDIX DOUBLE PASS TRANSISTOR CLOCKED CMOS ALU CIRCUIT
Kindly refer to Fig. 16

FIGURE 1 .
FIGURE 1. Demonstration of resemblance and flexibility to switch between multi-context hybrid CMOS/MTJ LIM architecture shown in (A) and In-memory computing using array architecture shown in (B).In (B), the SL1 − SLN serving as inputs to sense amplifier and BL1 − BLN are source lines and bit lines, respectively.They control the direction of write current (shown by red and blue arrows) and En1 − EnN act as word lines controlling MOS L 1-MOS L N & MOS R 1-MOS R N for selecting corresponding MTJs, similar to (A).

FIGURE 3 .
FIGURE 3. (A) Circuit diagram of design 1 for S ckt with 5 MTJs on each branch with same TMR=200%.Here, MTJs A,B,C in constitute Req, shown with a resistor in (C).(B) Truth table corresponding to design 1 with Rl,Rr values for all input combinations along with .(C) Demonstration of replacement of 5 MTJ structure with 4 MTJ structure with TMR=600% for MTJ C out bar and corresponding equation.

FIGURE 5 .
FIGURE 5.(A) Design 2 of S ckt , the optimized version that constitutes Sum sub-circuit of the proposed 1-bit NVRMFA.The carry output and its complement from C ckt (Fig.4) is given as inputs to write circuits along with primary inputs (A,B,C in ) and their complements.(B) Buffer circuitry to shift CLK by 2.5ns to obtain CLK sum .(C) STT write circuit schematic[36].Here, Vdda =1.25 V and widths of transistors P1,P2,N1,N2 are taken as 1µm for reliable writing.Input data controls the direction of current, while the Enable signal turns the circuit ON or OFF using the control circuit.

FIGURE 6 .
FIGURE 6. Transient waveform for NVRMFA simulated for 100ns, covering all 8 input combinations (1 cycle=12.5ns).Orange and blue rectangular strips denote the evaluation phases of C ckt and S ckt , respectively.Example case, A,B,C in = 0,1,0 is highlighted by red box.

FIGURE 7 .TABLE 4 .
FIGURE 7. A top view/block diagram of the proposed NVRMALU cell.TABLE 4. Opcodes combinations for various arithmetic and logical functionalities of the proposed NVRMALU.

FIGURE 8 .
FIGURE 8. Circuit diagram of the proposed NVRMALU.(A) Carry sub-circuit (C ckt ) with muxes (Mux carryL & Mux carryR ) to configure ALU for different functionalities by providing corresponding inputs to write circuit.(B) Sum sub-circuit (S ckt ) of NVRMALU with C ckt outputs, primary inputs for arithmetic and logic operations and equality detector output, serving as inputs to muxes (Mux sumL & Mux sumR ).Here, S ckt performs equality detection operation.(C) Peripheral circuit aiding S ckt in realising equality detection operation.
Fig.7is the schematic of the top-level block diagram of the proposed 1-bit NVRMALU cell, built using NVRMFA, including the input-output interconnects, while Fig.8shows its internal schematic.Each sub-circuit (Fig.8(A),(B)) of the NVRMALU comprises two 4:1 Multiplexers (Mux carryL & Mux carryR in C ckt and Mux sumL & Mux sumR in S ckt ) corresponding to two write circuits, one for each branch.The four functionalities of the NVRMALU; Addition, Subtraction, Logical operations (6 basic logic gates), and 2-bit digital comparator and their corresponding input data values are mapped to the write circuits configuring the MTJs (CL1 -CL3 & CR1 -CR3 in Fig.8(A) and SL1 -SL4 & SR1 -SR4 in Fig.8(B)) accordingly, using Mux carryL & Mux carryR and Mux sumL & Mux sumR , respectively.The 4:1 Mux control signals, S0 and S1 are set according to the opcodes summarized in Table4to configure the NVRMALU to perform different functionalities.The output nodes, CL and CR (Fig.8(A)) are mapped to Output 1 and Output 3 of the NVRMALU cell (Fig.7), respectively.The left and right nodes of S ckt (Fig.8(B)), SL, and SR serve as inputs to a 2:1

TABLE 6 .
Truth table for the proposed 2-bit comparator along with corresponding resistances in left and right branches.

FIGURE 10 .
FIGURE 10.Transient response for six basic logic gates functionality of NVRMALU.Green box shows how C in is used as mode signal for switching between AND/NAND and OR/NOR modes.XOR & XNOR outputs are produced in both modes.Example cases, A,B = 0,1 for AND/NAND mode and A,B = 1,0 for OR/NOR mode are highlighted.

FIGURE 15 .
FIGURE 15.(A) Plot for variation in sense margin (SM for C ckt & S ckt ) for variation in t ox in steps of 0.05nm in presence of 3% Gaussian deviation in t sl and TMR, obtained using 1000 MC runs.(B) Plot for variation in sense margin (SM for C ckt & S ckt ) for variation in t sl in steps of 0.1nm in presence of 3% Gaussian deviation in t ox and TMR, obtained using 1000 MC runs.(C) Plot showing variation Rp for stand-alone MTJ against t ox varied in steps of 0.025 nm, while keeping other parameters constant.(D) Plot for variation in I pulse length for different V MTJ values against t ox varied in steps of 0.025 nm, while keeping other parameters constant.Here, I pulse length for P->AP transition and AP->P transition are in close proximity, thus their average values are reported in Y-axis.
(C).Therefore, as the second step, to analyze the factors affecting write performance, Figs.15 (C),(D) are plotted, where Fig.15(C) suggests a decline in current across MTJ (I MTJ ) for an increase in t ox , leading to the reduced probability of deterministic switching of MTJ state as given by the stochastic switching equation in[69].When V MTJ is less than I co(38.2µA)or in its proximity, then a sufficiently large switching current pulse (I pulse ) is required to completely switch the MTJ state.The length of I pulse is directly proportional to the Rp of the MTJ as highlighted by Fig.15(D), where I pulse length for different values of I MTJ , the voltage applied across MTJ (V MTJ ) is plotted against t ox .Increasing V MTJ increases I MTJ speeding the switching process, thus the inverse relation between V MTJ and I pulse in Fig.15(D).In Fig.15(D), for V MTJ =500mV, beyond t ox =0.925nm, irrespective of I pulse length switching does not occur because of the rise in Rp value to an extent where I MTJ <I co (I MTJ 37.6µA) given by Fig.15(C), which is eliminated for higher V MTJ values.Another interesting trend in Fig.15(D), is that for V MTJ =750mV and particularly for V MTJ =1V, I pulse increases like a step function.This is due to different ranges of I MTJ with different threshold values, much like different electron energy bands in the subatomic scale.For a given range of I MTJ , where I MTJ is sufficiently higher than I co , adequate STT current is produced enough to keep I pulse length constant despite varying Rp.

TABLE 2 .
Comparison between the two proposed Sum subcircuit designs for NVRMFA.
[54] Fig.3(C), it can be understood that the choice of resistance of the MTJ replacing MTJs C out 1(C out bar1) & C out 2(C out bar 2) should ensure that the equation Req || C out bar 1 || C out bar 2 = Req || C out bar holds good.Variation of geometric parameters such as MTJ area as done in[54],

TABLE 9 .
Comparison of variation in Power amongst various MFAs and proposed NVRMFA under 3σ process variation using 1000 MC runs.