MagCiM: A Flexible and Non-Volatile Computing-in-Memory Processor for Energy-Efficient Logic Computation

This paper presents a high-performance and energy efficient processor exploiting a Magnetoresistive-based Computing-in-Memory array architecture (so-called MagCiM processor), to perform Boolean logic functions on operands stored in a memory array. The proposed processor efficiently addresses the memory wall and the leakage power consumption problems in conventional processors. The MagCiM processor utilizes mCell memory, a class of Magnetoresistive memory employing only Magnetic Tunnel Junction (MTJ) devices, to realize both computation-in-memory and on-chip instruction and data memories. The mCell memory is characterized by almost zero leakage power, high integration density, high level of reliability, and compatibility with the CMOS VLSI fabrication process. The circuit-level simulation results through comparisons with the previous work reveal that the MagCiM processor provides low occupation area, low power, and energy consumption and offers Normally-off instant-on computing capability, which makes it very suitable for embedded system applications. Based on our evaluations, a conventional processor based on the well-known MIPS architecture consumes about 13 times more energy while having 1.5 times more delay than the MagCiM processor.


I. INTRODUCTION
Big data processing applications with more random and less local access patterns ended up having unbalanced workloads on multi-core processors [1] that intensifies the memory wall problem. One effective solution to alleviate the communication costs is to place computation units in the proximity of memory units [2], the so-called Near Memory Processing (NMP). Instead of having close but distinct computation and memory units, the computations can be performed within memory units, known as Computing In Memory (CIM). Having memory units capable of both storing the data and performing simple computations, one can deal with the memory wall problem with the minimum required data movements [2]. SRAM technology might not be a fit candidate for CIM architectures due to its large area occupation (>30%) [3], The associate editor coordinating the review of this manuscript and approving it for publication was Gian Domenico Licciardo . high power consumption [3], and low reliability [4]. These issues worsen in data-intensive applications that demand larger memories to minimize the communication cost between off-chip and on-chip memories. Therefore, there are serious barriers against the deployment of SRAM memories for the design of memory-centric computing architectures such as CIM. Instead, magnetic tunnel junction (MTJ) based memories like STTRAMs feature small dimensions/high density, nearly zero leakage power, non-volatility, compatibility with semiconductors [5], and being robust against particle strikes. Various studies have been conducted on the design of MTJ-based memories [6], [7] and logic circuits [8], [9]. The non-volatility of MTJ cells can be perfectly used for helping the Normally-off/Instant-on computing paradigm [10].
In general, the application of non-volatile memories such as ReRAM [11], [12] and STTRAMs [13]- [15] in the design of CIM architectures can be categorize into Computing in Memory Arrays (CIMA) [11], [13]- [19] and Computing in Memory Edge (CIME) [20]. In a typical CIMA architecture (see Fig.1(a)), the computational logic is scattered between rows/columns of the memory and the sense amplifiers are placed at the edge of the memory to read data. In contrast to the CIMA architectures, in CIME architectures ( Fig.1.b), computations are performed in a peripheral circuit at the edge of the memory unit. In the former, the logic/arithmetic operations are done within memory arrays. In contrast, some extra computation units (attached to the memory units as peripheral circuits) perform the logic/arithmetic operations in the latter. [16], [20].
The main shortcomings of the previous CIM architectures can be listed as follows.
I) ReRAM-based CIM architectures [11], [12] suffer from the higher operating voltage and higher read/write access delay of ReRAMs (w.r.t STTRAMs); and also suffers from low endurance (about 10 8 ) which makes ReRAMs an inappropriate candidate for on-chip memories.
II) The need for sense amplifiers to read data in STT-MRAM-based memories makes them vulnerable to decision failure. This is due to 1) temperature-related resistance variations in MTJs that may result in decision failure, especially for modern MTJs with low contrast between their high and low resistances. 2) having bit line's resistance significantly bigger than MTJ's resistance happens when the bit line length grows in bigger memories.
III) STT-MRAM-based memories suffer from the read disturb problem in which a read current may unwantedly change the magnetization direction of an MTJ free layer, i.e., the data is corrupted during the sensing process [21]. This problem does not apply to Spin-Orbit Torque Magnetic Random Access Memory (SOT-MRAM) as they use different write and read current paths. Overall, the previous CIM architectures are vulnerable to the decision failure mechanism regardless of the used technology [15], [19]. Though, the STT-MRAM-based architectures are also vulnerable to the read disturb failure as well [13], [16]- [18], [20].
IV) Although MTJs are experimentally proved to be robust against radiation-induced faults [22], the peripheral circuits in both STT-MRAM and SOT-MRAM based memories are still vulnerable to soft errors [23]. Consequently, all previously proposed CIM architectures based on either STT-MRAM or SOT-MRAM cells are vulnerable to radiation-induced errors. It should be noted that SOT-MRAM has been proposed to mitigate such STT-MRAM challenges as 1) STTMRAM cannot reliably operate at sub-ns scales due to long incubation delays [6], making it an unsuitable solution to tackle L1/2 SRAM cache replacement and non-volatile logic; 2) The shared read/write path can impair the read reliability, while the write current can impose severe stress for the memory cell, which results in a possible time-dependent degradation of the MTJ.
To the best of our knowledge, the research problem of designing a full MTJ-based processor is still not addressed and this paper is the first research work in this context. The main contributions of this paper include: 1) An all magnetic computing in the memory array (CIMA) architecture using MTJ cells (the so-called MagCiM) is proposed. This is the first CIM architecture capable of performing both bit-wise logic and arithmetic operations within a memory array to the best of our knowledge. The proposed MagCiM architecture is robust against radiation-induced errors. We have eliminated the transistor-based read peripheral circuits, either the sense amplifiers or additional circuits used by CIME architectures for arithmetic operations. Eliminating these circuits also significantly reduces leakage power consumption. The MagCiM is immune against decision failure as it does not use sense amplifiers. Read disturb does not occur in the MagCiM, as the read and write current paths are disjoint. 2) Using MagCiM, we have designed and implemented an energy-efficient processor, the so-called MagCiMprocessor that performs all arithmetic and logic operations in memory arrays without needing any register file or arithmetic/logic unit. 3) To support normally-off instant-on computing, the MagCiM-processor can completely switch off when it is in the idle mode. If needed, it operates instantly with maximum performance. This capability will help battery-operated embedded systems save more energy by switching to power-off more frequently without storing the system state. The paper is organized as follows. Section II reviews related work. Section III provides an overview of mCell device. Section IV presents the proposed majority gate. Section V describes the CIM array used in the proposed processor. Section VI explains the structure of the proposed processor. Section VII describes normally-off computing in the MagCiM processor. Section VIII explains the pipelined design of the proposed processor. Section IX discusses the analysis and evaluation results. Finally, Section X presents the conclusion.

II. RELATED WORK
As mentioned in the introduction section, previous research work in the design of STT-MRAM/SOT-MRAM CIM architectures can be classified into two broad categories: a) Computing-In-Memory Arrays (CIMA), and b) Computation-In-Memory Edge (CIME). Table 1 lists recently proposed CIMA and CIME architectures and compares these architectures based on various features including the integration of the architecture in a processor, the used technology, the data extraction approach, failure mechanisms, logical bit-wise operation support, arithmetic operation support, and the type of CIM architecture. To make it easier to follow the paper, we also have added our proposed (MagCiM) architecture in the table. As shown in the table, the proposed MagCiM is the CIMA architecture that performs both logic and arithmetic operations within memory arrays. The MagCiM architecture is also highly robust against various failure mechanisms, and we have integrated it within a processor.
According to Table 1, the mentioned methods are not architecturally the same. Some of them are processors, and others are just a CIM. In addition, some perform only logic bit-wise operations, while others perform both logic bit-wise operations and arithmetic operations (by adding power-hungry modules). Considering these differences in their architectures, comparing the methods in terms of power consumption and computational complexity load in table 1 is not fair. Therefore, in Table 1, the common features of the architectures have been stated.
The major shortcoming of the previously proposed CIMA architectures can be summarized as follows. 1) As most of them are designed based on STT-MRAM devices, they have reliability issues in decision making and read disturb. 2) Although computations are done within the memory, they still require complex extra MOS-based circuits and different control signals to connect specific rows/columns. Such circuits impose static and dynamic power overheads and negatively impact the reliability of the CIM architectures. 3) The most crucial part of designing a CIM architecture is to try to integrate the CIM in a processor (this may need the design of corresponding instruction and hardware set architectures). Most of the previous CIMA architectures have ignored such an integration.
Although STT-MRAM based CIME architectures can perform both logic and arithmetic operations, they suffer from three problems: 1) The used fairly-complex circuit at the memory peripherals that performs arithmetic and/or logic operations imposes notable static and dynamic power consumption. The computing circuit is also prone to radiation-induced faults and may end up in a single point of failure. 2) The CIME architectures require additional signals across the memory's columns/rows that consume more energy when the memory capacity grows. 3) Like standard STTRAMs, CIME architectures are also subject to the common failure mechanisms such as decision failure and read disturb.

III. OVERVIEW OF mCell DEVICE
The storage cell used in the mCell-MagCiM array is a spintronic device called mCell [7] (shown in Fig. 2). mCell is a four-terminal device with electrically separated read path (R, R*) and write path (W−, W+). mCell works based on magnetic domain wall in the write path which is displaced back and forth by positive and negative pulsed current, respectively. When the write path is magnetically coupled to the free layer of a magnetic tunnel junction, read path resistance changes as determined by the element's tunneling magnetoresistance. In our work, two terminals of mCell comprise a write-path, wherein the direction of flowing input current charges the digital state of the device. The other two terminals include a read-path that is electrically separated from the write-path. The state of the device is detected as a high or low resistance through the read-path terminals. As shown in Fig. 2(b) and Fig. 2(c), an input current (I mCell ) larger than or equal to threshold current (I C ) passing from W− to W+ changes the mCell read path resistance from a value between R and R* to high (RH) value, and a reverse current changes it to low (RL). Therefore, the input to the mCell are currents +I mCell or −I mCell . The resistance of mCell can be measured by passing a sense current from R to R*. These two states are non-volatile, i.e., they are preserved even if the supply voltage is disconnected. Table 2 lists all essential parameters values which are used in simulations and analysis of the MagCiM processor in this paper.

IV. THE PROPOSED MAJORITY GATE
It has been proven that the logic of a majority gate is functionally complete [28] i.e., it covers every logical function.      where k N = k N ( W L ) N and k P = k P ( W L ) P . k P and k N are the parameters of process transconductance. Equating drain currents and solving equation, V M can be obtained by Eq. 3: The midpoint voltage is adjusted by changing ( W L ) N and ( W L ) P ; to obtain the necessary threshold. For the majority gate to work correctly, the threshold is adjusted to satisfy the following condition: the output will change from high to low every time the result becomes greater than '1.5'. Therefore, we have: Considering Eq. 3 and Eq. 5, it is concluded that the resistances of mCells (RL, RH) affect k N and k P values. The Gn inverter inverts the TD output so that both data and data will appear at the output. Fig. 4 shows the normal operation of proposed majority gate, extracted by HSPICE simulations. As VDD, transistor specifications, and mCell specifications are not same in different technology, the value of Midpoint Voltage (VM) would be also different, and its value should be obtained based on Eq. 3 to Eq. 5 for each technology.  MagCiM uses three decoders (A, B, and C) and one de-multiplexer (DeMUX). Outputs of decoders are connected to an OR gates' input to allow concurrent read of three rows in memory. The DeMUX module makes it possible to write data on the selected row. The RD signal is applied to the enable pins of decoders (A, B, and C), whereas the WR signal enables the DeMUX module. When the RD signal is set to '0', the decoders' outputs become '0', and when the WR signal is '0', the outputs of DeMUX become '0'. In MagCiM architecture, a computation process follows two steps. In the first half-cycle, the RD and WR signals are set to '1' and '0', respectively to read the memory. Then, three decoders are activated, and three stored bits of each column are selected simultaneously and applied to the majority gate to perform the computation on the selected bits. The result finally appears in the MagCiM output (Dout and Dout). The majority gate, responsible for all computations, performs computations simultaneously with the read operations in the first half-cycle. In the second half-cycle, RD and WR signals are set to '0' and '1', respectively. The input data (Din) is then written into a MagCiM memory row determined by DeMUX. Fig. 6 shows the MagCiM array, which uses the allmagnetic mCell-based bit-cells proposed in [7]. Locations denoted as M 0, M 1, and M 2 are reserved locations of the MagCiM array to achieve logic bit-wise operations. The M 0 cells are set to ''0 . . . 00'', whereas M 1 cells are set to ''1 . . . 11''. For performing arithmetic operations, M 2 cells are reserved for storing the result of the ripple carries. In the following, we explain the operations of these components in detail.

A. ALL-MAGNETIC mCell-BASED BITCELL
As every memory cell consists of three mCells, two of which form a buffer called 'driving buffer'. The third mCell keeps data and transfers it to the output in the reading phase.

B. WRITING INTO THE DRIVING BUFFER
Unlike conventional memories, the WBL(i) signal is of the current i.e., the direction of the WBL(i)'s current determines the status of driving buffer mCells (Fig. 7).

1) WRITING INTO A MEMORY CELL
To write in a memory cell, the WWL(j)+ and WWL(j)− writing lines are set to V+ and V− values, VOLUME 10, 2022 respectively. This causes the electrical current to flow through the writing path of the memory cell. The direction and the magnitude of the writing current depend on the pull-up and pull-down resistors of the driving buffer reading path. In Fig. 7(a), the pull-up resistor of the driving buffer is greater than the respective pull-down resistor on Path-1. If the reading path of the driving buffer is activated (by activating WWL(j)+ and WWL(j)− lines), a right-to-left current on the writing path sets the value of the memory cell to zero. In Fig. 7(b), the pull-up resistor of the driving buffer becomes smaller than the corresponding pull-down resistor on Path-2. If the reading path of the driving buffer is activated (by activating WWL(j)+ and WWL(j)− lines), a left-to-right current on the writing path of the memory cell sets the value to '1'.

2) READING FROM A MEMORY CELL
To perform a reading operation, the RWL(j) line is initialized with the value V+, and the output current is transferred to the BL(i) line through the reading path of the memory cell. Given the fact that a mCell has the two low resistors (Logic '0') and high resistor (Logic '1') states, the current transferred to the RBL through the memory cell can also have two values. When the value of the storing element is '1', the output current will be low (IOL), otherwise when the value of the storing element is '0', the output current will be high (IOH). Fig. 8 shows an MagCiM memory cell, and Fig. 9 indicates its corresponding waveforms.

C. THE mCell-MagCiM ARRAY
Similar to other random access memories (DRAM, MRAM, etc.), the MagCiM is divided into individual rows and columns. Each row has a dedicated word-line, and each column has a dedicated bit-line. When data is being written on the WBL(i) line, all of the buffers connected to WBL(i) will be initialized concurrently. However, as the storing element is separated from the WBL signal in each memory cell, it will not be affected. After activating the WWL(j) line, only the driving buffer data connected to WWL(j) will be transferred to the storing element. When RWL(j) is activated, the storing element's data will be read and transferred to BLs.
WBL voltage depends on the array size, i.e., the equivalent resistance of all reading paths of all driving buffers and the wire loads along with the bit-line when the WBL voltage can provide at least a 5µA. The bit-line current can be set higher than 5µA to reduce the probability of write failure (by providing safe margins through separated writing and reading paths in the mCell). WWL voltages can also be increased to improve the switching of the storing element in the writing process. Given the fact that these voltages decline toward reading the driving buffers, their values should be determined to ensure that storing elements are switched correctly despite the voltage drop.
Along with the reading operations, an appropriate voltage is applied to the word-line of reading. As a result, the current flows through the reading path of the storing element. The magnitude of this current depends on the resistance of the storing element (RH or RL). This current is regarded as one of the majority gate inputs.

D. ARITHMETIC AND LOGIC UNIT (ALU)
In the following, we describe how the proposed processor can do arithmetic and logic operations.

1) LOGIC UNIT
Accordingly, when three memory cells of BL(l) (m A (i, l), m B (j, l), and m C (k, l)) are selected as the inputs of the majority gate in the MagCiM array (Fig. 6), the reading path current of each mCell (Im i ) can be IOH or IOL based on the cell state (RH or RL). According to Kirchhoff's current law (KCL), the sum of currents entering any junction is equal to the sum of currents leaving the junction. Therefore, the output current of BL(l) will be the resultant of Im A (i, l), Im B (j, l), and Im C (k, l) (Isum = Im A (i, l)+Im B (j, l)+Im C (k, l)). Isum is converted into voltage through Mn transistor, then the result of the majority gate is revealed through the voltage threshold detector.

2) ARITHMETIC UNIT
As mentioned earlier, all previously proposed CIMA architectures only support logic bit-wise operations inside the memory. The previous CIM architectures commonly embed arithmetic circuits in the peripheral circuitry of the memory to perform arithmetic operations. The arithmetic circuits are either based on CMOS or hybrid spin/CMOS circuits. However, the MagCiM processor supports both logic and arithmetic bit-wise operations using only magnetic cells. In the following text, we will explain how the MagCiM performs primary arithmetic operations, including addition, subtraction, multiplication, and division.

a: ADDITION OPERATION
A ripple carry adder adds the bits of the same significance and the carry-in from the previous stage and propagates the carry-out to the next stage. The relations between the inputs and the outputs of any stage are expressed as In our processor, C(i) is directly generated in memory, and SUM (i) is generated using XOR operation performed in memory without requiring any peripheral circuit at the edge of memory. Fig. 10 shows the hardware and wiring to carry computation and propagation. Location M 2 of the memory is considered as special location to store the result of the ripple carries. According to Eq. 6, the majority circuit calculated the carry of i-th stage.
The majority circuit used in MagCiM (shown in Fig. 10(b)) serially connects three mCells A(i), B(i), C(i-1) with the R ref of 562 . In fact, the majority circuit consists of two separate parts: a reference resistance (Rref) and a pull-down resistance. It also includes two opposite source/sink output currents (I Src , I Sink ). Source and sink currents are obtained by comparing the pull-down resistance with the Ref resistance. Higher and lower values than the reference resistance directly influence the output (C(i)). If pull-down resistance is more than Ref resistance, the reference current would be greater than pull-down current; so that the output current turns into a source current (I OUT = I Src ). Likewise, if the pull-down resistance is less than Ref resistance, the output current turns into a sink current (I OUT = I Sink ). The reference resistance provides a current between WWL(3)+ and the output whenever the output of the majority circuit becomes ''1'' (based on the inputs). Similarly, the pull-down resistance provides a current between the output and WWL(3)-whenever the output of the majority becomes ''0'' (based on the inputs). Fig. 11 shows the regular operation of a MagCiM majority circuit.
In order to do an ADD operation, in the first step M 4 (A) is initialized; M 3 (B) is then initialized and M 2 (Carry) is directly generated. According to Eq. 7, the SUM can calculated using three MAJ operations.  [29]. CORDIC is a simple and efficient algorithm to calculate the mentioned operations using only iterative shift-add operations.

3) COMPUTATION FLOW IN MagCiM
To explain the computation flow of the MagCiM processor, we have set up two subsequent case studies on the execution of i) logic operations (AND/NAND and OR/NOR operations) VOLUME 10, 2022  Fig. 12(b)). Fig. 12(c) shows the computation flow for a SUM operation used in the ADD instruction. An addition operation has two outputs, namely, SUM and Carry-out. The SUM is computed by XORing the inputs and Carry-in. We need to perform XOR operation on three operands for producing SUM, e.g., M2, M3, and M4. MagCim performs this in two steps. In the first step, a majority operation processes M2 and M3, and stores the result in Mi (Path 1 of Fig. 12(c), 2). In the second step, the second majority operation processes Mi and M4, and stores the result in Mj (Path 2 and 3 in Fig. 12(c)). Fig. 13 indicates the structure of the proposed MagCiM single-cycle processor designed using all-magnetic magnetoresistive RAM. The MagCiM is a 32-bit RISC processor that executes instructions without needing any register file and ALU. Based on the fact that the majority operation is a logically complete operation, i.e., other operations can be restated using only majority expressions [28], the MagCiM processor uses the majority gates for its computations. Table 3 shows the basic instructions of the proposed MagCiM processor based on the majority gate. The MagCiM single-cycle processor performs all tasks of every instruction in one clock cycle. A new instruction is fetched only when the execution of the previous is finished. Execution of every instruction takes the following four steps.

VI. THE ARCHITECTURE OF THE PROPOSED MagCiM PROCESSOR
• Instruction fetch (IF): the instruction is fetched from the instruction memory, and the address of the next instruction is generated.
• Instruction decode (ID): the instruction is decoded, and the operands are being read. • Memory access and EXecution (MEX): the memory (MagCiM) is accessed based on the computed addresses, and the majority function is executed.
• Write back (WB): the load operation is completed by writing the result from memory (MagCiM) into the memory (MagCiM) itself. All the essential components of the MagCiM processor (including PC, Instruction Memory, and MagCiM) are designed non-volatile to support normally-off computing [30]. In the following, we will describe the functionality and structure of these components in detail.
The proposed processor has two memories: MagCiM and instruction memory. However, in the pipeline version, the existence of only one memory for both data and instructions may result in Structural Hazards. To tackle this issue, we have used two separated memories one for data and one for the instruction.

A. NORMALLY-OFF COMPUTING
Normally-off computing (NoC) is one of the promising power-saving techniques that is capable of offering extra power saving on top of existing low-power techniques such as dynamic voltage and frequency scaling, clock gating, and power gating can. NoC has zero standby power consumption during idle time and instant-on characteristics [30] that are commonly used in numerous Internet of Things (IoT) applications with long-term sleep duration e.g., wearable healthcare devices, wireless sensing, etc.
In general, the total power consumption of an IoT device can be described with Eq. 8, where P Active , P Sleep , and P Wakeup are respectively the power consumption during active mode, sleep mode, and wake-up transition: The weight of each parameter in the total power consumption severely depends on the application. In some applications, the system will spend most of its time in sleep mode. Accordingly, a low sleep-power consumption is more critical design issue. On the other hand, for applications such as data loggers, the device will often switch between active and sleep modes. In that case, the wake-up energy has to be reduced. When non-volatile memory is utilized in the computation device, P Backup , and P Restore are completely removed. This significantly decrease the total energy consumption when the frequency of switching between the sleep and the wake-up state are high. Briefly, the total energy consumption when using MagCiM is given by Eq. 9: E MagCiM = P Active × T Active + P Wakeup × T Wakeup (9) where T Active and T Wakeup are the average time duration in which the device is in active and wake-up state. The total energy consumption when using MIPS is given by Eq. 10: Comparing Eq. 9 with Eq. 10, it can be seen that the MagCiM will is more energy efficient essentially with increasing T Sleep .

B. INSTRUCTION SET ARCHITECTURE
The 32-bit instruction set architecture (ISA) of the MagCiM processor contains three classes: M − type (for Memory), B − type (for Branch), and I − type (for Immediate). The MagCiM instruction format for each of the classes is shown in Fig. 14. Common fields for all three formats include i) the four most-significant bits of the instruction that are used for the operation code and ii) the branch bit. The instructions OP bits [30:28] are sent to a control unit to determine the type of instruction; the type of instruction then determines which control signals are to be set, i.e., the instruction decode. If the VOLUME 10, 2022  branch bit is set to '1', the content of md field is set to the PC-relative address of a memory location. M-type MagCiM instructions are used for arithmetic and logic operations. These instructions consist of four memory location fields. ma, mb, and mc are three sources fields, and the md is the only destination field. The memory address fields ma (bits [27:21]), mb (bits [20:14]), mc (bits [13:7] I-type MagCiM instructions are used for writing data in a MagCiM location. Per this format, the immediate data (a 16-bit constant value) is stored in bits [22:7]. Table 3 shows the basic instructions the MagCiM processor supports. Worth to mention that other operations can be done as a combination of instructions of Table 3. Table 4 demonstrates how, for example, XNOR and XOR operations can be done using a majority instruction. An XOR gate is used for the implementation of the BNE instruction. If two operands are the same, the XOR operation returns zero that, according to control signals (see Fig. 13, the NOR-gate output), the branch will not be taken. The SLL instruction (shift left logical, Dout 1 in Fig. 13) is done by locating a wire of '0' at the least significant bit of Dout and ignoring the most significant bit of Dout.

C. MagCiM ADDRESSING MODES
The ISA specifies the mechanisms by which a processor accesses operands for calculations. Fig. 15 shows how operands are identified for three addressing modes of the MagCiM processor. Notable to mention that the MagCiM supports having more than one addressing mode in a single instruction. For example, jMAJz instruction uses both PC-relative and memory addressing modes.
1. In the memory addressing mode, the operand is stored in a memory location.
2. In PC-relative addressing mode, the branch address is obtained as the summation of PC and a constant value embedded in the instruction.
3. In immediate addressing mode, the full target address is embedded as a constant value within the instruction.

VII. NORMALLY-OFF COMPUTING IN MagCiM PROCESSOR
To support normally-off computing in the MagCiM processor, we have proposed a non-volatile instruction memory (IM) and the program counter (PC), which are shown in Fig. 16 and Fig. 17. The IM cells are similar to MagCiM cells. The IM array benefits from one decoder. This decoder's output allows the stored bits of a row to be selected simultaneously and applied to current-to-voltage (I-V) converters. In the I-V converter (Fig. 16(c)), the Mn transistor converts Im Ai into a voltage and transfers it to an element called the voltage threshold detector (TD). The result finally appears in the IM output.
The proposed 8-bit Non-volatile PC is composed of four components: 1. Write circuits to program mCell-devices; 2. mCell-devices to store the address; 3. Sense Amplifiers (SA) to read address stored in mCell-devices according to their resistances; 4. SR-latches (Slave) to provide the next address. The mCell device is utilized as a storage element in the PC register. The writing and reading process of the PC register is controlled by the signal ''CLK''. When ''CLK'' is high, the address is stored in mCells by write circuits in the form of different resistance levels, e.g., high resistance state stands for logic '1' and vice versa; meanwhile, the slave latches keep the precedent address. When ''CLK'' is low, the sense amplifiers read the stored address, and the slave latches become transparent and update their output address.

VIII. PIPELINING IN MagCiM PROCESSOR
To enhance the MagCiM processor's performance, we have also designed a 3-stage pipelined architecture for the MagCiM processor (see Fig. 18).     Fig. 18, IFD/MXW and MXW/BR are pipeline registers that connect consecutive pipeline stages by holding data only for one clock cycle. Table 5 shows data fields held by each pipeline register.

A. IFD STAGE
The IFD stage consists of the PC register, a simple adder, the instruction memory, and the control unit. At each clock cycle, the PC increments to address the instruction memory to fetch a new instruction. This stage also decodes the operands according to the operation code. The control signals are generated by the control unit and passed to the MXW stage along with PC+1 via the IFD/MXW pipeline register.

B. MXW STAGE
Based on the operation code and other instruction bits, the controller generates A, B, and C address lines for MagCiM. The data D1, D2, D3 are then passed to majority gates. Finally, the majority operation is performed. This process is done in the first half of the clock cycle when the MagCiM read signal (RD) is high. The result is then written into the MagCiM, corresponding to the write address rd, during the second half-cycle when the MagCiM write signal (WR) is high. The address, PC+1, and the branch signal are all written into the MXW/BR pipeline register.

C. BR STAGE
This stage is used to obtain the next instruction address for the branch instructions. In this stage, the zero flag and branch address are calculated, and the result is written into the PC.
Like any other pipelined processor, there are situations in which the MagCim pipeline needs to be stalled due to the structural, control, and data hazards. The straightforward solution to resolve data or branch hazards is to insert NOP ('no operation' in Table 3) instructions at the assemble/link time. The assembler/compiler inserts NOPs after branch and before data-dependent instructions.

IX. SIMULATION SYSTEM SETUP AND RESULTS
To evaluate the MagCiM processor, we have employed the same method as [18] (See Fig. 19). First, Circuit level simulations are done using a Verilog-A model proposed in [31] for mCells. Then, mCells and interface CMOS circuits are co-simulated in Cadence Spectre and HSPICE tools. The MagCiM processor is developed at the circuit level VOLUME 10, 2022 FIGURE 19. Circuit and system level simulation framework. using the HSPICE tool. Simulations are carried out using 32nm Predictive Technology Model (PTM) technology with a nominal voltage of 0.9v. For the system-level simulations, we employ a modified self-consistent NVSim [32] along with an in-house developed C++ code for our power consumption and occupied area evaluations.

A. RELIABILITY ANALYSIS
In order to analyze the read failures under process variation of mCells for MagCiM operations, we performed a Monte-Carlo circuit-level simulation. We increase the amount of process variation from ±5% to ±25% and run 100,000 simulations for each level of process variation. Table 6 shows the percentage of iterations in which MagCiM operates incorrectly for each level of variation. Two conclusions are reached: First, as expected, up to ±5% variation, there are zero errors in MagCiM; and second, even with ±10 and ±15 variation, the percentage of erroneous MagCiM across 100,000 iterations each is just 0.18% and 5.31%. These results show that MagCiM is reliable even in the presence of significant process variation.
The reliability of an STT-MRAM device depends on the following failure mechanisms: read disturb (MTJ flipping during read), write failure (incorrect write operation), and decision failure (incorrect sensing of MTJ value). An efficient way to solve the read disturb problem is to separate the read and write paths of MTJ devices [33]. For this purpose, we have used ''mCell''. Write failure occurs when the MTJ device's write current is less than its critical current. Write failure rate can be decreased by increasing the write pulse duration or the write current value [33]. In mCell, the critical current is 5uA. we have set the write current to 10uA to reduce the write failure rate in the MagCiM processor. Decision failure occurs in STT-MRAMs in which sense amplifiers are used to read the values of the MTJs by comparing the MTJs resistance with a reference resistance [33]. Decision failure will not happen in MagCiM because sense amplifiers are not used to extract the result of the cells.

B. MEMORY PERFORMANCE EVALUATION
According to Table 7, the proposed MagCiM shows the least read dynamic energy in comparison to other designs. In addition, MagCiM reduces the total leakage power compared to SRAM. Although, the proposed MagCiM shows longer average latency compared to SRAM due to the longer write latency of magnetic memory storage. Moreover, the area overhead of the proposed MagCiM is 42.31% more than STT-MRAM but still 37.76% less than SRAM design. It should be noted that the first and foremost advantage of spintronic memories, compared to SRAM, is their nonvolatility with almost 10 years' retention time [5].

C. MagCiM VERSUS A TYPICAL MIPS
We have compared the MagCiM processor with a typical MIPS processor [34] under various instructions (ADD, SUB, MUL, DIV, SQRT, and EXP) in terms of instruction execution latency, the leakage power consumption, and the energy consumption. To this end, we have implemented a typical MIPS processor at the circuit level using the HSPICE tool under 32nm PTM CMOS technology.
As the MagCiM processor does not have any register file and ALU, it is highly area-efficient compared to a typical RISC processor like MIPS. Using 32nm technology, MagCiM exhibits about 76% less occupied area than the MIPS processor. The MagCiM exploits mCells (a particular type of MTJ-based memory cell) to cope with the leakage power consumption issue. Fig. 20.a shows the Leakage power consumption of the MagCiM versus its competitor, the MIPS processor. According to our results, almost 98% leakage power saving is achieved by the MagCiM processor, mainly because of magnetic memory and computation. In addition, as the instruction memory, the data memory, and the program counter are non-volatile, the MagCiM is a normally-off enabled processor.
Latency in our evaluations is measured as the product of the number of clock cycles required for executing an instruction and the clock signal period. Fig. 20.b confirms that the MagCiM processor offers a significant lower delay in all instructions than the MIPS processor. On average, it exhibits 58% improvement in latency as compared to the MIPS processor. The energy per instruction is measured by the product of the execution latency and the instruction's power consumption. According to Fig. 20.c, the MagCiM processor consumes much less energy than that of the MIPS processor.

D. COMPARISON WITH THE PREVIOUS WORK
Although NVM-based CIM architectures provide substantial performance improvement and energy efficiency for data-intensive applications, they are vulnerable to Data  Remanence Attacks (DRA) as they may hold persistent data. Therefore, it makes sense to employ some sort of encryption algorithm. In this subsection, we compare the MagCiM processor with different architectures proposed in the previous work. To keep results comparable, we use the same case study benchmark as the earlier works have used, i.e., Advanced Encryption Standard (AES) algorithm. The selected architecture for this comparison are: I) The proposed Computing-in-memory Edge (CIME) architecture in [18], which is based on 4-terminal spin Hall effect-driven domain wall motion devices. In this work, three different parallelism levels for the proposed CIME architecture have been proposed, which are referred to as P1, P2, and P4 configurations. Similar to our work, a Verilog-A model of their MTJ-based devices is developed to co-simulate with the interface CMOS circuits in Cadence Spectre and SPICE. Authors have also used the 45nm North Carolina State University (NCSU) Product Development Kit (PDK) library for their SPICE simulations [18]. II) ASIC: The proposed AES hardware accelerator in [36] fabricated in 22nm CMOS technology is the next candidate for this comparison experiment. Although this research work is not in the context of CIM, we have intentionally selected it to compare the MagCiM processor's efficiency with ASIC designs.
III) CMOL is a hybrid CMOS-Nano circuit proposed in [38] for the implementation of the AES algorithm. To evaluate CMOL circuit, HSPICE simulations were conducted using a CMOS 45nm technology.
IV) A special-purpose spintronic based CIME architecture designed to perform AES encryption [39], the so called DW-AES. To evaluate DW-AES, authors have used SPICE simulations using 32nm technology. Three different configurations of DW-AES have been proposed in this research namely, Baseline, Pipeline, and Multi-issue DW-AES [39].
V) The MIPS [34] and the pipeline MIPS processors [34] that we used in the previous subsection are used as in this comparison as well.
VI) Finally, a software implemented AES algorithm on a general purpose processor (GPP) [35] is also used in our comparisons.  We have compared the MagCiM processor with the mentioned architectures in terms of energy consumption, occupied area, number of clock cycles, and leakage power dissipation. In our comparisons with the mentioned designs, we have used the results reported in their corresponding papers. As mentioned earlier, for the MIPS processors, we have used the simulation results of our MIPS implementations. To be consistent with the previous work, we have carried out our SPICE simulations for the 32nm Predictive Technology Model (PTM) for MagCiM and MIPS processors.
The Fig.21, Fig.22, and Fig.23 show the leakage power dissipation, the occupied area, the energy consumption results, respectively (all results are normalized to that of MagCiM processor). As shown in Fig.21, Fig.22, and Fig.23, both MagCiM and the Pipeline MagCiM consume less leakage power, occupies less area, and dissipate less energy than all of the other architectures. To accurately compare the MagCiM processor with different architectures in terms of performance, we should measure the total execution time, i.e., product of the total number of clock cycles and the clock period. As shown in Table 8, except for the CMOS ASIC design of AES [39], MagCiM has less execution time than the other architectures.

E. A DISCUSSION ON EFFICIENCY AND APPLICATIONS OF MagCiM
It should be noted that, MagCiM can be used as a tiny embedded microcontroller. In most of tiny embedded microcontrollers, there is no hardware multiplier or divider (e.g. ATMega128, Nios II Processor, iCE40HX8K FPGA). In such cases, the missing logic is usually implemented as an assembly language routine and available as part of the standard library. Therefore, such embedded processor including MagCiM takes much more time to perform scientific computations as compared to processors designed and optimized for high performance computation. However, in embedded system applications such as edge devices where power consumption and cost are serious concerns, the employment of such processors is inevitable. Firstly, in embedded applications, as opposed to high performance applications, the computations are not complex. Therefore, there are very limited number of complex operations with large operands in such applications. Secondly, most of the embedded systems are battery operated systems which use a harvesting approach to charge the battery. Therefore, the power consumption and being robust against power shortage are very crucial factors. In MagCiM processor, we have addressed these two important factors by using pure magnetic cells. MagCiM non-volatility feature (originating from the non-volatility of PC, Instruction Memory, and Data Memory), protects data against temporary power outages due to some noises or the battery shortage because of unavailability of harvesting resources. Thus, after energy becomes available, the reading and execution of instructions continues from the outage time. While in volatile processors such as MIPS, in the event of a temporary power outage, the algorithm must be restarted from the beginning; Therefore, the non-volatile feature plays a major role in saving power and reducing the execution time. The higher the number of outages, the more the difference between MagCiM and other processors performance is magnified.
Main advantages of the proposed MagCiM are: 1) The MagCiM processor uses pure magnetic memories. In other words, for the implementation of memory cells, only mCell devices have been used. This makes the implementation of MagCiM easier and less expensive compared to the previous work (in the previous work, a combination of MTJ and transistors has been used in the structure of memory cells).
2) As opposed to the previous works that employ MTJ devices. In MagCiM, Read Disturb does not occur due to the use of mCell in which the write and the read paths are completely separated.
3) Previous works have employed a sense amplifier to extract the output, while MagCiM uses a majority gate. This design prevents Decision Failure. In addition, fewer devices are used in its implementation as compared to sense amplifiers. 4) As MagCiM does not use Sense Amplifiers, It is significantly robust against Single Event Upsets (SEUs) and Decision failure. In addition, process variation has fewer effects on MagCiM.

X. CONCLUSION
In this paper, a novel Computing-In-Memory processor was proposed and evaluated. The MagCiM processor supports normally-off computing based using an all-magnetic mCellbased memory. The paper combines the advantages of both non-volatile memory and computation-in-memory architecture to offer a low-power and high-performance processor. The throughput of the MagCiM processor was also boosted using the proposed pipeline architecture. Circuit-level simulation results revealed that the MagCiM processor consumes significantly less leakage power and energy consumption and occupies less area as compared to the previously proposed methods while also offering the Normally-off computing capability which is a very viable computing paradigm for design of ultra-low power cyber-physical embedded systems. He has published more than 80 journal articles and conferences papers, and served as a reviewer and a PC Member for a wide variety of IEEE, ACM, Elsevier, and Springer journals and conferences. He has also served as a Panelist for the National Science Foundation, for reviewing grant proposals. His research interests include security and reliability of multi-processor systems-on-chips (MPSoC), hardware security in embedded and cyberphysical systems, and hardware design and acceleration for deep/spiking neural networks.
MAHDI FAZELI received the M.Sc. and Ph.D. degrees in computer engineering from the Sharif University of Technology, Tehran, Iran, in 2005 and 2011, respectively. He is currently an Associate Professor with the School of Information Technology, Halmstad University, Sweden. His research interests include hardware security and trust, reliable VLSI circuits and systems, energy efficient computing, and dependable embedded systems. VOLUME 10, 2022