Efficiency of Double-Barrier Magnetic Tunnel Junction-Based Digital eNVM Array for Neuro-Inspired Computing

This brief deals with the impact of spin-transfer torque magnetic random access memory (STT-MRAM) cell based on double-barrier magnetic tunnel junction (DMTJ) on the performance of a two-layer multilayer perceptron (MLP) neural network. The DMTJ-based cell is benchmarked against the conventional single-barrier MTJ (SMTJ) counterpart by means of a comprehensive evaluation carried out through a state-of-the-art device-to-algorithm simulation framework. The benchmark is based on the MNIST handwritten dataset, Verilog-A MTJ compact models developed by our group, and 0.8 V FinFET technology. Our results point out that the use of DMTJ-based STT-MRAM cells to implement digital embedded non-volatile memory (eNVM) synaptic core allows write/read energy and latency improvements of about 53%/61% and 66%/17%, respectively, as compared to the SMTJ-based equivalent design. This is achieved by ensuring a reduced area footprint and a learning accuracy of about 91%. Such results make the DMTJ-based STT-MRAM cell a good eNVM option for neuro-inspired computing.


Efficiency of Double-Barrier Magnetic Tunnel
Junction-Based Digital eNVM Array for Neuro-Inspired Computing demonstrated in machine learning (ML) applications including image processing/classification/recognition, natural language processing, and visual intelligence [1], [2], [3].
Due to features such as small cell area footprint, short programming time, and good endurance and data retention [4], [5], there is an increasing interest in the field of neuro-inspired computing exploiting emerging non-volatile memories (eNVMs) such as resistive RAM (RRAM), phase change memory (PCM), spin-transfer-torque magnetic random access memory (STT-MRAM), and ferroelectric field-effect transistor (FeFET), allowing flexibility to the development of DNNs. Although analog synapse eNVM-based architectures could be competitive in terms of energy and latency, they mainly suffer from low online learning accuracy [6]. To deal with this issue, digital synapse based architectures have been widely considered [4], [7]. As potential eNVM candidate for digital synapse devices, STT-MRAM cell offers low operating voltage, enough good speed operation, high-density, relatively large endurance, low fabrication cost, low-power consumption, and scalability [8], [9], [10]. Typically, STT-MRAM based DNN implementations are based on conventional single-barrier MTJ (SMTJ) devices [11]. However, it is required high writing current, thus limiting the overall energy-efficiency and latency of DNN. To counteract with this, a solution consists of using double-barrier MTJ (DMTJ), with two reference layers, to enable higher-speed operation, lower power consumption, and more energy-efficient switching process [10], [12], [13]. We evaluate the impact of DMTJ-based STT-MRAM cell on DNN, by using Cadence-Virtuoso environment for circuit-level simulations, along with the multilayer perceptron (MLP) + NeuroSimV3.0 simulator computing-in-memory (CiM) based neural network accelerator [7]. More precisely, NeuroSim is used to support a 2-layer MLP neural network to benchmark the DNN architecture, relied on SMTJ-based and DMTJbased digital synapse devices, in online learning and offline classification with MNIST handwritten dataset.
Our results point out that the use of DMTJ-based STT-MRAM cell in a digital eNVM synaptic core allows write/read energy and latency improvements of about 53%/61% and 66%/17%, respectively, as compared to the SMTJ-based counterpart. This is also achieved by ensuring a learning accuracy of about 91%, suggesting that the DMTJ-based STT-MRAM cell could be a promising candidate for digital synapse in neuro-inspired computing. This brief is organized as follows. Section II details the simulation framework, its customization and setting from device-to-algorithm level. Section III discusses the system level performance evaluation in terms of accuracy, area, latency and energy. Finally, Section IV concludes this brief.
II. SIMULATION FRAMEWORK -MLP + NEUROSIMV3.0 NeuroSim simulator allows to estimate the algorithm-level performance by emulating the online learning and offline classification scenario with MNIST handwritten dataset in a 2-layer MLP neural network [7], [14], [15], [16]. As shown in Fig. 1, the evaluation framework takes into account the whole system from device and bitcell levels to memory architecture and algorithm levels. The input parameters of the simulation tool include memory type, non-ideal device parameters, transistor technology node, network topology and array size, training dataset and traces, etc. For the full list of input parameters/variables, the reader is referred to [7]. The outputs of the simulator include: (1) the memory architecture-level performance metrics, such as area, latency, dynamic energy, and (2) algorithm-level learning accuracy in run-time. As for the design options of digital synaptic arrays, SRAM or eNVM bitcells can be used.

A. Device Level
As shown in Fig. 1 (a), we consider STT-SMTJ/DMTJ devices, whose main physical and performance parameters are listed in Table I. The STT-MTJs are described through Verilog-A based compact models [17], [18], which have been validated against full micromagnetic and experimental results. In particular, the STT-MTJ models utilize experimental data reported  [10] in [19]. These models further account for the impact of process variability on the STT-MTJs. Specifically, the variability, modeled by incorporating Gaussian-distributed variations, was set to 1% for both the free-layer and oxide thickness, 3% for the tunnel magnetoresistance (TMR) ratio, and 5% for the cross-section area [10].
1) SMTJ: The SMTJ consists of two types of ferromagnetic (FM) layers, one with fixed magnetization called reference layer (RL), and the other with a free magnetization named as free layer (FL), whose magnetization direction can be changed by applying a switching current greater than the critical switching current of the device [10]. Based on the relative magnetization direction of the FL and RL, the SMTJ can reside in one of two stable states: parallel (P) or antiparallel (AP). If two FM layers have the same magnetization directions, i.e., RL and FL in P, the resistance of the MTJ is low (R 0 ), indicating a "0" state. Conversely, if the two layers have different magnetization directions, i.e., RL and FL in AP, the resistance of the MTJ is high (R 1 ), indicating a "1" state [10].
2) DMTJ: The FL is sandwiched between two MgO oxide barriers, each of them interfaced with one RL. The low resistance state ("0") corresponds to FL in P and AP with respect to the RL top and RL bottom, respectively. As for the high resistance state ("1"), the FL is in AP and P with respect to RL bottom and RL top, respectively. Accordingly, the DMTJ resistances in states "0" and "1" can be calculated as R 0 =R P,T +R AP,B and R 1 =R AP,T +R P,B , respectively, [10]. Due to the presence of the second reference layer, the spintransfer torque is enhanced [18]. Therefore, the write switching currents is reduced as compared to the conventional SMTJ device. Fig. 1(b) shows the considered SMTJ-based and DMTJbased bitcell configurations designed exploiting a 28 nm FinFET technology featuring a nominal supply voltage of 0.8 V. These are referred to the two complementary transistors and one MTJ (2T1MTJ) cells in reverse and standard connection (2T1MTJ-RC and 2T1MTJ-SC) for the SMTJand DMTJ-based bitcells, respectively. According to the study carried out in [10], those considered are the most write energy-efficient bitcell configurations.

B. Bitcell-to Memory Architecture-Level
At the architecture level shown in Fig. 1(c) and Fig. 1(d), two synaptic cores of 2-layer MLP are considered. Each synaptic core is a computation unit specifically designed for weighted sum and weight update [7], [14]. Among the available design options for the synaptic cores, we considered the digital eNVM based on pseudo-crossbar array.
For the sake of an accurate modeling of the MLP NN, we have adapted NeuroSim to match the considered FinFET technology node. To this aim, a fine-grained electrical characterization of the transistors was carried out exploiting Cadence Virtuoso tool. More specifically, the MLP NN utilizes transistor information like gate capacitance, mobility, threshold voltage, ON/OFF current, etc. Therefore, the synaptic core and periphery neuron of the MLP NN are accurately built for the considered 28 nm FinFET process.

C. Algorithm Level
At the algorithm level, the standard MNIST benchmark data is used for online learning (6k images for training dataset and 10k images for testing dataset) and offline classification [7].
The considered MLP is a fully connected neural network, where each neuron node in one layer connects to every neuron node in the following layer. Fig. 1(e) shows the flow of Neural Network, where the MNIST images are cropped and encoded into black and white data for simplification on hardware implementation. The network consists of an input layer, hidden layer and output layer. The connections between input-hidden and hidden-output layers represent the weight matrix W IH and W HO , respectively. As shown in Fig. 1(e), the network topology contains 400 neurons (20×20 MNIST image) of input layer, 100 neurons of hidden layer, and 10 neurons (10 classes of digits) of output layer.

III. SIMULATION RESULTS
NeuroSim framework shown in Fig. 1 was properly calibrated with the 0.8 V FinFET technology parameters, along with the bitcell electrical characteristics of the considered 2T1MTJ-based bitcells, which are the cells of the pseudocrossbar eNVM digital synaptic core. Bitcell-level results consider both SMTJ/DMTJ and FinFET device-to-device variability through extensive Monte Carlo simulations. Table II shows the bitcell-level parameters of the energy-optimal cell size and configurations (refer to Fig. 1(b)). It is worth to mention that these results are carried out at parity of tunnel magnetoresistance ratio (TMR), and oxide thickness, i.e., t ox,SMTJ = t ox,t,DMTJ = 0.85 nm. Performance results for write and read operations are obtained, assuring a write-error-rate (WER) of 10 −7 and read disturbance rate (RDR) of 10 −9 , respectively. From Table II, it is clear that thanks to the reduced switching and read currents, the DMTJ-based bitcell is the most energy-efficient alternative under write/read operations. Overall, at bitcell-level, the DMTJ-based alternative shows energy savings of about 72% and 97% for read and write operations, while assuring faster (65.7%) switching in contrast to the SMTJ-based bitcell.
The parameters reported in Table II were used as input in NeuroSim to evaluate the algorithm-level performance.
Considering the training time, we employ 15 epochs (i.e., number of training iterations), 8000 and 1000 MNIST images for training and testing, respectively, giving a total of 12000 MNIST images being trained. We used the online learning in hardware configuration, which handle testing and training for both weight sum and weight update all in hardware.

A. Performance Analysis
The SMTJ-and DMTJ-based 2-layer MLP neural network performance is evaluated in terms of learning accuracy versus latency and energy consumption, calculated at the run-time.
The read (weighted sum-feed forward operation) and write (weight update operation) latency and energy are shown in Fig. 2. We can observe that the weighted sum and weight update operations associated to the DMTJ-based eNVM cell achieve the highest accuracy much faster as compared to the SMTJ-based counterpart, while at the same time ensuring less energy consumption. This is due to the reduced energy/writepulse width of the DMTJ-based bitcell (refer to Table II).
From Fig. 2(a), it is worth noting that the delta latency (i.e., time between iterations) in feed forward operation, for both STMJ-and DMTJ-based alternatives, is roughly the same, mainly do to the similar requirement for the read pulse width.
As for the weight update operation, the delta latency between each epoch is 14ms and 4.7ms, respectively. This can  be explained due to the larger pulse width required for writing operation. As compared with the SMTJ-based alternative, the DMTJ-based cell shows an improvement in terms of latency, of about 18% and 66% in feed forward and weight update operations, respectively, during online learning. Similar results have been obtained for the energy consumption, as shown in Fig. 2(b). The DMTJ-based cell shows lower energy consumption as compared to the SMTJ-based alternative, owing to its reduced bitcell read/write energy. The results showed an improvement of about 61% and 54% during feed forward and weight update, respectively.
The benchmark results show that, while the DMTJ-based solution achieves a good accuracy of (> 90%), the SMTJbased neural network reaches a learning accuracy of about 83%.
The cause of degradation in terms of learning accuracy is attributed to the devices' poor ON/OFF ratio [6].
In addition, we estimate the area occupation as extracted from NeuroSim. Fig. 3 shows the total area footprint. The area occupation for the SMTJ-based and DMTJ-based alternatives is 0.0788 mm 2 and 0.0531 mm 2 , respectively. DMTJ-based bitcell can achieve the smallest area footprint due to the smaller bitcell area (see Table II), which corresponds to the energy-optimal cell size.

B. Impact of Synaptic Device Properties on Accuracy
During the weight update, the conductance of the device should be sufficiently large, i.e., the lowest conductance state (OFF-state) should be low enough to represent the zero weight in the algorithm [6]. To quantify the impact of the device properties on the learning accuracy, we carried out an analysis for both STT-MTJ alternatives by varying t ox /t ox,t .
If we decrease the oxide thickness for both devices, the ON and OFF resistance of the bitcell will be affected. When considering a top barrier of t ox,SMTJ = t ox,t,DMTJ = 0.80 nm, the conductance ON/OFF ratio for SMTJ-and DMTJ-based cell are 1.91 and 1.88, respectively. The reduced ON/OFF conductance ratio in the DMTJ-based cell can be explained by the presence of the second oxide barrier. Therefore, the accuracy for SMTJ-based cell increases by 5.9%, while DMTJ-based cell decreases by 2.4%, see Fig. 4. Note that the use of very thin oxide barriers could lead to breakdown of the MTJ structure. To deal with this reliability issue, the write voltages have to be reduced [20]. Table III shows the assessment of energy, latency, accuracy, and area results obtained at different values of oxide thickness, for SMTJ-and DMTJ-based cells. From table III, the SMTJ-based cell at t ox =0.85 nm has less latency and energy consumption compared with SMTJ-based cell at t ox =0.80 nm in feed forward operation. In contrast, during the weight update, the latency and energy consumption increases when t ox =0.85 nm. Moreover, during the feed forward and weight update operation the DMTJ-based cell at t ox =0.80 nm results less energy hungry than its t ox =0.85 nm counterpart. Furthermore, the DMTJ-based cell at t ox =0.85 nm is faster compared with t ox =0.80 nm along the weight sum. During the weight update, the DMTJ-based cell at t ox =0.80 nm has improved latency over the t ox =0.85 nm counterpart.  Finally, we have also performed the comparative study of the DMTJ-and SMTJ-based solutions considering t ox =0.80 nm. The DMTJ-based cell shows an improvement in terms of latency, of about 23% and 57% in feed forward and weight update operations, respectively, compared with the SMTJ-based cell. As for the energy consumption, the analysis shows similar results compared with t ox,SMTJ = t ox,t,DMTJ = 0.85 nm, showing accuracy improvements of about 62% and 55% during feed forward and weight update, respectively.

IV. CONCLUSION
In this brief, we have explored the STT-MTJ synaptic pseudo-crossbar array architecture and device/transistor models in NeuroSim. We have used the NeuroSim emulator to evaluate the learning accuracy with 2-layer MPL neural networks at the run-time of online learning in eNVM devices such as MTJ-based STT-MRAM. Our results show that, at parity of TMR and oxide thickness, as compared to the conventional SMTJ-based alternative, the DMTJ-based solution proves to be faster during feed forward and weight update operations of about 18% and 66%, respectively, more energy efficient under read (−60.7%) and write operation (−53.7%), and less area hungry (−35%) at an energy-optimal bitcell configuration/size. This occurs while also achieving an accuracy closed to 91% when running the neural network with the MNIST dataset. Our study suggests that DMTJ-based eNVM synaptic cores are good candidates to replace conventional SRAM-based solutions.