Neuromorphic In-Memory RRAM NAND/NOR Circuit Performance Analysis in a CNN Training Framework on the Edge for Low Power IoT

Training a CNN involves computationally intense optimization algorithms to fit the network using a training dataset, to update the network weight for inferencing and then pattern classification. Hence, the application of in-memory computation would enable a highly power-efficient low latency on-the-edge CNN training technique by avoiding the memory-wall created during the external memory read/write operation (for off chip instruction and data transfer). A memory write-verify, and re-program technique can control the RRAM variability. Still, memory verification and re-program is a complex process with additional resources needed for practical implementation of verification circuit. In this study, we have demonstrated a practical (First-in Max-Out) FIMO-based cache memory called Maximum Count Binary Comparator Layer (MCBC), using 1T3R, 1T5R, and 1T7R RRAM structures by using a probability-based accuracy improvement architecture, without the conventional verification process. We constructed 10 layered modified MobileNET with filter size ranging from 32 - 512 and trained with Traffic Sign Recognition Database (TSRD) using a three-tier abstraction simulation learning framework - (1) High level, 10 layered CNN implementation with Python+TensorFlow; (2) Verilog HDL based FP32MUL and FP32ADD (32-bits Floating Point adder and multiplier) circuits constructed with RRAM NAND gates using 1T2R structures; and (3) Digital Look-Up-Table (LUT) model for RRAM variability. An edge learning framework (for the forward pass) is demonstrated using digital RRAM-NAND/NOR universal gates integrated with the Maximum Count Binary Comparator Layer (MCBC) to partially circumvent the impact of RRAM variability and to quantify the RRAM variability on the CNN training prediction accuracy for 65nm CMOS OxRAM (TiN/HfO2/Hf/TiN) with varying device current compliance of 5, 10, and $50\mu \text{A}$ for low power IoT applications. The MCBC layer was simulated using a SPICE model, for which the estimated chip layout is $1150\times 1230$ nm2 per logical gate input, which resulted in an overall prediction accuracy improvement from 10% to 60% by repeating the logical operations of the NOR gate for {1, 3, 5, and 7} cycles respectively.


I. INTRODUCTION
In today's deep learning era, advancement in artificial intelligence (AI) models are achieved with bigger training The associate editor coordinating the review of this manuscript and approving it for publication was Mitra Mirhassani .
datasets. However, bigger is not always better. The problem with training high-performance AI models and deploying such models entails a tremendous amount of computation power. The model training and inferencing are performed typically by servers placed in the datacenter. The estimated temperature range for a datacenter is between 21 • C ∼ 24 • C and in order to maintain the temperature range, the conventional heating, ventilation, and air conditioning systems (HVAC) such as Air Cooling, Free Cooling, Two-phase Cooled Systems, etc. are used, resulting in a substantial amount of the by product -carbon emission [1]. It has been documented that global carbon emissions due to cloud computing amount to 2.5% to 3.7% of overall global emissions, far outweighing the contributions from the aviation sector [2]. Training a deep neural network (DNNs) model takes considerable mathematical calculations, longrunning time, high energy, and dedicated parallel processing units such as Intel CPU, NVIDIA GPU, AMD GPU, and Google TPU performing millions of floating-point operations per second [3], [4]. A consolidated list of popular chip makers and their chips with operational performance to power up cloud-based deep learning model training is shown in Table 1.
It has been documented that global carbon emissions due to cloud computing amount to 2.5% to 3.7% of overall global emissions, far outweighing the contributions from the aviation sector [2]. Training a deep neural network (DNNs) model takes considerable mathematical calculations, longrunning time, high energy, and dedicated parallel processing units such as Intel CPU, NVIDIA GPU, AMD GPU, and Google TPU performing millions of floating-point operations per second [3], [4]. A consolidated list of popular chip makers and their chips with operational performance to power up cloud-based deep learning model training is shown in Table 1.
The parallel processing chips used for training can achieve ∼10 × 10 5 GOPS (Giga Operations Per Second) with a peak power consumption of ∼500W and can perform precision computation from integer 8-bits up to floatingpoint 64-bits of data [5]. The recent trend shows rapid progress in autonomous driving systems integrated with deep learning and AI-based navigation systems deployed for efficiency improvements in the transportation sector and enhancing safer environments. The four major modules for the autonomous navigation system are (1) perception and localization, (2) high-level path planning, (3) low-level path planning, and (4) motion controllers. Today, all four modules use deep learning with LIDAR (to sense distance) and highspeed automotive camera data to perform the necessary sensing and timely control. Training an autonomous AI model equally requires high computation power and therefore exploring the use of in-memory technology will help reduce the overall power consumption and enable moving the training process from the cloud to the edge [6], [7], [8].
With an in-memory computation system, the bottleneck and extra power barrier to achieving high bandwidth data transfer between the external memory chip and the processor are significantly minimized using the non-von Neumann architecture [9]. The application of non-volatile memory device technologies such as resistive-switching random access memory (RRAM), phase-change memory (PCM), magnetic random-access memory (MRAM), and ferroelectric random-access memory (FeRAM) are studied for in-memory applications [10]. Here, we intend to study TABLE 1. Popular chipsets used for cloud-based AI model training [5].
further the application of lower power oxygen vacancybased RRAM for in-memory circuits used for edge-based training to build AI models for the autonomous system. The oxygen vacancy RRAM (OxRAM) is popular for its ultralow power switching, CMOS compatible process fabrication, high endurance cycle, and multi-bit pseudo-analog memory storage [11]. Ultra-low-power in-memory computation is practical to achieve low powered and battery-operated IoT applications. However, OxRAM devices exhibit stochastic switching due to the oxygen ion / vacancy drift / diffusion and irregular stochastic conductive filament formation and rupture while switching the device between the two resistive states, namely low resistance (LRS) and high resistance states (HRS) [12]. The following section shows various methodologies and process improvements performed by different studies to deal with the imperfect switching and the overall system performance while applied on a deep learning neural network.

A. OVERVIEW OF RRAM VARIABILITY CONTROL METHODS
RRAM switching variability is an inherent property of the diffusing oxygen ions in the switching process, and two prominent methods are widely studied to achieve a more controlled and enhanced device switching. The first method uses various fabrication process improvements with different material stacks while the second method relies on using an appropriate re-programming scheme to identify the more defective device on the given crossbar array and re-map by re-programming these defective devices to achieve device performance improvement. However, the more prominent  methodology is still to improve the fabrication process and material property. The oxygen ion movement has a stochastic nature, and therefore it is not easy to achieve a controlled switching comparatively on a lower current compliance operation when the number of oxygen vacancies comprising the filament is lower and with wider spread [13], [14], [15]. We see significant research and studies performed in the above areas, and therefore in this paper, we are focusing on the second method to improve the variability using a re-programming scheme and re-mapping the more defective array group.
The studies on the re-programming and verification using various programming architectures involving RRAM device model data [16], [17], [18], [19], [20] or actual fabricated device array data or intelligent workload mapping devices array data [21], [22], [23], [24], [25] are discussed in Table 2. Intelligent workload mapping strategy uses Hill-Climbingbased local search technique to map the cluster-to-crossbar array and maximize the inference accuracy [16]. Shift and Duplicate Kernel (SDK) convolutional weight mapping architecture uses multiple copies of the same weights, and the mapping algorithm would select the less defective data [17]. The above techniques used a complex mapping algorithm to address the device variability; however, more area and power were needed to implement such techniques on silicon. The switching imperfection in RRAM due to stochastic distribution of oxygen vacancies is improved by performing a two-step write-verification scheme where the device is programmed-verified-reprogrammed through an iterative process to achieve the device improvement [18], [19], [20], [21], [22], [23]. Alternatively, the defective RRAM array is grouped according to the defect severity level and the re-programming iterations are defined based on the severity levels as shown by An et al. in Ref. [24]. A multiple RRAM re-programming and verification scheme was conducted using quantized trained weights to improve the RRAM variability by Pan et al. [25]. However, all the listed techniques have an additional step to verify and re-program. Here, the verification process is much more complex than the re-programming process and it requires additional logic real estate on the silicon. In this study, we explore a simple RRAM array, designed as cache memory, and used for FIGURE 1. (a) Maximum Count Binary Comparator -MCBC Layer configured with 3-bits FIMO cache memory with Register 1,2, and 3. The input data shifted 1-bit per clock from left to right, and upon reaching 3 clock cycles, the maximum occupancy probability, F (ϒ) outputs a single bit with the most occurring input bits. (b) Shows stream of 3 inputs patterns and the most occurred bit transmitted to the output after 3 clock cycles. (c) simulation table showing the output (ϒ), which is one of the most occurred bits from χ 1 , χ 2 and χ 3 .
the reprogramming scheme without verification to keep the RRAM validation process quick and straightforward. Hence, we propose a practical and straight forward FIMO-based cache memory model that stores 3, 5, and 7 input bits and outputs a single bit matching the maximum occurring input at a given instant of time, resulting in probability-based accuracy improvement architecture called as Maximum Count Binary Comparator'' − MCBC Layer. A two-stage combinational circuit demonstrates the functionality of FIMO as shown in Fig.1(a), the first stage operates as a counter logic to count the occurrence of each input (total number of 1's and 0's), and the second stage acts as a comparator to output the maximum value from the given inputs. A simple operational use case is shown in Fig.1(b). The maximum occurred bit of the input stream propagated as the output, demonstrating the probability-based write/read architecture to improve the switching defect inherited by the RRAM device. The  hardware-based reprogramming scheme uses 1T3R, 1T5R, and 1T7R structures for RRAM-NAND and RRAM-NOR gate applied to in-memory architecture.
The MCBC Layer acts as a special shift register, where the (1-bit) input data is shifted-in serially into the memory array (3-bits or 5-bits or 7-bits), and outputs a single bit of VOLUME 10, 2022 data matching the most occurred data from the input serial stream, at any given instant of time as shown in Fig. 1. The shift registers are configured with a 3-bits, 5-bits, and 7-bits memory array, and each of the memory elements is connected in serial. For the given 3-bits, 5-bits, and 7-bits configuration, the register takes serial input data sizes of 3, 5, and 7 respectively. Upon reaching the maximum array size, a single output is generated with the maximum occurred input data from the array. Total clock cycles of 3, 5, and 7 are required here for the above-described configurations to generate a single output. We refer to the MCBC logic scheme as Repetitive Cycles (RC), and it is further used in our learning simulation framework to enhance and study the RRAM variability trend while applied in an in-memory computation circuit. Henceforth in our discussion, RC represents the depth of the underlying shift registers array size used in the MCBC ranging from 3 to 7. Following the functional simulation of MCBC, we have shown a practical implementation of the MCBC circuit using the transistor and RRAM logic in Section III.B. We have also presented a practical analysis of MCBC using the 65nm SPICE Transistor model and resistor circuit.

B. VARIABILITY STUDY IN NEUROMORPHIC LEARNING CIRCUITS ON THE EDGE
Training a CNN is an inevitable process, and significant studies have been conducted to move the training process from cloud to the edge system, using the low-powered inmemory computation architecture by overriding the excessive energy required to handle the memory wall during the memory read-write operations. Table 3 summarizes various edge-based in-memory studies conducted for low-power IoT applications. A study conducted by Feng et al. in Ref. [26] shows the OxRAM device stack used to construct a neural network of size 144-40-10 to train the MNIST dataset using a hybrid compute-in-memory (CIM) architecture illustrating the CNN model accuracy impact for the given RRAM device. In another study, a similar oxide device exploration was conducted by Li et al. [27] on a 784-400-10 MLP structure to emulate an FPGA+ARM based RRAM crossbar array to train the MNIST dataset and benchmark the prediction accuracy drop versus RRAM device variability. Adding to this list, a (10nm) HfO x /Ti/TiN resistance distribution data set was used to build a shallow DNN network using a Pytorch learning framework with neuron activation that takes a binary value (+1 and −1) with a simple XOR operation scheme by Majumdar et al. in Ref. [28]. These studies are conducted on a simple neural network, whereas a practical network would comprise of deep hidden layers with tightly packed convolution operations to achieve a required prediction accuracy. Therefore, exploring a simulation framework with practical and deep NN layers will be useful in understanding the trend of RRAM variability for a real-world application.
The RRAM non-ideal model performance was studied by Wang et al. in Ref. [29] using a deep practical CNN such as VGG, and a ResNET architecture was used to train the MNIST and CIFAR-10 datasets. Here, (1) quantization aware, (2) device-aware, (3) tile aware, and (4) conductance weight mapping architecture are used to study the impact of the RRAM programming variability on the training network. A second study by Qiu et al. [30] was conducted on a deep practical CNN (LeNET/ VGG /AlexNET) using a SPICE RRAM model. A CIFAR-10 dataset was used with MNSIM2.0 based Training-In-Memory simulation framework to perform training at the edge and analyze the CNN prediction accuracy. The two studies aim to show the device variability trend by using RRAM models that use a small memory array. Hence, such a training framework can be tested for different RRAM current compliance with varying memory window overlap. The study with varying RRAM current compliance is necessary to quantify the RRAM variability and its impact on the training accuracy on a different current compliance scale when it is considered for an IoT application at very low power.
The RRAM crossbar array shows interesting multiplication and addition operations implemented using Kirchhoff's law to apply convolution operations of CNN. The crossbar arrangement uses comparatively fewer resources. However, the overhead of using external peripherals such as the ADC/ DAC to read and write the data on the analog crossbar array is significant in terms of power and area. The CNN is a parallel architecture in nature, where the total number of parallel processing computational elements define the throughput latency. Hence the primary constraints for designing an IoT system are defined by the total available processing power and chip area occupied by the parallel processing elements, that in turn define the latency of the system. The study conducted by Yu et al. in Ref. [31] using TSMC R 40nm RRAM technology and Intel R 22nm RRAM technology was used to build a VGG-8 NN and trained on a CIFAR-10 dataset using a modified Neurosim, with the intention of optimizing the use of ADC by using a MUX based ADC. The same group utilized a mixed RRAM design with RRAM memory designed for MSB bits, and a regular memory used for LSB bits as shown in Ref. [32]. Another work by Liu et al. in Ref. [33], uses a 1Kb quantum point contact oxide RRAM model to build the VGGNet on a CIFAR-10 dataset and applied on the device-to-system simulation framework to validate the on-chip training performance. A similar edge-based training analysis was conducted by Giordano et al. in [34], on a 40nm-CMOS foundry on-chip RAM used on ResNet-18 with an ImageNet dataset effectively demonstrating the training accuracy performance. Considering the above studies, it is evident that exploring the use of RRAM-based in-memory computation system on a digital domain will be useful to get rid of the ADC/ DAC overheads with a synchronized global digital clock driving the system for a more organized and controlled operation to achieve high speed digital inmemory computation logic. Hence, we focus on designing and operating the RRAM as digital memory to harness the FIGURE 2. The various steps of a CNN training operation are shown as forward pass and backward pass, both executed in a loop to compute the training weight and loss function. The forward pass consists of transformation and mapping operations such as convolution, pooling, fully connected, and softmax, whereas the backward pass computes the weight difference between the 'n' and 'n-1' loop to derive the optimized weights and training loss or error.
advantage of avoiding the peripheral circuit overheads and achieve high external noise immunity when designed as a digital system.
Our current work aims to build a full-sized practical CNN with 10 layers (filter sizes ranging from 32 to 512) of a modified MobileNET trained with traffic road sign data set (TSRD) with a three-tier abstraction of the simulation learning framework -(1) High level 10-layered CNN implementation with Python+TensorFlow; (2) Verilog HDL based FP32MUL and FP32ADD (32-bits Floating Point adder and multiplier) circuits constructed with NAND gates of 1T2R structures for the logic computations; and (3) a Digital Look-Up-Table (LUT) model for encoding RRAM variability.
The novelty of our work is the methodology of edge learning framework (forward pass) using digital RRAM-NAND/NOR universal gates integrated with Maximum Count Binary Comparator Layer (MCBC) to control the impact of RRAM variability and to quantify the RRAM variability on the CNN training prediction accuracy for varying low device current compliances ranging from 5 to 50µA for ultra-low power IoT applications. We have also demonstrated a practical implementation of the RRAM-NAND-based standard cell and simulated it using the SPICE model. Today we see models such as TinyML etc., targeted for embedded processors performing well for a low footprint application. While looking into such embedded specific CNN architecture, the resource reduction is achieved by pruning and shrinking the deep neural network size to fit the small memory computing device. Such systems are popular for binary result conditions, such as ''visual wake words'' saying YES or NO, to predict the trained label from the input image. However, in a vision-guided autonomous system, a CNN navigation guidance system requires a considerably higher feature mapping than the simple YES or NO prediction conditions, which are very well achievable using a deep CNN such as MobileNET, which uses a wide range of filter sizes and intense convolution operations. The large-scale datasets such as CIFAR100 and COCO are structures with a few hundred generalized categories, whereas the autonomous application trained with the traffic road sign dataset, will aid in developing a more optimized navigation system with a higher recognition rate by training the CNN with a specific ''traffic road sign'' dataset. VOLUME 10, 2022 Section II presents the CNN simulation framework used to train a model to identify the road signs, wherein the convolution operations are implemented with RRAM variability encoded floating-point multipliers and adders and for which, the prediction error loss is computed between the hardware and software pipelines. Section III discusses the results and trends obtained using the learning framework for different current compliance RRAM datasets and the prediction trend variation by applying the Maximum Count Binary Comparator Layer to each RRAM NAND gate. We conclude our work in Section IV after a summary and inference based on all the analyses carried out.

II. SIMULATION METHODOLOGY FOR DESIGN AND IMPLEMENTATION OF RRAM BASED IN-MEMORY COMPUTATION MODELS APPLIED TO TRAINING CIRCUIT A. FUNDAMENTAL ELEMENTS OF CNN TRAINING OPERATION AND WEIGHT COMPUTATION 1) TRAINING
The Training is a constructive process that involves parallel computation to be performed on the input images by applying convolution and linear transformation operation for feature map extraction and finally calculating the loss function during the Backward Pass to update the learning weights as shown in Fig. 2. The training process is classified as Forward Pass and Backward Pass operations; the former involves having a batch of input images to pass through the given network in a forward direction, and the predicted output is then compared with the actual labels. By knowing the closest predicted value from the actual label, the weights of the networks are adjusted using the Backward Pass operation, during which the network predicts close to the actual labels when the images are seen the next time. A similar process is repeated for as many different batches of available images, and this is regarded as an epoch. The training process is performed for as many epochs as necessary to achieve the desired accuracy level, and the five steps involved are (1) Prepare the images in batch for training, (2) Pass the batch to the network -Forward Pass, (3) Calculate the loss between predicted and actual data, (4) Compute the gradient for the loss function and update the weights to reduce the loss, (5) Loop -repeat 1-4 for 'n' iterations to reach the required minimum error.

2) FORWARD PASS
The Forward Pass is a pipeline with stages of chained matrix transformation functions performed to extract features from the input image to classify it against the trained label class. The various transformation functions are convolution, ReLU activation function, Maxpool operation, fully connected layer and Cross-Entropy coding (Softmax). Each layer in the network has its own transformation, and all individual layers constitute the total transformation of the given network. The objective of this operation is to transform and map the inputs to the right output class with minimum possible error [35].

3) CONVOLUTION
The convolution is the sum of the products (SOP) matrix operation of the input image to its local neighbor or filters. By sliding the filter matrix of size (n × n) over the image matrix (m × m), and the SOP between these two matrices generates the feature map, with the consideration that the size of 'm' is greater than n'. The stride is the size of the step the convolution filter moves each time over the image matrix [36].

4) ReLU-MAXPOOL
The rectified linear unit (ReLU) activation function is a linear functioning operation whereby the output is equal to the given input if it is positive, while for all other cases, the output to be maintained at zero [37]. The ReLU is a popular function used among many neural networks as they perform better for a broader class of data sets. Max Pooling is a down sampling approach to reduce the computation power and avoid network overfit [38].

5) FULLY CONNECTED LAYER
The feed-forward neural network, known as the Fully Connected layer, forms the last few layers in the CNN and receives flattened or 1-dimensional data array from final pooling or convolution layers [39].

6) SOFTMAX
The Softmax is a Cross-Entropy activation function that predicts a multinomial probability distribution where a class member requires no more than two class labels at the last or output layer of the neural network [40].

7) BACKWARD PASS
The Backward Pass is an intense computation operation where the weights are updated based on the learning rate and by computing the past and present weight gradients [41]. Adam optimization and Stochastic Gradient Descent (SGD) are the commonly used gradient functions and the SGD is used as the gradient calculation algorithm in our current training framework.

8) LOSS FUNCTION
The Loss Function, which is one of the critical operations in the training process, is the difference between the expected value and the predicted value. The loss is used to calculate the gradient, and they are further used to update the weights of the network layers [42].

9) ACCURACY METRICS
The confusion matrix is one of the more useful accuracy matrices used to show where the trained model became confused in predicting correct labels and help us re-train the model with more related data sets to improve prediction accuracy.

B. OVERVIEW OF THE TRAINING DATASET AND CNN USED FOR SIMULATION
Autonomous vehicles are developed for safe roads by reducing human error. However, safety here is a function of how precisely the navigation system identifies and how fast a decision is being made with high uncompromisable accuracy. The navigation system must process multi-dimensional parameters such as vehicle speed, road lane identification, and position of other neighbors (vehicles and civil structures), road and traffic signs, weather conditions, etc., to make the right decision to navigate. The identification of traffic signboards is one of the primary essences of autonomous systems, and we have used the Traffic Sign Recognition Database (TSRD) for our training simulation and analysis. The TSRD is extracted from Beijing Jiaotong University, China's Traffic sign database repository (repo), and it consists of 6000 images, of which 4000 images are used for training and 2000 are test/validation images. The images are classified into 10 categories, such as No_Entry, No_Left, No_Right, No_Parking, Stop_sign, Entry, Left_Turn, Right_Turn, Parking, and Pedestrian Crossing as shown in Fig. 3. There is an eleventh category called Others, whereby any sign which does not fall under the previous 10 categories will be grouped under this, and all our simulation results are shown for the main 10 categories only. The first 5 traffic sign categories are the mandatory signs that signal the navigation system to actions that represent ''not to do'' or ''can't do'' and are designated with a red color border (No_Entry to Stop_Sign); whereas the next 5 sign categories include ''can do'' symbols with blue and white colored signs. The 10 categories of labelled images were trained with 50K steps of 32 epochs using Stochastic Gradient Descent weights optimized for MobileNET CNN. In real-time, the front-facing camera and traffic sign recognition system used to complement the navigation system must cater to filter the background scenes and should be able to read signs from an angle, faded text and broken signboards, ensuring consistent readability during high vehicle speed, etc. All of these conditional parameters and considerations are out of scope of this study and considered to be problems with already available solutions. Here, we assume the camera system has inbuilt pre-processing modules for image enhancement. The backend CNN system receives a good quality image processed data set that is to be classified into any one of the trained 10 categories with high accuracy.
The MobileNetV1 [43] CNN is designed to operate on small-footprint embedded devices with reduced model size and complexity. The CNN operates with two different convolution modules, which comprise of the Depth Separable Convolution (DS) followed by the Pointwise Convolution (PW). The DS performs a single convolution on every channel rather than combining all three and flattening it. The DS is a 3 × 3 convolution operation, and all the outputs are combined with a single 1 × 1 PW convolution in a single step. The DS is constructed with two layers, one for filtering and the other for combining all the outputs; hence this factorization has a more significant reduction of computation and is suitable for IoT applications. The MobileNet is a 30-layered architecture with a combination of stride-2 convolution block, depthwisepointwise convolution block, fully connected layers, and the softmax classifier. The convolution layers are intense computational layers and consume more than 80% of the overall device power. Hence, we confine our simulation and study on the convolution layers and Fig. 4(a) shows the slightly modified first 10 convolution layers of MobileNet, that we have used for our device variability simulation framework. Fig. 4(b) shows the core modules of every convolution layer, such as Conv module configured with a 3 × 3 convolution, followed by the Batch Normalization (BN) and the ReLU layers. In contrast, the second core, Conv_dw module, configures a 3 × 3 convolution, BN, and is further applied with a 1 × 1 convolution, BN and ReLU. The Conv layers operate with a stride of 2 while Conv_dw operates with a stride of 1. We have considered using the 10 convolution layers in 5 different groups as shown in Fig. 4(a) with group 5 comprising of layer 1 and 2 with a filter size of 32; group 4 including layers 1 to 4 with a filter size of 32 and 64, and further groupings are performed similarly all the way up to group 1, with 10 layers and filter sizes from 32 up to 512. The trained weights in Groups 1-5 are represented in a 32-bit floating-point format, which consists of the mantissa (23-bit), exponent (8-bits), and signed bit (1-bit).

C. LUT-BASED RRAM VARIABILITY ENCODING SCHEME AND NEUROMORPHIC SIMULATION FRAMEWORK TO TRAIN ROAD SIGN DATASETS
The TiN/HfO 2 /Hf/TiN RRAM element's electrical characterization data set showing device-to-device and cycleto-cycle variability is considered for our in-memory 1T2R gate logic simulation. The RRAM resistance data is extracted from Fantini et al. [15] with a varying I comp switching data for 5, 10, and 50 µA. The resistive element was fabricated on a 65nm CMOS process and with an oxide thickness of 5 nm. Since we intend to examine the low power IoT regime, we confine our simulation to an OxRAM device operating in a low power regime and do not consider a CBRAM device which has comparatively higher power due to the metallic nature of the switching conducting filament VOLUME 10, 2022 [44], [45]. The objective of our simulation framework is to demonstrate the configuration of an RRAM-based device in a CNN, where the device is operating in a digital domain. The device's Low Resistance State (LRS) is considered as ''Logic-0'' state, and the High Resistance State (HRS) is considered as the ''Logic-1'' in our simulation framework. We have proposed a practical circuit in Section III.B, demonstrating a digital RRAM-based gate used to build the convolution operations, as digital circuits with high noise immunity and also comprising a digital synchronizing clock for precise and high-speed operations compared to analog circuits. We perceive RRAM devices to be configured on a crossbar array, demonstrating a multi resistive state that has greater power-saving and is less dense with smaller chip area. In spite of all its advantages, the given crossbar design exhibits a sneak path effect. Moreover, additional peripheral circuits such as ADC and DAC are employed to read and write from the RRAM device in a crossbar array. Today's image acquisition and preprocessing modules operate in a digital pipeline with a Digital Signal Processing (DSP) backbone, where the images captured in a CMOS sensor are transmitted to Image Signal Processing module for preprocessing, followed by CNN recognition and finally to a post-processing module for final presentation. As such, the entire pipeline is digitized.
An analog crossbar CIM (Compute-In-Memory) RRAM system becomes challenging to fabricate as a mixed-signal processing SoC (System On Chip) for a vision-based neuromorphic application. The I comp corresponding to LRS and HRS logarithmic resistance distribution extracted from Fantini et al. [15] is shown in Fig. 5(a). Plotting the resistance as a normal distribution shows an overlap region between the LRS and HRS state, as shown in Fig. 5(b), and by considering the LRS as Logic-0 and HRS as Logic-1, these overlap regions show the possibility for false-0 and false-1 regions, which result in the device level variability getting translated into the final circuit. The false-0 and false-1 regions represent the memory window overlap region. For higher current compliance, the overlap is minimal or negligible. For the lower current compliance, the overlap is significant and critical. Therefore, quantifying the impact of the device variability for different current compliances is critical to understand the final usability of the device.
We propose a Look-Up- Table (LUT) model to encode the RRAM variability into an RRAM-based NAND and NOR gate logic. The universal gates are the basic building blocks for any digital circuit. With the RRAM NAND and NOR logics, we aim to construct a convolution framework to demonstrate the 10-layered MobileNET. The primary consideration of the LUT model works by encoding the resistance region of 0 to Log 10 (0.5R) as Logic-0 and the next half of the resistance distribution region, which is Log 10 (0.5R) to Log 10 (R) as Logic-1 from Fantini et al. [15] data set for 5, 10, and 50µA respectively. A 1T2R structure is further proposed as shown in Fig. 5(d), where two parallel RRAM devices are connected to the gate terminal of the MOS transistor. Here, the RRAM resistance threshold controls the transistor gate operation in a sequence to demonstrate the NAND and NOR functionality.

1) NOR GATE
When input A and B are at logic-0 (Zero voltage), the transistor is not gated and stays at OFF state, as a result of which, the output is at Logic-1 or close to the VCC (Supply voltage). Considering the scenario where input A or B or both A and B are supplied with a Logic-1 an equivalent current resulting from the parallel resistors flows into the gate of the transistor, which supplies the required gating voltage to turn ON the gate. This will result in a low resistance path from the VCC to ground (GND) and cause the output to be maintained at Logic-0 for the given input conditions demonstrating a NOR operation.

2) NAND GATE
The operation is vice versa for the NAND gate, where the resulting voltage from the two input resistors reaches the gating threshold only when both the inputs are maintained at Logic-1. The gating threshold is not met for the other input sequences to turn ON the transistor in order to demonstrate the NAND operation.
The RRAM-based NAND and NOR circuits demonstrated here may work theoretically, but in order to make these circuits a practical working system, we need to add a gate biasing resistance and carefully choose the input resistance range to achieve the gating sequence logic of the NAND and NOR gates. We have simulated and shown the various resistance values for the inputs and biasing resistance in Section III.A.
The learning process used here is constructed as a threestage simulation framework using TensorFlow and Python programming, as shown in Fig. 6. Stage 1 is the Forward Pass of the MobileNET CNN with 10 layers of varying and increasing filter size ranging from 32 to 512 and with Depth-Wise and Point-Wise convolution modules, as shown in the layer group of Fig. 4. The Stage-1 Forward Pass consists of two parallel CNN computation pipelines called the Software (SW) and Hardware (HW) pipelines. The SW pipeline was purely implemented with the TensorFlow framework while the HW pipeline was implemented with a combination of Python, and Verilog HDL with the RRAM variability encoded NAND gates used to perform the convolution operations in all the 10 layers of the CNN. Every NAND gate in the HW pipeline is implemented using 1T2R structure as shown in Fig. 5(d) and furthermore, the two RRAM devices in the NAND gate are encoded with the varying resistance data using a LUT model, where the 0 to Log 10 (0.5R) range defines Logic-0 and Log 10 (0.5R) to Log 10 (R) gives the Logic-1.
The Hardware pipeline is programmed with Python, Verilog-HDL, and the data exchange between the two programming languages is handled through the CSV (Comma Separated Version) file system as shown in the block diagram in Fig. 7. Here, the Python program implements the 10 layers of the MobileNET architecture with varying filter sizes, the sub-routine of the convolution operations inside the 10 layers calls the Verilog program, where a combinational circuit of 32 bits floating point multipliers and adders are used to perform the convolution operation. As discussed in our previous work [46], the general Verilog Convolution implemented includes 7-layers of abstraction that exposes all the NAND gates used for the convolution operation. Hence, by looking at the NAND structure, we substitute our proposed 1T2R structure to obtain the truth table for the NAND as in Fig. 7. The low-level subroutine in the Verilog code defines the 1T2R functionality for the NAND gate and the simulation framework uses a LUT file with 5000 varying RRAM resistance data set for LRS and HRS, where the framework randomly reads and assigns the resistance of the 1T2R value based on the LUT file. Thus, the false-0 and false-1 states from the RRAM data set are encoded into the FP32MUL and FP32ADD modules through the 1T2R NAND structure, which in turn affect the convolution operation performed in the HW pipeline.
Stage 2 is the backpropagation process where the error value is calculated as the weights difference between the expected and computed image and a Stochastic Gradient Descent (SGD) is implemented to adjust and update the training weights. Here, in Stage 2, the entire process was implemented using TensorFlow and Python, and no hardware logic (RRAM NAND) was used. Finally, Stage 3 will be the simulation framework to predict the difference in calculation between the SW and HW pipelines from Stages 1 and 2. The difference between the prediction percentage VOLUME 10, 2022 FIGURE 6. A TensorFlow Python and a Verilog HDL-based three-stage simulation framework to validate the RRAM variability encoded CNN training network. STAGE-1: constructed with forward pass software pipeline -SW (actual logic) and hardware pipeline -HW (RRAM encoded logic). STAGE-2: Backpropagation using stochastic gradient descent (SGD) to update the trained weights. STAGE-3: Prediction error computation between the accuracy obtained between SW and HW pipelines. (confidence) using SW and HW trained weights quantifies the variability inherited from the underlying RRAM NAND Logic (1T2R). Every NAND logic is coupled to an Arbitrary Logic (MCBC) layer to improve the RRAM variability, as in a computational logic pipeline, even a few gate failures can result in a massive system calculation failure. The MCBC layer works like an inter cache memory which takes 3 to 7 inputs and outputs the best or most occurred data among the input at any given instance. The MCBC layer decreases the probability of device variability/failure to enhance the system performance. The prediction error reduction is quantified for three different configurations of MCBC (RC1, 3, and 7) in the following results and discussion in detail. A practical MCBC layer is demonstrated and discussed in Section III.B. The Python pseudo-code for the SW and HW pipeline is shown in Fig. 8, where Fig. 8(a) shows the 10 layers of convolution implemented with the specific filter sizes, and Fig. 8(b) shows the two different subroutines of SW and HW. The SW is implemented with a direct TensorFlow library, and the HW subroutine calls the Verilog model to perform the convolution operations.

III. RESULTS AND DISCUSSION-PERFORMANCE OF RRAM BASED IN-MEMORY COMPUTATION CIRCUIT IN A CNN TRAINING SYSTEM
6000 different traffic sign images were used for the training framework, out of which 4000 images were used for training and the rest were test images. Stochastic Gradient Descent (SGD) weights optimizer and an error function were used for the purpose of configuring the framework to compute the training accuracy given by the loss function. Final trained weights were obtained from the 50,000 training steps resulting from different stimulation tests using the above parameter setup. Under various scenarios, the training test results obtained are plotted on the Y-axis while the corresponding training steps (epochs) are plotted on the X-axis, and the resulting trend will be examined in depth in the different cases as follows.

A. IMPACT OF DIFFERENT MAXIMUM COUNT BINARY COMPARATOR LAYERS, HIDDEN LAYERS AND TRAINING IMAGE LABEL CATEGORY ON PREDICTION ERROR FOR VARYING RRAM COMPLIANCE
The training accuracy trend is examined for the following parameters which include 10 convolution layers (Group 1), 10 different image categories, 10µA RRAM current compliance resistive encoded data set for the HW pipeline and four different Repetitive Cycles (RC) viz. RC = 1, 3, 5, and 7 as shown in Fig. 9. The objective of this analysis is to plot the variability in training trends for the different RC, as it is a known fact that by increasing the RC, the RRAM variability reduces as a result of probability. Here, we aim to quantify the effect of variability to the corresponding trend improvement in the final prediction accuracy for the increase in training steps. The given figure is generated by using a five-simulation results plotter, such as the SW pipeline computed software trained data (SD) and four different RRAM variability encoded HW pipeline logic with RC = 1, 3, 5, and 7, with the obtained training accuracy on the Y-axis and the number of training steps on the X-axis. As the training steps increase to 9000, the SD accuracy gradually increases to 12%, but thereafter, the SD accuracy took a steep increment to 88% when at 27,000 steps, and at 50,000 steps, the obtained training results are 92% accurate. We use the SD data set training accuracy as a benchmark to compare the four different hardware trained logics. As explained, the RC3 logic is performed by repeating the basic NAND operation three times with RRAM encoded data to choose the maximum occurred outcome from the three results, hence the entire convolution operations in the given 10 layers are repeated three times in total for RC-3. The same concept applies to RC-5 and RC-7, meaning the operations are repeated five and seven times respectively, while the RC-1 is simulated with no repetitive cycle. The 10µA RRAM data set has an approximate 10% overlap in the LRS and HRS resistive distribution and the impact on the training accuracy is plotted using the given overlap and different repetitive cycles. The RC-1 reaches a maximum of 15% accuracy on the 50,000 th training step, with the accuracy almost reaching a steady oscillation state from 13,000 steps onwards. The RC-3, RC-5, and RC-7 follow a similar trend, reaching a maximum accuracy of approximately 30%, 60%, and 68%, respectively. The accuracy improvement from RC-3 to RC-5 is significantly higher than when compared to the improvement from RC-5 to RC-7. Hence, it is evident that the increment VOLUME 10, 2022 in the RC does not translate to significant accuracy improvements and the contribution to the effects on the prediction accuracy lies more on the underlying RRAM resistance overlap range between the LRS and HRS rather than the RC. The idea of using repetitive cycles to improve RRAM logic element variability is necessary only for low RRAM current compliance since the repetitive logic tends to increase the power budget compared to the SW pipeline implementation for the higher I comp devices.
The power consumption is compared for high RC -low compliance versus low RC -high compliance scenario considering the median LRS resistance for the 50µA and 5µA distributions from Fig.5(a) with memory operating voltage taken as 1.2V (CMOS Voltage). As known, more clock cycles are required for high RC as the same operation is repeated more times. The power difference is quantified for scenario A: -low RC -high compliance (RC1@50µA) and scenario B:-high RC -low compliance (RC7@5µA) using the standard power law equation with the parameters such as current (50µA and 5µA), voltage (1.2V), LRS median resistance (Log 10 (4) and Log 10 (5.5)), and RC (1 and 7). Scenario B operates with 6-clock cycles more than Scenario A to perform a single memory operation (read); still, Scenario B consumes only 8.6% power of the total Scenario A setup.
Following the simulation of different RC logics, the accuracy trend between the HW and SW pipeline is estimated by reducing the number of hidden layers. The simulation starts out with 10 layers, followed by a reduction of 2 hidden layers for each simulation cycle. The layer reduction is performed by five different groups as shown in Fig. 10. Group 1 has the greatest number of layers and convolution elements, whereas Groups 2, 3, 4, and 5 are constructed with lower filter sizes ranging from 32 to 256. The filter size of the given 10 layers is 32 for layers 1 and 2, 64 for layers 3 and 4, 128 for layers 5 and 6, 256 for layers 7 and 8 and 512 for layer 9 and 10. The application of a 10-layered CNN in autonomous vehicles and process industry settings do function effectively while the backpropagation training performed well with a sufficiently large volume of the dataset, which is our basis for choosing or setting ''10 layers'' as our upper limit in our simulation framework to quantify the RRAM device variability on the prediction error rate for the layer configurations {2∼10}.
It is to be noted that a very deep CNN is not optimal from a hardware (RRAM-based in-memory circuit) perspective due to compounded variability as the number of hidden layers increases. Still, a very shallow network may not have high prediction accuracy due to the fewer feature extraction logics used to classify the input image. On the other hand, a full software implementation (without RRAM in-memory logic) prefers to use deep networks as much as possible to achieve high accuracy at the cost of high computing power. The color scale depth of the input image (RGB or Monochrome) also contributes to the computational intensity in every layer of the network. Hence, a delicate balance must be met for choosing the right network size considering the available power and required prediction accuracy trade-offs as defined by the end use application.
It is obvious that the reduction of the hidden layer count will increase the performance of RRAM logic elements; and in the recent past, there are several studies [47], [48] conducted on exploring the usage of tiny CNN modules for low power embedded applications with less number of hidden layers to perform simple recognitions in real-time. The references [47], [48] use a smaller number of layers {VGG-1-16 (6 layers) [47] and 5 layers with filter size from 32∼4 [48]} demonstrating image classification, and further, no benchmark comparison was conducted on the implementation methodology of these references with our simulation methodology. The study for different hidden layer count complements with using CNN on IoT low power applications by quantifying the achievable prediction accuracy with the given low power RRAM logical element. The simulation shown below was performed using the parameters that include 10µA RRAM I comp HW pipeline; 5 repetitive cycles; 10 different road sign categories and 5 different hidden layer groups. The obtained prediction accuracy is plotted in the Y-axis and the number of training steps on the X-axis in Fig. 10. The network prediction accuracy decreases with the reduction of the hidden layers, since the overall feature extraction and mapping is reduced and more importantly, we observe that the training curve trend differences between the SW and HW pipeline also reduce as the filter size is lowered, as seen in Fig. 10, which shows the quantified impact of RRAM variability over the different filter size and computational logic. The SW and HW pipeline prediction accuracy with Group 1 (10 layers) achieved are 92% and 59% respectively for parameters of 10µA I comp and 5 repetitive cycles. The SW and HW pipeline accuracy results of Groups 2, 3, 4, and 5 with the same parameters settings are recorded as (89/58) %, (80/60) %, (42/37) %, and (32/31) %, respectively. Interestingly, the SW pipeline accuracy decreased for lower number of hidden layers (as expected) resulting from the lower count of features mapped, while the accuracy of the HW pipeline increased with lesser number of layers; the SW prediction accuracy drastically reduced from 6-layers down to 4 and 2 layers due to the fact that such low number of layers have smaller convolution filter sizes of 64 and 32, respectively.
To be more qualitative, consider a practical application of an IoT-enabled battery-powered smart bin sensor with camera and inferencing modules deployed to assist the waste management industry in increasing the re-cycling index and contributing towards a sustainable environment. For the said application, RC5-L4 (Repetitive Cycle = 5 and CNN Layer = 4) from Fig. 10(f) is considered more suitable for a smart bin sensor with an in-memory edge computation system running on battery power.
The computation power of gradient calculation for the convolution operator's filter data is a function of the training image data set, the number of hidden layers, total training steps, and the total number of labeled categories. We consider regrouping our labeled categories used in the simulation and retraining the other training parameters such as image data set size and the number of hidden layers, keeping the training steps as a constant. The intention of the regrouped image category simulation is to further study the impact of RRAM logic element variability on the training process for computing the weights with highest achievable prediction accuracy. The training data set consists of 10 categories (No Entry,Stop Sign,No Left,No Right,No Parking, Entry, Parking, Left Turn, Right Turn, and Pedestrian Crossing), and based on the driving instruction/operation, the given 10 categories are regrouped into 6 category and 2 category groups, as shown in Table 4 for generating the training trend of SW and HW pipelines. The 6 categories group is formulated with the following driving instruction logic, (1) (6) Watch out for pedestrians crossing. An assumption made here is that there will not be any T-Junctions and crossroads while driving in order to resolve the tie between left or right direction sub-condition in (2) and (5). Note that category (3) ''No-Parking'' and (6) ''Pedestrian Crossing'' are not clubbed together in the 10-category and 6-category group. For the 2-category group, we classify the entire data sets into ''Can Do'' and ''Don't Do'' categories. The 2-category group setting is not practical to be considered for the use in the primary navigation system for autonomous driving as it is necessary for the recognition and features to be mapped to a much broader class/conditions for more accurate prediction and precise decision-making. Still, we intend to simulate the 2-category group for comparative study on the variability trend obtained in the prediction. However, the 2-category group-based trained weights can be used as low powered secondary navigation system to complement the primary navigation system. The constant training simulation parameters are 10 CNN layers, NAND logic repetitive cycles of RC = 5 and 10-different road sign categories, and RRAM devices with I comp ranging from 5 -50µA. Based on the given training parameters, the simulation was performed for 50,000 steps, and the obtained training accuracy is plotted in Fig. 11. We see the training accuracy percentage for I comp = 50µA increase from 92% (Category 10) to 97% (Category 6) and further to 98% (Category 2). A similar trend is observed for 5µA and 10µA with a fractional increment in training accuracy from their base accuracy. Hence, the reduction in label category results in increased prediction accuracy for the given RRAM logic element-based computation. There is another trend that is to be noted in Fig. 11, where we see the slope to reach the peak of the accuracy curve saturation increasing as the category count decreases. The maximum prediction accuracy is attained faster in fewer thousand training steps for Category 2 when compared to the other two categories, meaning that the training for a smaller number of categories can be stopped much earlier while achieving the maximum training accuracy with reduced computation power. However, the accuracy can still be further improved with more training data set by continuing the learning process for further performance improvement.
Precision and Recall are useful metrics to understand and further tune the training accuracy trend with appropriate parameters and training data settings. The precision metric is the ratio of total true positive results to the sum of all true positive and false-positive results, while on the other hand the recall metric is computed using a true positive and false negative ratio. A confusion matrix is a summary of the prediction results for the given classification using the number of correct and incorrect predictions which are summarized with count values and broken down by each class. The confusion matrix plot is intended to display the Precision and Recall for the three-category groups of 10, 6, and 2. Fifty images are used to compute the prediction results for the 3 category groups; and together with the application of the trained weight obtained from 50,000 training steps and the HW pipeline configured with 50µA RRAM logical element with 5 repetitive cycles for every NAND logic, the results are obtained and plotted in Fig. 12. The X-axis of the confusion matrix represents the predicted labels results, the Y-axis shows the labels of the actual input images, while the diagonal represents the correct prediction, and other elements in the matrix represent incorrect prediction. To explain it further by looking into Fig. 12(a), the prediction for NO-Entry label was 92.46% accurate from the result of 50 test images and the breakdown of the 8% incorrect labeling for the given NO-Entry images classification includes NO-Left (2.47%), No-Parking (0.67%), Parking (3.17%) and Pedestrian-Crossing (1.23%). A similar interpretation can be used for the Category 6 and Category 2 groups from Figs. 12(b) and (c), respectively. The average prediction accuracy calculated with the diagonal elements (dark blue) from the confusion matrix in Fig. 12 shows the trend whereby the prediction accuracy using defective hardware (RRAM) logic improves for a reduction in labeled image category from 10 to 2. The average prediction accuracy for Category 10 is 91.6%, Category 6 is 94.7%, and Category 2 is 97.34%. While the prediction accuracy increases for the given RRAM variability by 5-6%, the overall classification to more delicate labels is significantly compromised from Category 10 down to Category 2 which might nullify the benefits of the end use application.

B. EMULATING NEUROMORPHIC CIRCUIT USING RRAM DIGITAL UNIVERSAL GATES AND POWER CONSUMPTION BENCHMARK
The truth table of the universal digital gates is presented using a 1T2R RRAM logic in sections II and Fig.5(d). However, the proposed 1T2R structure requires a few additional passive and active electronic components to make the circuit useful for any practical applications. As shown in Figs. 14(a) and 14(b), simple modifications are performed on the given 1T2R structure: an additional bias resistor (bias) added in series to the input restores A and B giving a 1T4R structure as in Fig. 13(a). The bias resistor acts as a voltage divider when a voltage / logic changes across input resistors A / B and maintains the required gating threshold voltage to the ON/OFF Transistor switch. The load resistor acts as a strong pull-up resistor for the output (OUT). The 2T3R configuration is a practical circuit as well; here, the load resistor is replaced with a second transistor to reduce the power dissipated on the load resistor for a higher load rating. The bias resistor draws significantly less current to gate the transistor. Hence, all the resistors in Figs. 13(a, b) are selected with mega-Ohm range of resistances to keep the power dissipation very minimal and ensure the practicability of the use of the circuit. The following sections show the various resistance values and their design considerations with SPICE simulations. Today's modern NAND CMOS circuit, shown in Fig. 13(c), is implemented using 4 transistors to perform the required switching operations to exhibit the NAND truth  The MCBC circuit function works as a First-in Max-Out (FIMO) memory to cache the input data and a combinational logic to output the maximum occurred input data at any given time. We propose using the RRAM elements with external program and bias voltages, and a transistor structure to demonstrate a cache memory that outputs the best of the occurrence from the input. The design of the bias, program voltage sources, and select 1 & 2 (SEL 1&2) logic switches are out of our current simulation scope. We assume these peripheral circuits to be designed as a global circuit by which few circuits can be used to program and reset the entire array of RRAM NAND gates.
The select-1 (SEL 1) switch is used to choose between the bias (operation voltage or input) and RESET voltage of the RRAM devices while the select-2 (SEL 2) switch functions as a toggle switch, programming one RRAM every clock cycle, and for RC=3, three clock cycles used to program the three RRAM arrays. The bias resistor connected in serial to the input resistors R 1 , R 2 and R 3 , supplies the required gating threshold voltage (V GS ) for the transistor (T1). The input data holding resistors R 1 , R 2 , and R 3 are connected in parallel, and the overall equivalent resistance (R eq ) varies for the different resistance values of R 1 , R 2 , and R 3 . The three parallel resistors must replicate the cache memory by taking 3 input data and output the most frequently occurring bit. The 3 RRAMs programming sequences are shown in 4 states as in Fig. 14, which mimics the cache memory to output the most frequently received bit. With T1 being supplied with a voltage equal or greater than V GS , it will be turned ON, by which the V CC and the output are pulled low (logic-0) through T1. The T1 operates vice versa for V GS less than the gating threshold by which the output is pulled high (logic-1). All the 4 combinations of R 1 , R 2 , and R 3 , and their corresponding resistance values are shown in Fig. 14. The resistance values for logic-0 are considered as less than or equal to 1M , while the logic-1 range is considered as greater than or equal to 100M . During State-1 (S1), where all three resistances are of 1M , the equivalent resistance is about 333 k , and the bias resistor is fixed to 500 k ; hence the V GS ∼ 0.72V, which is greater than the gating voltage of 0.60V, turns on the T1. Now, for state S2, R 1 = R 2 = 1M and R 3 = 100M resulting in R eq ∼ 497 k with a gating voltage of 0.602V that sees across the 500 k bias resistor, keeping the T1 ON. Here for both S1 and S2, the most frequently occurring resistance value is 1 M (defined by logic-0), and both also obtained state logic-0 at the outputs. A similar operation sequence is seen at states S3 and S4, where the most occurred resistance value is 100M , corresponding to logic-1 at the outputs.
Hence, the given 1T4R logic mimics a cache memory to output the most recurring bit from the input. A similar sequence can be used for RC-5 and RC-7 by using 5 and 7 parallel input RRAMs. To keep the power dissipation negligible, the resistance range must be maintained in the range of M and above. The LT-Spice tool is used for the simulation and analysis of the RRAM-NOR circuit. A 65nm MOS transistor model from the Arizona State University (ASU) repository was used to build a SPICE circuit of RRAM-NOR gate with two abstraction layers (MCBC RC=3) connected to the input port of the NOR gate, as shown in Fig. 15. Here the blue shaded area shows the NOR circuit, and orange/green shaded regions will be the two MCBC circuits. From Fig. 15, the variable resistors represent the RRAM device {R 5 ∼ R 8 , R 10 ∼ R 13 } and the fixed resistors are {R 2 , R 1 = 120M }, {R 9 , R 4 , R 14 = 1M } and R 3 = 150M . The resistance values of the MCBC circuits are programmed as per the input sequence using the select switch 1 and 2, as explained in the earlier section. The NOR operation is demonstrated for 10 clock cycles or steps for the given SPICE circuit shown in Fig. 16. Cycles 1∼4 program the resistance of the RRAMs in the given 1T-4R structure, following this, in cycle 5∼6 the 1T-4R structure switched back to the operational mode after being programmed with the required resistance values for operating as RRAM-NOR logic when reaching the 7 th clock cycle. Hence clock cycles 1∼6 are one-time configuration cycles FIGURE 12. The confusion matrix presented for the true-positive, false-positive, and false-negative predictions (precision and recall). Note that the numbers in the matrix are shown in percentage and obtained from a test data set size of fifty random images. (a) 10 category data set, (b) 6 category data set, (c) 2 category data set. All the analyses presented here is for I comp = 50µA and RC = 5.
during the chip power ON. Once the chip is configured with appropriate RRAM resistance values and functions as a computational logic to perform the convolution operation, the latency of the RRAM-NOR gate will be just 1 cycle, which is demonstrated from clock cycle 7∼8 (NOR truth  table). A large sequence of RRAM-NOR gates is combined to operate as combinational and sequential circuits to achieve 10 layers of convolution operation during the forward pass. The clock cycles 7 to 10 demonstrate the NOR truth logic where for a logic-0 (0 Volts) at inputs A and B, the gating voltage will be 0V, keeping T1 OFF and thus a logic-1 set at the output. Similarly, for the other input sequences of A and B, the gating voltage varies from 0.66V to 0.85V which keeps the T1 in the ON state, and a logic-0 state is obtained at the output; thus, demonstrating the NOR operation. The layout design for the simulated SPICE circuit is shown in Fig. 17, where the unit cell consists of a NOR (1T3R)   [49], (4) the other metal layers and interconnecting dimensions follow as shown in Fig. 17. From the layout design, the unit cell's overall dimension is 3190 × 1230 nm 2 . The NOR and NAND RRAM gates have the same structure except for the variation in serial and parallel resistor values; therefore, this layout design is common for the universal gates. We omit the select switch circuit design in the layout drawing and reframe our analysis to the primary logic unit cell.
The proposed layout drawing here is an initial design to illustrate the methodology of an RRAM-based NOR unit cell structure. We certainly have room for further optimizing the unit cell layout, which we intend to explore in our future work. Furthermore, the layout drawing can be optimized by choosing the foundry's design recommendations, with fine-tuned standard cell design for the specific foundry fabrication process. The above simulation results and dimensions of the logical compute element (RRAM gate) were further used to estimate and project the overall chip area, power consumption, heat dissipation, and battery life (IoT application) for the given 5 hidden layer groups (Group 1 of 10 layers to Group 5 of 2 layers). The standard formula used to compute the size and performance of the different convolution groups is shown in Eqns. (1) ∼ (4) below. The  hidden layers use 3 × 3 and 1 × 1 generic convolution block with different filter matrices sized from 32 to 512; the estimations of the total NAND logic count to construct the 3×3, and 1×1 is 5.3M and 1.7M, respectively. The gate count for the convolution modules was extracted from our previous work [46]. The total RRAM NAND gates are given by the sum of the product of all the convolution modules gate count and the given filter size as shown in Eqn. (1).
T lgc = Total Logic gates in all convolution layers Conv gc = Convolution operator gate count per layer Q fms = Convolution filter size of a specific layer n = Total number of hidden convolution layers The dimension for a unit cell of one NOR/NAND gate with two Maximum Count Binary Comparator Layers of size RC=3 is estimated as (3190 × 1290 nm 2 ) from the layout design. Hence, the total RRAM NAND gate count and the unit cell area give the overall chip area of the respective hidden layer count. Other peripheral components such as interconnecting bus, clock bus, input/output buffers, pads etc. are not being considered during the calculation of the overall size since these peripheral components are foundry specific. Hence, we omit it to keep our estimation only for the required RRAM NAND logic. Based on the above assumptions, the overall chip size was calculated using the Eqn. (2).
E iter = Estimated power to process one image (Pixel size:224 × 224) at 100MHz operation V oper = Digital logic 1 threshold voltage I mem = RRAM operating current compliance F clk = Simulation digital clock speed Today, the IoT applications are often supported by an external battery source. Hence, we have attempted to calculate the achievable battery life during the execution of the given RRAM NAND-based convolution operations on the edge using a 48V and 14Ah battery source. The size of the battery was chosen from a standard and typical available portable battery pack used to power up the e-scooters. Eqn. (4) estimates the battery pack lifetime for different convolution operations. Last but not the least, heat dissipation is one of the primary parameters which defines the reliability and lifespan of an IoT application. The power estimation in relation to the battery life is shown for the RRAM-NAND logic alone, whereas the additional power used for the synaptic weights read /write is not considered (excluded) here as our framework focuses only on the simulation of the in-memory mathematical logic/operator for performing the convolution operation.
B lpc = Estimated battery life to perform convolution in all hidden layer with source (48V14Ah) E src = Battery source rating (48V14Ah) A consolidation of the energy consumption and estimated battery lifetime using Eqns. (1) ∼ (4) for our simulated in-memory circuit for the different counts of the hidden layer groups and filter sizes is listed in Table 5. The table shows that the total computational energy and the chip estate area consumed reduce by as much as 10-20X as the hidden layer count and the current compliance of the RRAM decrease. However, the design challenge here is to strike a balance whereby the required battery lifetime is met for the right combination of the hidden layers and the RRAM switching current. There are several IoT applications that need to work with low maintenance frequency due to their use in remote deployment sites and such applications should consider using 10µA RRAM with a 6-layer CNN or a 50µA RRAM with 4-layer CNN for a 6 month to 1 year frequency of maintenance schedule provided the prediction accuracy drop brought over by RRAM device variability is within the acceptable range as defined by the end application.

IV. CONCLUSION OF THE STUDY
We have simulated and quantified the impact of RRAM variability on the training accuracy of a deep CNN, for which three different RRAM current compliances of 5, 10, and 50µA were used corresponding to soft to hard filamentation regimes. A suitable 65nm (TiN/HfO 2 /Hf/TiN) OxRAM NAND/NOR logic was constructed and simulated for five different groups of hidden layers (convolution operation) to show the training accuracy trend by implementing a purely digital RRAM logic using the look-up table (LUT) approach. Our methodology of using LUT-based RRAM resistive encoding scheme was well demonstrated with a suitable neuromorphic simulation framework using Python and TensorFlow in the upper layer and Verilog HDL in the lower layer by construction of FP32MUL and FP32ADD using RRAM (1T4R / 2T3R) based NAND logics. We have demonstrated that adding MCBC logic with a standard RRAM NAND logic improves the overall device variability. The MCBC layer is estimated to consume an extra space of 1150 × 1230 nm 2 per logic gate per input, which results in an overall prediction accuracy improvement from 10% to 60% (RC1=10%, RC3=30%, RC5=45%, and RC7=60%). The estimated battery life can range anywhere between 19 to 466 days for CNN layers of 10 to 2 when configured with a 50µA RRAM switching current, considering the maximum prediction accuracy achieved compared to the software training pipeline. Finally, the in-memory circuit design with overall chip area, power consumption, estimated battery lifespan, and heat dissipation are derived for IoT application deployment for a truly edge implementation.
As a continuation of this work, we intend to build and simulate the backward pass computation system using the same RRAM NAND/NOR logic in our future work and quantify the impact of memristive performance variability seen in the final learning accuracy trend for a completely edge based forward and backward flow computation schema. This study provides a practical multi-faceted design tool for RRAM-based edge computing that enables the quantification of the impact of CNN architecture and RRAM operating current levels on the prediction accuracy, chip estate area, energy consumption and battery replacement frequency. The design tool can potentially be harnessed as a multi-objective design optimization decision making framework for RRAM edge compute applications depending on the end user-defined requirements for the application in context.