Computational Failure Analysis of In-Memory RRAM Architecture for Pattern Classification CNN Circuits

Power-efficient data processing subsystems performing millions of complex concurrent arithmetic operations per second form part of today’s essential solution required to meet the growing demand of edge computing applications, given the volume of data collected by real-time Internet-Of-Things (IoT) sensors. Adding to it, the in-memory computation designed as memory and processing elements on a single wafer has enabled promising performance improvement in terms of computational power savings by avoiding the memory wall created while accessing the memory array. The Resistive RAM (RRAM), with its simple metal-insulator-metal (MIM) structure, proves to be a very appealing candidate for in-memory computation given its ultralow switching power and its Complementary Metal Oxide Semiconductor (CMOS) process fabrication compatibility. However, despite all advantages, the resistive switching phenomenon in RRAM has an inherent stochastic variability. On the algorithmic side, convolution neural networks (CNN) have gained popularity in image classification applications, and the network’s architecture is memory-intense in nature for memorizing the trained weights. Hence, an RRAM-based CNN system will pave way for a power-efficient image classification system on the edge. Accounting however for the inherent variability in RRAM (inter-device and intra-device), the accuracy of CNN’s prediction is surely expected to drop. This motivates us to quantify the impact of RRAM variability on the CNN trained weights and classification accuracy (prediction loss). In this study, we have constructed a Look-Up-Table (LUT) based model for encoding wide current compliance (2μA to 250μA) 65nm CMOS 1T1R OxRAM’s (TiN/HfO2/Hf/TiN) resistive variability into CNN’s trained weight in a digital regime. The RRAM resistance encoded trained weights are in turn used here to simulate the two extreme CNN architectures, namely, Fully Serial System (FSS) and Fully Parallel System (FPS). The architectures’ prediction variability trends are quantified given its current compliance, RRAM resistive variability, CNN’s convolution matrix sizes (5×5, 3×3, 1×1, and 1×1 max pool), the total number of layers in the CNN as well as the input image pixel size.


I. INTRODUCTION
Video data analytics has transformed today's industries by providing more significant and precise insights into the process automation side to improve productivity, personal safety and building strong customer relations in various services and solutions being provided today [1]. Cloud-based video analytics has proven to be colossally powerful in entrenching features for greater scalability on computational power, data redundancy, quick deployment, and regulatory compliance. However, it fails to perform for applications with low internet bandwidth and mission-critical on-the-fly decision making [2]. Edge computing is a new approach to network architecture, and they are quite powerful, capable of gathering and processing more data than ever before by which the data processing function is relocated closer to where the data is collected and analyzed in real-time. Every snapshot of data does not get transmitted over to a central server for processing, by which the network latency is drastically reduced with enhanced performance for real-time applications such as convolution neural network (CNN) based image classification applications. Placing the memory and computational logic on a single chip reduces the computational power by avoiding the von-Neumann memory wall created during a memory read and write operation [3]. This performance improvement fosters the development of in-memory computation architecture and revitalizes data center's increasing unacceptable levels of power utilization (which require intense expensive cooling solutions right now).
It is essential to have an insight into today's in-memory computation approach to reduce computation power using various emerging memory devices. On a large commercial server, almost ~50% of the total operation power is consumed by the off-chip Dynamic Random-Access Memory (DRAM). A bulk memory transfer using the existing DRAM operation principle is proposed by Onur et al. to improve power usage [4]. Furthermore, a 10T bit-cell-based Static Random-Access Memory (SRAM) holding 1-bit filter weights exhibiting dot-product mathematical operation with >98% accuracy for classifying MNIST hand-written text with better energy efficiency by reducing data transfer has been demonstrated by Biswas et al. in Ref. [5]. Other than the commercial memories DRAM and SRAM, a variety of emerging memory devices have also been fabricated and studied for in-memory operation, some of which are discussed here.
A 2D-array of processing elements using Spin Transfer Torque RAM (STT-MRAM) based in-memory architecture was proposed by Agarwal et al. [6] and tested with significant energy savings of 1.75X over the commercial memory for an image classification application. A Boolean NAND operation demonstrated with the other emerging memories, such as Ferroelectric RAM (FeRAM) and Phase Change RAM (PCRAM), by assigning binary codes to different physical device states, has also paved the way for in-memory computation applications [7]. Adding to the current in-memory study, quantum-dot cellular automata (QCA) have emerged as a new breed of nanoelectronics with significant performance improvement over the conventional Von Neumann architecture [8]. Recent studies presented a fully scalable in-memory Resistive RAM (RRAM) architecture of an edge-aware-anisotropic filtering algorithm aimed at computer vision applications, demonstrating reduced memory operation by 64% to 92% resulting in power saving of up to 75% [9]. As CMOS technology approaches its physical limits, NVM-based neuro-inspired computing chips offer a promising route, for which increasing research effort seen among device engineering and an extensive review shown in [10].

A. Overview of RRAM Synapse-based CNN System
The RRAM device is a simple metal-insulator-metal (MIM) structure that switches into a conduction state when subjected to a moderate voltage / electric field and exhibits non-volatile property. The features enabling RRAM's popularity amongst the emerging memory technology candidates are its (i) simple device structure that can be integrated into today's CMOS fabrication environment and compatibility with back-end-ofthe-line (BEOL) process thermal budgets, (ii) 3-D cross-point architecture with a memory cell area of 4F 2 , (iii) Low-cost non-volatile memory with an operation speed as low as tens of nanoseconds per bit, (iv) Per device multi-bit memory storage feature and (v) significantly high endurance cycle of 10 6 with ultra-low switching energy in the range of a few pJ [11].
The resistive switching mechanisms can be broadly classified into electrochemical metallization memory (ECM), in which the conductive switching path is formed by the metal cations in an electrochemical process. The second category is valence change memory (VCM) (also known as oxygen vacancy RAM (OxRAM)), with conductive switching achieved by the oxygen vacancies generated with an active electric field. Typically, ECM devices are fabricated with one active metal electrode (Cu, Ag, Ni) and VCM with one inert metal electrode (Pt, Ru, Au or Ir), when sandwiched in an MIM structure [12]. We limit our study here to VCM since it exhibits several orders lower switching characteristics when compared to ECM (metal migration through the dielectric media is always more power intensive than simple bond breaking induced oxygen vacancy generation). The real-world metal-oxide RRAM applications are notably restrained due to the cycle-to-cycle and device-to-device variability inherent in the device switching mechanism. The stochastic nature of the oxygen vacancy generation, migration, and recombination result in the formation of non-uniform conductive filament (CF) with varying size, shape and/or pattern; hence the filamentary formation and rupturing process results in leftover oxygen vacancies inside the tunneling gap region leading to stochastic atomic / ionic motion, making variability a property intrinsic and inherent to the metal-oxide RRAM [13,14].
The Convolution Neural Network (CNN) pervades as a successful edge computation algorithm for on-the-fly image classification applications [15]. The CNN was designed as a first-order computation function to achieve image classification on complex patterns swiftly with edge filterbased matrix convolution technique followed by a fully connected neural network with an activation function to identify the pre-trained image patterns. The CNN is a biologically inspired model with a memory-centric algorithm for memorizing pre-trained patterns or images like that of a human brain. This memory intense CNN's deep structure perpetuates in-memory computation technique; hence it paves the way for using Resistive RRAM as synapses or memory units creating integrated devices with ultra-low power application capability [16]. As mentioned earlier, the RRAM  is subject to stochastic variability, which induces performance degradation on the end application; hence, conducting a study to quantify this performance degradation is critical.

B. Variability Study in Neuromorphic Circuits
Applying RRAM as a synaptic memory in a CNN has triggered interest among various research groups to investigate the prediction accuracy loss given the device variability and a consolidated review of such studies is provided in Table 1. The  table discusses the different RRAM devices, machine learning (ML) simulation methodology used along with their merits and remarks. Filamentary non-ideal RRAM model programmed into an ML simulation architecture in a shallow analog crossbar array to study Neural Network (NN) prediction accuracy variability with MNIST handwritten text is demonstrated in the work of Refs. [17][18][19]. However, today's CNN is packed deeply with many computational intense hidden layers to improve prediction performance; hence this brings interest to study the prediction variability trend in a complex and more practical CNN. Significant efforts are also seen in characterizing new device stacks to improve the RRAM device switching effects. Such studies extensively showcase the device-level time-dependent fluctuation and temperature variations as a function of the crossbar prediction accuracy [20][21][22][23][24]. These studies are confined to a single current compliance. Hence, adding to the random noise variability models, a further study with varying and wide current compliance will give more insight into the design considerations of a non-ideal RRAM for low power IoT (Internet-Of-Thing) applications. Substantial research works are also evident in modeling the oxide RRAM conductance data into an MLP (Multi-Layer Perceptron) as a mixed-signal crossbar to perform low power and highly efficient MAC (Multiplier and Accumulator) circuit [25][26][27][28][29].
There is a clear computational benefits in terms of lower power consumption for the convolution operation, but the additional peripheral circuits such as analog-to-digital (ADC) and digital-to-analog (DAC) converters add up to the complexity of the design for processing the mixed-signal data. Hence, exploring digital RRAM synapses is useful to design optimized and compact circuits.
With this in mind, we propose here a Look-Up- The study extensively shows the device retention model, ADC quantization effects, and trade-offs between inference accuracy, energy efficiency, throughput, area, and memory utilization. The framework operates the RRAM in the analog regime, and hence further to this study, one can explore operating RRAM in a digital domain for better noise immunity in real-world IoT application. [29] J. Doevenspeck *These studies are proposed with generic RRAM model, which can be applied to any device, but no specific device was shown as a proof of concept.
large variations in the shape and size of the conducting filaments) by considering the two extreme convolution architectures (FSS and FPS) of a practical CNN network. It is essential to examine and quantify the hardware (RRAM) variability and its impact on the prediction accuracy for these extreme architectures, which we have recorded in this work. The current work is a marked improvement over recent past studies, which are aimed at only assessing the mean value of the prediction error brought about by RRAM device variability with a very limited range of operating current compliance.
The structure of this paper is as follows. Section II presents the simulation methodology followed for encoding the resistance value of RRAM into the CNN trained weights and compute the prediction error loss between the software and hardware (RRAM) trained weights. Section III discusses the results obtained and the trends observed for a wide current compliance RRAM encoded trained weights while being applied in two extreme convolution architectures in today's CNN. We conclude our work in Section IV after a summary and inference based on all the analysis carried out.

A. Enhancing Prediction Accuracy with Edge Detection based Inception Function
An insight into the generalized CNN architecture unveils the underlying two fundamental mathematical matrix functions, namely the inception and fully connected (FC) neural network.
Both have a parallel structure and perform pattern classification by manipulating every image pixel concurrently in the given input image. Among the two functions, the inception persists as intense computation and memory operation, while the FC neural network layer consists of an activation function to compare the similarity between the manipulated image pixel and the trained data to signal to the next neuron with the likelihood of the current image against the trained data [30]. The inception function consumes ~80% of the overall resources when compared to the FC neural network in a commercial CNN [31]; hence, we limit our simulations and variability examination to the memory intense inception operation alone. The inception is a function of matrix convolution and max-pooling operations; here, both belong to a class of edge enhancement techniques for effective pattern classification application.
Edge enhancement is a type of image processing used to enhance a pattern's edge in an image to improve its apparent sharpness. The edge filter works by increasing the contrast of the edge or boundary between the subject and the background. This effect results in bright and dark highlights on both sides of the edges in the image, making the pattern more prominent from the background. The edges are highlighted by mathematically manipulating every pixel with pre-defined filter data, as illustrated in Fig. 2. A convolution is an advanced edge enhancement technique, wherein a single resultant vector is derived from sum of the products of two given matrices, namely the input image pixel matrix and filter data matrix. Hence, to convolute an image with a pixel size of × , we chose a convolution window size of × , which is smaller than the given image. The convolution process is  repeated on the given image by shifting the convolution window by pixels, known as the stride. The generalized convolution formula for an × matrix is given by Eqn. (1), Where: = Resultant matrix obtained by convoluting input image matrix ( ) and trained weight filter matrix ( ); Size of matrix is ✕ . = Input image matrix, holding the image's RGB pixel values that are to be classified or identified; Size of matrix is ✕ . = Filter matrix consists of trained weights obtained by back propagation based stochastic gradient descent algorithm and trained for a considerable amount of labeled data set; Size of F is ✕ , which is less than or equal to the size of matrix. × = Defines the convolution operation window size in the given 2D image, which is smaller than 's size. = Stride is the delta between the location of two consecutive convolution windows. = The row ( ) and column ( ) number of the given filter matrix and image matrix elements. = The row ( ) and column ( ) element of matrix; Size of matrix represents the convolution operation size.
: Here, padding is the number of pixels (value equal to zero) added to an image when the kernel or trained weight of a CNN is convoluted to keep the convolution window's size as ✕ when the stride value approaches the end of the given matrix.
The Max pooling is a discrete quantization technique for down sampling the input pixel matrix, and this dramatically reduces the over-fitting at the FC neuron activation layer. Today's inception employs max pooling function in all the layers to improve prediction accuracy and reduce computation power on the following layers significantly by down sampling the image [32]. The generalized Max pooling function is shown in Eqn. (2) and a simple 4×4 max pool example is shown in Fig. 3, wherein a 4×4 matrix gets down sampled to 2×2, where the maximum value of each 2×2 array is copied to the new max pooled matrix as shown. Where: = Maximum value from the given input matrix. = Input matrix; Size of ✕ A sequence of edge-enhancing convolution functions with varying size filter matrix layered in different groups called as Inception, has become the building block of today's CNN with enhanced prediction accuracy rate. For the given inception network, the smallest convolution function is 1×1 pixel size, and the largest is 5×5 pixel size. Hence, we see that the convolution matrix resolution is maintained low to keep the edge sampling rate as high as possible for improved edge detection. Thus, with higher computational resolution, the prediction performance is increased along with the cost of higher computational power. The matrix convolution of sizes 1×1, 3×3, 5×5, and max pool 3×3 is the most used operation in today's CNN architectures. We considered these four common and prominent computation functions for our simulation study by connecting them in a fully serial and fully parallel sequence, as shown in Fig. 1.
These two architectures are the two extreme operation sequences seen in today's CNN. We conduct our variability simulation study on these extreme architectures by encoding the RRAM's electrical resistance as Look-Up- Table (LUT). This will allow us to quantify the impact of RRAM's variability on these two extreme architectures and analyze the error propagation trend from layer to layer of CNN for the given OxRAM's wide range of current compliance.

B. Encoding Scheme of RRAM Resistance Variability on Synaptic Data
We have extracted the resistance distribution data from one of the most comprehensive RRAM variability data sets published to date, by Fantini et al. from IMEC [33]. In that study, the authors report OxRAM resistance data distributions for a wide range of compliances → {2, 5, 10, 25, 50, 100 and 250}µA. A 65nm CMOS 1T1R OxRAM stack comprising of TiN/HfO2/Hf/TiN of size 20x20 nm 2 cell was fabricated in their work. The device was subject to several switching operations from Low Resistive State (LRS) to High Resistive State (HRS). The cycle-to-cycle resistance variability trend was measured and plotted as a lognormal distribution, which we have replotted in Fig. 4a.
The resistive variability trend of LRS and HRS for the wide switching current compliance data set was extracted to construct the CNN trained weight based LUT model so as to further analyze the CNN prediction performance degradation trend. The work in Ref. [33] records the device switching for 200 read/write cycles of different current compliance as a cumulative resistance distribution, as shown in Fig. 4a. Here, we further use the linear extrapolation technique to synthesize a significantly large data set at the very low and very high percentiles (tail ends) from the measured 200 cycles of switching data. The X-axis represents the LRS and HRS resistance distribution of different current compliance, and the Y-axis is the device cumulative probability. The LRS data represents logic-0 and HRS represents logic-1; hence, we extrapolate the LRS curve and HRS curve towards the negative X-axis until both the curves of the specific Icomp intersect. Thus, we extrapolate a more realistic large data set using the actual 200 read/write cycle data set, which shows a significant overlap between corresponding LRS and HRS. This overlap results in a prediction accuracy error, which is encoded into the LUT CNN, trained weight model. The technique used to encode RRAM resistance into CNN trained weight is explained in the following sections. The formula for the linear extrapolation is shown in Eqn. (3). By using any two specific endpoints on the given resistance distribution curve, namely ( , ) and ( , ), the new data points are extrapolated by running the formula in a loop. Here the endpoint ( , ) is the initial point on the curve, and it is fixed, while the ( , ) is the last computed value from every iteration. In general, the approximated overlap between the specific LRS and HRS distributions after extrapolation is relatively low for high Icomp and vice versa for low Icomp. This is due to the lower number of oxygen vacancy defects and higher relative change in defect count for smaller conducting filaments during the SET and RESET transitions.
Where: − = Resistance distribution for the given current compliance.
− axis = Cumulative probability value of the resistance for any specific current compliance , = Resistance data points to be extrapolated 2 , 2 = Represents last computed resistance data 1 , 1 = Initial Resistance data points

C. Generating LUT model for Variability Trend Analysis
The GoogleNet CNN's synaptic weights are trained for 1000 image categories with 1.2 million images and was constructed with an Inception architecture to enhance the prediction accuracy by localized object detection with comparatively fewer hyperparameters of 25million compared against 60million of its predecessor AlexNet [34]. The Inception architecture is formed by stacking smaller CNN's on top of each other to create a deeper network. The basic blocks of such Inception framework are 1×1, 3×3,5×5 Convolutions, and 3×3 max pooling. The GoogleNet consists of 9 symmetric inception layers, namely 3a, 3b,4a-4e, 5a, and 5b. Here, we extract GoogleNet's weights from the 3a inception layer for our further simulation, and the underlying convolution sizes of this layer are 1×1, 3×3, 5×5, and 3×3 max pool.
Each and every trained weight in the CNN is represented in a 32-bit floating-point format, which consists of mantissa (23bit), exponent (8-bits), and signed bit (1-bit), as illustrated in Fig. 4(c). As proposed in our past work [35], the normal resistance distribution of RRAM is encoded into the mantissa part of CNN trained weights as logic-0 with LRS data and logic-1 with HRS data (based on a threshold resistance value, RTH, i.e. if R < RTH, it is logic "0" and if R > RTH, it is logic "1") and both HRS and LRS resistance data are extracted from the extrapolated data set. For every trained weight of the 3a inception layer of GoogleNet, we "encode" 1000 points of logic-0 or logic-1 from the given RRAM resistance distribution plot for any given current compliance. This results in a LUT with 1000 varying mantissa inheriting the RRAM variability for a given single trained weight. Fig. 4(b) shows a false logic-0 and false logic-1 at the intersection of LRS and HRS distribution. This represents the RRAM variability, and these incorrectly encoded "0" and "1" are embedded into the software trained weight mantissa as error data resulting in prediction accuracy drop. Hence, with the proposed LUT technique, the RRAM electrical variability is encoded into the given CNN software trained weight and further used to simulate and quantify the impact of RRAM variability on the prediction accuracy rate if the software trained CNN were to be implemented on the "edge".
The FPS and FSS inception schemes are developed using the Keras framework. Keras is a deep learning software written in Python programming language, running on TensorFlow's machine learning platform. The proposed Keras-based FSS and FPS inception architectures are implemented as two parallel computational pipelines as shown in Fig. 5. The first pipeline works with an original GoogleNet trained weight called software trained weights, and the second pipeline takes the RRAM resistance encoded data, referred to as the hardware trained weights. The difference between the above two inception pipeline's outputs gives the actual prediction error drop. This is due to RRAM resistance variability (and false '0' and '1' as shown in Fig. 4(b)) encoded into the actual trained weight. For the given input image from the ImageNet-ILSVRC (ImageNet Large Scale Visual Recognition Challenge) data set, we simulated 5000 cycles for a single image, and this procedure was repeated for all the Icomp data set of the OxRAM. The obtained prediction and power variability trend is discussed in the following sections.

III. RESULTS AND DISCUSSION
The prediction error trends obtained from the computational difference between the two pipelines using software and hardware trained (RRAM) weights are shown in Figs. 6 -8. The computational difference is a relative error difference between the software and hardware pipeline outputs.

A. Impact of Convolution Size on Prediction Error for Varying Compliance
The mean prediction error for three different convolution operations (1×1, 3×3, and 5×5) based on our simulation  framework is shown in Fig. 6. Note that we did not simulate the max pool block of the hidden layers because there is no arithmetic manipulation involved in the max pool function. With a comparative operation module, the max pool operation takes the maximum value in the pixel group and drops the other low-value pixels. Hence, we compare the various convolution operations for the given wide range of current compliances. The predictive error trend for the 1×1 convolution starts from ~63% prediction error for 2µA Icomp, and a steep decline in the error rate is observed as Icomp increases to 5µA, 10µA and 25µA, respectively. Subsequently, the slope becomes insensitive to the higher current compliances, as shown in Fig. 6. While we compare the mean trend among the three-convolution operations (1×1, 3×3, and 5×5), the magnitude of the error value for the 3×3 and 5×5 convolution are 1.5 times and 2 times higher than the 1×1 convolution operation. The size of the 3×3 and 5×5 matrices is obviously higher than that of the 1×1 convolution matrix, and so is the probability of false bits getting encoded in the computation. This explains the higher prediction error for increasing size of convolution operation as more RRAM devices need to be used to construct the synapses of the hidden layer.
The rise in prediction error for lower Icomp can again be explained based on Figure 4(a). The memory window between HRS and LRS drastically reduces for lower Icomp, which results in higher overlap between the resistance state distributions. Moreover, at low Icomp, the conducting filament is very narrow with very few defects in it and hence, for repeated switching, the relative change in defect count within the filament results in a wide variation of the resistance state. In other words, the probability of false-0 and false-1 rise steeply as we move from 50-100 µA (which falls into the "hard breakdown" regime for dielectrics) to 2-5 µA (traditionally referred to as "soft breakdown"). Furthermore, subjecting the device to consecutive SET and RESET for many thousands of cycles, the defect count and defect density spread are also affected by the gradual reduction in the mobility of the oxygen vacancy defects, resulting in further memory window overlap, more so again at low Icomp. These effects get absorbed into the encoded trained weights and further amplified in the convolution layer's matrix multiplication and summation function resulting in the trends as shown in Figures 6-8.

B. Comparing Errors in Fully Series and Fully Parallel Architectures for Varying Compliance
The prediction error trend for the two extreme inception architectures (FPS and FSS) originating from RRAM variability encoded trained weights is shown in Figs. 7 and 8.
Here, the standard deviation ( % ) of the relative error trend between the hardware and software pipelines is obtained by simulating a single image over 5000 repetitive stimulation cycles are plotted. Every simulation cycle uses a random entry from the LUT with RRAM variability encoded weights to classify the given image. From the simulation results of the FPS architecture, it is clear that both the variance and the mean of the error decreases for higher Icomp. (Fig.7). It is also important to note that the error flattens out to a finite non-zero value ~ 5% for Icomp > 100µA, which suggests that it is unnecessary to operate the device at even higher powers as further reduction in relative error is too low to justify the use of a higher power consuming architecture for the edge application. We can achieve at least ~60% power reduction by choosing 100µA instead of 250µA.
From Fig. 8, it is worth noting that the prediction error mean is much higher, and variance is also comparatively higher for the FSS architecture. The FSS system is constructed with a serial chain of convolution operations where the error trend significantly gets convoluted due to matrix multiplication in  every block of the serial chain. This explains why the mean error for the serial architecture is much higher than for the parallel one (shown by dotted purple lines). Surprisingly, the variance for 2µA and 5µA is comparatively smaller than for the higher Icomp, contrary to our logical thought flow. It should be noted that this is purely an artifact because of the definition of the error, which cannot be more than 100%. The upper percentile error bars have already hit their ceiling of 100% for Icomp ~ {2, 5} µA.

C. Trade-Off Between Prediction Error and Power Consumption
The error variance and the power consumption per memory bit for the two extreme convolution operations, namely 5×5 and 1×1, is studied and plotted in Fig. 9. Here, Y1-axis on the left shows the error distribution for 5000 cycles of all the given Icomp, and Y2-axis on the right shows the corresponding power per bit trend for the synaptic weight with an operating voltage of 1.5V. As we know, for higher Icomp, the memory window is wider and with less overlap between LRS and HRS; hence the prediction error spread is less, but with a high computation power budget. For discussion, let us consider the error spread for 5×5 and 1×1 at 2µA and 250µA; the variables of the function responsible for the error spread are the memory window overlap and the convolution matrix's size. While analyzing the intra curve of the 5×5 convolution function for the large Icomp of 250µA from Fig. 9(a), the prediction error spread is approximately 4X smaller than for Icomp = 2µA. Furthermore, the power consumption at 250µA is about 15X higher than that for 2µA (see yellow line in Fig. 9(a)). In  The convolution operation is repeated using encoded trained weights for different Icomp of RRAM operation ranging from 2µA to 250µA. it is worth nothing that the edge enhancement is much clearer towards the right for higher current compliance, while for very low Icomp ~ 2uA, the edges are hardly discernible. comparison, the prediction error magnitude for 250µA 1×1 convolution is approximately 8X lower than for Icomp = 2µA, as shown in Fig. 9(b).
For a comparative analysis on power saving, let us consider an edge device performing a 1x1 convolution operation on a video stream of 224×224 pixel and powered by a coin battery of 130mAh. The 1×1 convolution matrix requires 23-bits of mantissa and uses 23 RRAM devices to hold the trained weights (1 device per bit, assuming binary digital RRAM). With this scenario, we can compute that the 1×1 edge device can be operated for a lifetime of 2826 hours (~118 days) at Icomp ~ 2µA, whereas the operation would last only about 22 hours (less than a day) for Icomp ~ 250µA. Fig. 10 illustrates the convolution loss using RRAM encoded GoogleNet trained weights for the given wide current compliance resistance distribution scale applied to the 1×1 convolution. For illustrative purposes, we have considered two sets of images. The first image depicts a dog's picture and takes up to 80% of the pixels in the given 224×224 image size. The second image is that of ants, which occupies 30% of the pixels in the standard image size of 224×224 pixels. Here, we can deduce that the ant image is comparatively more convoluted and results in more visible/pattern loss, making it appear more blur than the dog image. Thus, the computational device failure depends on the pattern size in the given standard image pixel of 224×224 and RRAM false 0/1-bit position encoding probability. The RRAM device stack performance also depends on the fabrication conditions which usually spans a wide range of parameters and process conditions. Here, we omit the influence of the fabrication process parameter variables on the device variability for simplicity and stick to the given material stack's resistance distribution data. There is always a trade-off between the CNN network topology, operating power, and end application accuracy. Battery-powered IoT applications in the real-world demand low operation power for a long lifetime; hence compromising prediction accuracy by choosing smaller current compliance will significantly extend battery life. Recent studies show edge AI applications with fault tolerance are the next trend for designing low-powered IoT edge devices in monitoring and sensing in remote applications such as oil platforms, covered drain, remote surveillance systems, etc... [36]. An overall prediction error tolerance of 20 to 40% is acceptable in such applications, where sampling and regression trend analysis can determine the error deviation. We can still operate the RRAM in high current compliance with a wide memory window for mission-critical operations alone.

D. Error Propagation Across a Deep CNN Architecture
While the previous analysis purely focuses on the error induced by just one single FSS or FPS layer, we know well that the CNN used for most applications easily consists of 4-10 layers. A model to study the error propagation through multiple hidden layers of FSS or FPS will be helpful to understand the end application's overall prediction accuracy drop. Hence, we have considered two different types of multiple hidden layers with 10 internal layers, and each configured with the FPS or FSS extreme architectures, as shown in Fig. 11. The multiple hidden layers with FSS mode are denoted as multiple Fully Serial Sequence (mFSS-CNN), and for the FPS mode, we coin it as multiple Fully Parallel Sequence (mFPS-CNN). Here we use simple statistical formulae to estimate the propagation of layer-to-layer uncertainty. We may assume that the FSS and FPS error variance is fixed for a given current compliance as recorded in Figs. 7 and 8. Chaining the same structure (FSS or FPS) as multiple hidden layers, the error propagation would follow the model below.
The standard deviation in error for the respective FPS and FSS architectures (σ) is computed for all the Icomp encoded data set using the Eqn. (4). Here, the mean error obtained from 5000 repetitive stimulation cycles is used in the above equation. The propagation in error across the deep network can then be estimated by Eqn. (5), where n=10 (total number of layers) is the depth (number of hidden convolution layers) of the network and σCNN represents the overall network error deviation.
We have used the error variance of Icomp = {25, 50, 100} µA to compute the layer-to-layer standard deviation error percentage ( % ) using Eqns. (4) and (5) and the results shown in Figs. 12 (a) -(c), respectively. Considering the output of the 10 th layer, the difference between the % of 25µA and 100µA is ~4 times higher, and a similar trend follows for their respective hidden layers. Note that the mean of the error in the CNN is likely to be the same as the mean of any single layer of the network. The mean errors are indicated in the legend of the plots in Fig. 12 and denoted by "ℳ". Hence, our approach here for computing the layer-to-layer error variability enhancement can be used as a preliminary setup to quantify the error trend in a multiple hidden layer CNN. For simplicity, we exclude other device fabrication parameters from this discussion.

IV. CONCLUSIONS OF THE STUDY
In this study, we have highlighted the critical issues involved in the power consumption bottleneck brought about by the memory system in today's cloud servers. The alternative is to move towards in-memory computation for image processing IoT applications. We have compared various modern inmemory technologies and considered RRAM as the candidate of analysis, given its low power footprint advantage, silicon CMOS compatibility as well as ease of fabrication and its robustness. For a wide range of compliance levels ranging from soft (2-5µA) to hard breakdown (100-250µA), we have quantified the trade-off in the power-prediction accuracy for a CNN. The impact of the series-parallel architecture on the prediction error has also been considered and we have extended our analysis to present the worst case (mFSS) and best case (mFPS) scenarios of how error propagates through a deep CNN up to 10 hidden convolution layers. The look-up table-based framework proposed here is device technology agnostic and can be used for error quantification for any edge compute application for any device as long as its operating state variability can be characterized comprehensively, as was done by Fantini et al. in Ref. [33]. This is the first study that clearly quantifies the impact of a hardware realization of an RRAM-based CNN on a practical large scale open-source network, the popularly used GoogleNet.
Note that our study assumes that the training of the weights in the CNN still happens in the cloud and the edge computing here is purely for the inference side using hardware to replace the software trained weights to minimize or even eliminate latency issues due to server-node communication traffic. A truly edge application would require training (learning) and inference to also happen on the nodes itself and the training process itself will also be heavily affected by the inherent device switching variability. We are in the process of extending our framework to also account for variability and error induced in the training phase of a fully RRAM based network learning process using NAND-based computational logic and 2T-1R architectures to replace the NAND operation. The impact of forward learning and backpropagation on the training robustness of a fully RRAM based CNN will be the subject of our next study, building on our work here.

ACKNOWLEDGMENT
The first author would like to thank the Ministry of Education (MOE), Singapore for providing the research student scholarship (RSS) at SUTD for 2018-2022. The corresponding author would also like to acknowledge the financial and logistical support from the A*STAR Brain Efficient Nanomechanical Artificial Intelligence Computing (BRENAIC) Research Project No. A18A5b0056, which enabled the work to be accomplished.