Exploring Model Stability of Deep Neural Networks for Reliable RRAM-Based In-Memory Acceleration

RRAM-based in-memory computing (IMC) effectively accelerates deep neural networks (DNNs). Furthermore, model compression techniques, such as quantization and pruning, are necessary to improve algorithm mapping and hardware performance. However, in the presence of RRAM device variations, low-precision and sparse DNNs suffer from severe post-mapping accuracy loss. To address this, in this work, we investigate a new metric, model stability, from the loss landscape to help shed light on accuracy loss under variations and model compression, which guides an algorithmic solution to maximize model stability and mitigate accuracy loss. Based on statistical data from a CMOS/RRAM 1T1R test chip at 65nm, we characterize wafer-level RRAM variations and develop a cross-layer benchmark tool that incorporates quantization, pruning, device variations, model stability, and IMC architecture parameters to assess post-mapping accuracy and hardware performance. Leveraging this tool, we show that a loss-landscape-based DNN model selection for stability effectively tolerates device variations and achieves a post-mapping accuracy higher than that with 50% lower RRAM variations. Moreover, we quantitatively interpret why model pruning increases the sensitivity to variations, while a lower-precision model has better tolerance to variations. Finally, we propose a novel variation-aware training method to improve model stability, in which there exists the most stable model for the best post-mapping accuracy of compressed DNNs. Experimental evaluation of the method shows up to 19%, 21%, and 11% post-mapping accuracy improvement for our 65nm RRAM device, across various precision and sparsity, on CIFAR-10, CIFAR-100, and SVHN datasets, respectively.


INTRODUCTION
D EEP neural networks (DNNs) outperform humans for a variety of applications, such as computer vision and natural language processing. Higher accuracy comes at the cost of increased computational complexity and model size, posing great challenges to traditional architectures [1]. In addition, limited on-chip memory capacity leads to a significant amount of communication with off-chip memory, whose energy consumption is 1,000Â higher than that of computations [2].
RRAM-based IMC accelerators provide a dense and parallel structure to achieve high performance and energy efficiency [3], [4]. Prior works with RRAM-based crossbar architectures have shown up to 1,000Â improvement in energy efficiency as compared to CPUs/GPUs [3], [4], [5], [6], [7]. The increased energy-efficiency is attributed to a full-custom design following the assumption; all weights are stored on-chip [3], [4], [5]. However, RRAM-based IMC architectures incur a significant area overhead, especially when the DNN model size is rapidly increasing. Hence, model compression (e.g., pruning and quantization) is necessary for RRAM-based in-memory acceleration of DNNs.
In reality, RRAM suffers from statistical variations such as quantization error, device-to-device write variations, stuck-at-faults, and limited R off /R on ratio, posing a significant challenge to designing reliable RRAM-based IMC architectures [10], [11], [12]. The statistical variations in RRAM cause deviation in the programmed resistance leading to a significant loss in post-mapping accuracy (i.e., accuracy in the presence of RRAM variations) for DNNs. To mitigate the post-mapping accuracy loss in DNNs, variation-aware training (VAT) is employed [9], [10], [12], [13], [14]. VAT exploits the inherent redundancy in DNN by embedding the device variations (s), based on a log-normal or normal distribution model, into the training process to achieve a variation-tolerant model, with no need of re-training for each individual RRAM chip [9], [10], [13]. The conventional VAT techniques train and test the DNN model at the same level of variation. Fig. 1a shows the post-mapping accuracy for ResNet-20 [15] on CIFAR-10 dataset for different RRAM levels across various average RRAM variations. The baseline accuracy for the floating-point 32bit (FP-32) model is shown in red (dash line). The red curve shows the variation of pre-mapping accuracy with RRAM levels for our 65nm RRAM data with an average variation (s avg ) of 0.3. For a given R off /R on ratio, a higher number of RRAM levels leads to higher variation and lower accuracy, and vice-versa. A higher variation for a higher number of RRAM levels arises from the increased HRS state utilization. Further, we analyze the scenario with a reduced average variation of 0.2 and 0.15. A reduction in RRAM variation through improved process control improves the pre-mapping accuracy. But, though the reduction in RRAM variation improves the accuracy, it does not achieve the same accuracy as FP-32. Hence, we establish that the reduction in RRAM variation does not get the pre-mapping accuracy back to the FP-32 level (baseline).
Next, we analyze the effect of RRAM variations on a sparse and quantized DNN. Fig. 1b shows the accuracy of ResNet-20 for CIFAR-10 at 29% sparsity for different RRAM write variations and data precision using the conventional VAT method [9]. Conventional VAT proves ineffective under pruning and quantization, resulting in reduced postmapping accuracy. Furthermore, a lower precision (ternary) helps improve the post-mapping accuracy, as shown in Fig. 1b. Hence, there is a need for a more systematic solution for reliable RRAM-based in-memory computing for dense, sparse, and quantized DNNs.
To address this, in this work, we propose a new metric, model stability, from the loss landscape to help shed light on accuracy under variations and model compression and guide an algorithmic solution that mitigates the loss. The model stability is visualized by the loss landscape and evaluated by the roughness score [16]. A lower roughness score indicates a smoother loss landscape and a more stable model. Through this, we select the more stable model that can withstand the variations better. The proposed model stabilitybased model selection effectively tolerates device variations and achieves a post-mapping accuracy higher than that with 50% lower RRAM variations. Next, we propose a novel variation-aware training (VAT) method for best model stability in compressed DNNs. The proposed method utilizes VAT to train the compressed DNN with different scales of device variations (s) to search for the most stable model and improve post-mapping accuracy. For a given DNN model, higher model stability implies better tolerance of variations and thus, higher post-mapping accuracy. We utilize a structured pruning method [8] and model quantization [17], [18] to compress DNN. The pruning method considers the mapping of the DNN onto the RRAM crossbar for best IMC performance. We show that pruning results in a less stable model, while quantization improves the model stability. We demonstrate that the proposed method achieves up to 19%, 21%, and 11% improvement in post-mapping accuracy on CIFAR-10, CIFAR-100, and SVHN datasets, respectively. The major contributions of this work are as follows: We propose a new metric, model stability, using the loss landscape to mitigate the accuracy loss of dense and compressed DNNs in the presence of RRAM variations, Using model stability as a metric to choose the more stable model results in similar accuracy improvement as that for 50% lower RRAM variations through costly process control, We further propose a model stability-based VAT method for compressed DNNs, which searches the most stable model under variations to achieve the best post-mapping accuracy, without knowing the exact amount of RRAM testing variations upfront, Finally, we show that the model-stability-based VAT method achieves up to 19%, 21%, and 11% improvement in accuracy for compressed DNNs on CIFAR-10, CIFAR-100, and SVHN datasets, respectively.

DNN MODEL STABILITY
Given a trained DNN model, its model stability is an intrinsic property to withstand perturbations, such as variations in model weights and input noise. Model stability of a DNN, i.e., the DNN generalization capability,is directly related to the contour of the loss function [16], [19], [20], [21]. A flatter contour of the loss function leads to a larger region of acceptable minima, which allows the DNN model to better tolerate variations in both weights and inputs.Vice versa, a steeper contour of the loss function leads to a Post-mapping accuracy for ResNet-20 on CIFAR-10 (a) across different RRAM levels and average RRAM variations. For a given R off /R on ratio, higher RRAM levels suffer from variations and thus lower accuracy. Simultaneously, a lower average RRAM variation results in higher accuracy, (b) at 8-bit and ternary-bit precision, with 29% sparsity [8]. Model is trained and tested with the same variation (s) [9], [10]. The 8-bit model has more accuracy loss than the ternary model as s increases.
smaller region of acceptable minima [19], which implies that any perturbations to the weights or inputs will lead to appreciable movement of the minima point and thus, reduce model accuracy. In this work, we utilize DNN model stability as a metric to guide reliable RRAM-based IMC acceleration.

Landscape Visualization
In order to quantitatively understand model stability, we utilize the landscape visualization method [21] to visualize the minima of the loss function. In [21], filter normalization is applied to remove the scaling effect of injected noise, and a 3-dimension matrix is generated with x, y, and z coordinates, where x and y represent the scale of two random perturbations injected into the model and z is the loss function. Essentially this matrix plots the fluctuation of the loss function under the local perturbation around the local minimum.

Roughness Score
We calculate the smoothness of the loss function, defined as roughness score, to quantify the loss landscape's stability further. We fit the 3-dimensional data from the landscape using quadratic linear regression and obtain the mean square error (MSE) of the fitting model, as shown below: where w j represents the fitted coefficients. We denote the stability or roughness score of the DNN model as MSEðz; x 2 ; y 2 ; x; y;ŵÞ. A smaller MSE arises from a flat and smooth landscape and vice-versa. Note that such a method was previously used to improve the accuracy in continual learning [16], while it did not consider device variations, sparsity, and quantization. To the best of our knowledge, this is the first time that model stability has been employedto provide systematic guidanceto improve the post-mapping accuracy for the acceleration of dense and compressed DNNs using RRAM-based IMC architectures.

Roughness Score and DNN Accuracy
Fig . 2 shows the accuracy and roughness score for different DNNs. We generate different versions for a DNN model by utilizing different weight initialization, which leads to different roughness scores and corresponding accuracy. VGG-16 for CIFAR-10 achieves 94.2% accuracy and the lowest roughness score of 118x10 -3 . At the same time, a ResNet-20 version achieves the lowest accuracy of 89.5% with a roughness score of 278x10 -3 . To further understand the relationship between the roughness score, loss landscape, and DNN accuracy, we visualize the loss landscape using the method detailed in [21]. VGG-16 has a shallow and smooth loss landscape while the lowest accuracy ResNet-20 variant has a rough loss landscape. Through these examples we establish that a lower roughness score leads to a smoother loss landscape, more acceptable local minima for the loss function, and a higher DNN accuracy.

65NM 1T1R RRAM DEVICE
To accurately model the RRAM device properties, RRAM data is collected from a fully integrated 1T1R structure on a 300mm wafer, using a custom RRAM module within the SUNY Polytechnic Institute's 65nm process. The size of each RRAM device is 100nmx100nm. The RRAM device stack is comprised of a 6nm HfO 2 mem-resistive switching layer, a 6nm PVD Ti oxygen exchange layer (OEL), and TiN electrodes (top and bottom). Electrical characterization is performed using a pulse-based approach having a magnitude 1V -1.2V and width of 10ms for the set and reset operation of RRAM devices. Fig. 3 shows the wafer-level cycle-to-cycle switching variations for the 65nm RRAM device measured using the pulse-based switching technique. The high-resistance state (HRS) has a higher variation up to 0.6, while the low-resistance state (LRS) has a lower variation up to 0.1. The average variation (s avg ) for the entire range of HRS and LRS amounts to 0.3. Fig. 4 shows the normalized variation for different R off /R on ratios across 1110 1T1R RRAM devices at 65nm. As the ratio of R off /R on goes up its HRS state utilization increases. A higher HRS state utilization results in higher device variations and an increase in overall average variation, as shown in Fig. 3. Overall, a lower R off /R on ratio results in lower average device variation at the cost of lower usable resistance levels and hardware performance, and vice-versa. Fig. 5 shows the retention of both HRS and LRS for 10 5 seconds at 100 C. The grey region overlaid on the plot Fig. 2. A lower roughness score leads to a smoother loss landscape, higher stability, and thus, higher model accuracy. shows the static variation from the RRAM device. The 65nm 1T1R RRAM device shows low retention variation, as shown in Fig. 5. Table 1 shows the endurance data for both HRS and LRS states up to 1 billion cycles. Both the HRS and LRS states show high endurance with distinction between the two states up to 1 billion cycles. Since static write variations are dominant, in this work, we do not consider the effects of retention or endurance. Table 2 summarizes the device models used in the cross-layer simulation framework described in Section 4. In this work, we use the 65nm RRAM models for all our experiments to provide realistic results.

CROSS-LAYER SIMULATION FRAMEWORK
In this work, we develop an in-house simulator to perform system-level benchmarking of the RRAM-based IMC architecture. Fig. 6 shows the block diagram of the cross-layer benchmarking tool utilized.
The simulator incorporates device, circuits, architecture, and algorithm under a single roof to perform system-level benchmarking. The simulator provides post-mapping accuracy (hardware accuracy), the overall hardware performance, roofline model, and model stability. The inputs to the simulator include the DNN structure, data precision, target sparsity, technology node, bits per RRAM cell, IMC crossbar size, ADC resolution, read-out method, frequency, NoC topology, and NoC size, among others. Table 3 shows the comparison between different popular IMC simulators. Compared to prior works [22], [23], [24], the proposed simulator provides supports for both SPICE and behavioral-based computation fabric estimation, supports both accuracy and hardware performance estimation, NoC-mesh interconnect support, estimation of roofline model and model stability of the DNN, and support sparsity and quantization.
Furthermore, the supported sparsity follows a hardwarefriendly structure as detailed in Section 4.4.1.

Device Models
The tool incorporates device models from the 65nm 1T1R RRAM device, as shown in Table 2. The RRAM device variations are modeled using the log-normal distribution [10], [25]. The R off /R on ratio of the RRAM device ranges between 2 and 650. We assume that the discrete resistance levels used to represent the weights are within the limited R off / R on ratio. The maximum number of resistance levels that a single RRAM cell can handle is 16, limiting the weights to be mapped to a single cell to 4-bits.

Circuit Estimator
The circuit estimator performs the estimation of the IMC, peripheral circuits, and digital modules within the architecture. We benchmark the overall circuit estimator with SIAM [26]. The circuit estimator performs the mapping of the DNN onto the RRAM-based IMC crossbar architecture. The mapping utilized within the estimator follows that in [26]. The IMC circuit components include the crossbar, wordline (WL) driver, level shifters, bitline (BL) and selectline (SL) multiplexers (MUX), column MUX, analog-to-digital converter (ADC), and shift and add circuit. In addition, the circuit estimator benchmarks the accumulator, pooling unit, and buffer circuits in the architecture. The estimator utilizes the device models and the transistor properties to perform the overall estimation.   associated peripheral circuits. The peripheral circuits include an ADC, column MUX, switch matrix, WL decoder and driver, level shifters, and SL and BL MUX. In addition, the PE consists of a local buffer that is utilized for the movement of activations and partial sums into and out of the PE.

Architecture
The architecture utilizes an NoC-mesh to perform the onchip data movement at the tile-level. Each tile is associated with an NoC router that performs the packet scheduling and routing. The NoC utilizes an X-Y routing mechanism. To benchmark the NoC interconnect, we utilize a cycleaccurate simulator that is benchmarked against the NoC module within SIAM [26]. To perform the estimation, we generate traces for the packets communicated between the tiles similar to that detailed in [26]. These traces are then utilized as the input to the NoC estimator to evaluate the cost of on-chip communication.

Algorithm
The algorithm component of the simulator performs the DNN training, pruning, quantization, variation-aware-training (VAT), evaluation of model stability, and hardwareaware training. The VAT training performed in this work utilizes the device models detailed in Table 2. We utilize the lognormal distribution to add the variations for each weight. The variations are added such that the variations depend on the weight value, thus having one-to-one correlation to the real hardware. In addition, for the hardware-aware training we include the effect of the limited precision of the RRAM, the ADC, and the accumulator within the shift and add circuit. Finally, the algorithm component of the simulator evaluates the model stability of the DNN after training. To achieve this, we evaluate the roughness score of the loss function and visualize the loss landscape as detailed Section 2. In the following sections we detail the pruning and quantization methodologies utilized in this work.

Pruning
In this work, we adopt the structured pruning method in [8]. It utilizes a weight-penalty clipping with a self-adapting threshold, as shown below: where d l denotes the layer-wise self-adapting clipping threshold, L is the number of layers, G l is the number of groups in the l-th layer, is the hyper-parameter to be tuned based on the dataset, and a is the scaling coefficient. The pruning is conducted group-wise along the output channel dimension: forlayer l with weight matrix W l 2 R N of ÂN if ÂK x ÂK y , we choose a group of size N g along the N if dimension, where N g is determined by the crossbar size. Groups of K x Â K y Â N g weights are pruned across output channels to favor the IMC.

Quantization
The pruned model is further compressed by applying quantization. For 4-bits or higher precision, we employ uniform in-training quantization [18]. Furthermore, for ternary bit precision, we follow the ternarization method in [17]. For both ternary and higher bit precision, we employ the straight-through-estimator (STE) method in the backwardpropagation to counteract the non-differential issues of the discrete quantization function. In this work, we focus on 8bit and ternary weight quantization.

Convergence Analysis
There exist several works to analyze the convergence of DNN when performing pruning and quantization [27], [28], [29]. First, for the weight quantization, [27] finds that if Fig. 6. The benchmarking tool incorporates algorithm (pruning and quantization), RRAM IMC architecture properties, circuit models, and the 65nm 1T1R RRAM device models. The tool outputs post-mapping accuracy, hardware performance, roofline model, and model stability.

TABLE 3 Comparison Between Different IMC Simulators
assuming the loss function is L-Lipschitz smooth and the gradients are bounded, the loss of the quantized model will converge to an error related to the weight quantization resolution Dw and weight dimension d as shown below: RðT Þ is the evaluated regret, which is equal to P T t¼1 fðW t Þ À fðW Ã t Þ, where W Ã t is the best model parameter, T is the number of iterations, and D is the finite diameter in the domain. It can be seen that the smaller the Dw or d, the smaller is the error. In addition, [29] provides the convergence analysis of the group Lasso-based pruning method. By setting suitable smoothing parameters, [29] proves the weak and strong convergence of the training process for the smoothing of neural networks, respectively. In practice, they show that weak convergence indicates that the norm of the gradients of the smooth cost function goes to zero, and the strong convergence implies that the weight sequence tends to a fixed point W f . Moreover, they demonstrate that such a convergence result of smoothing of the network is consistent with the original non-smoothing one with group-lasso penalty. The consistency between the smooth error function EðW t Þ and original error function EðW t Þ is shown below:

MODEL STABILITY FOR RRAM-BASED IMC
In this section, we detail the algorithm utilized to evaluate the model stability of DNN under the presence of RRAM variations, sparsity, and quantization. Algorithm 1 details the methodology utilized in this work to evaluate the model stability. First, for each DNN model we perform training to generate the floating-point model A. We perform inference with model A to calculate the inference accuracy. Next, we quantize the DNN model to fixed-point weights and activations across all layers of the DNN, to generate model B. Uniform quantization is employed to maximize the hardware performance for the RRAM-based IMC architecture. The quantized model is then pruned to generate the structured sparse DNN model C. Thereafter, we add the RRAM variations (s train ) and perform hardware-aware training for the quantized model to generate model D. In addition to adding the RRAM variations, hardware-aware training involves breaking the convolution or fully-connected (FC) layer into partial convolutions and FC layer operations based on the size of the crossbar and adding the ADC quantization for the column sum from each crossbar. We then perform inference for the hardwareaware trained model D in the presence of the RRAM testing variations (s test ) to generate the realistic hardware accuracy. Next, to evaluate the model stability for each model, we plot the loss landscape as defined in Section 2.1. Finally, we evaluate the roughness score of the loss landscape (models A, B, C, and D) to quantify the stability of the DNN model. In this work, we propose to use model stability as a metric for reliable RRAM-based IMC accelerations. To achieve this, we propose two directions, first, a model stability-based model selection, and second, a model stability-based VAT method. Add RRAM variations (s train ) and perform hardwareaware training /* Model D */ 12: Plot loss landscape using tool in [21] for models A, B, C, and D 13: Calculate roughness score of loss landscape for models A, B, C, and D 14: end Given a dataset, Fig. 2 illustrates that the choice of the DNN model significantly affects the accuracy. Based on this observation, we propose a novel loss landscape-based model selection for stability that tolerates RRAM device variations and achieves higher post-mapping accuracy. Such an observation is attributed to the DNN model stability. Model stability of a trained DNN is the intrinsic property to withstand perturbations such as variations and noise. A more stable model with higher model stability will be more robust under RRAM variations and have higher post-mapping accuracy.
Model stability-based model selection provides a viable solution when there is a choice for the target DNN. But, if the DNN cannot be changed and is compressed, model selection cannot be utilized. To address this, in this work, we propose a novel model stability-based VAT to improve the post-mapping accuracy of DNNs under sparsity and quantization. Previous VAT approaches focus on nonpruned DNNs and require the precise knowledge of RRAM testing variations and apply that to the training [9], [10]. Distinct from that, we first train the sparse and quantized DNN model with different scales of device variations (s train ), without knowing the exact amount of RRAM testing variations. The range of s train is from 0.1 to 0.5, as suggested by the 65nm 1T1R RRAM data.Furthermore, we evaluate the loss landscape and the roughness score for each of the different VAT variants to help identify the optimal model with the highest model stability (Algorithm 1). A higher model stability from the optimal scale of training variation leads to higher post-mapping accuracy.

Experimental Setup
We evaluate the proposed model stability metric for reliable RRAM-based IMC acceleration using two main methods. First, we demonstrate a DNN model selection for higher model stability which improves the overall DNN accuracy. We evaluate the proposed method for ResNet-20 on CIFAR-10, DenseNet-40 for CIFAR-10 and CIFAR-100, and ResNet-32 for CIFAR-100. All experiments are done for a crossbar size of 256Â256 with a 5-bit ADC at the periphery. We evaluate for different RRAM levels ranging from 2 to 16.
Next, we extend the model stability metric to present a novel model stability-based VAT method to mitigate the accuracy degradation in RRAM-based IMC architectures. We evaluate the proposed VAT method for ResNet-20 on CIFAR-10, VGG-16 on CIFAR-10, ResNet-32 on SVHN, and ResNet-56 for CIFAR-100. We evaluate the VAT method in the presence of RRAM variations that range from 0.1 to 0.5, ternary and 8-bit quantization, and different levels of structured sparsity generated using pruning method detailed in Section 4.4.1. The pruning group size is chosen equal to the crossbar size for maximum hardware inference performance. We utilize the same crossbar size of 256Â256 with a 5-bit ADC at the periphery.

Effect of Pruning and Quantization on RRAM IMC
We follow the mapping as in [26]. In the pruning method, we set the group size in accordance with the number of rows of the crossbar. For example, for a crossbar of size 72Â64 and kernel size of 3x3, we set the group size to be 8. Hence, we prune groups of 3Â3Â8 weights along the output feature dimension. Therefore, we are able to skip the mapping of 3Â3Â8 weights along the column dimension of the RRAM crossbar while maintaining high utilization. In this work, we set the group size to be 8 and the crossbar size as 72Â64. Fig. 7 shows the energy-delay product for ResNet-20 on CIFAR-10 across different sparsity, quantization (8-bit and ternary), and crossbar sizes. A higher sparsity and lower quantization leads to lower EDP and vice-versa. At higher rates of sparsity, the EDP reduces exponentially across different grades of quantization, thus increasing the hardware performance. The pruned model has a roughness score higher than the FP-32 model, resulting in a rougher loss landscape, less stability, andlower accuracy. At the same time, the addition of quantization to the pruned model results in a reduced roughness score, making it more stable with a smoother loss landscape and a higher accuracy. Hence, we quantitatively establish through the roughness score and loss landscape that quantization helps improve the model stability and is a necessary step for reliable RRAM-based IMC acceleration of sparse DNNs.

Precision and Variation
The inherent variations with the RRAM device result in significant accuracy degradation. Section 1 details the effect of RRAM variation for ResNet-20 on CIFAR-10 with full precision and pruned and quantized models (8-bit and ternary). Through this, we establish that higher RRAM levels lead to higher variation and degradation in post-mapping accuracy.
Next, we evaluate the effect of ADC precision on postmapping accuracy. Fig. 9 shows the total inference energy breakdown for ResNet-20 on CIFAR-10 at two ADC precisions, 8-bit and 5-bit. We divide the total energy into ADC, buffer, and other (accumulator, NoC, crossbar, ReLU, pooling, etc.) components. It can be seen that a higher ADC precision leads to higher energy dominated by the ADC and higher post-mapping accuracy. At the same time, a lower precision reduces the ADC component resulting in reduced total energy. At higher RRAM levels, the ADC cost reduces due to reduced crossbars and associated peripherals. Considering the dramatic design challenge and cost, a lowpower and high throughput, and high-precision ADC may not be practical soon for RRAM IMC.

Roofline Model
In this section, we develop a roofline model that comprises of the number of RRAM levels, RRAM variation, stuck-atfaults, R off /R on ratio, and ADC precision. We evaluate the post-mapping accuracy for two DNNs, DenseNet-40 and ResNet-32 for the CIFAR-100 dataset, for our fabricated HfO 2 based RRAM device.   Next, we consider the ADC precision in the accuracy estimation and evaluate the post-mapping accuracy. The red curve shows the maximum achievable post-mapping accuracy with a 5-bit ADC. Hence, the RRAM-based IMC accuracy at lower RRAM levels is limited by the ADC precision, while at higher RRAM levels, the RRAM device limits the accuracy. Finally, for our 16-level 65nm 1T1R RRAM devices, only 4-6 levels are useful to achieve the best performance due to the ADC precision, RRAM variations, and algorithm limits.

Model Stability-Based Model Selection
In this section, we show the efficacy of the model stability-based model selection method. Here the training and testing RRAM variations are the same (s) [10]. Table 4 shows the post-mapping accuracy of various DNNs for CIFAR-10 and CIFAR-100 datasets. We evaluate ResNet-20 and DenseNet-40 for CIFAR-10. For our 65nm RRAM device with a device variation (s) of 0.3, ResNet-20 achieves 68.83% accuracy. A 50% reduction in RRAM variation (s = 0.15) through process control results in 72.04% accuracy, a 4% increment. At the same time, DenseNet-40, which has higher model stability, achieves a higher accuracy of 72.34% at a s of 0.3, a 0.3% and 4% improvement over ResNet-20 with 0. 15  We note that the more stable model has a smaller model size and attributes to improved hardware performance. The improved accuracy is attributed to the lower roughness score and higher model stability from the proposed loss landscape-based model selection. Through this, we establish that a loss landscape-based model selection achieves higher postmapping accuracy than a 50% reduction in RRAM device variation through process control.

Model Stability-Based VAT
In this section, we detail the efficacy of the proposed model stability-based VAT method. Fig. 11 shows the loss landscape, roughness score, and post-mapping accuracy for ResNet-20 at 29% sparsity and 8-bit quantization for a testing variation (s) of 0.1 on the CIFAR-10 dataset. The model is trained with different scales of variations (s) from 0.1 to 0.3 to generate each of the VAT models. The most stable VAT model, s equal to 0.15, has the lowest roughness score of 91Â10 -3 , resulting in a smoother loss landscape, higher model stability, and post-mapping accuracy. As the roughness score increases, the loss landscape becomes more rough, and the post-mapping accuracy reduces. Fig. 12 shows the detailed result. The optimal scale of training variation is different for each testing variation and is chosen based on the model Fig. 9. ADC dominates the energy consumption, especially under high ADC precision. With higher RRAM levels, its portion reduces due to reduced number of crossbars and associated peripherals.   stability. Thus, the proposed method is also applicable to situations with unknown precise RRAM testing variation. We repeat the same experiment for VGG-16 on CIFAR-10 with ternary quantization and 83% sparsity as shown in Fig. 13. Table 5 shows the detailed results for VGG-16 on CIFAR-10 across the entire range of testing variations. Conventional methods refer to using the same scale of variation for both training and testing [9], [10]. In contrast, in this work, we show that an optimal scale for training variation results in higher model stability and post-mapping accuracy. Furthermore, at a higher range of testing variations, the proposed method provides greater improvement in post-mapping accuracy. Hence, the systematic model stability-based VAT method is effective in choosing the optimal VAT model at different precision and sparsity for best accuracy across a range of DNN models. Table 6 shows the overall results across different models and datasets. All post-mapping accuracy is compared to that of conventional VAT, where the training and testing variations are the same. ResNet-20 on CIFAR-10 at 29% sparsity and 8-bit precision shows 11.5% improvement in post-mapping accuracy with the proposed method at 0.3 testing variation. At the same time, VGG-16 at 88.3% and 82.3% sparsity and ternary bit quantization achieve 2.2% and 19% improvement post-mapping accuracy for 0.2 and 0.4 testing variations (s), respectively. We evaluate ResNet-32 on SVHN dataset at ternary quantization and two sparsity (48.4% and 71.5%) and achieve up to 11.2% improvement in post-mapping accuracy. Finally, for ResNet-56 for CIFAR-100, at 17.8% sparsity, 8-bit quantization, and testing variation of 0.15, we achieve a 21.1% improvement in postmapping accuracy.

Comparison With Other Work
We compare the proposed model-stability-based VAT method with the state-of-the-art method as proposed in [9]. Table 7 shows the post-mapping accuracy for VGG-16 on CIFAR-10. The proposed method achieves a 5.1% higher average improvement (as defined in [9]) in post-mapping accuracy as compared to [9]. Furthermore, the proposed method provides an improvement in the presence of structured sparsity, which reduces the model stability due to the presence of more sensitive weights. Finally, the proposed method does not require precise prior knowledge of the testing variations (instead requires only the expected range), hence providing a more generic solution for reliable RRAM-based IMC acceleration 7 RELATED WORK 7.1 RRAM-Based IMC Architectures IMC-based hardware architectures have emerged as a promising alternative to conventional von-Neumann architectures. The crossbar-based IMC structure efficiently combines both memory access and analogdomain computation into a single unit for the acceleration of DNNs. RRAM-based IMC architectures provide a promising alternative to conventional von-Neumann architectures [3], [4], [5], [6], [7], [26], [30]. Authors of ISAAC [5] proposed an RRAM-based IMC architecture for DNN inference. The architecture utilizes a crossbar of size 128Â128 to perform the multiply-and-accumulate   [31]. Authors in [30] proposed a systolic array-based RRAM IMC design for DNN inference. Authors in [4] proposed a methodology to optimize the IMC utilization, area and energy by utilizing a heterogeneous IMC structure and a custom NoC router-to-tile mapping and scheduling. But, prior works do not focus on the non-idealities associated with an RRAM-based IMC. To address this, in this work, utilizing model stability as a metric, we propose a novel model selection and VAT method to improve the post mapping accuracy of RRAM-based IMC architecture. The proposed methods incorporate the effect of architectural and circuit properties into the accuracy estimation while ensuring the best performance.

Pruning and Quantization
Pruning has been an effective method to reduce the DNN model size. Pruning can be classified into element-wise pruning, kernel-wise or channel-wise pruning for structured sparsity, and group-wise pruning using the underlying hardware execution flow. Element-wise pruning [32], [33] prunes weights of DNN in a random manner, while structured pruning techniques [34], [35] produce a structured sparse DNN. Such sparsity is more hardware-friendly to achieve better performance as compared to random pruning. Other works have focused on crossbar-aware pruning and FPGA-aware pruning by incorporating the crossbar structure [36] and the underlying FPGA execution flow [37]. Simultaneously, quantization provides an efficient method to reduce the bit precision of the weights and activations within the DNN. A quantized DNN model results in improved hardware performance while reducing the overall inference accuracy. Prior works have focused on weight and activation quantization using uniform and non-uniform methods [17], [18], [38].
In contrast, in this work, we utilize group-wise structured sparsity along the output channel dimension, where the size of the group is determined by the number of rows in the crossbar. Such a pruning requires no additional hardware overhead or special mapping techniques. Furthermore, we combine pruning with quantization to generate a structured sparse low-precision DNN model. We utilize both uniform (4-bits and higher) [18] and ternary quantization [17]. Furthermore, we analyze the effect of sparsity and quantization on the DNN model by utilizing the model stability and the loss landscape visualization. Through our results, we show that pruning reduces the model stability while quantization improves the stability. Finally, we analyze the effect of non-idealities associated with RRAM-based IMC architecture in the presence of lower precision and sparsity for the DNN.

Mitigation of Post-Mapping Accuracy Loss
Several VAT methods have been proposed to mitigate the post-mapping accuracy loss due to RRAM variations. Closed-Loop-on-Device (CLD) and Open-Loop-off-Device (OLD) perform iterative read-verify-write (R-V-W) operations at the RRAM device level until the resistance converges to the desired value [39], [40]. Other approaches, such as [41], involve VAT based on known device variation (s) characterized from devices, while [9] combines VAT with dynamic precision quantization to mitigate the post-mapping accuracy loss. These approaches partially recover the accuracy but fail to consider the effect of sparsity and quantization on the DNN model. Furthermore, these works do not provide a systematic metric to provide a reliable DNN model. In addition, these methods assume a known RRAM device variation and utilize the same scale of variation in training and testing. Hence, precise variations models need to be extracted by mapping the weights onto the fabricated RRAM device. To address these drawbacks, in this work, we propose model stability as a metric for reliable RRAM-based inmemory computing. Utilizing the roughness score and loss landscape, as defined in Section 2, we show that a lower roughness score leads to higher model stability and higher accuracy. Second, we propose a model stabilitybased model selection where the choice of the model is driven by the inherent model stability in the presence of the variations within the RRAM-based IMC architecture. Finally, based on model stability, we provide a systematic VAT method that searches for an optimal scale of variation during training for the best accuracy. The proposed VAT method is agnostic to the actual variations within the RRAM device and provides a generic solution for reliable RRAM-based IMC acceleration.

CONCLUSION
In this work, we explore the model stability of DNNs as a metric for reliable RRAM-based in-memory acceleration. Utilizing the loss landscape and roughness score, we show that a more stable model has a lower roughness score, a smoother loss landscape, and higher accuracy under variations. To provide realistic evaluation, we measured statistical variations from a 65nm 1T1R RRAM test chip and integrated them into a cross-layer benchmark tool to access model accuracy and other performance metrics under variations. Based on the model stability of DNNs, we propose two methods to achieve reliable RRAM-based in-memory acceleration. First, a novel model stability-based model selection that effectively tolerates RRAM device variations and achieves higher accuracy than that with 50% lower RRAM variations for both CIFAR-10 and CIFAR-100 datasets. Second, we propose a variation-aware training (VAT) method to mitigate the post-mapping accuracy loss in sparse and quantized DNNs. We conclude that quantization improves the stability under variations, leading to higher accuracy, but pruning reduces the model stability. The proposed VAT method searches for the most stable model to mitigate the post-mapping accuracy loss without pre-knowledge of testing RRAM variations and no re-training during mapping. Experimental evaluation shows up to 19%, 21%, and 11% improvement in post-mapping accuracy at different sparsity, quantization, and device variations on CIFAR-10, CIFAR-100, and SVHN datasets, respectively. Rajiv V. Joshi (Fellow, IEEE) received the BTech degree from the Indian Institute of Technology, Bombay, India, the MS degree from the Massachusetts Institute of Technology, and the Dr. Eng. Sc. degree from Columbia University. He is a research staff member and key technical lead with T. J. Watson research center, IBM. He has led successfully predictive failure analytic techniques for yield prediction and the technologydriven SRAM with IBM Server Group. He developed and commercialized novel memory designs which are universally accepted. He received three Outstanding Technical Achievement (OTAs), three highest Corporate Patent Portfolio awards for licensing contributions, holds 60 invention plateaus and has more than 260 US patents and more than 400 including international patents. He has authored and co-authored more than 210 papers. He received NY IP Law association "Inventor of the year" Award in Feb 2020, the IEEE Daniel Noble Award for 2018 and Industrial Pioneer Award, the Best Editor Award from IEEE TVLSI journal, recipient of 2015 BMM Award, and inducted into New Jersey Inventor Hall of Fame in 2014.
Nathaniel C. Cady (Member, IEEE) received the BA and PhD degrees from Cornell University, in Ithaca, New York. He is currently an empire innovation professor in nanobioscience with the Colleges of Nanoscale Science and Engineering, SUNY Polytechnic Institute. He has active research interests include development of novel biosensor technologies and biology-inspired nanoelectronics, including novel hardware for neuromorphic computing.
Deliang Fan (Member, IEEE) is currently an assistant professor with the School of Electrical, Computer and Energy Engineering, Arizona State University, Tempe, Arizona. His primary research interests include energy efficient and high performance Big Data processing-in-memory circuit, architecture and algorithm, with applications in deep neural network, data encryption, graph processing and bioinformatics acceleration-in-memory system, hardware-aware deep learning optimization, braininspired (Neuromorphic) computing, AI security. He has authored and co-authored more than 130 peer-reviewed international journal/conference papers in above area. He is the receipt of best paper award of 2019 ACM Great Lakes Symposium on VLSI, 2018 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), and 2017 IEEE ISVLSI. He is also the technical area chair of DAC 2021, GLSVLSI 2019/2020/2021, ISQED 2019/2020/2021, and the financial chair of ISVLSI 2019. Please refer to https://dfan.engineering.asu.edu/ for more details.
Yu Cao (Fellow, IEEE) received the BS degree in physics from Peking University, in 1996, the MA degree in biophysics and PhD degree in electrical engineering from the University of California, Berkeley, in 1999 and 2002, respectively. He is now a professor in electrical engineering with Arizona State University, Tempe, Arizona. He has published numerous articles and two books on nano-CMOS modeling and physical design. His research interests include neural-inspired computing, hardware design for on-chip learning, and reliable integration of nanoelectronics. He served as associate editor of IEEE Transactions on CAD, and on the technical program committee of many conferences.