A High-Speed and High-Efficiency Diverse Error Margin Write-Verify Scheme for an RRAM-Based Neuromorphic Hardware Accelerator

Resistive random access memory (RRAM)-based neuromorphic hardware accelerators are attractive platforms for neural network acceleration due to their high energy efficiency. However, the inherent variations of RRAM, arising from diffusion or recombination of oxygen vacancies, can cause significant conductance deviation from the target value, resulting in noticeable performance degradation. In practical ex situ training, write-verify methods are widely adopted to avoid this issue when transferring a trained network model. However, the intense reading and reprogramming operations make the conventional write-verify methods require extensive programming time and energy. In this brief, for the first time, we propose a novel write-verify scheme that can transfer each weight with a different acceptable error margin to achieve a high-speed and high-efficiency write-verify scheme while maintaining network performance. Our experimental results show that the speed and energy efficiency of the write-verify process can be improved significantly, by up to <inline-formula> <tex-math notation="LaTeX">$\mathbf {\times } 3.4\mathbf {\mathrm {\sim }} \mathbf {\times }9.0$ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$\mathbf {\times } 4.1\mathbf {\mathrm {\sim }} \mathbf {\times }14.1$ </tex-math></inline-formula>, respectively.


I. INTRODUCTION
R ESISTIVE random access memory (RRAM) has been extensively studied as a promising candidate for neuromorphic computing [1], [2]. Highly parallel RRAMbased crossbars are attractive platforms for neural network acceleration [3], [4], [5], [6], [7], [8], [9], [10]. In neuromorphic computing, the conductance values of the RRAM cell in crossbars represent the synaptic weights in the network Manuscript  and should be programmed before computation. However, due to the diffusion or recombination of oxygen vacancies in multiple weakly conductive filament regions, the RRAM conductance might fluctuate when programmed [2]. Deviation of programmed weights from the trained target weights caused by variations is inevitable, degrading the network performance significantly. Two mainstream solutions, in situ [3], [4], [5], [6] and ex situ training, are proposed to mitigate the impact of conductance fluctuation on network performance. In situ training, which directly training on a crossbar array, are effective but require extra complex hardware for backpropagation and weight updating. In contrast, some widely used practical ex situ training methods can be easily implemented [7], [8], [9], [10]. The networks are trained using existing software platforms and then transferred to a neuromorphic computing accelerator. To transfer an external trained network model to a crossbar, RRAM cells are programmed to the target conductance states within an accepted error margin by using write-verify operation methods. The write-verify methods can reduce the weight deviation remarkably while maintaining network performance.
However, using an identical error margin for each weight in the write-verify process is extremely energy-and timeconsuming since it demands a large amount of reading and programming operations for individual weight. In addition, identical error margins that are too large or small may degrade network performance or reduce transfer efficiency by applying imprecise weights or adding write-verify cycles. It is unacceptable to reprogram large-scale RRAM-based neuromorphic hardware accelerators for different tasks. These issues hinder its application to areas, such as mobile edge computing scenario, which demand high-speed and high-efficiency write-verify schemes for weight transfer. However, an efficient solution for weight transfer is lacking.
A straightforward way to tackle these issues is to relax the limit of the error margin for each weight differently, resulting in fewer reading and programming operations. However, due to the difficulty and complexity of evaluating each weight deviation's influence on network performance, it is still a great challenge to determine every acceptable error margin for weights. In this brief, we present a unified and efficient write-verify scheme using diverse error margins. The major contributions of this brief are summarized below: • The proposed ex situ training method converts the weights in the traditional network into probabilistic distributions it transfers each weight with a different acceptable error range and can greatly reduce the cost of weight transfer. • We evaluate the proposed diverse error margin scheme by two typical deep neural networks in a classification task.

II. PRELIMINARIES A. RRAM-Based Neuromorphic Hardware Accelerator
The basic operation of a deep neural network can be expressed as vector-matrix multiplications (VMMs). A differential conductance pair of RRAM cells finds synaptic weight values in neural networks, so both positive and negative weights can be fully represented by RRAM. The voltages applied on the word line are the input vector of the network layers. The computing result vectors are the accumulated output currents flowing on the bit line. In an RRAM-based VMM operation, we assume that the dimension of the input vector of the network layers and the result vector are n and m, respectively. The entry of the output current vector I j is: where W i,j = G + i,j − G − i,j is the differential conductance of a pair of RRAM cells, representing one synaptic weight in the network, and V i is the entry of the input voltages vector.

B. Ex Situ Training
The neural network models are trained to obtain target weights first, utilizing ex situ training approaches before hardware computation. In typical ex situ training [7], [10], a network training process is performed on a conventional software platform. The weights in the network are optimized until the network achieves the expected reasonable performance. Next, the quantified target conductance values of RRAM corresponding to the weights are obtained. Finally, the target conductance values are transferred into the hardware accelerator. The ex situ training method can easily make use of existing high-performing computation platforms. For multiple transfers, the network training process does not need to be repeated since the existing learned weights can be directly transferred into multiple hardware accelerators. Therefore, the ex situ training method can be applied on a large scale, especially for edge devices used for inference.

C. Weight Transfer With Verification
The RRAM-based accelerator suffers from various sources of variations and noises [12]. It is difficult to precisely transfer learned weights into the hardware accelerator due to these variations. This kind of transfer error can have a significant impact on a neural network's performance. To address the issue of the transfer error, Alibart et al. [13] proposed a simple, widely used closed-loop feedback scheme to modulate the RRAM conductance. RRAM was programmed first to an initial random conductance state. Then, the conductance value was measured by a read pulse operation to verify whether its conductance is precise. Next, whether its value falls within an identical error margin from the target value was determined. If it failed, RRAM was programmed again with an additional program pulse to ensure that its value falls within the error margin. The conductance value continuously approaches the target value through a series of repeated programs and read operations until it is acceptable.

D. Bayesian Neural Network
Unlike traditional neural networks, where weights are fixed values, Bayesian neural network (BNN) weights are depicted by random variables [14], [15]. A BNN is a parametric model that incorporates the flexibility of neural networks into a Bayesian framework. The learning process of a BNN involves probability distributions. This crucial component incorporates probabilistic weights into the network training process [16]. Thus, the trained weight parameters and calculations must be resilient under tolerable weight deviations. Therefore, a BNN can be used to determine appropriate weights and acceptable diverse error margins. As for the advantages of BNN, it combines the objective function of network tasks with the learning of weight distribution in network training process. What's more, BNN can take the intrinsic non-ideal factors of RRAM, read variation, as the network parameter. The disadvantage of BNN is that it will increase the time consumption of training neural network. However, it can be mitigated by pre-training.
III. PROPOSED DIVERSE SCHEME In the traditional process of the RRAM-based accelerator, the network training process aims to optimize weights and then transfer the target weights into crossbar arrays with write-verify methods. However, using an identical error margin for each weight in the write-verify process demands notable reading and programming operations, leading to large energy and time consumption. Hence, we propose a high-speed and high-efficiency diverse error margin write-verify (DIVERSE) scheme. The proposed DIVERSE scheme consists of an ex situ training method and a write-verify method. The propose training method uses trainable parameters for obtaining acceptable weight deviations. A diverse error margin determination method is proposed to relax the requirement for exact matches between each RRAM weight and target weight. The proposed write-verify method uses different error margins for each weight to reduce the cost of weight transfer significantly.

A. Overview of the DIVERSE Scheme
The DIVERSE scheme involves three major phases, as illustrated in Fig. 1. First, a traditional neural network is transformed into a BNN with the same network structure. The BNN is trained to obtain the appropriate probabilistic weights. The weights are then allocated distinct error margins based on  the optimized parameters in the BNN, and the network is eventually transferred onto an RRAM-based accelerator using the proposed diverse error margin write-verify method.

B. Ex Situ Training Using Probabilistic Weights
To acquire acceptable weight deviations without any network performance degradation, a BNN with the same network structure is created from a conventional neural network model. We use Bayes by backprop (BBB) for learning a probability distribution on the parameters in the created BNN, which is an approximate variational method introduced by Blundell et al. [15]. The probabilistic weight in the BNN follows normal distribution N(μ i,j , σ i,j 2 ). And mean μ is identical to the fixed weights in conventional neural network. When using pretrained conventional network models, we can easily set the mean μ to pretrained weight values. The standard deviation σ is additional parameter to capture the uncertainty of weight, which can be optimized easily by stochastic gradient descent [17]. The optimized mean and variance can represent the appropriate deviation of different weights without affecting the network performance, so that different weights can use different write error ranges. In RRAM case, mean μ i,j is the target weightŴ i,j that is needed to transfer on the crossbar. The deviation of weightŴ i,j is indicated by standard deviation σ i,j . Training a BNN is main cost of ex situ training step. If we use pretrained network models, the training process is to finetune weights, and the cost can be less than training of a traditional network. If we train the BNN from the beginning without pretrained models, the training cost of a BNN is usually twice of a traditional network with similar architecture, due to the double learnable parameters in a BNN, i.e., mean and variance for every single point estimate weight in the traditional network. However, this training process uses existing software platforms and is one-time cost for the same application. So, in mass hardware accelerators production, it is reasonable to consider the programming cost for multiple transfers.
After training, the optimal probabilistic weights are resilient and ensure network performance. Moreover, there are certain general requirements in the learning process to guarantee that the network is more consistent with the RRAM-based accelerator, such as truncating the weights within an identical symmetric range so that the trained weights can be implemented with differential RRAM cells, and limiting the identical minimum value of standard deviations to ensure the standard deviation is always larger than read variations and the network has better read variation resistance.
Hence, the proposed ex situ training employs probabilistic weights in the training process to generate robust target weights and tolerable weight deviations. This ensures that the network's inference output is resistant to change even when the RRAM weights are transferred with various error margins.

C. Diverse Error Margin Determination
After learning using the proposed ex situ training method, the target weight values in the network are represented by mean μ of the normal distribution. The acceptable deviation of each weight corresponds to standard deviation σ of the normal distribution. The performance of the network is not sensitive to a certain level of deviation (related to σ ) of the RRAM weight, which is ensured by the proposed training method. In other words, it is not necessary to transfer weight using a small identical error margin. The larger standard deviation σ is, the larger the acceptable deviation of the weight. Hence, we assign different error margins for the weights, which are proportional to standard deviation σ . We can formulate the error margin as: where EM i,j and σ i,j are the error margin and learned standard deviation of weightŴ i,j in the network, respectively. k is the proportionality factor, which is the same for every weight and is determined by the network and learning task. Then, the upper and lower tolerance boundaries of weightŴ i,j are: where [ · ] Q is the quantization operation. Therefore, we can use different error margins to transfer the weights with a writeverify scheme to relax the requirement for the circuit.

D. Transfer With Diverse Error Margin Write-Verification
To transfer the weight in a network into RRAM conductance, a diverse error margin write-verify method is proposed, as illustrated in Fig. 2a. The target weightŴ i,j and the corresponding acceptable error margin EM i,j can be determined after completing the previous two phases. The proposed method measures the present RRAM weights of the differential pair The verification process checks whether the deviation is within the allowable error margin by comparing the present weight value W i,j with tolerance boundaries BU i,j and BL i,j . If the verification result is "pass", the write verification procedure is completed. Otherwise, the conductance of RRAM is programmed to minimize the deviation value until the verify result is "pass". Fig. 2b shows a prototype verification circuit, and the verify logic is the same as Fig.4a for tightening conductance distributions. Each verified VSA is connected to a reference voltage that represents a conductance threshold. The reference voltages are generated by a resistor voltage divider network. The circuit aims to tighten RRAM conductance within the window formed by the high threshold and low threshold (BU i,j and BL i,j ). As the conventional write-verify method, all RRAM devices in the same column share a common verification circuit, however, do not share common thresholds. In fact, when verifying each RRAM device, reconfiguring the BU i,j and BL i,j is required for the conventional method as well as our method.

IV. EXPERIMENTAL EVALUATION A. Experimental Setup
Two common deep neural networks, a multilayer perceptron (MLP) and a CNN called LeNet, are utilized to validate our proposed scheme on MNIST dataset. First, the networks are transformed into BNNs with the same network structure, and the BNNs are trained on the MNIST training set using the proposed ex situ training method. We employ the widely-used Adam algorithm for training. The conductance window ratio of RRAM is 10.0, and the minimum conductance is 2.0 μS. Then, the tolerance boundaries are quantified to n conductance levels.
Next, we transfer weights to the RRAM-based accelerator through the DIVERSE scheme. We obtain the statistics of conductance variation from [18] to establish the simulated RRAM program variation model in the evaluated experiments. The analog switching data are measured using identical pulses during the program process. The program variation model is: S), where the ideal conductance change ( G ideal ) and the update variations factor (S), which depend on the current conductance state (G) and operation direction (SET or RESET), are obtained from [21]. The read variation model is: G read = G + Normal(0, σ 2 read ), which is used in evaluating the final classification drop under different read variations σ read , is a simple additive normal noise model. We get the minimum value of the standard deviation through presented data measured on a physical RRAM-based accelerator in [19], which is small conductance fluctuation mainly originating from RTN and σ read = 0.2 μS. For the application of computing in memory, the endurance of our RRAM device can reach 1e6 [20] and retention time is 1e4 s @ 85 • C [21]. It is found that there is almost no influence for endurance and retention due to the robust tolerance of neuromorphic computing system. So, the endurance and retention are not considered in our experiments. However, we should point out that the proposed method can save writing-verify iterations and programming time, which further improve the device endurance and retention.
Finally, using the MNIST test set, the accuracy drop of the network is evaluated by dividing the number of incorrect predictions made by the total number of predictions made.
The network's accuracy drop is stochastic due to random weight variations. We perform weight transfer 20 times to carefully analyze the resulting average accuracy drop in each scenario. We study the impact of the DIVERSE scheme by testing various scenarios of the conductance level n and the proportionality factor k. A combination of one program and one read operation is regarded as a single write-verify cycle. The time spent on weight transfer is proportional to the write-verify cycle number. Therefore, we average write-verify a b

B. Experiments on MLP
A two-layer MLP with 100 neurons in the hidden layer is used to validate the effectiveness of the DIVERSE scheme, which has 0.15M weights and is represented by 0.30M RRAM cells. The conventional weight transfer method employs identical error margins (IEM) to transfer whole weights. The accuracy drops more and the write-verify cycle is required less as the IEM value increases, as shown in Fig. 3a. When IEM is 0.56 µS, it keeps a small accuracy drop of around 2.83%, and it requires 4.61×10 5 write-verify cycles, which is less than other smaller identical error margin cases. Fig. 3a shows that the DIVERSE scheme requires fewer write-verify cycles than the conventional method with the same accuracy drop.
As shown in Fig. 4a, we study the impact of different error margin proportionality factors k on the number of write-verify cycles and the accuracy drop of the RRAM network. The solid lines with markers indicate the accuracy drop under different conductance levels n. In addition, the color of the marker is used to indicate the normalized write-verify cycle number related to the conventional method. When n is equal to 256, the complexity of the verify circuit and the performance of the network are balanced, as shown in Fig. 4a. In this scenario, the required cycle number for the DIVERSE scheme is reduced to 3.35×10 5 when k = 1.00. The relative accuracy loss is zero ( = 0.00%) even when the error margin is k = 1.20, which is almost less than 1.88×10 5 write-verify cycles, and the energy efficiency is improved by ×1.7 compared with the conventional weight transfer method (Fig. 4b). When k = 2.10, the cycle is further reduced to 29.59% (×3.4), and the energy a b efficiency is improved by ×4.1 even though the accuracy loss is 1.00%. These results indicate that reasonable diverse error margin settings can significantly reduce the number of write cycles to save time and energy while maintaining accuracy.

C. Experiments on LeNet
The proposed DIVERSE scheme is tested using LeNet, a typical network with convolution layers. The network is made up of 2 convolution layers, 2 max pooling layers, and 3 fully connected layers, which has 0.12M weights and is represented by 0.24M RRAM cells. In comparison to the previous MLP model, it is a fairly deep neural network. After being trained and transferred with the traditional weight transfer method, the final accuracy drop of LeNet is 0.29%, and 3.45×10 5 write cycles are needed, as shown in Fig. 3b. The figure also shows that the DIVERSE scheme requires fewer write-verify cycles than the conventional method with the same accuracy drop. The same experiments as those for the MLP model are carried out, and the results are presented in Fig. 5.
When the conductance level n equals 256, a better tradeoff between verification circuit complexity and network performance is obtained, as shown in Fig. 5a. In such scenario (n=256 and k = 1.00), the weight transfer cycle is reduced to 0.89×10 5 , and energy efficiency is improved by ×4.9 with a slight accuracy loss ( = 0.05%). Moreover, the accuracy remains almost unchanged ( = 1.00%) when k = 1.60. It improves the energy efficiency almost 14.1 times compared to the traditional method, as shown in Fig. 5b. These experiments show that the DIVERSE scheme only requires 3.83×10 4 write cycles (×9.0), leading to high-speed and highefficiency weight transfer. Also, Fig. 4a and Fig. 5a show that LeNet is more sensitive to error margins. Its accuracy drops faster than MLP's accuracy as proportionality factor k increases. This might be caused by the large difference in the weights of convolution layers, which will impact the capability to extract features.
V. CONCLUSION This brief proposed a DIVERSE scheme to achieve highspeed and high-efficiency weight transfer for RRAM-based accelerators. The scheme acquires a different tolerable error margin, which can greatly relax the constraint for weight deviation from the target value. The experimental results reveal that the DIVERSE scheme can significantly improve efficiency by ×3.4∼×9.0 and ×4.1∼×14.1 in speed and energy, respectively.