BETTER: Bayesian-Based Training and Lightweight Transfer Architecture for Reliable and High-Speed Memristor Neural Network Deployment

Deep learning models implemented using memristors show high scalability and high energy efficiency, promising a compact and efficient computing architecture for resource-constrained edge computing applications. These technologies integrate both data storage and computation simultaneously in a highly parallel memristor crossbar array architecture. However, the significant variations arising from the inherent physical randomness of memristors lead to a large performance degradation of deep learning models. The challenges of extensive energy costs and transfer time for deployment to maintain performance are faced. In this brief, for the first time, we propose a unified architecture that consists of a Bayesian-based training method and a lightweight transfer scheme. The proposed architecture can tackle the robustness, energy and time consumption issues caused by memristor variations. Our experimental results show that our architecture can double the speed and energy efficiency of deploying deep learning models.


I. INTRODUCTION
D EEP learning has rapidly developed and achieved remarkable successes in many real-world tasks, including object detection, natural language processing and strategy development. Deep learning models will surely be widely used in high-complexity inference tasks at the edge. However, conventional von-Neumann hardware will require computational energy along with memory access and memory leakage components for enabling deep learning models [1]. Constrained power and area features in edge computing applications prevent the promotion of deep learning deployment to edge Manuscript  tasks. On the other hand, memristor technologies have shown significant advantages over conventional von-Neumann hardware, both in computing parallelism and energy efficiency [2]. Their high scalability and high-performance computation allow an efficient computing method for resource-constrained edge computing applications.
To accelerate the inference of deep learning models, trained network models should be first deployed locally at an edge device. In memristor crossbar arrays, however, intrinsic conductance variations constitute a significant challenge. Due to the random physical behavior of the unpredictable motion of ions and electrons, the conductance will exhibit uncertain perturbations [3]. The conductance variations, therefore, limit the number of multiple conductance levels that can be distinguishable. Moreover, it is impossible to achieve a high-precision conductance modulation through a single program operation when transferring model weights. These variations of memristors cause weight deviations, leading to considerable network performance degradation [4].
Some approaches have been proposed based on the algorithm and circuit-level perspectives to overcome this problem. On the one hand, recent research [5], [6], [7], [8], which attempts to accommodate variations using training algorithms, repeats the evaluation and training. These approaches suffer from serious scalability issues, and need to redesign based on learning tasks. In addition, recent work retrains network layers belonging to the SRAM block to recover the accuracy degradation caused by variations, after the networks are deployed to a hybrid RRAM-SRAM system [6]. This method requires expensive retraining processes, including the overhead of backpropagation implementation. On the other hand, at the transfer level, closed-loop weight transfer methods are widely used to mitigate against these memristor variations using a feedback scheme to achieve high-precision transfer weights (also referred to as write-verify schemes). Memristors are iteratively programmed until their conductance falls within a tolerated error margin. However, since every weight in the network will transfer to the crossbar array by performing a write-verify step, these approaches require enormous energy and time consumption for massive weights in large-scale deep neural networks [6], [9], [10].
The sensitivity of network performance to each weight is not the same [11]. This will allow a relaxed and highly efficient transfer method if we can find a codesign resolution to fully utilize the redundant capacity of the network. Therefore, it is necessary to design a novel co-optimization from algorithm to circuit and even device levels for reliable and highly efficient memristor neural network deployment. In this brief, we propose a unified architecture covering both variation-aware training and weight transfer, namely, Bayesian-based training and lightweight transfer architecture (BETTER architecture). The unique feature of uncertainty in Bayesian neural network (BNN) weight enable us to handle variations in the device and to learn tasks through a single training process. To the best of our knowledge, a lightweight transfer method that bridges the gap between the algorithm and circuit levels is established for the first time. The specific contributions of this brief are as follows: • A variation-aware training method inspired by a BNN is proposed. The proposed method learns the robust weights under a unified optimization objective, considering both the prior knowledge of the variations in a memristor crossbar and the learning task.
• A weight selection method is proposed to find the key learned weights. Using the selection method, it is only necessary to transfer a small part of the weights to attain an accuracy comparable to the ideal network, resulting in a low-cost weight transfer.
• We evaluate our proposed BETTER architecture using three typical deep learning models on a simple dataset MNIST and a more complex dataset CIFAR-100. The results show that our architecture can obtain robust networks and improve the speed and energy efficiency in weight transfer by almost 100%.

II. PRELIMINARIES A. Memristor Neural Networks (MNN)
The basic operation of each layer in deep learning models can be expressed as vector-matrix multiplications (VMMs). It can be efficiently and naturally implemented using memristor crossbar arrays [2]. Usually, a differential conductance of programmed memristors in a crossbar array represents weight values in neural networks. Due to the randomness of unpredictable ion motion, the conductance of a memristor will exhibit uncertainty perturbations [12]. The stochasticity in ion migration or electron movement causes varying conductive filaments and fluctuated currents, respectively. These two major variations include program variations and read variations [4] can significantly produce random deviations from the expected weight value and measurement errors, resulting in inaccurate VMM operations and degrading MNN performance.

B. Ex Situ Training and Weight Transfer of an MNN
To deploy the weights of a neural network into memristors, a mainstream solution, the ex situ training method, is usually adopted [9]. In the process of typical ex situ training, the network training process is carried out on an external software platform to obtain learned weights. Then, the learned weights after quantification are transferred into the physical memristor crossbar array. The aforementioned variations limit the precision when the learned weights are directly transferred into a memristor crossbar array. And it degrades the performance of the neural network significantly. To address this issue, Alibart et al. [13] proposed a widely used iterative closed-loop feedback scheme to modulate memristors. In this typical write-verify method, the conductance of the memristor is programmed until its value falls within an error margin from the target value through a series of repeated program and read operations. However, the write-verify methods require enormous energy and time consumption since they are performed for every weight in the network.

C. Bayesian Neural Network
A BNN is a parametric model that places the flexibility of neural networks in a Bayesian framework [14]. Rather than the weights in conventional networks are fixed values, all weights in a BNN are represented by probability distributions over possible values. Given a dataset D, the goal of BNN training is to optimize the posterior distribution of weight p(w|D) using Bayes' theorem: where p(w) is the prior weight distribution, p(D|w) = p(y|x, w) is the likelihood corresponding to the output of the network, and p(D) is the marginal likelihood. The true posterior p(w|D) is usually approximated by q(w|θ) using an inference method. The goal of minimizing the closeness between q(w|θ) and p(w|D), which measured with Kullback-Leibler (KL) divergence, can be formulated as: A backpropagation algorithm can be used to optimize the complexity cost KL(q(w|θ)||p(w)) and the likelihood cost E q(w|θ) [ log p(D|w). For a Gaussian distribution weight, θ is equivalent to the mean μ and standard deviation σ . The posterior q(w|θ) can be formulated as: A BNN infers the posterior weight distribution by the prior p(w) and likelihood probability p(D|w). This main characteristic introduces the weight uncertainty of the network into the learning process. Therefore, the learned weight parameters and computations must be robust under perturbation of the weights. The variations in an MNN are similar to the weight uncertainty in a BNN. Consequently, a BNN can be used to obtain the appropriate weights and enhance the reliability of an MNN. Recently, some works on exploiting the memristor variability to implement BNN have been reported [15], [16], [17]. However, they require a complex memristor variation model to train BNN during the ex situ training phase. The memristor model is hard to construct precisely due to the complex physical mechanism. Our proposed BETTER architecture views the prior p(w) in training algorithm from a new perspective and eliminates the need for the complex variation model used in the above works. Also, we further propose a lightweight transfer scheme to deploy a small part of the weights rather than the whole weights of BNN in previous works.

III. BETTER ARCHITECTURE
In the existing ex situ training method, the training process and weight transfer are separated at the algorithm and mapping levels. Instead, we propose a unified architecture covering both variation-aware ex situ training and weight transfer, namely, Bayesian-based training and lightweight transfer architecture (the BETTER architecture). The architecture bridges the gap between the algorithm, mapping, and device levels. The proposed Bayesian-based ex-situ training method can make use of the prior case of the worst variations of the device to obtain reliable neural network models. In addition, the proposed lightweight transfer only needs to transfer partial weights of the network, which can efficiently deploy the network on memristor crossbars. The architecture not only significantly reduces the cost of weight transfer but also retains the high accuracy and stability of the MNN.

A. Overview of the BETTER Architecture
The overall process of the proposed BETTER architecture is shown in Fig. 1. First, the memristor crossbar provides the variations situations as the a priori to a BNN, and a network model is trained as the process of BNN training to obtain the optimized weight posterior distribution. Then, the key weights in the network are selected and maintained with the proposed significant weight selection method. Finally, the network is deployed on the memristor crossbar, transferring only a small percentage of the weights.
There is a two-way connection between the algorithm and the memristor (circuit) level. The connection from the circuit to the algorithm level is that the crossbar array provides the variations of the memristor as the prior knowledge to the BNN in the proposed Bayesian-based training algorithm. The reverse connection from the algorithm to the circuit is our training method that feeds the selected target weights back to the circuit, which ensures a lightweight transfer. As a result, we can obtain a robust MNN and significantly reduce the energy and time consumption of the weight transfer.

B. Bayesian-Based Training With Variation Priori
Since the transferred weight values can be considered as samples generated from some distribution, we can naturally view the memristor weights as uncertain weights following a Gaussian distribution N(μ, σ 2 ) in a BNN. The means μ are the target weights that will be transferred on the crossbar. Second, when optimizing the complexity cost term in the objective function of the BNN, the weight distribution N(μ, σ 2 ) will be as similar as possible to the prior. Thus, we can use the worst variations as the prior knowledge to obtain a large standard deviation σ of the posterior during the training process. After training, the learned weight distributions are as tolerant as possible to the worst situation and guarantee network performance at the same time. Third, since read variation always exists during the lifetime of a device, the conductance deviation will not be zero. Therefore, we further restrict the minimum value of the posterior standard deviation in the learning process. Furthermore, there are some general constraints to ensure that the network is more compatible with the memristor crossbar array, such as the weights being truncated within a symmetric range due to the limited conductance window of the memristor. Hence, the proposed Bayesian-based training integrates the influence of memristor variations into network training. This ensures that the inference output of the MNN is robust and reliable even under perturbation of the memristor weights.

C. Significant Weight Selection
To reduce the cost of transferring the weight into a crossbar, we propose a significant weight selection method. Only the selected weight will be transferred to the hardware. Before deploying the network, the memristor is programmed first to an initial low conductance state. Hence, a non-transferring weight is equivalent to fixing it to zero. This suggests that a weight can be removed if its probability density at zero is sufficiently high. This yields the signal-to-noise ratio (SNR = μ/σ ) for a Gaussian distribution. In other words, if the SNR has a larger value, the probability density at zero is sufficiently lower, and the weight is more significant for the network.
Therefore, we can leverage the SNR to develop a significant weight selection approach. First, we evaluate the SNR of the weights in the network. Then, we sort all the weights in descending order of the SNR value. Finally, we select the top group as the significant weights. The specific percent of significant weights can be determined according to the actual network model and dataset.

IV. EXPERIMENTAL EVALUATION
To verify our proposed architecture, we use three typical deep neural networks, one multilayer perceptron (MLP) and two convolutional neural networks (i.e., LeNet, AlexNet), for classification task using the MNIST dataset and CIFAR-100.

A. Experimental Setup
First, the networks are trained on 50,000 images in the MNIST training set using the proposed Bayesian-based training method. The worst variations of the prior standard deviation and the minimum value of the posterior standard deviation are obtained from [18], measured on a physical memristor crossbar array. Then, the weights are quantified to 8 bits and transferred to a memristor crossbar array, and the misclassification rates of each network are evaluated. For the program variation model of the memristor, we obtain the model variation statistics from [19]. The analog switching behaviors are tested under identical pulses during the program process. The program variation model is shown in Table I. The ideal conductance change ( G ideal ) and the update variations factor (S), which depend on the current conductance state (G) and operation direction (SET or RESET). The read variation model, used in evaluating the classification drop under different read variations σ read , is a simple additive normal noise model as shown in Table I. The conductance window of the memristor is from 2.0 μS to 20.0 μS, that is, the current window is from 0.4 μA to 4.0 μA at read voltage V read = 0.2 V. When the transfer weight is verified to the crossbar, the error margin is set to 0.5 μS.
Since the weight deviations are randomly generated, the misclassification rates of the network are also stochastic. To  precisely evaluate the final misclassification rates in each case, we repeat the weight transfer ten times and classify images in the test set. To precisely evaluate the proposed significant weight selection method, we compare two traditional methods as baselines-mean-based and standard deviation-based weight selection methods. We sort all weights based on their learned mean μ and standard deviation σ and select the top percent as the significant weights. We consider a pair of programming and read operations as one write-verify cycle. The consumed time and energy of weight transfer are positively related to the number of write-verify cycles. Hence, we use the average write-verify cycle of ten weight transfers as an indicator of the consumed time and energy. We further study the impact of difference read variation σ read on inference accuracy to evaluate the robustness of our architecture.

B. Experiments Using an MLP
The effectiveness of the proposed BETTER architecture is first evaluated on a two-layer MLP. The hidden layer of the MLP has 100 neuron units. We trained the MLP through the proposed training method. The learning rate of the Adam training algorithm is 1 × 10 −3 . The traditional weight transfer method, which transfers all the weights, attains a final misclassification rate of approximately 2.86% and requires 5.2 × 10 5 write-verify cycles. The accuracy was identical to that of the original BNN training. Based on the trained MLP, we investigate the impact of different selection methods and different percentages of significant weights on the number of writeverify cycles and the misclassification rate of the memristor network. The results are shown in Fig. 2.
In Fig. 2, solid lines with different markers indicate the misclassification rates of the three weight selection methods, and the color of the marker is used to represent the number of write-verify cycles. It can be seen that the SNRbased weight selection method requires only approximately 55% of the weights and 3.3 × 10 5 of the cycles to deploy the MLP without any accuracy drop. This is an almost 2× speedup compared to the traditional method. On the other hand, the mean-based method requires approximately 80% of the weights and 4.8 × 10 5 of the cycles, and the standard deviation-based weight selection method requires approximately 65% of the weights and 3.5 × 10 5 of the cycles. This suggests that the SNR-based method shows a better ability to select weights than the other two in general and significantly reduces the number of write cycles while keeping the accuracy unchanged. From Table II, our architecture can always maintain a very robust inference output across different variations. The accuracy drops only 1.05%, even when approaching the largest read variation with σ read = 1.25 μS.

C. Experiments Using LeNet
In this section, a large network with convolution layers, LeNet, is used to verify the effectiveness of the proposed BETTER architecture. The network structure of LeNet consists of two convolution layers, two pooling layers, and three fullyconnected layers. After being trained and transferred through the proposed training method and traditional weight transfer method, the final misclassification rate of LeNet is 0.23%, and it requires 3.9 × 10 5 cycles. The same experiments as the previous MLP network are performed, and the experimental results are shown in Fig. 3. From Fig. 3, to recover the accuracy, it can be seen that the mean-based, standard deviation-based, and SNR-based weight selection methods require 3.4 × 10 5 , 3.5 × 10 5 , and 2.6 × 10 5 cycles, respectively. They only need to transfer 70%, 85%, and 50% of the weights, respectively. These experiments illustrate that the SNR-based selection method can realize a lightweight transfer, which only transfers 50% of the weights and reduces the number of cycles by 50%. As shown in Table II, LeNet can consistently guarantee computational robustness even when read variation σ read = 0.75 μS. In addition, we found that LeNet is more sensitive to the read variation, and its misclassification rate decreases by approximately 6.83% under the largest read variation, while the MLP is more robust. This may be because the large variation in the weights of convolution kernels will affect their feature extraction ability.

D. Experiments Using AlexNet
Finally, we verify the effectiveness of the proposed architecture on a deep network, AlexNet, on a large dataset CIFAR-100. It is a relatively deep network compared with the previous two networks, which consists of 5 convolution layers, max pooling and 3 dense layers. The experimental results are shown in Fig. 4. AlexNet requires 3.5 × 10 7 cycles to transfer whole weights and attains the final misclassification rate of 64%, which is the same as the baseline [20] indicated as the dotted line in Fig. 4. Although the standard deviation-based method shows negligible selection ability in deep network situations, the SNR-based method which is related to standard deviation shows the best performance. It only requires 40% of weights and 2.4 × 10 7 cycles and this is an almost 2 × speedup compared to the traditional method. Also, the final classification relative drop of AlexNet is unchanged when read variation σ read is smaller than 0.75 μS. It also shows that the AlexNet is more sensitive to read variation than LeNet due to the continuous accumulation of variation in the deep network.
V. CONCLUSION This brief presents a unified BETTER architecture that consists of a Bayesian-based training method and a lightweight transfer scheme. Covering both the algorithm transfer and device levels, the unified architecture can tackle the robustness, energy and time consumption issues of MNN deployment due to memristor variations. The proposed Bayesian-based training method can learn robust weights considering both the variations in a memristor crossbar and the learning task. The proposed weight selection relaxes the requirement for the operation of a memristor device, leading to a high-efficiency and high-speed lightweight transfer. Our experimental results show that the proposed BETTER architecture significantly improves the speed and energy efficiency of weight transfer by almost 50%.