A Multilayer Neural Accelerator With Binary Activations Based on Phase-Change Memory

Novel in-memory computing circuits, based on arrays of emerging nonvolatile memories, such as the phase-change memory (PCM), can boost cutting-edge performances of artificial intelligent applications. However, the spread of PCM-based circuits is currently hindered by the lack of a design framework enabling fast, efficient, and low-power neural networks. In this work, a novel approach to the conceptual and technical design of integrated neural networks is proposed. In particular, to relax the power hunger and complexity of state-of-the-art solutions, we propose a fully analog computing approach where the analog-to-digital converter (ADC) is replaced by a simple comparator. The analog building blocks of the accelerator are presented and validated in Cadence Virtuoso. The major nonidealities, such as PCM conductance variability, conductance drift, IR drop, and readout threshold, are studied by considering their impact on accuracy.

and natural language processing [1]. ML algorithms usually rely on intensive matrix-vector multiplication (MVM), which results in time-and energy-consuming transfer of input data and model parameters between the dynamic random access memories (DRAMs) and the CPUs [2], [3]. On the other hand, in-memory computing emulates the parallel computation of the brain [4], [5], [6], thus overcoming the main limitations of conventional digital computing systems, also because of highdensity, back end of the line nonvolatile memories, such as resistive random access memory (RRAM) and phase-change memory (PCM) [5], [7], [8], [9], [10]. In particular, as shown in Fig. 1(a), the MVM is the most intensive operation for implementing hardware accelerators of large neural networks. MVM can be executed efficiently by in-memory computing hardware because of the use of memory arrays of nonvolatile memories [11], which are shown in Fig. 1(b). These cross-point arrays inherently perform parallel multiply-andaccumulate (MAC) operations by direct application of Ohm's and Kirchhoff's laws [2], [12]. Several mixed-signal accelerators integrating a memory array have been proposed for neural network inference [13], [14], [15], [16], [17], [18], [19], [20], [21], [22]. The performance of some of these accelerators in terms of throughput, energy efficiency, and input, output, and weights precision is reported in Table I. Hardware accelerators based on in-memory computing usually rely on a mixed analog-digital approach with analog-to-digital  converter (ADC) for two main reasons: 1) achieving sufficient signal integrity and 2) enabling digital processing of the information, including activation functions, shift-and-add, and normalization operations, which might not be straightforward in the analog domain. Note that an ADC consumes more than 80% and 60% of the circuit power and area, respectively [23], [24]. Thus, simplifying or even removing the ADC would result in a strong improvement for the integrated design of neural networks accelerators.
This work addresses the hardware design of a PCM-based multilayer neural network with comparator-based nonlinear activation functions. The circuit speeds up the workload by avoiding the use of ADCs, and it is tested for the recognition of Modified National Institute of Standards and Technology (MNIST) and Fashion-MNIST datasets. An activation-slope aware training method is used, and the accuracy loss is minimized in the tests using step function activations and quantized 4-bits weights. The network is implemented on a cross-point array of real conductances leveraging a multibit resistive weight mapping. This technique is extended from the array of RRAM cells use case presented in [14] and [15] to our PCM-based network. Previous works investigated the impact of PCM variability on the classification accuracy. In [25], the variability and drift of PCM devices were simulated by rescaling the network weights by the maximum conductance of an experimental distribution. Programming variation with zero mean and variance obtained from the Gaussian fitting of the distribution was added. In our work, we extend these results by considering the effects of PCM variability on a quantized network with multibit resistive weights, thus offering a comprehensive picture on the joint effect of weight quantization and variability. We focus on the interaction between these nonidealities and the Heaviside activation of our network. We also consider the impact of IR drop on the network accuracy. Differently from the analysis carried on in previous works [26], we evaluate the effects of IR drop on a complete inference task. The higher average conductance of the cells adopted in our work also forces to address a more severe IRdrop configuration.
Our work presents a fully analog hardware implementation of the bit-line (BL) readout and neural activation, including a transimpedance stage, a weighting star of resistors, an integration stage, and a dynamic StrongARM comparator. The joint effect of conductance variability, drift, IR drop, and readout threshold on the network accuracy is carefully investigated. Fig. 1(b) shows the cross-point memory array based on one-transistor/one-resistor (1T1R) PCM cells, with the top electrodes (TEs) of the PCM devices connected to the BLs and the bottom electrodes to the drain of MOS transistors, which work as selectors. The gate of the transistor is increased by applying a voltage signal at the corresponding word line (WL), while the TE is kept at a relatively low read voltage V READ . The resulting BL current is proportional to the dot product between the input vector of the gate voltages applied to the WLs (binary on/off voltages) and the vector of analog conductance values stored in the memory cells of the BL.

III. SLOPE-UPDATE TRAINING
The network we propose in this work has 784 binary inputs, ten outputs, and 150 hidden neurons. An element always set to one is also appended to the input vector of each layer, so that the bias can be implemented [1]. After an MVM operation, the outputs of each layer undergo a nonlinear activation. We used the logistic function where A and B are free parameters, as nonlinear activation function. Fig. 2(a) shows the logistic function for increasing B, which describes the slope of the activation function. During training, the input patterns are feedforwarded across the network. Errors can be evaluated by computing the squared difference between the one-hot code of ten elements associated with the input image and the actual output of the network. The network weights are modified according to the product between this error and a model parameter, called learning rate. Backpropagation and gradient descent are used during the process [27]. The hyperparameters of the model, namely, A and B in (1), the number of epochs and the learning rate η, were selected to obtain, at the same time, the largest accuracy and a high slope for the activation function, thus allowing to implement the logistic function in (1), as an analog comparator. At high slope coefficients B, none of the combinations of the other parameters gave high accuracy. The slope of the activation function was then increased gradually during training, as shown in Fig. 2(a). This method was called slopeupdate training. Two variants of the method are considered to optimize the training process: in the derivative function  parameters A and B are either chosen equal to A and B of the function used in the forward propagation or selected to obtain a larger full-width half-maximum (FWHM), as shown in Fig. 2 Fig. 2(c) shows the top accuracy achieved for the different training methods as a function of parameter B. In these calculations, weights were assumed with floatingpoint precision, and a test was performed with comparators as activations. Various training schemes were assumed, as summarized in Table II. The training using slope update minimizes the accuracy loss when the final slope of the activation function increases. Slope update coupled with a large FWHM σ (x) allows efficient backpropagation of the errors, thus ensuring the highest accuracy.

IV. QUANTIZATION
Nonvolatile memory devices, such as PCM cells, are not suitable to represent high-precision analog weights [28]. The 64-bits full-precision weights obtained by the supervised training algorithm must then be quantized to limited-precision levels. Fig. 3 schematically illustrates the quantization and mapping steps. For each network layer, the full-precision weights generally have a zero-mean distribution as in Fig. 3(a). Note that 4 bits are used for quantization; thus, 16 quantized levels can be represented. The 16 equally spaced values selected between ±(3.5 × σ ) of the full-precision distribution are chosen for this purpose. In order to minimize the accuracy loss, the incremental-quantization method proposed in [29] has been adopted. As shown in Table II, this approach coupled to the slope update, and the use of a large FWHM backpropagation function ensured a small accuracy loss.
In a 1T1R memory array, single PCM cells cannot represent positive and negative weights, because the current is sourced in a single direction. To avoid this limitation, we rely on the multibit resistive weight approach presented in [9], [14], and [15]. As shown in Fig. 3(a), the quantized distribution of weights is shifted on a positive range only. The product between the input vector and a positive and negative matrix is recovered by subtracting a midpoint reference, as shown in Fig. 3(b). The positive quantized weights are mapped on the conductance of the PCM array, programed in either a low-resistive state (LRS) or a high-resistive state (HRS). By combining these binary (LRS/HRS) PCM conductances, 4-bit binary codes are obtained to describe each one of the 16 quantized levels. The value of a dot-product operation is obtained by binary weighting of the currents of four PCMs.

V. SIMULATIONS WITH PCM DISTRIBUTIONS
The effect of the variability and drift of PCM devices on the network accuracy is simulated considering the conductance distributions obtained from previous experimental measurements [30]. The PCM devices were initialized by a forming operation to uniformize [31] the Ge-rich composition of the phase change material, as shown in Fig. 4(a). Higher forming currents can be used to create a larger conductive region in the PCM device. Fig. 4(b) shows the average value and the variance of the LRS conductance G LRS , as a function of the forming current [32]. Fig. 4(c) shows the computed network accuracy versus I FORM . Higher forming currents lead to a better accuracy, although at the expense of a higher power consumption. In fact, by programming the cells with I FORM = 150 μA, a BL draws at maximum 165 μA, while the maximum current is 317 μA for I FORM = 550 μA, roughly corresponding to a two times larger power dissipation. Since accuracy and power dissipation are trade-offs, applications that require a low power consumption may have accept a relatively low classification accuracy.
The performance of the network is evaluated at increasing time, to account for the effect of the drift. Fig. 5(a) shows the PCM conductance as a function of time after the programming pulse. The measured distributions of PCM devices are fit to extract their mean value and variance. The conductances used in network simulations are extracted by the obtained distributions. Fig. 5(b) and (c) shows the accuracy obtained by considering the time evolution of the PCM conductance. In the simulations corresponding to the red plot, the mean values of the LRS and the HRS distributions decrease over time, as well as the dynamic range of the differential analog dot product. The accuracy of the network is not affected; because of the self-referential implementation of the analog circuitry, the sign of this difference with respect to an infinite slope activation threshold is preserved. The conductance variance increases with time for both the HRS and the LRS. The evolution in time of the conductance variance causes the statistical spread of the analog dot products around the Heaviside activation  threshold. As shown in the green plot of Fig. 5(b) and (c), when the time evolution of the conductance variance only is considered, a moderate accuracy drop is observed. On the contrary, when both the mean value and the variance of the PCM conductance distribution evolve in time, the conductance variance has a larger impact on accuracy. In this case, as shown in the light-blue plot of Fig. 5(b) and (c), the statistical spread superimposed to the lower dynamic range dot-product analog signals causes a larger accuracy drop. Nevertheless, the simulations results demonstrate a good resilience to drift by the network because of the self-compensation between the weights in the array and in the reference combined by the step-like activation function.

VI. SIMULATIONS WITH IR DROP
The current flowing in the BL causes a voltage drop across the BL wire resistance and across the resistance of the BL decoder, as shown in Fig. 6(a). A decoder circuit is indeed necessary to share the readout among different BLs of the array. This IR-drop effect superimposes an input dependent error to the current of each cell, thus affecting the linearity of the MVM operation [26], [33]. The network was simulated to take into account the voltage drop across the decoder resistance and the IR drop accumulated along the BL, assuming a cell-to-cell wire resistance R wire = 0.5 . To reduce the impact of IR drop, the readout of a BL was divided into different steps by reading 16, 32, 64, or 128 cells. The current obtained from different read steps was then integrated in the analog domain, because the step function used as activation needs to operate on the result of a full analog dot product. Fig. 6(b) and (c) shows the simulation results of the impact of IR drop. The results indicate that good performance can be obtained if the voltage drop across the decoder resistance is kept below 10% of the read voltage, with a reading of 32 out of 128 WLs for MNIST and of 16 out of 128 WLs for Fashion-MNIST. Based on the evidence that a higher forming current results in a higher accuracy and yet a higher voltage drop on the BL, the compensation of the IR drop on the network accuracy has to take into account the average conductance of the PCM cells in the array. In the framework of a hardware-software codesign, the PCM array and the reading circuit could be jointly reprogrammed. When higher PCM conductances, hence higher network accuracies, are targeted, the number of BL cells read at a time can be reduced, without enforcing expensive retraining processes to minimize the IR drop. Fig. 7(a) schematically shows the first layer of the network. Input signals are applied to the WLs of the array in parallel. The BLs share the readout circuits of each array by means of a BL decoder, whose input dimensions depend on the number of outputs of the network layer. A readout and weighting circuit like the one represented in Fig. 7(b) is used to assign a binary weight to each of the four BLs, while clamping the BL read voltage to 0.2 V. The readout circuit includes four transimpedance stages to convert the BL current in a voltage. The outputs of these amplifiers are connected using binary weighted resistors, having resistance R, 2R, 4R, and 8R. A second noninverting stage is used to amplify the signal. The output of the second stage is, thus, given by

VII. READOUT CIRCUIT
which is equivalent to the result of a dot product between a 16-, 32-, 64-, or 128-WL input and a four-BLs column vector. The same readout integrates both the positive array and the reference, thus avoiding any potential mismatch due to process variations. Since the synaptic weights are split in different arrays and only 32 out of 128 WLs are activated at a time, analog integrators are used to accumulate the partial dot products from different arrays and different read steps. Connecting a different BL segment to the transimpedance inverting input at each read step modifies the circuit linear response and causes transient effects. A two-step integration avoids the influence of these effects on the integrated charge. Fig. 7(c) shows the integration circuit. In the first integration step, the output of the readout reaches steady state, and in the second, the charge stored across C 1 is transferred to the feedback capacitor C 2 . While one channel integrates the BL currents, the other one integrates the two reference columns, and the latter requiring a 1/2 gain factor in the ratio (C 1 /C 2 ) to set the reference voltage to half of the dynamic range.
Since, given an input pattern, the reference is the same for any of the equivalent four-BLs columns, the charge integrated on the reference side is maintained during the integration of all the BLs in the positive array. The integrated outputs on the array and the reference side are used as input voltages of the dynamic StrongARM comparator, which is designed based on a differential stage driving a latch, as shown in Fig. 7(d) [34]. The output bits of the comparison are then used as binary inputs for the next network layer. A control block is used to set the timing of the circuit and to manage the communication between layers. When a layer completes the dot product and activation operations for the following one, the inputs are propagated forward in parallel mode. With the proposed readout and integrator, reading a group of four BLs takes two clock cycles. To reduce the impact of IR drop, a 128-cells BL is divided in four sections of 32 cells. The 2 × 4 clock cycles are then used to integrate a full BL, and two additional clock cycles are needed to generate the comparator output and reset the integrator. The classification of the full MNIST dataset in a network with 150 hidden neurons requires to read 152 × 4 BLs, including the reference, for 10 000 patterns. The circuit design assumes a clock period of 50 ns, and the classification can, thus, be completed in less than 1 s. Fig. 8 shows a summary of the simulated accuracy for MNIST [ Fig. 8(a)] and Fashion-MNIST [ Fig. 8(b)], where the various nonidealities are separately considered. The baseline accuracy is reported for training with binary inputs, 64-bits weights, and testing by using step functions instead of sigmoids. The following items show the accuracy for different nonidealities taken into account. We considered quantization, conductance variability, drift, and IR drop as well as the nonideal properties of the readout, namely, the comparator offset, and the mismatch between the integration channels. These statistical variables are obtained from Monte Carlo circuit simulations on Cadence Virtuoso. The statistical variations of the PCM conductance affect the accuracy by introducing a statistical uncertainty around the result of the analog dot product, which is presented at the comparator input. IR drop and drift instead lead to an overall reduction of the signal fullscale range. The combination of these effects results in lower accuracies.

VIII. NETWORK SIMULATIONS
IR drop is caused by the current flowing in a BL and increases with the number of PCM synapses and their respective current. Fashion-MNIST is strongly affected by IR drop, as a result of the lower sparsity of the input pattern. When drift is included, the accuracy slightly increases with respect to the time-zero test, as the average PCM resistance increase causes a lower current to flow in the array, which suppresses the IR drop. Drift and IR drop can, thus, positively affect each other.
The strive for energy efficiency of edge devices pushes toward the binarization of convolutional neural networks (CNNs) [35]. Long short-term memory (LSTM) networks with binary activations and states and binary graph neural networks (GNNs) were proposed too [36], [37]. We, therefore, consider  our approach transferrable to other network architectures and problem sizes. The dynamic interaction of PCM variance, PCM drift, IR drop, and their effect on the decision performed by the binary threshold of the comparator presented in this article can indeed serve as a reference for the hardware implementation of any binary activated network.
To address the deployment of networks requiring higher precision of the activation, an extension of the proposed circuit is possible. In our work, we compared the result of a weighted analog dot product to a midpoint reference, generating a binary activation. In applications that require activation with higher precision, the analog dot product could be compared with multiple reference levels, generated from additional reference BLs, similar to what was discussed in [38]. If each one of the new voltage references is held by a replica of the reference integrator, the area occupied by the readout circuit on the in-memory computing chip increases. Using a single reference integrator and scheduling in time the comparisons with different references require more time to infer an input pattern. In both cases, the power consumption of the circuit increases. The analysis of these trade-offs is left to future work.
The works on quantization of attention-based GNNs show that the quantization of the attention layers is critical for accuracy performances, requiring up to 32b in a full integer inference [39]. Nevertheless, these network models are mainly intended to be deployed on the cloud, whereas the in-memory computing circuit proposed targets edge applications, where the exploration could be restricted to lower precision ranges.

IX. CONCLUSION
This work presented the study of a multilayer hardware neural accelerator based on PCM synapses. The communication between different layers of the network was simplified by the use of comparators instead of ADCs. The effect of the variability and drift of PCM binary state on accuracy was addressed. IR drop was minimized by introducing an integrator, to read smaller portions of a BL in different time steps. The network was simulated on the MNIST and Fashion-MNIST datasets showing good robustness. This work enhances the relevance of PCM-based circuits for artificial intelligence applications, and it paves the way for accurate and small-area hardware neural networks.