I. Introduction
The inference of a DNN mainly involves computing a series vector-matrix multiplies, \begin{equation*}{\mathbf{y}}{\text{ = W}}{\mathbf{x}}\tag{1}\end{equation*}
and nonlinear transformations, such as pooling, ReLU and softmax. The vector-matrix multiply is performed on resistive crossbar arrays all in parallel and constant time using Ohm’s and Kirchhoff’s law and was proposed more than 50 years ago [1]. The weight matrix W is stored as differential conductance values in the crossbar array. The input vector x is transmitted as voltage pulses through each of the rows and resulting vector y is read as an integrated current signal from the columns as illustrated in Fig. 1. However, the vector-matrix multiply performed by the crossbar arrays and the supporting peripheral circuitry is only an approximation of (1) and can be written as
\begin{equation*}{\mathbf{y}}{\text{ = ADC}}\left( {{{\text{W}}_{\text{r}}}{\text{ DAC(}}{\mathbf{x}}{\text{) + noise}}} \right)\tag{2}\end{equation*}
where DAC discretizes the input vector x, noise is introduced due to analog computation (such as thermal, 1/f, or opamp noise) and ADC clips and discretizes the output. In addition, Wr, the weight matrix stored on the crossbar array, also deviates from the original weight matrix W of the model due to transfer errors, limited conductance ranges, device fails and variability. All these hardware induced constraints and variations therefore make inference of DNNs on analog hardware a non-trivial task. As we show here, if one naïvely tries to get a better approximate of (1) by designing a low noise and high accuracy analog hardware, it results in very challenging hardware specifications that makes this approach impractical, especially for large scale networks. Instead, if hardware non-idealities are introduced in the training process, high accuracy in the inference process can be maintained. When the hardware induced non-idealities are introduced into the training process, then the system solves a constraint optimization problem that is guaranteed to perform better once run on the analog hardware that has the same constraints applied during training. Furthermore, noise terms introduced during training act as regularization terms and improve the robustness of the DNN inference to hardware fails. Therefore, the training process of a DNN should be tightly coupled (married) to the hardware that the trained network would run on during the inference task as illustrated in Fig. 2. If the conventional model trained with floating point arithmetic is used for the inference task on analog hardware, then the model’s performance will significantly degrade compared to a model obtained using hardware aware (HWA) training [2]–[3].