Training of mixed-signal optical convolutional neural network with reduced quantization level

Mixed-signal artificial neural networks (ANNs) that employ analog matrix-multiplication accelerators can achieve higher speed and improved power efficiency. Though analog computing is known to be susceptible to noise and device imperfections, various analog computing paradigms have been considered as promising solutions to address the growing computing demand in machine learning applications, thanks to the robustness of ANNs. This robustness has been explored in low-precision, fixed-point ANN models, which have proven successful on compressing ANN model size on digital computers. However, these promising results and network training algorithms cannot be easily migrated to analog accelerators. The reason is that digital computers typically carry intermediate results with higher bit width, though the inputs and weights of each ANN layers are of low bit width; while the analog intermediate results have low precision, analogous to digital signals with a reduced quantization level. Here we report a training method for mixed-signal ANN with two types of errors in its analog signals, random noise, and deterministic errors (distortions). The results showed that mixed-signal ANNs trained with our proposed method can achieve an equivalent classification accuracy with noise level up to 50% of the ideal quantization step size. We have demonstrated this training method on a mixed-signal optical convolutional neural network based on diffractive optics.


Introduction
Artificial neural networks (ANN) are growing larger and deeper [1][2][3] to tackle tasks of increasing complexity [4][5][6].To accommodate the computation demand in future neural network structures, specialized computing hardware and data formats have been engineered.Various low-precision or even binary neural networks (BNNs) accompanied by specifically designed training algorithms [7][8][9][10] have proven successful in accelerating the inference and reducing the memory footprint [11,12] by using lowbit width, fixed-point data format for the weights and inputs.When designing and deploying these networks on digital computers [10], intermediate results (e.g.activations) often need to be cached in a higher precision format than the weights and inputs to achieve the expected accuracy.
Recently, due to the advantages in speed and power efficiency [13], analog computing paradigms have been considered as solutions to the growing demand in neural network computing, with implementations in both electronics [14,15] and photonics [16,17].However, analog computing is susceptible to ambient noise and device imperfections [18].Ex-situ training has been deployed on a simulated analog unit using fixed-point data format [19], analogous to the low-bit width neural network using a digital computer.Yet a model trained by such method is likely to have an inferior inference performance [20], as analog intermediate results cannot match the full precision in a digital computer.To overcome this performance degradation, fine-tuning of the analog parameters on each computation node [15,21] can be performed, though requiring an exhaustive effort in training.There has not been an efficient training method that is robust to the errors on analog ANNs.
In this work, we incorporate two types of common analog computation errors -random noise and deterministic errors -into the training process, extending low-precision neural networks training to mixed-signal or analog computing platforms.The network trained with our method is robust against analog signal noise level as high as 50% of the quantization step, indicating that mixed-signal neural network can operate at a reduced quantization level.We have demonstrated a trained model on a programmable optical convolutional neural network.

Low-precision training
Low-precision neural networks perform matrix multiplications or convolutions between fixed-point inputs and weights, which are typically the required data format for many digital tensor processing units [22].Fig. 1(a) illustrates the computation scheme of a low-precision neural network layer.A fixed-point processor computes the activations ℎ () from the quantized inputs  () and weights  ()   ℎ () =  (−1) ⋅  () , where ⋅ denotes the matrix multiplication or convolution operation.The activations ℎ () require higher bit width than the inputs and weights due to the associated accumulation process [10].A nonlinear function (⋅) is then applied on the activations, along with a quantization operation, to match the input data format of the next layer, Here (⋅) can be a common neural network operation such as batch normalization, down-sampling, ReLU etc., or a combination of multiple operations.The quantization operation is defined as, where  is the total bit width,  is the number of integer bits, [⋅] denotes the rounding to the nearest integer.Several nonlinearities, such as clipping and scaling functions, have been purposefully designed for easier integration with the quantization operation [23,24].

Mixed-signal ANN layer with an analog acceleration unit
A growing number of neural network architectures have replaced the traditional digital fixed-point matrix multiplication or convolution calculations with the analog counterparts for speed and power efficiency [15][16][17].Fig. 1 (b) illustrates the computation scheme of a mixed-signal neural network layer with an analog acceleration unit.A set of digital inputs  (−1) and weights  () are sent to their corresponding digital-to-analog convertors (DAC), generating the inputs  ̃(−1) and  ̃() , respectively, for analog accelerators.The activations  ̃(−1) ⋅  ̃() from the analog acceleration unit are collected by a detector, producing signals ℎ ̃() .The detected signals are then sent to a digital processing unit, which maps the activations to the inputs of the next layer via a nonlinear function,  () = (ℎ ̃() ), that can include similar operations as those in digital low-precision neural networks.
Computational errors in an analog acceleration unit include random noises or deterministic errors.These two types of errors can apply to any analog signals  ̃(−1) ,  ̃() , and ℎ ̃() .From the statistics perspective, random noises introduce a variance to the signal, and the deterministic errors introduce a bias to the signal.
Deterministic errors typically originate from the nonlinear response (⋅) of the modulators, DACs, or the detectors.The output of (•) can be either continuous or discrete.For a continuous output, the signal  ̃ distorted from the ideal signal  is given by Examples of continuous deterministic error include gamma curves of the detector, or the sinusoidal relation between the intensity and phase in an interferometry-based intensity modulator [25].The discrete deterministic error maps tensor  to a set of values determined by hardware specifications where ′ is the set of discrete values.Examples of discrete deterministic error include DACs that can only generate discrete voltage levels, or analog-to-digital convertors (ADCs) that digitize an analog signal with a fixed number of levels.The quantization error in digital low-precision or binary networks can be considered as special cases of Eq. ( 5), in which the set ′ ={± ⋅ 2 +1− ;  = 0,1, … , 2 −1 − 1} for fixedpoint quantization with  bits.
Noise is modeled as a random variable added to an ideal signal .The signal corrupted by noise  ̃ can be expressed by where  is assumed to follow an unbiased distribution.If the random noise in an experimental platform introduces a bias to the signal, this bias can be merged into the fixed-pattern distortion.Notice that random noise and deterministic errors can be combined to model the errors associated with any analog signals in a practical analog acceleration unit.

Training of neural networks with analog computation units
Our proposed training method for mixed-signal ANN considers the two types of errors described above in both the forward pass and the gradient backpropagation during the training process.The gradient flow through Eq. ( 6) can be computed from the noisy instance of the tensor used in the forward inference [26] The gradient flow through the deterministic error process  ̃= () (Eq.( 4)) involves the derivate of (⋅) The derivate of a continuous nonlinear response (⋅) is readily available.In the case that the output of (⋅) is discrete, the gradient is 0 almost everywhere since (⋅) is piecewise constant.To preserve the gradient flow, we use a gradient clipping method similar to that in BNN [8] where   ⁄ is the gradient with respect to the distorted tensor  ̃;   1 << 2 denotes a binary tensor with the same shape as the ideal tensor , with value 1 for elements of  within the range ( 1 ,  2 ), and 0 for elements of  outside the range ( 1 ,  2 ). 1 and  2 are typically chosen as the output range of (⋅).For the special case of binarization,  1 and  2 are -1 and 1 (or 0 and 1 if  ′ = {0,1} in Eq.Error!Reference source not found.),respectively.We have constructed a mixed-signal, low-precision convolutional neural network, termed MCNN, that classifies the input digit, as shown in Fig. 2, with binary inputs and kernels in all layers.The MCNN consists only of convolutional layers to facilitate its deployment on a mixed-signal diffractive-optics-based system later.The input digit from the MNIST dataset (28X28) is gradually down-sampled to a 3×3 image representing the probability of each digit given the input image.The first layer convolves the 28×28 input image with 64 3×3 kernels, and outputs a 64-channel activation tensor with a size of 28×28×64.The 64 channels are then individually batch-normalized and max-pooled with 2×2 down-sampling to a 14X14X64 tensor as the input of layer 2. Layer 2 and 3 consist of the same convolution and post-processing operations, except that the number of kernels used are 128 and 256, respectively.The input of layer 4 is a 3×3×256 tensor down sampled from layer 3 activations by extracting the 2 nd , 5 th and 7 th element along the horizontal and vertical spatial dimensions.Layer 4 performs a weighted sum of all 256 channels, applies softmax activation function, and outputs a final 3X3 image.Because only 9 possible labels can be produced from this MCNN, we excluded the digit '6'.

Mixed-signal convolution neural network simulation
The computation errors that we consider in this MCNN simulation are the discrete deterministic errors on the inputs and weights, as well as random noise on the detector.In layer , the analog inputs  ̃() and weights  ̃() produced by the convertors from the ideal, digital values  () and  () are respectively.Here  ′ and  ′ are the sets of discrete input and weight tensors.For an input with M×N pixels and  input channels, the activations in a CNN convolution with 3X3 kernels and  output channels are computed by an analog accelerator as where ,  denote the index of the convolutional result; ,  are the index of the input image;  is the index of input channels, and  is the index of output channels;  denotes the random additive noise, which is modeled by an unbiased Gaussian distribution ~(0,  2 ).Here we assume that the inputs are zero- padded.For simplicity, we omit the pixel and channel indexes in the tensor when they are not ambiguous.The activations ℎ ̃(,,) then undergoes digital post-processing, which consists of batch normalization and 2X2 max pooling,  () = (ℎ ̃(,,) ) =  (ℎ ((ℎ ̃(,,) ))).

Training of MCNN
The random noises and deterministic errors are both quantified by the root-mean-square-error (RMSE), which indicates the average deviation per dimension of the tensor Here  ̃ is the tensor  corrupted by errors; dim () is the total number of elements in tensor ; | ⋅ | 2 denotes the L2-norm.If  ̃ is corrupted by unbiased Gaussian noise (0,  2 ), the RMSE reduces to the standard deviation, .
The MCNN was trained considering binary inputs  ′ = {0,1}, the kernel sets ′ that can be displayed in the experiment, and a noise level =0.5 of the parameter .These parameters were selected to match the experimental MCNN setup.As a comparison, we also trained a reference MCNN model of the same structure, but without the random noise term  as in Eq. ( 11).The training of the reference model was similar to the BNN training [8], except that the binarization on the kernels were replaced by rounding to the nearest experimental kernels, as in Eq. (10).After training, we tested the accuracy of the trained OCNN using the MNIST test dataset under various levels of simulated Gaussian noise on the activations.
For each Gaussian noise level , we ran 7 noisy inference instances by randomly sampling  from (0,  2 ) to obtain the mean and standard deviation of the accuracy.

Inference simulation of MCNN
Fig. 4 plots the inference accuracy at various noise levels, quantified by RMSE, for the MCNNs with our training method and that of BNN.The MCNN trained with our method can maintain the inference accuracy up to =0.5, where the accuracy is 75.0±3.2%for our method, and 47.3±3.1% for BNN training.These results show that adding random noise in the training has improved the performance in a mixed-signal scenario, an effect similar to the regularization on the neural network parameters [27].Notice that here we trained the MCNN off-line by modeling the analog computation using random noise and deterministic error in the forward and backpropagation processes.An in-situ [14] forward pass through the physical MCNN setup can leverage the full potential of the speed and efficiency provided by the analog accelerator unit.The probability of the input digit being classified as '5' is 99.2% and 83.1%, respectively, for the MCNN with our training method and that of BNN.Though the MCNN trained with BNN method still correctly classifies this digit, the probability is reduced to and confusions from digit '0', '3', and '8' can be observed from its output.Our treatment of the analog computation noise is similar to the stochastic quantization [10] to a lower precision level on a digital computer, indicating that mixed-signal neural network can operate at a reduced quantization level.Table 1 shows the inference accuracy of the two MCNNs on a simulated digital low-bit width system.We kept the same range of the activations, while stochastically quantizing them to 3, 2, and 1 bit(s), corresponding to 8, 4 and 2 quantization levels, respectively.For a binary input convolving with 3X3 binary kernels, the ideal activations range from 0 to 9. The fluctuation due to unbiased Gaussian noise with =0.5 can range from -1.0 to 1.0, considering the 95% confidence interval.The MCNN trained with our method maintains the accuracy at 2 bits (4 levels) or higher quantization levels, consistent with the amount of random noise it can tolerate.Quantization on digital, fixed-point neural networks is often performed stochastically to avoid introducing the quantization bias, which is undesirable in low-precision neural networks.Likewise, if there are some distortion uncorrected for in the training of MCNN, the residue deterministic error will introduce a bias, which is accumulative throughout all the layers.Fig. 6 plots the inference accuracy versus the RMSE of the residue error for the two MCNNs trained with our method and BNN method.The accuracy drop as the residue error increases is consistent with the results on digital platforms [8][9][10], indicating that our training is still sensitive to the bias from uncorrected errors.Most of the existing diffractive optics-based neural networks employ non-erasable diffractive optical elements to represent a pre-trained set of weights [28,29], and hence cannot be re-programmed easily.Here, we constructed a fully programmable optical mixed-signal convolutional network layer based on a 4f system for the deployment of the trained MCNN model.The layer input is a digital mirror device (DMD, ViALUX, V4100 DLP7000, pixel size 13.7 ).The analog convolution is performed by a phase-only spatial light modulator (SLM, Meadowlark Optics, P1920-400-800-HDMI, pixel size 9.2 ) on the Fourier plane, as shown in Fig. 3 either the on or the off state.The light field then passes through a 200 tube lens (Thorlabs TTL-200),  1 , that creates a Fourier transform (FT) of the input onto the SLM.The FT of the kernel  ̃(,,) , approximated as phase only, is loaded onto the SLM.Upon reflection off the SLM, the FT of the input is multiplied by the FT of the kernel, thereby implementing the analog convolution.A beam splitter directs the reflected beam from the SLM through lens  2 (identical to  1 ), performing the inversion FT to yield the desired convolution between the input and the kernel, which is captured by a camera (JAI Ltd., GO-5000M-USB camera, 5.0  pitch).To implement the CNN operation in Eq. ( 11), the kernels must be flipped horizontally and vertically before use.
The input patterns  ̃() that can be displayed are strictly binary due to the use of the DMD as the input device, hence  ′ ∈ {0,1}.The use of phases only masks [30] to approximate a complex Fourier filter gives rise to distortions in the kernels.For 3X3 binary kernels, there are a total of 511 non-trivial kernels.We pre-calculated the 511 phase masks needed to display all the non-trivial kernels.Because of the experimental artifacts and approximations, the actual 511 kernels, ′, displayed in the experiment are not strictly binary.The distorted kernels are calibrated by imaging a single pixel displayed on the DMD through the optical system for each phase mask.
Due to aberrations and the limited numerical aperture of the 4f system, the full-width-at-half-maximum (FWHM) of its point-spread-function (PSF) is about 4 camera pixels.To mitigate crosstalk due to the PSF, we introduced a 3-pixel separation between adjacent samples of the input on the DMD and an 8-pixel separation for the kernels on the SLM, accounting for the differences between the pixel sizes of the DMD and the camera.To take advantage of the spatial bandwidth of the 4f system, we tiled multiple input channels in 2X2 and 4X4 formations for layer 2 and layer 3, respectively.Fig. 3(b) shows the tiled input on the DMD.After the raw image on the camera is acquired, we performed an 8X8 down sampling, and separate the tiled channels to recover ℎ in its native spatial resolution.
Fig. 3(c) shows the layer-by-layer activations of the MCNN model, trained with our method, in classifying the input digit '5'.With the calibrated kernel set ′, ideal convolution between the DMD input and the actual kernels in each layer can be computed.We compared the ideal convolution results with the activations obtained from raw camera images.The distributions of errors in the activation tensor ℎ are plotted in Fig. 3(d), for all MCNN layers in the experiment.These distributions all resemble the Gaussian shape, with standard deviations =0.37,0.38, and 0.58 for layer 1 through 3, respectively.The error in each layer is consistent with our choice of the random noise with =0.5 in the MCNN simulation.Despite the presence of this activation error, MCNN trained with our method achieved the correct inference.

Summary
We have demonstrated a training method to incorporate the analog computation error in neural network training for the deployment on mixed-signal computation platforms.Neural networks trained with our method is robust against a noise RMSE of 0.5 in the analog computing process, and thus can tolerate the reduced precision of the activations.Compared with a neural network trained using conventional backpropagation, our training method maintains the inference performance at approximately half of the precision levels determined by the data format and device specification.This allows us to deploy a trained convolutional neural network on a mixed-signal, diffractive optics-based convolution system that exhibits convolution error and kernel distortion.

Figure 1 :
Figure 1: Computation scheme of (a) digital fixed-point neural network layer (b) a neural network layer with an analog acceleration unit.Variables marked by tildes are analog signals.

3. 1 Figure 2 :
Figure 2: Structure of the convolutional neural network in the simulation.

Figure 3 .
Figure 3. Classification accuracy as a function of noise RMSE added in the inference simulation.

Fig. 5
Fig. 5 exemplifies the layer-by-layer activations of the two OCNNs in Fig. 4 at =0.5 for the input digit '5'.The probability of the input digit being classified as '5' is 99.2% and 83.1%, respectively, for the MCNN with our training method and that of BNN.Though the MCNN trained with BNN method still correctly classifies this digit, the probability is reduced to and confusions from digit '0', '3', and '8' can be observed from its output.

Figure 4 .
Figure 4. Layer-by-layer activations of the OCNNs trained with BNN and our method for the input digit 5.

Figure 5 .Figure 6 .
Figure 5. Accuracy vs. RMSE of residue deterministic errors added in the simulated inference process.
(a).The DMD is illuminated by a collimated beam from a 12  laser source (Coherent, OBIS LX  = 488).Each element of each input channel,  ̃,(−1,) , is represented by one DMD pixel, in

Table 1 :
Inference accuracy of the OCNN with reduced activation quantization bit widths