MR-PIPA: An Integrated Multilevel RRAM (HfOx)-Based Processing-In-Pixel Accelerator

This work paves the way to realize a processing-in-pixel (PIP) accelerator based on a multilevel HfOx resistive random access memory (RRAM) as a flexible, energy-efficient, and high-performance solution for real-time and smart image processing at edge devices. The proposed design intrinsically implements and supports a coarse-grained convolution operation in low-bit-width neural networks (NNs) leveraging a novel compute-pixel with nonvolatile weight storage at the sensor side. Our evaluations show that such a design can remarkably reduce the power consumption of data conversion and transmission to an off-chip processor maintaining accuracy compared with the recent in-sensor computing designs. Our proposed design, namely an integrated multilevel RRAM (HfOx)-based processing-in-pixel accelerator (MR-PIPA), achieves a frame rate of 1000 and efficiency of ~1.89 TOp/s/W, while it substantially reduces data conversion and transmission energy by ~84% compared to a baseline at the cost of minor accuracy degradation.


I. INTRODUCTION
I NTERNET-of-Things (IoT) devices are expected to reach $1100B in revenue by 2025, with a web of interconnections estimated to consist of approximately 75+ billion IoT devices, including wearable devices as well as smart cities and industries [1], [2]. Artificial Intelligence-of-Things (AIoT) nodes are composed of a variety of sensors, which are used to collect and process data from the environment and people. There is usually a great deal of redundant and unstructured sensory data captured. The conversion and transmission of large raw data to a backend processor at the edge are energy-intensive and highly latent [1], [3]. Those issues can be addressed by shifting computing architecture from a cloud-centric way of thinking to a thing-centric (data-centric) perspective, where IoT nodes process sensed data. Despite such challenges, artificial intelligence tasks that require hundreds of layers of convolutional neural networks (CNNs) have severe computational and storage constraints. There have been considerable advancements in both software and hardware to improve CNN efficiency by mitigating the ''power and memory wall'' bottleneck.
From the software point of view, exploration of shallower but wider CNN models, quantizing parameters, and network binarization [4] is widely accomplished. A recent development is reducing computing complexity and model size using low-bit-width weights and activations. By converting the multiplication-and-accumulate (MAC) operation into the corresponding AND-bitcount operations in [4], Zhou et al. performed bit-wise convolution between the inputs and the low-bit-width weights. Binarized CNNs (BNNs), as an extreme quantization method, have achieved acceptable accuracy on both small [5] and large datasets [4] after removing some high-precision requirements. By binarizing the weight and/or input feature map, they offer a promising solution to mitigate the aforementioned bottlenecks in storage and computation.
From the hardware point of view, the underlying operations should be realized using efficient mechanisms. The conventional processing elements are designed to work with a von-Neumann computing model involving separate memory and processing blocks interconnected via buses, which poses serious problems, such as long memory access latency, limited memory bandwidth, and energy-hungry data transfer, which limit the edge device's efficiency and working time [2]. In addition, this presents several significant issues at the upper level, including bandwidth congestion and security concerns. The concept of instant image preprocessing with smart image sensors has therefore been extensively investigated [2], [6], [7], [8] as a potential remedy. By using an on-chip processor, the digital output from pixels can be accelerated where the sensor is located, paving the way for enhanced sensor paradigms such as processing-near-sensor (PNS) as depicted in Fig. 1(b). Other promising alternatives are a process-insensor (PIS) platform [7], [9], as shown in Fig. 1(c), that processes preanalog-to-digital converter (ADC) data and a hybrid PIS-PNS [1] platform to incorporate vision sensors and eliminate redundant data output. Generally, PIS units process images before transmitting them to an on-chip processor for further processing. Typical designs rely on this type of data transfer (from CMOS image sensors to memory), which reduces the speed of feature extraction. With this PIS unit, a computation core can: 1) significantly reduce the power consumption of converting photo-currents into pixel values used for image processing; 2) accelerate data processing; and 3) alleviate the memory bottleneck problem [1], [2].
This article develops a new efficient processing-in-pixel (PIP) paradigm, as shown in Fig. 1(d), named an integrated multilevel RRAM (HfO x )-based processing-in-pixel accelerator (MR-PIPA), co-integrating always-on sensing and processing capabilities for image sensors. The main contributions of this work are as follows.
1) We experimentally demonstrate an integrated two-bitper-cell resistive random access memory (RRAM)-based weight storage unit. As low resistance states (LRSs) of the RRAM devices can lead to high power consumption, we run extensive device-level experiments on the fabricated device to achieve multilevel high resistive states. 2) The MR-PIPA architecture is developed based on a set of innovative microarchitectural and circuit-level schemes optimized to process the first layer of quantized neural networks (QNNs) using nonvolatile RRAM components to store weights offering energy efficiency and speedup. 3) We present a solid bottom-up evaluation framework and a PIP assessment simulator to analyze the whole system's performance. 4) MR-PIPA's performance and energy efficiency are thoroughly evaluated and then compared with the recent IoT sensory platforms.

II. BACKGROUND AND MOTIVATION
Systematic integration of computing and sensor arrays has been widely studied to eliminate off-chip data transmission and reduce ADC bandwidth, known as PNS [8], combining sensor and processing elements in the so-called PIS [9], [10], [11] and integrating pixels and computation unit, known as PIP [7], [8]. In [8], photo-currents are converted into pulsewidth modulation signals, and a dedicated analog processor is used to perform feature extraction, reducing the amount of power consumed by the ADC. To run spatiotemporal image processing, 3-D-stacked column-parallel ADCs  and processing elements are implemented and utilized in [2]. The CMOS image sensor with dual-mode delta-sigma ADCs described in [12] is designed to process the first convolutional (Conv.) layer of binarized-weight neural networks (BWNNs). Charge-sharing tunable capacitors are used by RedEye [13] to implement the convolution operation. By sacrificing accuracy in favor of energy savings, this design reduces energy consumption compared to a central processing unit (CPU)/graphics processing unit (GPU). However, for high-accuracy computation, the required energy per frame increases dramatically by 100×. As a PIS platform, a processing-in-sensor architecture integrating MAC operations into image sensor (MACSen) [7] processes the first convolution layer of BWNNs with the correlated double sampling procedure and achieves speeds of 1000 fps in the computation mode. This method, however, suffers from an expansive area overhead and high-power consumption. In this work, we are motivated mainly by three observations to develop a PIP accelerator for the first layer of QNNs. First, from the accuracy point of view, in most QNN accelerators, the first and the last layers of the networks remain in full precision, that is, the floating-point domain. This is translated to a performance bottleneck in different hardware/software co-design accelerators and requires excessive memory and processing resources [14]. The continuous valued inputs can be readily handled as fixed points with n bits of precision. To verify this, we utilize the deep neural network (NN) energy estimation tool developed by Massachusetts Institute of Technology (MIT) [15] to assess the energy requirements. Fig. 2 depicts the breakdown of normalized energy consumption of a three-layer multilayer perceptron (MLP). As observed, the first layer consumes considerably higher energy than the other layers for computation (purple block) and data movement (the other three blocks). It is worth noting that this figure could be varied for different NN architectures. Second, in conventional image sensors, most of the power (>96% [16]) is consumed by processing and converting pixel values. This means that pixel circuits consume only 4% of power to perform photovoltaic conversions, whereas signal amplification, digital-to-analog conversion (DAC), and data transmission consume most of the power. Third, almost all the PNS/PIS/PIP systems are hardwired, so the functionalities are limited to simple preprocessing tasks such as first-layer BWNN computation.

III. PROPOSED RRAM-BASED MULTIBIT STORAGE
RRAM is a two-terminal nonvolatile memory (NVM) that stores data in varying resistive states by creating and rupturing a conductive filament within the metal oxide insulator, as shown in Fig. 3(a). Fig. 3(b) illustrates a transmission electron micrograph (TEM) of the fabricated TiN/Ti/HfO 2 /TiN RRAM device integrated with CMOS n-channel field-effect transistor (nFET) in 65-nm CMOS technology to realize a 1T1R unit cell as a primary storage element in the proposed PIP accelerator. In the set phase, the conductive filament connects the top and bottom electrodes, leading to an LRS, whereas in the reset phase, the filament breaks, and the resistance of the device increases, yielding a high resistance state (HRS), as shown in Fig. 3(a). Switching between LRS and HRS allows RRAM to operate as binary storage/memory elements. Leveraging different switching schemes enable RRAM devices to store multilevel resistance states [ Fig. 3(c)] for multibit per cell storage [17]. The most commonly used ways to produce multilevel resistance states are modulating the compliance current at lower resistant states and the reset voltage amplitude to reach multiple HRSs [18], [19]. The first approach results in an increased cell current due to low resistance and consequently increases overall system power consumption, while the latter results in higher HRS variability. Therefore, we propose a promising device-tosystem level codesign approach to reduce overall system power consumption aiming at multiple well-defined HRS levels. Fig. 4(a) shows the experimental results for switching voltage pulse widths across RRAM and gate voltages on the transistor [ Fig. 3(b)]. The device-level switching experiments are performed using a semiautomated Suss Microtech probe station with a high-precision semiconductor device analyzer B1500. A switching pulsewidth of 100 ns to 1 ms and a gate voltage during switching on 15 devices with 1000 cycles for each condition are considered. The median resistance values at the HRS state range from 80 to 200 k . This approach shows much higher resistances compared to low resistance levels, ranging from 3 to 30 k [20]. To reduce HRS variability, we adopted a read-write-verify approach to achieve resistances in a specific window, as shown in Fig. 4(b) [17]. The selected experimental resistance states will then serve as the potential memory states for MR-PIPA. We confirmed that the read-write-verify strategy employed requires a minimal amount of programming cycles. The box plots in Fig. 4(b) show that the required median programming cycles are as low as 20.

IV. MR-PIPA ARCHITECTURE
We propose an energy-efficient and high-performance solution for real-time and smart image processing for AIoT devices. MR-PIPA will integrate sensing and processing phases and can intrinsically implement a coarse-grained convolution operation required in a wide variety of image-processing tasks such as classification by processing the first layer in QNNs. Once the object is roughly detected, MR-PIPA will switch to a typical sensing mode to capture the image for a fine-grained convolution.

A. MICROARCHITECTURE
At the architecture level, the MR-PIPA's array consists of an m × n compute focal plane (CFP), row and column controllers (Ctrl), command decoder, sensor timing ctrl, and sensor I/O operating in two modes, that is, sensing and processing, as shown in Fig. 5(a). The CFP is designed to cointegrate sensing and processing of the first layer of QNNs targeting a low-power and coarse-grained classification. To enable this, the conventional pixel unit is upgraded VOLUME 8, NO. 2, DECEMBER 2022 to a compute pixel (CP). The Ri (row) signal is controlled by the row Ctrl and shared across pixels located in the same row to enable access during the row-wise sensing mode. The core part of MR-PIPA is the CP unit consisting of a pixel connected to v NVM elements, as shown in Fig. 5(b). A sense bitline (SBL) is shared across pixels on the same column connected to the sensor I/O for the sensing mode. Moreover, CPs share v compute bit-lines (CBLs), each connected to a sense amplifier for processing, as indicated by the purple line in Fig. 5(a). The first-layer weight corresponding to each pixel is prestored into RRAM conductance, and an efficient coarse-grained MAC operation is then accomplished in a voltage-controlled crossbar fashion. Fig. 6(a) depicts a sample MLP, wherein CP 1,1 -CP m,n are linked to out1 via NVM 1 's weight. Similarly, every pixel is connected to out2-outv. To maximize MAC computation throughput and fully leverage MR-PIPA's parallelism, we propose a hardware mapping scheme and a connection configuration between CP elements and corresponding NVM add-ons shown in Fig. 6(b) to implement the target NN.

B. PIXEL DESIGN 1) BASIC PIXEL STRUCTURE
A basic three-transistor (3T) pixel structure is depicted in Fig. 7(a) [21]. It comprises a photodiode (PD) as the primary sensing component, a reset transistor, a source-follower transistor, and a transfer transistor. PD is a semiconductor sensor that generates the photo-current (I PH ), proportional to the brightness of incident light or the number of photons. A simplified equivalent circuit of the PD is shown in Fig. 7(a) [22]. During exposure, the PD functions as a leaky capacitance, while the leakage rate proportionally depends on the illumination [23]. The photo-current, I PH , generated from PD can be calculated from the active PD area (A PD ), responsivity (R), and input I RR radiance (E in ) as  I PH = A PD × R × E in . As shown in Fig. 7(b), during the bright illumination phase, the capacitor discharges faster and decreases the voltage across the PD more quicker. During low illumination, I PH is low, which results in a low voltage drop across the PD. The source-follower (SF) operates as a voltage buffer between the sensing element PD and replicates the voltage for readout.

2) COMPUTE ADD-ON
The compute add-on structure depicted in Fig. 5(b) consists of two functional blocks: 1) an input encoder and 2) 1T1R cells. The input encoder converts input from the basic pixel circuit to the input of the 1T1R cell. The 1T1R cell (part of the 1T1R array) acts as an analog multiplier unit for column-wise MAC operation. The input encoder unit consists of four transistors, of which T4 and T5 are logic transistors with an operating voltage of 1.2 V, while T6 and T7 are thick oxide 1.8-V transistors. The 1T1R devices are integrated with thick gate-oxide transistors T8 and T9. These transistors' maximum operating voltage is 3.3 V, allowing them to form and program high voltages for the RRAM cells. The proposed design follows three critical considerations (Cs) as follows.

a: LOCATION OF RRAM DEVICES
The thin oxide transistors require a smaller area and are suitable for low-power applications; as they have a low safe operating voltage, for example, 1.2 V. On the other hand, the thick oxide transistors can withstand large operating voltage, for example, 3.3 V, but suffers from higher power and area consumption. Hence, to reduce power and area, the pixel circuit is typically designed using low operating voltage thin-oxide transistors. However, RRAM devices require high forming and programming voltages (∼3.3 V). If the RRAM devices are connected directly across PD or pixel circuit transistors, during forming/programming, voltages will far exceed their operating voltages [ Fig. 8(a)], which can damage the low-voltage devices. MR-PIPA separates the pixel sensing and computing modules by transferring the signal from the pixel circuit through input encoders to the gate of the thick-oxide transistors, as shown in Fig. 5(b). We then used thick-oxide transistors with RRAM, allowing it to be formed or programed at the required higher voltage.

b: COMPUTE ADD-ON OUTPUT
The following subtle but critical consideration focuses on input encoding for RRAM cells, converting PD voltage to input for RRAM cells. In the standard RRAM-based matrix multiplication, for binary input x, which can be either 0 or 1, each RRAM cell current can be expressed by I = x · (V R /R) [24]. Here, V R is the applied voltage across the device, and R is the RRAM resistance of the cell. It can be realized from the given equation that for input zero, the RRAM-based compute unit should result in ideally zero current. Due to improper input encoder for the PIP circuit, the RRAM cell can result in nonzero cell current I RR when the input is 0 [ Fig. 8(b)]. In our design, we follow the conventional RRAMbased in-memory crossbar operations for NN inference as shown in Fig. 8(b).

c: FILL-FACTOR
The pixel is fabricated on silicon for hardware deployment using the CMOS fabrication process. Typically for imaging applications, a larger sensing area is preferred. The ratio of the PD sensing area to the total pixel area is defined as the fill factor. It is optimal to increase the PD area, which results in increasing the fill factor. However, depending on the application and add-on pixel capability, such as in-pixel digital processing, the fill-factor-feature tradeoff is chosen. Since the RRAM devices are fabricated at the back of the line, no large silicon area is consumed, as shown in Fig. 8(c). Although the fill factor is unaffected by RRAM, its access transistors can affect the fill factor.

C. OPERATIONAL MODES
To initialize the MR-PIPA, the proposed pixel circuit requires to go through forming and programming of the RRAM devices for weight storage. The filament [ Fig. 3(a)], required for resistive switching, can be formed by applying V R = 3.3 V across the RRAM one time. Forming can be performed by turning on transistor T1; this results in the input encoder output being V IG . As the input encoder is followed by RRAM cells, V IG is applied to the gate of T8 and T9 integrated into series with RRAM [ Fig. 9(b)]. As for the multilevel programming, different 1T1R gate voltage is required from 1 to 1.8 V [ Fig. 4(a)], it can be possible with similar approach applying different V IG = (1-1.8 V) [ Fig. 9(c)]. As we utilize a bipolar RRAM, it requires opposite polarity voltages for set and reset operations as shown in Fig. 3(a). This can be accomplished by applying positive voltages across opposite electrodes of the RRAM as shown in Fig. 9(c).
In the sensing mode, initially setting Rst = ''high,'' the reverse-biased PD is charged to V DDL = 1.2 V [ Fig. 7(a) and (b)] [21]. In this way, turning on the access transistor T3 and the k 1 switch at the shared ADC [ Fig. 5(c)] allows the C 1 capacitor to fully charge through SBL. By turning off T1, PD generates a I PH based on the external light intensity, which leads to a voltage drop (V PD ) at the gate of T2. Once again, by turning on T3, and this time the k 2 switch, C 2 is selected to record the voltage drop. Therefore, the voltage values before and after the image light exposure, that is, V 1 and V 2 in Fig. 5(c), are sampled. The difference between two voltages is sensed with an amplifier, while this value is proportional to the voltage drop on V PD . In other words, the voltage at the cathode of PD can be read at the pixel output.
During the object-detection mode, we leverage the efficient crossbar MAC with 1T1R array. As RRAM cells store data as resistive states, the resultant cell current I RR = V R /R VOLUME 8, NO. 2, DECEMBER 2022 when V R is the voltage applied across the cell [see Fig. 5(b)]. The voltage applied across the 1T1R cell, also known as read voltage V R , is chosen as low as 0.2 V such that it does not alter the programed state of the device (e.g., the voltage required to set or reset the device is ≥0.7 V). Here, the T8/T9 transistor gate voltage controls the output of the input encoder [ Fig. 5(b)]. If T8/T9's gate voltage is larger than the threshold voltage (0.7 V), it allows the current to pass through; as a result, the cell current is Here, R is one of the four resistive states representing the weighted state, and G represents the conductance of the cell. If the T8/T9 transistor gate voltage is 0, the transistor blocks the current, resulting in no cell current (I RR = 0A).
As discussed previously, under high illumination, the voltage across PD, V PD , is low and vice versa [ Fig. 7(b)]. The proposed input encoder converts V PD so that the output is logic ''1'' during low illumination (dark pixel) and logic ''0'' for high illumination (bright pixel). The first inverter (T4 and T5) of the input encoder operates at 1.2 V and converts V PD to 0 or 1.2 V output for the second inverter. The second inverter consists of thick-oxide 1.8-V transistors (T6 and T7), which allow the 0-1.8-V gate voltage for multilevel programming [ Fig. 9(c)]. As the threshold voltage of T6 and T7 transistors is below 0.7 V, the output of the second inverter results in (V IG , 0 V) [ Fig. 5(b)]. The resultant output of the input encoder is V IG and 0 V for low/dark illumination and high/bright illumination, respectively. Accordingly, the resultant cell's current for low and high illumination are I RR = V R /R and 0, respectively. Then, to combine and quantify the currents from both positive and negative weight connections, we constructed a differential amplifier [ Fig. 5(d)]. Input currents into the operational amplifier in each column pair consist of two columns of the positive and the negative weights [ Fig. 5(a)]. Each column current is the summation current from each 1T1R cells, for example, the positive weight current for the jth column can be described as M i=1 V R · G + i,j . The resultant output voltage of the operational amplifier will be proportional to is the conductance of the RRAM cell indexed by i and j storing the positive and negative weights, respectively. From a programmer's standpoint, MR-PIPA is a third-party accelerator rather than a memory unit. Thus, for general-purpose parallel execution, an Instruction Set Architecture (ISA) and virtual machine will be needed. With this, any user-level program can be translated at install time to the MR-PIPA's hardware instruction set to support MAC.

A. FRAMEWORK AND METHODOLOGY
To assess the performance of the proposed design, we developed a simulation framework from scratch consisting of three main components as shown in Fig. 10. First, at the device level, we fabricated the proposed RRAM device and extracted the switching data and resistance ranges experimentally. Second, at the circuit level, we fully implemented MR-PIPA with peripheral circuitry with IBM 65-nm CMOS10LPe PDK in Cadence to achieve the performance parameters. We trained a PyTorch QNN model inspired by [4] extracting the first-layer weights. MR-PIPA's RRAM elements are then programed at the circuit level by the quantized 2-bit weights. Third, after the first-layer computation, the results are recorded and fed into a behavioral-level in-house simulator to simulate the whole network at the architecture level and extract the performance parameters and inference accuracy.

B. DEVICE-TO-CIRCUIT LEVEL RESULTS
The proposed CP was designed at a 65-nm process node. The pixel's PD was simulated as a parallel capacitor, and the photo-current represented the illumination. The capacitance value (13 fF) was calculated from the doping concentration of the 65-nm CMOS process and the PD area (Section IV-B). To demonstrate, the lowest case for high illumination (/bright pixel) was considered as ∼13 k lux, and the highest case for low illumination/dark pixel was considered as ∼130 lux. The resultant I PH s used for the simulations are 10 and 0.1 nA for high and low illumination, respectively.
We simulated both high and low illumination with 10% variation and observed the response at different points of the circuit. First, the voltage response across the PD shows expected high voltage drop and low voltage drop over time, respectively [ Fig. 11 (a) and (e)]. It also confirms that the add-on compute unit does not affect the pixel sensing operation. Fig. 11 (b) and (f) shows the input encoder output. As the proposed input encoders are inverters, the inverters tend to switch to rail voltages 0 and 1.2 V during MAC operation. The switching from 1.2 to 0 V occurs before the ''read'' operation (at 0.8 × 10 −6 s). We observe that the proposed design is immune to I PH 's variation as any V PD during high illumination and low illumination are converted as rail to rail 0 and 1.2 V [ Fig. 11 (f)]. As the input encoder output acts as an input for the thick-oxide transistor integrated with RRAM, the 0 and 1.2-V voltages fall below and above the transistor's threshold voltage of 0.7 V. As a result, no current flows through the RRAM cell during 0-V input encoder output or during high illumination. On the other hand, during low illumination, the input encoder output becomes 1.2 V, and the cell output current according to Ohm's law is I RR = V R /R. It is noteworthy that the RRAM cell output current [ Fig. 11(b) and (f)] is independent of I PH 's variation. The immunity to I PH 's variation is a result of using inverters for input encoding. As for an analog voltage range above and below V DDL /2, the output of an inverter is 0 V or V DDL , respectively. Fig. 11(d) and (h) shows that when I PH and RRAM resistance variation are present, the output RRAM cell current is only dependent on the RRAM resistance variation. The RRAM cell current for four different resistance levels is shown in Fig. 12. Even with variations considered, the cell currents are distinguishable for different resistance/weight stored.

C. CIRCUIT-TO-ARCHITECTURE LEVEL RESULTS
We limited the weight precision to four resistance levels. This can be readily used to map and accelerate binary, ternary, and quaternary NNs. Table 1 compares the literature's structural and performance parameters of selective PIP and sensor designs. As different designs are developed for specific domains, for an impartial comparison, we estimated and normalized the power consumption when all units executed the similar task of processing the first layer of CNNs. Our crosslayer simulation results show that the MR-PIPA achieves a frame rate of 1000. This comes from the massively parallel CPs. However, the design in [6] achieves the highest frame rate, and the design in [2] imposes the least pixel size enabling in-sensor computing. As for the area, our simulation results reported in Table 1 show the proposed MR-PIPA's compute-pixel occupies ∼6 × 6 µm 2 in 65 nm. As we do not have access to the other layouts' configurations, it is almost impossible to have a fair comparison between area overheads. However, we believe that a rough assessment can be made by comparing the number of transistors in previous SRAM-based designs and MR-PIPA's lower-overhead compute add-on. We reimplemented MACSen [7] at the circuit level as the only CNN accelerator developed with the same purpose. Our evaluation showed that MR-PIPA consumes ∼74% less power consumption compared with MACSen performing the same task. Compared to [6], MR-PIPA substantially reduces data conversion and transmission energy by ∼84%. While Table 1 focuses on various PIS architectures (close-to-pixel computation) primarily supporting CNNs in the binary domain, recent architectures show a systolic neural CPU fusing the operation of a traditional CPU and a systolic CNN accelerator [26]. Compared with our work, the design in [26] shows a systolic neural CPU fusing the operation of a traditional CPU and a systolic CNN accelerator. It converts 10 CPU cores into an 8-bit systolic CNN accelerator showing a comparable performance (1.82 TOPS/W @65 nm versus 1.89 TOPS/W @65 nm in MR-PIPA) but provides higher flexibility and bit-width (up to 8-bit). Putting everything together, MR-PIPA offers: 1) a low-overhead, dual-mode, and reconfigurable design to keep the sensing performance and realize a processing mode to remarkably reduce the power consumption of data conversion and transmission; 2) singlecycle in-sensor processing mechanism to improve imageprocessing speed; 3) highly parallel in-sensor processing design to achieve ultrahigh-throughput; and 4) exploiting NVM reduces standby power consumption during idle time and offers instant wake-up time and resilience to power failure to achieve high performance.

D. ACCURACY
An image classification task is selected to demonstrate the benefits of MR-PIPA design. In the original BWNN topology, all the layers, except the first and last, were implemented with quantized weights [27]. However, in these tasks, the number of input channels is relatively lower than the number of internal layers' channels, so the required parameters and computations are small, and converting the input layer will not be a significant issue [27]. Therefore, in almost all previously developed 3T and 4T-pixel PIP designs, the first layer is implemented with quantized weights, realizing BWNN [7]. Then an identical NN accelerator can be used to accelerate the remaining layers after the first layer has been computed.

d: DATASETS
We conducted experiments on several datasets, including Modified National Institute of Standards and Technology (MNIST) database [28], Fashion-MNIST [29], MIT CBCL face database (MCFD) [30], and street view house numbers (SVHN) [31]. MNIST is leveraged as a gray-scale dataset that contains 70 000 28 × 28 images of handwritten digits from 0 to 960 000 images for training, and 10 000 images for testing sets. Similar to MNIST, Fashion-MNIST consists of 28 × 28 gray-scale images but includes 10 000 images for each training and testing set to form ten fashion categories. MCFD face recognition database contains face images of ten subjects, where each image is normalized to 20 × 20 pixels. Training data consist of 6977 images, while testing data consist of 24 045 images. Finally, we also exploit SVHN with 73 257 training digits, 26 032 testing digits, and 531 131 additional digits for extra training data. The images are preprocessed to 20 × 20 from the original 32 × 32 cropped version and fed to the model.

e: NN ARCHITECTURE
In order to evaluate our design and perform a fair comparison, we developed two networks, including a two-layer MLP and a CNN with three convolutional and three FC layers, which are equivalently implemented by convolutional layers. Herein, the first layer is performed at the device level, and its outputs are then fed into the second layer of the algorithm, which is implemented in Python. The comparison of classification accuracy is summarized in Table 2. The results show that higher accuracy can be achieved using our MR-PIPA architecture, which can handle four analog values (2-bit quantized) rather than two (1-bit).

VI. CONCLUSION
This work presents a PIP accelerator that intrinsically implements and supports a coarse-grained convolution operation in low-bit-width QNNs leveraging a novel compute pixel with nonvolatile weight storage at the sensor side. We demonstrate four distinct high resistance levels in order to decrease overall system power consumption. Our results demonstrate acceptable accuracy on various datasets, while MR-PIPA achieves the frame rate of 1000 and the efficiency of ∼1.89 TOp/s/W.