An Efficient In-Memory Computing Architecture for Image Enhancement in AI Applications

Random Spray Retinex (RSR) is an effective image enhancement algorithm due to its effectiveness in improving image quality. However, the algorithm computing complexity and the required hardware resources and memory accesses hampered its deployment in many applications scenarios, for instance in IoT systems with limited hardware resources. With the rise of Artificial Intelligence (AI), the use of image enhancement has become essential to improve the performance for many emerging applications. In this paper, we propose the use of the RSR as a pre-processing filter before the task of semantic segmentation of low-quality urban road scenes. Using the publicly available Cityscapes dataset, we compare the performance of a pre-trained deep semantic segmentation network on dark noisy images and on RSR pre-processed images. Our findings confirm the effectiveness of RSR in improving segmentation accuracy. In addition, to address the computation complexity and suitability to edge devices, we propose a novel efficient implementation of the RSR using resistive random access memory (RRAM) technology. The architecture provides highly parallel analog in-memory computing (IMC) capabilities. A detailed, efficient and low latency implementation of the RSR using RRAM-CMOS technology is described. The design is verified using SPICE simulations with measured data from fabricated RRAM and 65nm CMOS technologies. The approach provided here represents an important step towards low-complexity and real-time hardware-friendly architecture and design for Retinex algorithms for edge devices.


I. INTRODUCTION
D IGITAL images captured in various application areas such as medical imaging, space exploration and underwater environment are often characterized by low-quality [1]. This can be due to insufficient lighting conditions or by the dynamic character of the environment. This is also the case for the images acquired for autonomous vehicle driving applications. An issue common to those images is that the difference in lightness between the bright and dark areas is large. This poses a challenge to subsequent processing tasks, such as image segmentation. Thus, a quality enhancement pre-processing step is normally added to the pipeline with the aim of increasing the overall effectiveness of the processing.
In this respect, one has to consider the tension between global and local aspects of the processing. Increasing the brightness of the overall image will improve the visibility of the dark areas, but reduce the visibility of the details in the bright areas. Several global techniques have been developed to overcome the low-light enhancement problem such as histogram equalization [2], gamma correction and tone mapping [3]. Nevertheless, these techniques result in a sort of overexposure when the grey level in an image is concentrated at a certain intensity [4]. In addition, intensity saturation would result from inconsistent enhancement performed by these algorithms as they rely on global information from the image. This calls for the use of a differential treatment of the different image regions, based on both local and global information.
To address these challenges, more complex algorithms have been developed, such as adaptive contrast enhancements [4], [5], adaptive histogram equalization [6], [7], and Spatial Color Algorithms (SCA) [8], [9]. SCAs are driven by Retinex principles, inspired to the behavior of the Human Visual System (HVS). These algorithms -RSR [10], STRESS [11], STAR [12], and ReMark [13], to mention a few that belong to the Milano Retinex family [14] -are widely employed to enhance real-world images. They implement two important characteristics of the human color vision system: (i) the independent analysis of the color components of the visual signal; (ii) the color adjustment based on local spatial and visual information. The algorithms of the Retinex familyoriginally created as a model of the Human Vision System -turned out to be endowed with several desired image enhancement properties [15].
These algorithms use both local and global information, and this results in efficient enhancement for the dark areas.
However, these algorithms are typically computationally intensive: this makes it hard to deploy them in real-time applications (especially on resource-limited edge devices) [16].
To address these challenges, this paper presents a novel low-complexity and real-time HW-friendly architecture and design of the Retinex algorithm. Furthermore, to the best of our knowledge, this work is the first to utilize the In-Memory Computing (IMC) feature of emerging memristor devices to reduce the power and improve the speed of the conventional Retinex algorithms.
Memristors are a type of Resistive RAM (RRAM) technology that provides low-power solutions at a low cost. A memristor contains a thin oxide film sandwiched between two metal electrodes [17] and has the ability to save information with zero leakage current, high endurance, relatively fast write time, and small cell size. Furthermore, the memristor has both storage and computing capabilities, which make it a suitable building block for IMC [18]- [20].
The new paradigm presented in this paper supports parallel computing and provides efficiency gains in area and power. It utilizes the analog computations in the memristor crossbar and uses the same physical elements for both processing and storage [18], [19], [21]. As a result, it substantially reduces the computing complexity resulted from the data-intensive Retinex algorithm.
As a case study, among the many Retinex variants, Random Spray Retinex (RSR) has been selected as a test for the innovative proposed architecture due to its proven high effectiveness in image enhancement [22], (we quantify the image quality using several quality assessment metrics). We use RSR as a pre-processing filter before the task of semantic segmentation of low-quality urban road scenes. Using the publicly available Cityscapes dataset [23], we compare the performance of a pre-trained deep semantic segmentation network on dark noisy images and on RSR pre-processed images. In addition, the image pixel accuracy is studied for different numbers of bits to understand the impact of this choice approach on accuracy and energy consumption. The latter study is essential for the implementation phase as it helps to identify the desired quantization level.
The main contributions of this paper are summarized as follows.
1) We implement an efficient RSR algorithm using emerging memristor technology. 2) We use the RSR algorithm as pre-processing filter for image enhancement before applying image segmentation to demonstrate the effectiveness of such implementation in a practical setting. The remainder of the paper is organized as follows. Section II presents a thorough background on Retinex algorithms. After that, the proposed architecture for memristor-based Retinex is described in Section III. Then, the simulation results for the proposed architecture are provided in Section IV. Finally, Section V presents the conclusions and the planned future work.

A. THE RETINEX ALGORITHM FAMILY
Hereafter we frame the RSR algorithm within the family of Retinex algorithms and summarize its main variants.
Several Retinex algorithms have been developed throughout the years [24], based on the original Retinex, which was formulated in the 60's and relied on random paths. This algorithm was defined by several mechanisms that specified the processing of the information collected by the random path and its progressive integration in the corrected output image. Among those mechanisms the most characteristic was the so-called reset, which allowed to refer the correction to maximum values found in the vicinity of the corrected pixel. The reset mechanism was preserved by several algorithms later developed in the Retinex family, among them the algorithms of the subfamily called Milano-Retinex [14], to which RSR belongs. For the sake of completeness we mention that this core mechanism was dropped by other simplified algorithms, among them the so-called NASA-Retinex [25]: this simplification made the processing more efficient, at the price of giving up to one of the distinctive characteristics of the Human Visual System.
The reset-based Retinex algorithms produce an enhanced image where the chromatic dominant of the light and any smooth gradients are partially suppressed [26], while scene details and edges are enhanced [27]. The Retinex algorithm performs a spatial color processing, i.e. it process colors based on their positions and combines these information by a specific equation aggregator. As for the spatial exploration, some use random walk processes [28], or their probabilistic representation [13], and some others use point sampling processes [29], [30] or their probabilistic representation [31]- [34]), while the aggregation can involve various kinds of averaging of the local intensity maxima.
RSR relies on a point sampling process, and was created by Provenzi and others [10] based on the observation that a point sampling process can explore the surrounding of a pixel in a less redundant way than a random walk. The pixel samples were dubbed sprays, hence the name Random Spray Retinex (RSR). In RSR, for each pixel to be corrected -the target -the algorithm generates several collections of informative pixels in the neighborhood, extracts from each collection the maximum intensity and eventually uses a suitable average of those values as a white reference for rescaling the input brightness of the target to compute the output value. The authors demonstrated that the spray technique outperforms the path-based strategy for neighbor information collection.
Later, Banić et al. [22] proposed Light RSR, which provided an efficient algorithmic implementation reducing the computational time while keeping the same spatial sampling method of RSR. Further study for the same authors in [35] studied the feasibility of using the RSR for global illumination estimation using the local RSR results. Moreover, a new light Random Spray Retinex-based image enhancement method was proposed by Banić and Lončarić [36]. It can be used as a color correction method, brightness adjustment method, or both. Although it operates locally, it performs a fixed number of operations per pixel, which means that its computational speed is almost independent of the parameter values used.
Later on, RSR was combined with the Automatic Color Equalization (ACE) algorithm. The two algorithms are complementary in their spatially variant approach. As a result the output images of the two algorithms exhibit complementary advantages and defects [37]. RSR shows good saturation properties, but has insufficient detail recovering capabilities. ACE, instead, has a propensity to put in evidence details, but it tends to wash out images.
Furthermore, Lecca et al. [38] modified the RSR algorithm to control the locality of the color filtering by considering the spatial image information. The spatial information is integrated into the RSR channel lightness computation at each pixel through a weighting function inversely proportional to the distance from the spray center. Finally, Tanaka et al. [39] produced two variants of RSR by concentrating on the region of interest (ROI): the first variant proposed a cone distribution based on anatomical data as a ROI, while the second variant focuses on the visual field information's visual resolution and considers it an ROI. Among the most efficient and fastest implementations of RSR is FuzzyRSR [29], which exploits the same spray for the correction of several pixels.

B. REAL TIME HARDWARE IMPLEMENTATIONS
Some hardware implementations of Retinex algorithms have been proposed in the literature. For example, a digital signal processor (DSP)-based real-time realization of the NASA-Retinex algorithms (as we mentioned in the previous subsection, algorithms very different from RSR and much less demanding in terms of performance) -a Single-Scale Retinex algorithm applied to monochrome images, and a simplified version of the Multi-Scale Retinex with color restoration [25] -were proposed in [40], [41]. In this case, however, the system performances are significantly lower than the ones of the original, not simplified, algorithm [42], [43], furthermore the DSP solution itself is not suitable for edge devices due to the high power and cost of the hardware.
Conventional architectures show huge computational costs due to the required processing layers, arithmetic operations, and the number of iterations. In order to address this issue and increase the speed of these algorithms, the implementation of a hardware accelerator based on a field-programmable gate array (FPGA) was proposed in [42]- [45].
Li et al. [42] presents a completely parallel architecture based on FPGAs for the implementation of multi-scale Retinex in an outdoor application. Address encoding and distributed arithmetic are used to optimize the Gaussian kernel, and concurrent multi-scale convolutions are accomplished. Furthermore, ustukov et al. [44] modifies the multi-scale Retinex algorithm. The algorithm performance is improved by using different methods of picture blurring such as tabular value replacement instead of computing logarithm values. The method's ability to combine algorithms allows it to be implemented on FPGA as a threading conveyor.
Park et al. [1] provided a concept for the retinex video enhancing method that is low-cost and high-throughput. The hardware (HW) architecture is built using an FPGA and has a throughput of 60 frames per second for a 1920 x 1080 picture with little delay. By employing a tiny line buffer instead of a frame buffer, using the notion of approximation computing for the complex Gaussian filter, and creating a novel and nontrivial exponentiation operation, the suggested FPGA architecture lowers HW resources while retaining quality and speed. Moreover, Masri et al. [45] suggested a flexible and effective architecture for real-time video frame augmentation that may be implemented in a single FPGA. The video enhancing algorithm is based on Retinex. To regulate the dynamic range of poorly lighted photos while keeping visual details, a novel illuminance estimate methodology was used. The video enhancement settings are regulated in real time by an inbuilt microprocessor, allowing the system to adapt to peculiarities of the incoming pictures and ambient lighting.
Nonetheless, FPGA has limited memory capacity and requires the image to be stored in external dynamic randomaccess memory (DRAM), which increases energy consumption and latency [42], [43], [45]. Furthermore, one has to consider the complex trade-off between performance, hardware (HW) resources, and efficiency degradation in terms of HW design. Li and Tsai [46] proposed the implementation of lowcost and high-speed HW for contrast-preserving and dynamic range compression. However, the Gaussian filter used in their algorithm can take only a small size.
Furthermore, Moore et al. [47] proposed a hardware implementation, which consists of resistive grids that average or smooth the pixel intensities. However, it lacks the RESET operation, which is the essential feature of the retinex theory.
To address the challenges associated with the abovementioned implementations, this paper presents a novel lowcomplexity and real-time HW-friendly architecture and de-sign of the Retinex algorithm. Furthermore, to the best of our knowledge, this paper presents the first efficient hardware ReRAM-based implementation for the RSR algorithm. We propose using memristor-based structures that can perform highly parallel operations that reduce area and energy and accelerate the computation of the Retinex algorithm.

III. RANDOM SPRAY RETINEX
Here, we introduce the RSR algorithm, the Semantic segmentation algorithm, and the fundamental design blocks for the typical hardware implementation of the RSR algorithm.

A. THE RSR ALGORITHM
In RSR, stochastic sampling is used for the estimation of the local white reference from the neighborhood of the target input pixel intensity i τ , where the index τ denotes that the intensity refers to the target pixel. For each chromatic channel and each target pixel, the algorithm works as follows.
1) Repeat N times the following spray generation and processing cycle: i Sample n points from a neighborhood Ω τ of the target, following a particular sampling profile [34], thus obtaining an n-point set. ii Get the corresponding sample of n input intensities, S * s = {i k } n k=1 , where s indicates the spray index: the bare spray). iii Add the target intensity i τ , to the set and obtain an (n+1) intensity set S s = {{i k } ∪ S * s }: the augmented spray. iv Compute the maximum intensity y s of the augmented spray, i.e. , y s = max(i τ , max k {i k } n k=1 ) After repeating N times the steps (i) through (iv), a set of maxima (y 1 , y 2 , . . . , y s , . . . , y N ) is obtained. 2) Compute the harmonic average of the maxima: The following three parameters affect processing performances: i) the number of sprays N controls the noise: increasing N lowers the chromatic noise; ii) the number of points per spray n controls the sensitivity of the sampling to local intensity maxima: increasing n increases the probability that a small bright patch is used as a reference white; iii) the locality of filtering (the difference between the influence from the closest points and the farther points in the neighborhood) is controlled by the sampling profile [36]. Such profile is a non-increasing function of the distance r from the target and represents the probability that a pixel at that distance is picked during the sampling process [30]. Among the most used profiles are: the flat profile, and the profile that decreases as 1/(1 + r) α , as a function of the distance, with α ≥ 1 (typical values are α = 2, 3, 4).

B. RETINEX WITH SEMANTIC SEGMENTATION
In this section, we report the study of the effects of illumination changes and contrast enhancement on the effectiveness of the semantic segmentation of urban road scenes. This case can be motivated by several settings, including the one of autonomous vehicles control. Autonomous vehicle applications need to operate correctly across different scenarios; however, environmental factors, such as weather and poor illumination, can deteriorate the quality of the acquired images, thus compromising safety.
Using the publicly available Cityscapes dataset, we simulate the underexposed images and compare the performance of a standard pre-trained deep semantic segmentation network on original and dark images. The Cityscapes Dataset focuses on semantic understanding of urban street scenes and it has 30 classes. These semantic objects contains: road, sidewalk, parking, rail track, person, rider, car, truck, bus, on rails, motorcycle, bicycle, caravan, trailer, building, wall, fence, guard rail, bridge, tunnel, pole, pole group, traffic sign, traffic light, vegetation, terrain, sky, ground, dynamic, static.
The training and evaluation of our approach are carried out using a public large-scale Audi Autonomous Driving Dataset (A2D2) [48], which contains over 40,000 labeled images, from which we take a subset of 12,497 images with dimensions of 1920 × 1208 pixels. Training images are cropped and resized to the 384 × 384 pixels size, while test images are resized so that the largest dimension is 768 pixels with the original aspect ratio preserved.
The effect on the dark image is done using the approach proposed by Christopher et al. [49]. They used equation (1) to model the consequences of underexposure on each pixel in an image: The brightness values (i.e. the V channel when the picture is converted to HSV color space) of the relevant pixels in the original and changed images are thus V 1 (0, 1) and V 2 (0, 1), respectively. θ 1 is a randomly generated threshold for each picture such that (µ − σ) ≤ θ 1 ≤ µ is obtained, where µ is the mean and σ the standard deviation of all pixel values V throughout the whole image; θ 2 is a second, lower threshold that controls the amount of compression applied to dark picture sections. For the selected dataset, θ 2 is set to θ 2 = θ 1 × 0.1 such that pixels with V 1 < θ 1 yield a V 2 = V 1 × 0.1. Essentially, the dynamic range of pixel values below the threshold is compressed, while the dynamic range of pixel values above the threshold is increased. Figure 1 shows a sample picture before and after the application of equation (1).
The pre-trained DeepLab v3+ architecture [50] is used in this paper to compare the segmentation performance between the original, dark, quantized, and enhanced images. As illustrated in Figure 5, the dark noisy images are passed through a random spray retinex algorithm to enhance the lightening of these images. Then, these enhanced images are passed to the pre-trained semantic segmentation model (DeepLab). Finally, an error is calculated as the mean pixels cross-entropy loss between the output of the segmentation model and the ground truth segmentation.
The Accuracy (Acc), Recall (Rec), Precision (Prec), and Jaccard metric (intersection over union, IoU) of the results are calculated compared to the ground truth data for the different classes presented in the dataset. To perform the comparison, the segmentation metrics shown in Table 1 are calculated for the original image, dark image, the enhanced image after applying RSR, and the quantized image with the RSR, which is the one used for the memristor implementation as shown in the following sections. The accuracy value improved after applying RSR with n=3 and a=10, which is better than the dark simulated noisy pictures but not as perfect as the real image. For a broader image processing and computer vision community, we expanded the results in Table 1 with baseline techniques such as histogram equalization and some of its more advanced forms. Adjusting image intensity values, histogram equalization, and contrast-limited adaptive histogram equalization are three functions that are especially well suited for contrast enhancement. The three functions' differences are reflected in Figure 2. The first method is adjusting the image intensity values or color map, which will boosts the image's contrast. By default, 1% of the data is saturated at low and high intensities of the input data.The second method is histogram equalization. It improves visual contrast by altering the values in an intensity image such that the output image's histogram closely matches the desired histogram (uniform distribution by default). The third and last method is the contrast-limited adaptive histogram equalization. It works on tiny data sections (tiles) rather than the complete picture, unlike histogram equalization. The contrast of each tile is increased such that the histogram of each output area comes close to matching the required histogram (uniform distribution by default). To prevent increasing any noise that may be present in the picture, the contrast enhancement might be minimized. Furthermore, Figure 3 shows how both n and a have a significant impact on the brightness adjustment. Moreover, a test was performed on images from the ColorChecker image dataset [36], which contains 568 8-bit sRGB images, most of which have the size 874 × 583 as shown in Figure 4.
We add that the purpose of this subsection was not carrying on a comprehensive comparison of the performance of the RSR variants in terms of enhancement of the image quality: similar studies can be found in [27], [51] and in several papers devoted to variants of RSR such as [52]- [55].

C. TYPICAL CONVENTIONAL HARDWARE IMPLEMENTATION FOR RSR
Most traditional versions of Retinex work offline, due to their computational complexity, although some versions of RSR, such as LightRSR [56] and FuzzyRSR [29] have reduced considerably the computational load, some, as SuPeR [57], at the price of some extra complexity in the code implementation. Some research has been carried on about the possibility of emulating Spatial Color Algorithms by learning the corresponding function by Artificial Neural Networks [58]: in those cases, the algorithm is fast and can be used online, however the time for training on a class of examples is non-negligible. We chose to use RSR instead of LightRSR, because the latter uses weights so that its implementation is slightly more involved than the former. The former is also more commonly adopted, this makes it easier to compare our results with other papers.
Conventional architectures show huge computational costs due to the number of iterations and the required processing layers and arithmetic operations. To the best of the authors' knowledge, sparsity in Retinex is not yet employed in hardware. This comes with the fact that high-dimensional sparse data are usually beyond what is allowed on commodity hardware. The complexity of RSR processing is reported in terms of the number of scale to max operations, their accumulations over a given augmented spray, and the required memory resources. Figure 6 presents the conventional architecture for RSR. The n masks represent the augmented random sprays generated from the input image by shifting (e.g., block A is a unique random spray for a given targeted pixel). This is equivalent to the use of image extensions with distributed arithmetic and convolution filters [42]. The multiple random sprays are generated following a pipeline data flow. FIFOs (First-In-First-Out) are used as line buffers for most FPGA- (a) n=1,a=0 based image processing, which means data reading is a serial operation. Pipelining can be realized with address encoding based on the random generation method. The address-encoding concept controls the address when the sprays from the image are serially read out from the storage. As presented in Figure 6, the x × y × n pixel values are fed to the line buffers for pipeline data flow for comparison, max scaling, accumulation, averaging, and resampling with respect to the input image. The x, y, and n denote the rows, columns, and the number of pixel elements per spray, respectively. Given the methods mentioned above for HWbased retinex implementation, memristor-based in-memory computing paradigms are used to perform the scale to maximum operations and their accumulation as highlighted in Figure 6 (i.e. (b) and (c) blocks). The latter reduces memory access and computational complexity that serves the track for efficient hardware processing toward high intensive tasks with a high order of sparsity.

D. QUANTIZATION VS IMAGE QUALITY METRICS
Since memristor requires a limited number of states, the intensity needs to be quantized. To study the impact of quantization, we use seven different full-reference metrics to assess the image quality and compare it between the original image and the quantized image. The Matlab code for evaluating these metrics is provided in [59], which covers The MSE is the most common metric used in the literature for assessing image quality as it is simple and does not involve costly computations [60], [61]. MSE works satisfactorily when distortion is mainly caused by contamination of additive noise [62]. Furthermore, another famous image quality metric used by the prior studies is the PSNR, which is the ratio between the maximum power of a signal to the maximum power of a noise signal [63], [64]. PSNR is measured according to peak signal power. PSNR involves simple calculations, has a clear physical meaning, and is convenient in the context of optimization. However, PSNR is not according to the characteristics of the human visual system (HVS) [62].
As shown in Table 2 and Table 3, as the quantization level increases, the PSNR increases, while the MSE and the NAE decrease. Moreover, the structural differences between reference and test images will generally increase as the quantization step size becomes larger. Hence, the AD is a monotonically decreasing function of the quantization step size, but the SC and MD are monotonically increasing functions of the quantization step size [65].

IV. MEMRISTOR-BASED IN-MEMORY COMPUTING ARCHITECTURE FOR RSR
In this section, the novel proposed hybrid CMOS-Memristor architecture for the random spray retinex algorithm is detailed. In addition, the experimental and simulation results for the proposed model are also revealed.

A. 4-BIT MEMRISTOR MODEL
The results provided in section II show that a 4-bit quantization is generally sufficient to achieve an acceptable image resolution, (see Table 2 and Table 3 for Figure 7 and Figure   8 respectively). In this work, an approximate 16-level fixedpoint conductance state is adopted. This number is used to be consistent with the real memristor device fabricated by our group [21]. The writing process requires the memristive states to be separated within the switching window over the same interval. As shown in Figure 9(a), a behavioral model is performed to fit the experimental gradual switching of the memristor. The resistance changes with a number of pulses (v(t)) are described with the following equations: where R max = 2800Ω, R min = 157Ω and v max = 21, represents, respectively, the maximum resistance, minimum resistance, and the maximum pulse number required to switch the device between the minimum and maximum resistance states. These parameters are directly extracted from the experimental data. α = 1.5V −1 is the parameter that controls the nonlinear behavior of resistance update, and F is a function of α that fits the state transition within the range of R max , and R min . As shown in Figure 9(b), nonlinear change in device resistance can be obtained by tuning α. This is equivalent to 4-bit characterization, which is used to code physical data acquired from the environment to be compatible with RRAM voltage pulse programming and efficient in-memory processing.

B. LOGICAL MEMRISTOR-BASED IN-MEMORY COMPUTING ARCHITECTURE
Realizing a practical memristor crossbar-based in-memory computing system usually requires the integration of multiple memristor crossbar arrays [66]. In general, splitting resistive states into different arrays is beneficial for parallel computing, increasingly needed with increasing system scales.
A single memory computing macro-core of the proposed architecture is presented in Figure 10. The latter has the fundamental one-transistor-one-memristor (1T1R) topology [19], which enables reliable and uniform analog switching behaviors. With the proposed hybrid-processing scheme, the parallel processing is implemented as in [19], which reduces the latency by a factor of n 2 . The input pixel values are encoded by the pulse number according to its quantized bit number. This allows direct writing of resistive values to the target cross-point memristors. Using the in-memory comput-ing paradigm, the augmented sprays are used to realize the scale-to-max operation in terms of resistance evolution over time through the selected memristor cross-point cells with the application of the equivalent modulated pulse signal. Furthermore, memristor arrays are highly efficient in achieving parallel Multiply and Add (MAC) operations under shared inputs for different arithmetic values. Different fetched batches of random sprays are passed into memristor cross-point cells separately by applying the input signals as described in the previous section. The training scheme for the memristor crossbar sets the constraint for which a batch of the intermediate signal will not be supplied as input until these constraints are met. Thus, the previous batch needs to be already used to calculate the desired weight updates. Furthermore, the corresponding memristor conductance needs to be already well-tuned to the greater modulated pixel amplitude. The desired max updates of the states with respect to the pulse sequence are stored in the accessed memristor element from the crossbar. Then, the second-row memristor conductance is updated after inputting the second input batch for the same-targeted pixel. During this processing stage, another cycle batches are inputted from another target pixel on the memristor crossbar and are fed into the unoccupied memristor-based-scale to a max operator in parallel [19]. These operations are repeatedly done through in-memory computation until the random paths are written in the memristor-based scale to max operators.
After inputting the encoded pulses that correspond to pixel intensity into the bit lines, the output currents through the source lines are sensed and accumulated. The current is the weighted sum corresponding to the input patch and the chosen augmented spray. The latest resistance values with different weights are written to different rows, and the entire memristor array operates MACs in parallel under the same inputs (read voltage). Thus, all the desired weighted-sum results are obtained concurrently. Afterthought, each output sense integrator is averaged and then resampled with the associated and predefined target pixel inputs. Figure 11 shows a SPICE simulation [67] of the scale to max operation through different memristor cells. The different batches of such an augmented spray results in an incremental behavior of the resistance that keeps the greatest value at the end-state.
The output images have the same results as the 4-bit quantization presented in section II for both natural and pattern images. However, in the memristor-based in-memory computing system reported in [68], the resolution loss is mainly attributed to two factors: first, the presence of nonideal device characteristics, such as device variations, array yield problems, and device reliability issues; second, as it is used in this approach, the limited precision due to weight quantization. Even though the accuracy is not fully recovered given the limited quantization precision, results suggest that the hybrid-processing method could effectively recover high-resolution accuracy by accommodating resampling with the original target pixel after averaging the accumulated memristor weights. These findings suggest that the parallel memristor-based in-memory computation is highly efficient in achieving a high resolution while greatly accelerating the RSR algorithm. In addition, the associated expenditure of chip-area is minimized by reducing the number of memory accesses and the arithmetic operations.

V. PERFORMANCE ASSESSMENT AND DISCUSSIONS
In this section, a comparative analysis of the proposed memristor-based IMC architecture is presented, along with conventional FPGA-based solutions.
An image size of I z = 256×256 is chosen for performance evaluation (i.e, memory access (M em acc ), number of arithmetic operations (Ar.op), and area cost (A)). Table 4 shows the metrics for evaluating the required resources for a given RSR-based target pixel computations using traditional FPGA solutions and the proposed IMC architecture.

A. MEMORY ACCESS
The total number of memory access to compute the required arithmetic operations for a given target pixel from a random path information in a traditional FPGA solution is calculated as follows; M em acc = M em acc (p row )×M em acc (p cols )×n×N, (4) where p row and p cols denotes the access to x,y image coordinates from line buffers of Random pixel generation and pipelining block, as shown in Figure 6. n is the setpixel-points per a single random path and N presents the processing cycles. Additional n × N accesses to the scaledto-max-values that are stored in the FIFOs in order to be accumulated, as well as n×N for the averaging and resampling processes. While for the memristor-based IMC architecture, the n set-point are required to hold in-memory the given random spray and process the four-step algorithm of the targeted pixel in analog manner, as shown in Figure 10, leading to M em acc (p row )×M em acc (p cols )×N memory access process. In the following, RSR's default values, n=250, N =25 are used for the objective assessments.
While using the IMC paradigm, N.(n + 1) Comp(.) and Acc(.), are done in-memory intrinsic computations (i.e. only the access pulse train is required across its nodes). The memristor-based IMC, instead of traditional FPGA implementation, shows a vast decrease in both; the number of memory accesses and arithmetic operations as shown in Table 4.

VI. CONCLUSION
The paper presented a novel efficient hybrid CMOS-Memristor approach for random spray retinex. The proposed solution provided high speed and energy/area-efficient architecture compared to the conventional retinex scheme. Image quality was assessed using several image quality metrics and different quantization levels. 4-bit memristor was used as the computing element. The latter operated as a scale to maximum resistive values. MAC operations were performed in parallel for the accumulation of output currents. Generally, this design can be extended to other memristor-based inmemory computing systems that use the scale to max opera-tors and employ sparse input data to boost their overall performance efficiently. The proposed approach is considered a great asset towards developing efficient memristor-based computer vision and deep learning applications. In addition, the usage of the RSR algorithm as pre-processing step for AI applications was investigated. The results showed improved accuracy when RSR was deployed on noisy images.