A FerroFET-Based In-Memory Processor for Solving Distributed and Iterative Optimizations via Least-Squares Method

In recent years, several designs that use in-memory processing to accelerate machine-learning inference problems have been proposed. Such designs are also a perfect fit for discrete, dynamic, and distributed systems that can solve large-dimensional optimization problems using iterative algorithms. For in-memory computations, ferroelectric field-effect transistors (FerroFETs) owing to their compact area and distinguishable multiple states offer promising possibilities. We present a distributed architecture that uses FerroFET memory and implements in-memory processing to solve a template problem of least squares minimization. Through this architecture, we demonstrate an improvement of <inline-formula> <tex-math notation="LaTeX">$21 \times $ </tex-math></inline-formula> in energy efficiency and <inline-formula> <tex-math notation="LaTeX">$3 \times $ </tex-math></inline-formula> in compute time compared to a static random access memory (SRAM)-based <italic>processing-in-memory</italic> (PIM) architecture.


I. INTRODUCTION
M ODERN computing systems based on the Von-Neumann architecture rely on a clear distinction between logic and memory and process information by executing a sequence of precise atomic instructions with periodic uploads to the memory. Such systems are the foundation of the digital revolution that began with the demonstration of the self-aligned planar-gate silicon MOSFET in the 1960s and was accelerated by rapid advances in transistor technology. However, in the last decade, the volume of data collected by distributed sensors and networks has grown exponentially. Ingesting, processing, and extracting actionable intelligence out of this abundant data require a large amount of data traffic between logic and memory blocks, leading to the problem of memory bottleneck. This requires novel ways of architecting the compute platform. For example, by embedding processing elements in the memory subarray itself in so-called processing-in-memory (PIM) architectures [1]- [5], the traditional Von-Neumann bottleneck can be addressed and significant acceleration and improved power efficiency can be achieved. In order to solve the memory bottleneck problem, current research focuses on architectures and memory arrays that can accelerate memory-based processing for machine learning applications. Designs explore the use of static random access memory (SRAM) arrays [6], crossbar arrays with ReRAMs [7]- [9], memristors [10]- [12], and spintronic MRAMs [13].
Apart from inference, one ubiquitous algorithm in signal processing and autonomous systems is optimization-in particular, convex optimization. Least squares minimization is such a template problem and is the focus of this paper. We demonstrate that distributed convex optimization via least squares method can be efficiently implemented in an iterative dynamical system using a systolic PIM architecture, with breakthrough energy efficiency and performance. In particular, the iterative and parallel nature of memory-read makes the systolic PIM a good candidate for the proposed algorithm. This is further made possible by a parallel development in device technologies, namely, the advent of multiple embedded nonvolatile memories (eNVM). Among all competing eNVM technologies, ferroelectric field-effect transistors (FerroFETs) have emerged as promising candidates due to their compact size, multi-level storage, nanosecond read-write, and high energy efficiency. We demonstrate that a systolic PIM architecture, using FerroFET pseudocrosspoint array can solve least squares minimization with 21× improvement in energy efficiency compared with an SRAM PIM architecture.

II. CONVEX LEAST SQUARE MINIMIZATION
Before discussing the systolic PIM architecture, we present a brief overview of distributed least squares minimization as a template problem, with widespread applications in discrete signal processing. In particular, it is a common tool for signal reconstruction where the process of sampling is nonuniform [14], [15] such as in computerized tomography (CT), magnetic resonance imaging (MRI) [16], radar signal processing, light detection and ranging (LIDAR) systems, and so on. Consider (1) u and v are the horizontal and the vertical arguments of a continuous signal; (2) x and y are the discrete coordinate indexes; and (3) ω x and ω y are horizontal and vertical spatial frequencies. Let f (u, v) be a band-limited signal in R 2 . The signal is nonuniformly sampled and are stored in vector b, which are referred to as f (x, y). The objective is to use the nonuniform samples to obtain complete reconstruction of f (u, v) in N x · N y dimensional subspace. Fig. 1 shows an example of f (u, v) and the results of nonuniform sampling. In this algorithm, we assume that f (u, v) lies in an N x · N y dimensional subspace. To reconstruct the signal accurately we have used 2-D lapped orthogonal transform (LOT) cosine-IV harmonics as the basis functions. A smoothing function g(u, v) is applied to all the basis functions to avoid distortions. Equation (1) shows a general LOT cosine-IV basis function. Here, f (u, v) is split into K x by K y frames, [k x , k y ] represents a specific frame, ω x and ω y indicate the harmonic in horizontal and vertical directions Since f (u, v) lies in a N x · N y dimensional subspace, it can be expressed as The key point to note here would be that LOT cosine-IV has compact support and the different frames are loosely coupled to each other. In fact, for samples in each frame, the nontrivial dependence would extend only to the adjacent frames apart from itself. According to (2), we can write an equation for each sample and collect them into matrix-vector product form and the coefficients can be found by solving the inverse-linear problem of Here b is the sample vector, z is the coefficient vector obtained by stacking the coefficients α(k x , ω x , k y , ω y ), and A is referred to as the Grammian (Gram) matrix of the basis. When the size of A matrix is large (as in most applications), a direct solution is not possible. Therefore, alternatively we follow an iterative approach, the Jacobi method. A general update of z in jth component at the kth iteration is given as Some observations are worth emphasizing: 1) to update z k j , only values from previous iterations are needed and 2) columns of A are coupled only with neighboring frames, which leads to simpler computation of B ji . Such a system maps naturally to a systolic PIM architecture with: 1) near neighbor connections and 2) embedded linear algebraic operators on the periphery of the subarray-as will be described in Sections III and IV.

III. OVERVIEW OF FerroFETs-BASED PIM: MODELING AND EXPERIMENTAL VERIFICATION
In this paper, we explore FerroFETs as the technology of choice for implementing resistive cross-bar architectures that can accelerate linear algebraic operations. In particular, HfO 2 -based FerroFETs have recently received great interest for its application in nonvolatile memory (NVM) [17]. It is CMOS-compatible and retains ferroelectricity for thin films with thickness around 10 nm. By tuning the portion of the switched ferroelectric domain, a FerroFET can exhibit multiple intermediate states, which has been used in neuromorphic computing [18], [19].
The operation of FerroFET as a multi-valued eNVM storage is different from a traditional binary memory [17] in that a series of weak pulses are applied to set the device in the desired state [18], [19]. Various pulse schemes are proposed to tune the state, including identical pulse schemes [21], pulsewidth modulation schemes [22], and pulse-amplitude  modulation schemes [19], [23]. For illustration, Fig. 2 illustrates the operation with a pulse-amplitude modulation scheme, which is used in this paper. Fig. 2(e) shows the applied pulse waveform. After each pulse, the percentage of switched ferroelectric domains is modified. The device states are shown in Fig. 2(a)-(d). The device I DS -V GS values corresponding to different states are shown in Fig. 2(f), which shows the intermediate states. The different states could be sensed by applying a read pulse, V R , the corresponding drain-to-source conductance, G DS , can be sensed. Fig. 2(g) shows the ideal G DS as a function of applied pulse numbers. G DS increases/decreases linearly with pulse number during potentiation/depression, respectively. A symmetrical potentiation/depression is necessary for high accuracy computation. The experimental procedure is outside the scope of this paper and is described in [24]. The FerroFET model includes the atomistic simulation of domain dynamics with a driftdiffusion-based FET model. The simulation results closely match the experimental data and are shown in Fig. 3, where the different conductance levels are shown as a function of the number of programming pulses.

IV. FerroFet PIM ARCHITECTURE AND END-TO-END TOOL CHAIN DEVELOPMENT
In this paper, we explore the FerroFET memory-based processing in memory (PIM) architecture in a hierarchical manner. A short description of each layer of the design abstraction is provided here. Fig. 4 provides the flowchart of the entire design cycle from devices to the PIM architecture. The salient features are as follows.  1) There are 64 cores, eight rows, with each row containing eight cores. With respect to Section II, this implies N x = N y = 8. 2) Each core is capable of performing Jacobi iterations with subspace dimensions K x and K y (horizontal and vertical dimensions) equal to 8. The subspace dimensions determine the core complexity and accuracy of signal reconstruction. From our analysis, we identified that 8 × 8 subspace dimensions are sufficient for signal-processing applications in hand. 3) Analog-to-digital converters (ADCs) are critical in terms of determining the latency and power consumption. In order to explore the design space properly, we have used ADCs with different resolutions and design constraints. 4) For the current design, the B-coefficients (B −1 jj B ji ) and z-coefficients (z k j ) are represented in 12-bit fixed-point representations where the MSB 6 bits represent the integer part and last 6 bits represent the fractional part. 5) To model the system, we have used Spice for simulating bit cells, Verilog and VerilogA models for array-level circuit architecture simulations and gem5 for architectural simulations.

A. FerroFET CELL STRUCTURE
Fig . 5 shows the schematic for a differential FerroFET memory cell. The cell, apart from storage, provides the facility to compute 12-bit by 3-bit in-memory multiplications. Unlike previous work [25]- [27], the proposed bit-cell allows both positive and negative values for stored values as well as the inputs. During a read operation, the word line (WL) is fully turned on, appropriate V GS values are provided through GL1 and GL2. The entire row is read simultaneously through the current that is accumulated on source line (SL). The accumulated current corresponding to G and V is given by The weights of B-coefficients are encoded as multiples of 2 G, and the inputs or z-coefficients are coded as multiples of V . Here, both the G (B-coefficients) and V (z-coefficients) can be positive or negative; or in other words, no additional peripheral structure is required that is determined by the sign of the number being multiplied. The FerroFET-based product evaluation has been done by implementing the full design through spice simulation.
This cell structure allows in situ analog computation of multiply and accumulate (with both positive and negative operands) in the memory array itself. Fig. 6 shows the block diagram for the entire core and provides the detailed structure of the FerroFET memory array. Cores can be divided into three major blocks: 1) the FerroFET memory array that computes vector dot product (sum of products); 2) peripheral blocks; and 3) the communication block. The memory array and the peripheral blocks together form the compute unit. Each core has a maximum of eight compute units corresponding to each neighbor. The details of the architecture and the subblocks are shown as a part of the supplementary material. Here, we discuss the salient features only.

1) FerroFET MEMORY ARRAY STRUCTURE
The hierarchy of the FerroFET memory array has been shown in detail in Fig. 6. In each iteration, the memory array performs matrix-vector product of B and z using a pseudocrossbar architecture.

2) PERIPHERAL BLOCKS
The current summing FerroFET subarrays have per-column ADCs to digitize the summation of the inner products. The peripheral blocks include, shift plus add (S + A) arrays, adders to collect the output of each compute unit, followed by a subtraction block. Once these blocks finish their operation the z-coefficients are computed and sent to the communication blocks. Each core receives inputs from the neighboring cores. Digital to analog converters (DACs) produce voltage signals corresponding to a digital value of z-coefficients and these voltages are asserted on bit-lines (BL1, BL2) of the memory array.

3) COMMUNICATION UNIT
Communication between cores is done through an asynchronous mechanism. In this design, a four-phase handshake protocol has been used because of reduced logical complexity and competitive power and area efficiency when compared with respect to a two-phase protocol. The details of the protocol have been discussed in the Supplementary material.

C. SYSTEM ARCHITECTURE
The proposed architecture comprises of eight rows with eight cores in each. The entire design is synthesized in the 28-nm CMOS process. To simulate and obtain latency and power estimations for the baseline Von-Neumann architecture, we used the gem5 simulator [28] and McPAT [29]. Table 1 shows the system specifications for the gem5 simulator. For each iteration of the baseline Von-Neumann architecture, we collect a set of workload statistics. The system configuration and the data for a single iteration are then run through McPAT to obtain power estimations.
Simultaneously, we construct an SRAM PIM to compare its performance with the proposed FerroFET-based PIM architecture. In this design, we use a single read and write ports and peripheral adders and multipliers to design a compute unit. The structure of cores in the SRAM PIM is identical to that of the FerroFET PIM. The SRAM PIM prototype also consists of 64 cores.  is defined as the L2 norm of the difference of Z between the proposed architecture and a corresponding floating-point architecture. In our design, we use 2/3/4/7 bits/cell to store 12 bits (excluding sign bit) of the fixed point (6 bits for integer and 6 bits for the decimal). For example, the range corresponding to 2 bits with the sign bit, that is, [−4, 3] is represented by 3 bits/cell (due to the cell architecture). In our design the default ADC resolution is 16 bits; and we also study the effect of 16-bit data converters on the design. We use the linear part of the FerroFET's conductance, as discussed above.

V. DESIGN SPACE EXPLORATION
We observe that the average normalized error increases as the number of bits/cell increases as shown in Fig. 7(a). This is attributed to the fact that the use of a larger number of bits/cell requires higher ADC and DAC bit resolution to maintain precision. average normalized error from 7 bits/cell FerroFET array is much larger than 2, 3, 4 bit/cell FerroFET array mainly due to the loss of precision during data conversion. A higher resolution from the data converters beyond 16 bits requires noise-shaping and advanced architectures that are not amenable for low-power designs.
In order to quantify the effect of the finite resolution of the ADC/DAC on the fidelity of the final results, we plot the average normalized error of Z in Fig. 7(b). Three cases corresponding to the ADC/DAC resolution of 12, 14, and 16 bits are studied. Here the number of bits per FerroFET cell is considered to be 3. We observe that an ADC/DAC of 14-bit resolution results in convergence, whereas the quantization offered resulting for a 12-bit ADC/DAC is unacceptable. This leads to the design point where 14-bit ADC/DACs are used in the peripherals.
So far, we have studied the effect of the peripheral circuits and storage architecture on the convergence of the optimization algorithm. FerroFETs, in spite of their multi-state storage capability, suffer from inherent nonlinearities where the conductance does not change linearly with the number of pulses. We analyze the effect of this nonlinearity in conductance on the average normalized error of Z in Fig. 9. The nonlinearity in conductance of FerroFET is modeled as a normalized sigmoid function where G max and G min are the maximum and minimum conductance values, α is an empirically derived parameter. This is in contrast to the convex/concave functions that have been used in [30], [25], and [31] to model nonlinearity. We note that in the case of FerroFETs, the sigmoidal function is 1) a better fit and 2) physically meaningful. The sigmoidal conductance response manifests from the approximately Gaussian distribution of coercive fields among individual domains within the ferroelectric. Therefore, an amplitude-modulated pulse scheme, which in essence integrates across the domain distribution is expected to produce sigmoidal characteristics. Fig. 8(a) presents the experimental data of device-to-device V th variation among 40 FerroFETs and the maximum variation in V th is 30%. More detailed experimental data are shown in [33]. Like other algorithms, we note that increasing variations will increase the error in computation. Fig. 8(b) shows the averaged normalized error of the algorithm with respect to % of random variation on V th after 20 iterations of the algorithm. As expected, the error increases as the number of bits of storage per cell increases.  In this design, the number of bits per FerroFET cell is assumed to be 3. It is shown that if α is greater than 0.1, the average normalized error increases as the number of iterations progresses. This illustrates that the use of Fer-roFETs in optimizations for PIM architectures require linear changes in conductance during potentiation and depression. In [30], the authors have shown that when resistive processing units (RPUs) are used in crosspoint architectures for solving inference in deep neural network architectures, the resistive units need high degrees of linearity. We arrive at a similar conclusion when such resistive elements are used in solving optimization problems. This motivates further research in the device community to address the issue of nonlinearity when PIM architectures are used for solving linear-algebraic problems.
We study the effect of the design space on critical system parameters such as compute time, energy, power, and area. The number of bits that can be stored in a FerroFET decides the FerroFET array size. Our baseline design uses a cell with 4 bits/cell. We also consider the case of 5 bits/cell where we need 64 × 256 memory cells (eight subarrays of 64 × 32 dimension) to store all the B-coefficients. As we decrease the number of bits/cell, the total number of memory cells required increases. For example, a design with 3 bits/cell Similarly, the DAC resolution also affects the compute unit area and other critical metrics. In this architecture, the multistage DAC resolution can be configured to 2, 3, 6, and 12 bits. The main role of the DAC is to provide analog values of the z-coefficients, which are represented in a 12-bit fixedpoint format. As we reduce the DAC resolution, there are two options that can be pursued in the design: 1) duplicate the subarrays to compute in parallel and maintain the compute time at the expense of area overhead and 2) perform the computations sequentially. The sequential computation can be explained by the following simple example. For a 6-bit DAC, we first evaluate the sum with six LSB bits of all the z-coefficients, and in the next cycle, we evaluate the sum with the six MSB bits for all z-coefficients and eventually add them with appropriate scales using S + A blocks. We define the first approach as parallel computation which results in higher throughput but lower area efficiency and the second approach as sequential computation, which consumes the lower area at the cost of lower throughput. Another important fact to note is that decreasing the number of bits/cell or the DAC resolution reduces the dynamic range of the read current out of SL lines resulting in simpler peripheral design. In our case studies, we have optimized the read peripheral circuits and ADCs based on the DAC configuration [32].
Figs. 10 and 11 illustrate the compute time and energy as the DAC resolution and number of bits/cell are varied for the parallel-computation and sequentialcomputation approaches, respectively. It can be clearly seen from the two figures that in case of a sequential approach, the computation time is 2-3× higher than in the parallel-computation approach. For parallel computation [ Fig. 10(a)-(d)], we observe a trend that the compute time goes up as the DAC resolution increases. This is because the ADC starts to dominate the system latency. As we increase the DAC resolution, to maintain the same quantization error VOLUME 5, NO. 2, DECEMBER 2019

FIGURE 11. Compute time and energy behavior of the compute unit versus DAC resolution for the sequential-computation approach and storage per FerroFET memory cell is (a) 2 bit/cell, (b) 3 bits/cell, (c) 4 bits/cell, and (d) 5 bits/cell.
for the read current a higher resolution ADC is required and ADC latency increases super-linearly as the resolution increases. In Fig. 10(a) and (b), a monotonic decrease in energy is noted as the DAC resolution increases. This is because for both cases, the parallel memory array and associated peripheral hardware overhead is the dominant factor, which decreases as the DAC resolution increases and eventually causes a reduction in the overall energy consumed. However, for Fig. 10(c) and (d) that have higher bits/cell (4 and 5 bits respectively) the ADC overhead starts to be significant. As mentioned before, as the DAC resolution for these two cases increases, we have to switch to a higher resolution ADC that adds to the energy consumed and off-sets the improvement due to the reduction of the parallel subarrays and adders. Fig. 11 exhibits an increasing trend of compute time as the DAC resolution and bits/cell decrease. With less bits/cell and DAC resolution, it results in multiple iterations of compute cycle since the number of subarrays is fixed. Due to the energy tradeoff between peripheral units and the ADC (discussed above), the trend for energy dissipation is similar to Fig. 10. Also, it can be noted that the sequential approach consumes higher energy than the parallel approach due to the multiple iterations that are required. The comparison with an SRAM PIM structure has been shown using a dotted line in each of the histograms. The proposed design outperforms SRAM PIM structure in terms of compute time and energy for the majority of design cases, as has been shown.
Figs. 12 and 13 present the latency and energy breakdown of each block in the computation and communication units. In Fig. 7, we present the analysis of the averaged normalized error of the nonuniform sampling algorithm with respect to the ADC bit resolution and the number of bits that a single FerroFET cell can store. Based on this analysis, the normalized error is minimized when the ADC bit resolution is ≥14 bits and number of bits per FerroFET cell is ≥3. With the same system configuration as shown in Fig. 6, a 12-bit DAC, a 14-bit ADC and 3 bits/cell, we calculated the latency  Fig. 12(a), the block that takes the most latency is 14-bit ADC, which has 92% of the total latency [32]. In case of SRAM+ALU PIM, the computation in ALU takes the 63% of the total latency. In Fig. 13(a), communication between the neighboring cores dissipate 62% of the total energy since we use a four-phase handshaking mechanism with Muller-C elements (details in supplement material) whose clock frequency is 1 GHz. In Fig. 13(b), SRAM and its peripherals dissipate the most amount of power because the SRAM size expanded three times compared to the size of FerroFET cells to store all elements of B coefficients from (4). Fig. 14 shows the total power of the computation unit when the number of bits/cell and DAC resolution are varied for the parallel and sequential cases. From both Fig. 14(a) and (b) we observe that power consumption reduces as we increase either the number of bits/cell or the DAC resolution. From this, we conclude that the total power consumed is determined by both the memory subarrays and peripheral logic. As the number of bits/cell or the DAC resolution increase, we observe a reduction in number of S + A array stages and memory subarrays, and this reduction causes an overall reduction in  power. Further when Fig. 14(a) and (b) are compared to each other the parallel computation approach consumes higher power because of the additional memory array and associated peripheral hardware requirements. Fig. 15 shows the total area of the computation unit when the number of bits/cell and the DAC resolution are varied for the parallel and sequential cases. For the parallel computation approach [ Fig. 15(a)], the area is larger than in the sequential approach [ Fig. 15(b)] since the computations are executed in parallel with a higher number of memory subarrays and peripheral blocks. As the DAC resolution and the number of bits/cell increase the total area increases because the memory subarray, S + A and multistage adders required are lesser in number, and they dominate any increase caused by the ADC area. For all the figures the dotted lines show the performance of a corresponding SRAM + ALU Von-Neumann architecture (baseline). Table 2 presents the architectural results of compute time and energy for the baseline, SRAM PIM and FerroFET PIM architectures of 64 cores. FerroFET PIM shows 3× improvement in compute time and 21× improvements in energy efficiency compared to SRAM PIM.

VI. APPLICATIONS
As examples of prototypical problems that can be solved using the proposed algorithm and architecture, we present two applications: 1) signal reconstruction from 1-D EEG signals and 2) recovery of CT Images used in medical imaging.  Typical examples have been shown in Fig. 16(a) and (b). Both the peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) are shown in Fig. 17. We note that increasing the subspace dimension increases the fidelity of the reconstruction process. This justifies the use of a subspace dimension of 8 × 8 for the current applications in hand. It also shows the power of iterative algorithms in systolic PIM architectures for solving distributed convex optimization.

VII. CONCLUSION
This paper presents a systolic PIM architecture based on analog FerroFet pseudo-crosspoint arrays with in situ computation to enable distributed convex optimization via least square minimization. Key contributions of the paper are as follows.
1) A FerroFET-based differential cell can compute matrix multiplication of both positive and negative numbers. 2) A FerroFET-based PIM architecture for solving a least squares minimization. 3) Development of a complete end-to-end tool chain and demonstration of 21× in energy efficiency and 3× in compute time compared to an SRAM-based PIM architecture.
We demonstrate that cross-bar resistive architectures are not only capable of accelerating machine-learning algorithms, but also distributed optimization in a systolic array. After graduation, he joined IBM India as a Physical Design R&D Engineer to work on the power series and Z-mainframe microprocessors. In this role, he was responsible for the implementation of synthesizable circuits while satisfying timing, power, noise, electromigration, design-for-test, and design-for-manufacturing specifications. His current research interests include the design of low-power digital circuits, low-dropout voltage regulators, and power management for wide dynamic range computation. VIJAYKRISHNAN NARAYANAN (S'97-A'98-M'03-SM'08-F'11) is currently the Robert Noll Chair Professor of Computer Science and Engineering and Electrical Engineering with Pennsylvania State University, State College, PA, USA. He also leads the National Science Foundation Expeditions in Computing Center. He is also a Thrust-Leader of the DARPA/SRC JUMP Center for Brain Inspired Computing. He has mentored more than 100 graduate students and published more than 400 papers. His current research interests include embedded systems, computer architecture, and design automation.

GUS HENRY SMITH
Dr. Narayanan is a fellow of the Association of Computing Machinery.