RM-NTT: An RRAM-Based Compute-in-Memory Number Theoretic Transform Accelerator

As more cloud computing resources are used for machine learning training and inference processes, privacy-preserving techniques that protect data from revealing at the cloud platforms attract increasing interest. Homomorphic encryption (HE) is one of the most promising techniques that enable privacy-preserving machine learning because HE allows data to be evaluated under encrypted forms. However, deep neural network (DNN) implementations using HE are orders of magnitude slower than plaintext implementations. The use of very long polynomials and associated number theoretic transform (NTT) operations for polynomial multiplications is the main bottlenecks of HE implementation for practical uses. This article introduces RRAM number theoretic transform (RM-NTT): a resistive random access memory (RRAM)-based compute-in-memory (CIM) system to accelerate NTT and inverse NTT (INTT) operations. Instead of running fast Fourier transform (FFT)-like algorithms, RM-NTT uses a vector-matrix multiplication (VMM) approach to achieve maximal parallelism during NTT and INTT operations. To improve the efficiency, RM-NTT stores modified forms of the twiddle factors in the RRAM arrays to process NTT/INTT in the same RRAM array and employs a Montgomery reduction algorithm to convert the VMM results. The proposed optimization methods allow RM-NTT to significantly reduce NTT operation latency compared with other NTT accelerators, including both CIM and non-CIM-based designs. The effects of different RM-NTT design parameters and device nonidealities are also discussed.


I. INTRODUCTION
H OMOMORPHIC encryption (HE) is a public-key cryptosystem that enables evaluations over encrypted data [1], [2], [3], [4], [5]. As shown in Fig. 1, a client can send its confidential data in encrypted form to the outsourced cloud resources, and the data can remain in the encrypted format during the evaluations. This strong privacy-preserving feature makes HE one of the most promising candidates for secure machine learning. It allows private data from multiple parties to participate in the training and inference processes even on a third-party cloud platform without losing privacy [6], [7], [8]. Security of the mostly used HE schemes such as BFV [4] and CKKS [5] is based on the hardness of the ring learning with errors (RLWEs) problem built upon polynomial rings over a finite integer field. However, data-intensive computations over the very long plaintext and ciphertext polynomials come with significant computational overhead. For example, MNIST data in the encrypted form requires 120× more memory than the one in the unencrypted form (570× more if one considers compression techniques on the unencrypted data) [9]. The intensive usage of data limits the computation speed of HE, especially if the data have to be moved in/out of the memory. In addition, even the most advanced HE machine learning inference engines are orders of magnitude slower than inference over the unencrypted data [10], [11], a significant challenge that needs to be overcome to make HE feasible in real applications.
The number theoretic transform (NTT) operation is extensively used in HE to accelerate polynomial multiplications. When two polynomials are transferred to the NTT domain, the polynomial multiplication becomes elementwise multiplication with O(n) time complexity, where normal polynomial multiplication requires O(n 2 ) time complexity. However, due to the long polynomials used, NTT conversion itself becomes a major factor that limits the HE implementation speed. For example, a recent study found that NTT consumes 51% execution time of the multiplication on the ciphertext and overall 55% of the HE ResNet50 inference time [12], [13]. Accelerating the NTT conversion operation thus becomes the most important task for HE acceleration.
Many NTT accelerators target fast Fourier transform (FFT)-like algorithms for NTT operation. The algorithm consists of n/2 modular multiplications per stage and log 2 n stages. The time complexity of the polynomial multiplication can thus be reduced to O(n log 2 n). Various platforms, including FPGA and ASIC, have been developed to speed up the modular arithmetic and complex data transfer between stages used in the FFT-like algorithms [14], [15], [16], [17], [18], [19]. Although the optimized processing units can operate the modular arithmetic in one cycle, large local memory needs to be placed near the processing units and accessed frequently for the operands of the modular arithmetic. Like most dataintensive computations, there are high energy consumption and long delay associated with the large data movement between the processor and off-chip/on-chip memory (memory bottleneck) for the NTT computation [20], [21].
Compute-in-memory (CIM) is a technique that enables computations to be executed parallelly inside the memory where the data are stored [22]. Therefore, CIM can eliminate the frequent data access and upload, potentially leading to significantly improved throughput and energy efficiency [23], [24]. Recently, CIM-based NTT accelerators have been proposed using FFT-like algorithms [14], [17]. Theoretically, the modular arithmetic operations can be computed parallelly over the memory arrays. However, the log 2 n stages originating from the FFT-like algorithms still need to be processed serially, leading to long execution time. Here, we propose a CIM NTT accelerator RRAM number theoretic transform (RM-NTT) based on resistive random access memory (RRAM) or similar techniques that utilize vector-matrix multiplication (VMM) approaches to compute NTT. The proposed RM-NTT stores the twiddle factors, which are constant for given NTT parameters, over 1-transistor-1-RRAM (1T1R) arrays and performs the NTT operations fully in parallel (without internal stages), thus leading to much shorter latency compared with the FFT-based approaches. Fig. 2 shows one example of the RM-NTT usage in the polynomial multiplication process. To ensure the accuracy, we choose a bit-serial representation of the input polynomials. To further improve latency and reduce the number of required operations, we employ a modified Montgomery modular reduction algorithm for the modulo reduction on the VMM outputs and developed a technique to process NTT and inverse NTT (INTT) using the same 1T1R arrays. Finally, the proposed RM-NTT system performance is evaluated in the TSMC 28-nm HPC+ process and benchmarked against the state of the art, and prerequisite RRAM device specs for the proper RM-NTT utilization are suggested.

II. BACKGROUND AND RELATED WORKS
This section introduces the mathematical structures of RLWE, NTT/INTT, and related works. The same notation is used in the rest of this article unless otherwise clearly noted.

A. RING LEARNING WITH ERRORS
Let integer n be a power of two and represent the length of the polynomials. R q = Z q [X ] /(X n + 1) represents a polynomial ring where the degree of the polynomial is n − 1 and all coefficients of the polynomial are in the finite integer field Z q = {0, 1, . . . , q − 1}, where q is a large prime number [25]. In other words, polynomial arithmetic (e.g., addition and multiplication) in R q needs to be followed by modulo (X n +1) and modulo q operations. In the RLWE setting, with a given secret-key in the form of polynomial s, it is computationally hard to distinguish noisy polynomial pairs of (a, a · s + e) from uniformly random and independent polynomial pairs in R q , where a is uniformly sampled from R q , s is uniformly sampled typically from {−1, 0, 1} for efficiency reason, and e is chosen from the error distribution χ (e.g., a discrete Gaussian distribution over the ring) [26], [27]. Security of the most HE schemes, such as BFV [4] and CKKS [5], relies on the hardness of the RLWE.

B. NUMBER THEORETIC TRANSFORM
NTT is a variant of DFT over Z q instead of complex numbers. For a given a = n−1 i=0 a i X i , n-point NTT is defined as a i = n−1 j=0 w i×j a j mod q, where w is a primitive nth root of unity in Z q and w i×j is termed a twiddle factor. The INTT is defined similarly with NTT except having n −1 multiplication and minus sign at the exponent of the twiddle factors, where a i = n −1 n−1 j=0 w −i×jã j mod q. Note that n −1 and w −1 are the modular inverse of n and w in Z q , respectively. Fig. 2 shows a VMM form of NTT.
Like DFT, NTT can also be computed by FFTlike algorithms, such as Cooley-Tukey (CT) [28] and Gentleman-Sande (GS) [29] butterfly. CT butterfly takes a bit reversal input and produces a normal ordered output, whereas GS butterfly takes a normal ordered input and produces a bit reversal output. One butterfly of both algorithms computes modular arithmetic (addition, subtraction, and multiplication) with two inputs and outputs two calculated results. The CT algorithm schematic is shown in Fig. 3 for n = 8. It has three stages in series, where each stage has four butterflies.

C. RELATED WORKS
In prior works to accelerate NTT using the FFT-like algorithms, the butterfly architectures and data transfer (to other butterflies in the next stage) schemes have been targeted for optimization in different platforms. For example, Zhang et al. [19] proposed a five-stage pipelined butterfly unit supporting fast modular multiplications, and a ping-pong memory access method in an FPGA platform. The modular multiplication is converted into the modular addition by calculating intermediate values through the pipeline stages. Two dual-port memory blocks on the FPGA allow the butterfly unit to read two coefficients from memory and write back two calculated results to memory in one cycle. Similarly, Banerjee et al. [18] implemented a ping-pong memory access method using single-port SRAMs with two memory banks in an ASIC implementation, where each memory bank has four single-port SRAMs. Two of the single-port SRAMs read two coefficients from one memory bank, and other two singleport SRAMs in the other memory bank write two outputs in one cycle to minimize data transfer latency.
Recently, NTT accelerators based on CIM systems have also been attempted. Bit-serial modular addition, subtraction, and multiplication using 6T-SRAM arrays and near memory logic circuits are implemented in [17]. A bit-serial comparator attached at each column compares the temporal result with q to fit the result into the q limit. During the NTT operation, an intermediate result is stored below the two operands in the same column of the SRAM arrays, and it needs to be bit-serially transferred to the operand area (either the same column or another column) for the next stage butterfly operation. Although this approach can operate n/2 butterflies in parallel using the SRAM arrays, the FFT-like algorithm requires log 2 n serial stages that cannot be parallelized. Nejatollahi et al. [14] utilized in-memory logic operations using RRAM to implement the GS butterfly and reduction algorithms. The use of an unfolded structure achieves high throughput with a relatively small area overhead compared to other memory technologies due to the high area density of the RRAM devices. However, frequently programming the intermediate results to the RRAM cells for the next stage and changing the output RRAM conductance for the logic operations may cause endurance problems on the RRAM devices. In addition, the RRAM write operation generally requires longer latency and much higher energy consumption than the RRAM read operation.

III. RM-NTT ARCHITECTURE AND OPERATION FLOW
Instead of using the FFT-like algorithm, we aim to perform NTT/INTT operations fully parallelly using a VMM approach in the proposed RM-NTT architecture. This is feasible because the twiddle factor matrix is constant with given NTT parameters (n, q). We additionally propose a technique to implement NTT and INTT operations using the same 1T1R arrays and introduce a modified Montgomery reduction algorithm to execute modular operations on the VMM results after all analog multiply-accumulate (MAC) operations to reduce the required modulo operations.

A. RM-NTT ARCHITECTURE
The overall architecture of RM-NTT is shown in Fig. 4, consisting of a global buffer and 64 tiles, where the global buffer transfers inputs to all tiles and collects all outputs from the tiles. All tiles share the same input and perform complete NTT/INTT in parallel. Each tile includes four CIM processing elements (PEs) that perform bit-serial MAC operations in parallel. Outputs of the four PEs correspond to partial sums of the VMM and are further processed by digital circuits (an adder and Montgomery reduction unit) in the tile. The inputs and outputs of the tile have the same k-bit width, where k is the bit-width of q.
Each PE consists of a 128 × 128 array of 1T1R cells that are connected to word-line (WL), bitline (BL), and sourceline (SL) drivers. The array size was chosen to balance performance and accuracy since throughput improves with larger array sizes, but device variation and parasitic nonidealities (e.g., wire resistance) effects increase with the array size, leading to larger output errors. Larger arrays may be used to produce even better performance if those effects can be efficiently controlled through device and array operating algorithm optimizations. Each WL controls the 128 transistor gates in a row, and each BL is connected to the source electrodes of the 128 transistors in a column to accumulate the output current from the cells in the same BL. The accumulated current is sampled by a sample and hold (S&H) circuit at the end of the BL. The SL is parallelly aligned with the BL and supplies a write-or read-voltage pulse to the top electrodes of the 128 RRAM cells along the column. The total RRAM capacity in the RM-NTT is 4 Mb. Fig. 5 shows the details of the CIM microarchitecture. Considering the large area of high-precision analog-to-digital converters (ADCs), we chose to share one ADC across eight BLs in a time-multiplexed scheme, i.e., S&H outputs from eight BLs are connected to an 8-to-1 MUX in the BL-driver and operated in series per each input injection. One such subarray with shared ADC is called a BL group. A k-bit input requires k number of input injections to the array for the bit-serial operation. Even though multilevel cells or analog inputs can, in general, improve the system power and latency, we chose to use single-level cells and bit-serial inputs to minimize computation error.

B. TWIDDLE FACTOR MATRIX MAPPING
Twiddle factors are stored in the RRAM arrays as conductance values. Since the twiddle factor matrix of NTT is constant for a given parameter (n, q) regardless of inputs, this approach avoids constant RRAM writes and implements parallel MAC operations through Kirchhoff's current law with the applied input pulses. Unlike the FFT-like algorithmbased CIM designs [14], [17], which need to program the current stage outputs to the memory arrays for the next stage operations, the VMM approach enables RM-NTT to process NTT without programming the intermediate values in the RRAM arrays.
To improve reliability, RM-NTT uses only single-bit RRAMs that have two resistance states, low-resistance state (LRS) and high-resistance state (HRS). One k-bit twiddle factor is programmed into k 1-bit RRAM cells on the same WL, from the most significant bit (MSB) to the least significant bit (LSB). Specifically, we decompose a twiddle factor w i×j = is either 0 or 1 and c denotes the bit significance. When w of the corresponding RRAM is LRS (HRS). Fig. 5 shows an example of the mapped twiddle factors for k = 14. In RM-NTT, two BL groups (16 BLs) are tied together to store twiddle factors up to 16 bit long.

C. NUMBER THEORETIC TRANSFORM OPERATION
The NTT operations are implemented by VMM operations performed in the bit-serial manner to ensure full precision. During the k cycles, the SL drivers apply V read to the mapped columns of the array, while WL drivers apply pulses corresponding to the input bit values (V DD for ''1'' and GND for ''0'') to the mapped rows. Let b denote the input bit and c denote an output column, and the output current can then be written as c , where V j,b is either V read or 0, n WL is the number of WLs, and i is the NTT output index.
The WL and SL voltage driving time includes a precharging step (T PC ) and a sampling step by the S&H circuitry (T S ). The S&H circuits convert the output current to the voltage during T S and hold the voltage until the ADC accesses the held voltage. Fig. 5 shows an example of one PE operation during T S . After ADC operation, the quantized digit value is left shifted by (b + c). The two shifted temporal values of the two BL groups are then added with the existing partial sum in the 64-bit register, and the addition result is stored back to the same 64-bit register for the next process. When the bit-serial input operation is completed, an adder in the tile accumulates the final partial sums of the four PEs, followed by Montgomery reduction to complete the NTT operation. The 64-bit registers in the PEs are then cleared. The described VMM process can be performed in all tiles fully in parallel.
With the 64 tiles, we can map twiddle factors up to n = 512 and k = 16. However, we can exploit the symmetry property of the twiddle factor (w i = w i+n ) and employ a technique used in the FFT algorithms, dividing the input index into even and odd parts, to execute the NTT operation up to n = 1024. Specifically, the NTT equation can be written as (1), when we split the even and odd inputs into the first and second half (1) Note that the first and the second half of each even and odd inputs are multiplied to the same twiddle factor matrix, and a i andã i+n/2 use the same VMM results. This means that the actual VMM computations only need to be performed with two twiddle factor matrices of w i(2j) and w i(2j+1) , where i is from 0 to n/2 and j is from 0 to n/4. As shown in Fig. 6, two PEs are designed to process the w i(2j) twiddle factor matrices and the other two PEs process the w i(2j+1) twiddle factor matrices. VMMs using the first half of the even and odd inputs [ Fig. 6(a)] and the second half of the even and odd inputs [ Fig. 6(b)] are processed by the four PEs. The computation results stored in the 64-bit registers of the 4 PEs are then added and subtracted forã i andã i+n/2 , respectively. Therefore, the design can process NTT/INTT operations up to n = 1024, which is doubling the RM-NTT memory size limit.

D. INVERSE NUMBER THEORETIC TRANSFORM OPERATION
We note that the same tiles can be used for both NTT and INTT, by changing the INTT output index and twiddle factor matrix. Specifically, we rewrite exponents of the twiddle factors in the INTT equation as a i = n −1 n−1 j=0 w (n−i)×jã j mod q since the twiddle factor of w −i×j is identical to w (n−i)×j due to the symmetric property. With i = n − i, the INTT equation becomes a n−i = n −1 n−1 j=0 w i×jã j mod q, where a 0 = a n . The INTT output order is reversed except for the first element. Another difference between NTT and INTT is a multiplication of n −1 . Instead of multiplying n −1 to the VMM results, RM-NTT multiplies n −1 to the twiddle factors in advance such that n −1 w i×j . Then, the NTT operation consequently needs to multiply n to VMM results, which is more efficient because multiplying n (a power of 2) is a left shift operation.

E. MODIFIED MONTGOMERY REDUCTION ALGORITHM
Since the general modular reduction algorithm assumes a two-integer multiplication, we modified the algorithm to accept the VMM result (a sum of n two-integer multiplications) as an operand. Specifically, the Montgomery reduction algorithm is adopted and modified in RM-NTT because it has a design choice of r that allows the operand to be larger than the two-integer multiplication. The modified Montgomery reduction algorithm for the VMM approach is described in Algorithm 1. The Montgomery reduction algorithm requires an operand T to be in the Montgomery space. RM-NTT pretransfers the twiddle factors to the Montgomery space such thatw i×j = n −1 w i×j r mod q, since they are constantly used for the given (n, q). Then, the VMM result T is already in the Montgomery space. When n = 1024, T can be a negative value due to the subtractions in (1). If T is negative, nq 2 /2 is added to make T be a positive value since adding a multiple of q does not change the modulo result. After the n multiplication for NTT operation (described in Section III-C), T is in a range of [0,n 2 q 2 ]. To minimize the logic operations, VOLUME 8, NO. 2, DECEMBER 2022

Algorithm 1: Modified Montgomery Reduction Algorithm for NTT and INTT
Input: An integer T = n−1 i=0 a iwi , where a i and w i Z q , andw = wn −1 r mod q Output: An integer t = Tnr −1 mod q (for NTT) or t = Tr −1 mod q (for INTT) 1 2 k−1 < q < 2 k and r = n 2 2 k ; 2 q is an integer such that q = (rr we choose r = n 2 2 k , which ensures that only one ''if-else'' process is needed for the reduced value to lie in the correct range of [0,q). The overall area overhead of the circuit to implement Algorithm 1 is estimated to be 1% of the total tile area, and the dominant area unit is still the ADC block. The area estimation is obtained, as described in Section IV.

IV. RESULTS AND DISCUSSION
We evaluate the RM-NTT performance with the TSMC 28-nm HPC+ process design kit (PDK), except for the ADC block, which we obtained from reported values [30]. The digital circuits are implemented by system Verilog codes and synthesized using Synopsys Design Compiler. The RRAM devices are assumed to have a 1000 ON/OFF ratio and 30 µS conductance as LRS. With a 0.1 V read pulse, an RRAM device produces 3 and 0.003 µA as LRS and HRS, respectively. The system clock frequency is 400 MHz. T PC and T S are 10 and 2.5 ns, respectively. Costs for output index reversing during INTT and input index modification are assumed low and not included in the performance estimation. Table 1 shows the performance estimations of RM-NTT, along with other state-of-the-art NTT accelerators for comparison. All columns of the BL groups are used for the k = 16 case, whereas seven out of the eight columns in a BL group are programmed for the k = 14 case. Therefore, the latency of the 16-bit cases is longer than that of the 14-bit cases because RM-NTT needs to operate one more column during ADC time multiplexing. Fig. 7 shows the modified twiddle factors  that are programmed in the xth tile for the n = 512 and 1024 cases. For shorter polynomials, e.g., the n = 256 case, duplicate copies of the twiddle factor matrices can also be stored to further reduce the latency with each copy taking half of the bit-serial input, leading to half of latency during bitserial operation. Fig. 8 shows energy breakdown for different modules in RM-NTT, along with trends of energy and latency for different numbers of n.

A. COMPARISON WITH NON-CIM PLATFORMS
RM-NTT outperforms other implementations in terms of latency due to the parallel operations. For the (1024, 14) parameter case, 21× faster latency is obtained than the fieldprogrammable gate array (FPGA) design that has doubled bandwidth memory access and pipelined modular arithmetic units [19]. Compared to the ASIC design of [18] that uses the (512, 14) parameter, RM-NTT achieves 71× faster NTT computations, whereas the consumed energy is 1.4× larger. In addition, the latency of non-CIM designs tends to increase superlinearly, i.e., more than doubled when n is doubled. This is principally because all butterflies are not operated in parallel in every stage and the number of the internal stages also increases with n.

B. COMPARISON WITH OTHER CIM PLATFORMS
The approach based on RRAM in-memory logic in [14] requires higher energy than the VMM approach used in RM-NTT because switching the output RRAM devices for logic operations consumes much more energy than the read process. In addition, transferring the intermediate data to the next stage RRAM arrays also requires an expensive switching process. RM-NTT shows 64× faster latency and 7.4× better energy efficiency than [14] for (1024, 16) parameter, showing that VMM operations with constant RRAM arrays are more suitable for the RRAM-based CIM NTT acceleration. Compared with the SRAM-based CIM design employed in [17], RM-NTT shows 26× faster latency and 1.6× larger energy consumption for (1024, 14) parameter. Although the FFTlike designs in the CIM platforms can operate all butterflies in parallel during each stage, the log 2 n stages need to be executed in series, leading to slower latency than the VMM approach.

C. LATENCY AND ENERGY ANALYSIS
ADC is typically a dominant source of power consumption in CIM designs that exploit VMM operations [31], [32], [33]. Since the ADC power increases with ADC resolution, a general trend of RRAM-based CIM designs is to explore low ADC resolution operations [34]. This approach is suitable for neural network applications since the networks offer inherent error robustness. However, HE systems cannot tolerate computing errors in general and high-precision ADCs covering full precision of the VMM have to be used. In our case, RM-NTT uses SAR ADCs for low-power operation. The ADC power is proportional to 2 N /(N + 1), where N is the ADC bit resolution [35]. Theoretically, a log 2 (n WL ) + 1 -bit ADC is needed to cover all n WL + 1 levels [36]. However, the ''+1'' bit corresponds to the single case when all input and twiddle factor bit values are ''1,'' and it can be handled with a 1-bit comparator. As a result, in our design, we use log 2 (n WL )bit ADCs along with a 1-bit comparator to handle all output cases.
One approach to accommodate lower resolution ADC is to enable fewer rows at a given time, but it will result in longer operation cycles. When n en WLs are enabled in one cycle (e.g., n en = 64) instead of n WL = 128, the latency is increased by n WL /n en times. Fig. 9 shows the trends for the required cycles and ADC energy consumption for the full-precision VMM operations with different n en . Results in Fig. 9 clearly suggest that enabling all rows per cycle is the better choice from the latency and energy consumption point of view. However, in practice, n en is limited by the RRAM device ON/OFF ratio (HRS/LRS) and conductance variation effects. n en should be smaller than the ON/OFF ratio to separate level 0 (all n en RRAMs are HRS) and level 1 (only one RRAM is LRS) for the worst case of all active inputs.

D. ANALYSIS ON DEVICE VARIATION
During the RRAM programming process, the cell conductance could deviate from the target one due to the stochastic switching behavior of RRAM [37], [38]. We assume that the RRAM device conductance distribution follows a Gaussian distribution [39], [40], [41]. Since NTT cannot tolerate bit errors, it requires tighter conductance variation control compared to the conventional VMM operations used for neural networks (such as those discussed in [39]).  During VMM operations, the variation effects from the n en cells in a BL are accumulated. For instance, Fig. 10(a) shows output current distribution examples when i cells out of n en cells in the BL are in the LRS. Note that 0 to i LRS cells can produce ON-current and 0 to (n en − i) HRS cells can produce OFF-current with unknown random inputs. With the given device parameters including the ON/OFF ratio and device variation (σ/µ), we identify the optimal ADC quantization level location [x in Fig. 10(a)] that minimizes the probability of the ADC quantization error. An error occurs when the output current distribution exceeds the expected ADC quantization level [red lines in Fig. 10(a)], and the constant quantization level location needs to be used for all i cases (from 0 to n en ) for a given ADC design. For example, with the given device parameters of µ ON = 3 µA, ON/OFF ratio = 1000, σ/µ = 0.01, and n en = 128, the optimal x is 1.5 µA for all i cases, as shown in Fig. 10(b). The black and green lines in Fig. 10(b) denote the quantization error in the worst case scenario from all i cases originating from the black and green distributions shown in Fig. 10(a), respectively. With the given optimal ADC quantization level (x = 1.5 µA), the probability of having computing errors due to device nonidealities then depends on the number of cells in the LRS, as shown in Fig. 10(c).
Smaller n en can narrow the output current distribution, leading to reduced computing error probability originated from the device nonidealities. Fig. 10(d) shows the probability of computing errors during all VMM operations after programming the RRAM arrays for the (n, k) = (1024, 14) NTT operation with different device parameters. The computing error probability decreases from 1.9E−3 to 2.06E−8 when we reduce n en from 128 to 64. Blue blanks represent the probability lower than 1E−10 and the red blanks have the probability of 1. As described before, cases where the ON/OFF ratio < n en are not considered and denoted as × and gray in Fig. 10(d). One can choose the proper RM-NTT operating parameter n en using Fig. 10(d) with the given RRAM properties. RM-NTT latency values in Table 1 are obtained with ON/OFF ratio = 1000, σ/µ = 0.01, and n en = 128. Further device optimizations and utilizing better write-verify programming schemes to minimize device variabilities can lead to lower error probability and allow larger n en with better latency and lower energy consumption.

V. CONCLUSION
In this article, we introduce RM-NTT, an NTT accelerator using an RRAM-based CIM design that significantly reduces NTT latency. RM-NTT stores the modified twiddle factor matrix in the RRAM arrays and uses the VMM method to compute the NTT/INTT operations fully in parallel. By modifying INTT output index and twiddle factors in advance, NTT and INTT operations can be performed on the same RRAM arrays. In addition, RM-NTT adopts a modified Montgomery reduction algorithm to execute the modular reduction on the VMM results after all analog MAC operations. In performance estimations, RM-NTT achieves much faster NTT operation latency compared to other NTT accelerators including other CIM-based designs. We further investigate the impact of design parameters and device nonidealities on latency, power, and computing error, and provide guidance on how to operate RM-NTT with given device parameters.