FPGA-Based Hardware Accelerator for Leveled Ring-LWE Fully Homomorphic Encryption

Fully homomorphic encryption (FHE) allows arbitrary computation on encrypted data and has great potential in privacy-preserving cloud computing and securely outsource computational tasks. However, the excessive computation complexity is the key limitation that restricting the practical application of FHE. In this paper we proposed a FPGA-based high parallelism architecture to accelerate the FHE schemes based on the ring learning with errors (RLWE) problem, specifically, we presented a fast implementation of leveled fully homomorphic encryption scheme BGV. In order to reduce the computation latency and improve the performance, we applied both circuit-level and block-level pipeline strategies to improve clock frequency, and as a result, enhance the processing speed of polynomial multipliers and homomorphic evaluation functions. At the same time, multiple polynomial multipliers and modular reduction units were deployed in parallel to further improve the hardware performance. Finally, we implemented and tested our architecture on a Virtex UltraScale FPGA platform. Runing at 150MHz, our implementation achieved <inline-formula> <tex-math notation="LaTeX">$4.60\times \sim 9.49\times $ </tex-math></inline-formula> speedup with respect to the optimized software implementation on Intel i7 processor running at 3.1GHz for homomorphic encryption and decryption, and the throughput was increased by <inline-formula> <tex-math notation="LaTeX">$1.03\times \sim 4.64\times $ </tex-math></inline-formula> compared to the hardware implementation of BGV. While compared to the hardware implementation of FV, the throughput of our accelerator also achieved <inline-formula> <tex-math notation="LaTeX">$5.05\times $ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$167.3\times $ </tex-math></inline-formula> speedup for homomorphic addition and homomorphic multiplication operation respectively.


I. INTRODUCTION
The fully homomorphic encryption (FHE) [1] provides a theoretical and practical solution for cloud computing security and privacy-preserving, which can directly perform the arbitrary computations over ciphertext without disclosing the personal sensitive information. Concretely, users can encrypt the data and upload it in the form of ciphertext to the cloud server, and yet perform computations on encrypted data (hidden from cloud owner). Unless the private key of the FHE is obtained, no one can get the plaintext. Because The associate editor coordinating the review of this manuscript and approving it for publication was Yiming Huo .
FHE has the property that any computation on ciphertext is equivalent to performing the same computation on plaintext, the users can obtain the final result by decrypting the ciphertext with private keys. Some interesting applications of FHE include: secure outsourcing matching computation on genomic data [2], private information retrieval (PIR) and privacy-preserving data mining (PPDM) [3], multi-party computation (MPC) and privacy-preserving prediction from consumption data in smart electronic instrumentation [4], training neural networks over encrypted data [5], [6] etc.
The concept of FHE was first introduced by Rivest, Adleman, and Dertouzos in 1978 [7]. But constructing FHE schemes proved to be a difficult problem that cannot be solved until Gentry proposed the first FHE based on ideal lattice in 2009 [8]. Despite the groundbreaking work, Gentry's scheme was not the practical solution for the low performance. Since then, many researchers and cryptographers have introduced more efficient schemes to improve the performance of FHE, such as DGHV10 [9], BV11 [10], BGV12 [11], FV12 [12], GSW13 [13], FHEW [14], TFHE [15] and so on. Despite great progress in the performance of the algorithms, these FHE schemes are still too slow to be used in practical scenarios. Even the somewhat homomorphic encryption (SWHE) schemes that can realize limited number of operations on encrypted data are slow yet. Speed has become one of the main factors limiting the practical application of FHE.
At present, there are mainly software and hardware two kinds of implementations to accelerate the homomorphic encryption operations. However, the efficiency of software implementations [16]- [21] is still too slow for practical applications. For instance, a homomorphic evaluation of AES-128 in [16] takes over 36 hours based on the NTL C++ library, while the homomorphic evaluation of lightweight block cipher SIMON-32/64 in [17] takes around 20 min and 50 min using C++ implementation of FV and YASHE respectively. Even the GPU-based implementation of matrix-vector multiplication of BGV leveled FHE in [20] needs at least 2 seconds when evaluated on NVIDIA Tesla K20, which has 2,496 cores, 5GB DDR5 memory. On the other hand, the existing hardware implementations [22]- [33] just realize limited several FHE functions and the performance is still low. For instance, Cao et al. [23] only proposed a FPGA-based large-integer FFT multiplier and a Barrett modular reduction for accelerating the FHE operations over integers, Wang et al. [24] merely presented a 768K-bit multiplier based on 64K-point FFT processor targeting for Gentry-Halevi FHE primitives. Although Pöppelmann et al. [28] and Roy et al. [30] proposed the homomorphic evaluation architectures for YASHE with different parameter sets respectively, the implementation in [30] evaluated SIMON-64/128 in approximately 157 seconds at 143MHz, the time delay is still very large. The hardware implementations of LTV and FV schemes were also introduced by Doröz et al. [29] and Roy et al. [32], However, they only focus on accelerating the homomorphic evaluations or homomorphic encryptions and cannot take both into account, and the performance is not high enough.
From the perspective of hardware implementation, the existing FHE hardware accelerators include the FPGA-based implementation and ASIC-based implementation. However, the ASIC-based implementation has the defects of long development period and high cost. While the FPGA-based implementation has the advantages of better programming flexibility, lower development difficulty and cost, and can achieve a good compromise between different design factors. Therefore, this paper uses FPGA to realize FHE accelerator. As a general hardware implementation platform and tool, FPGA has been widely used in various hardware design fields, such as neural network [34], image processing [35], cryptographic algorithms [36], [37], etc. We implement FHE accelerator on a FPGA platform, which has novelty in algorithm selection and hardware acceleration architecture, and it is also a new extension of FPGA platform in fully homomorphic encryption application.
In this paper, we present a complete FPGA-based hardware accelerator for homomorphic encryption and homomorphic evaluation of BGV leveled FHE scheme, which is the first efficient FHE scheme based on Learning with Error (LWE) or Ring LWE (RLWE) problem, and it is an important basis for other FHE variant algorithms. To the best of our knowledge, there are no published complete hardware implementations of Ring-LWE based BGV prior to this paper. A very recently paper by Pedrosa [37] implements the hardware of encryption and decryption of BGV algorithm based on FPGA, but they have not provided the hardware architecture of homomorphic evaluation function. The goal of our accelerator is to provide a complete implementation of a solution that is mature for practical application. We implement all required components for homomorphic encryption and homomorphic evaluation in hardware.

A. OUR CONTRIBUTIONS
The contributions of our paper can be summarized as follows : We propose an efficient hardware implementation of BGV leveled FHE scheme, to the best of our knowledge, is the first complete FPGA-based Ring-LWE accelerator for BGV algorithm. In contrast to the prior art for other FHE scheme, our architecture supports both homomorphic encryption and homomorphic evaluation computation, and can be tailored according to concrete application needs. We leverage multi-layer parallelism to accelerate the operations from circuit-level to arithmetic block-level. However, we see our work as the first step towards a practical accelerator.
We improve the performance of polynomial multiplication over ring by designing the NTT-based negative wrapped convolution (NWC) algorithm, which adopts four-level pipelines and a single round iterative structure. The optimized structure can achieve a good trade-off between performance and area. In addition, a resource-saving and high performance modular reduction algorithm is presented, which occupies only half of resources of Barrett reduction.
We introduce the hardware architecture of KeySwith module and ModSwitch module which are necessary to implement the leveled BGV FHE scheme. For KeySwitch, we select the switchkey parameters that are more suitable for hardware implementation, and improve the efficiency of KeySwith by using multi-level pipelines. We propose the first hardware structure of ModSwitch which can decrease the noise of ciphertext of homomorphic evaluation by using a smaller modulus.

B. PAPER OUTLINE
The organization of this paper is as follows. Section II describes the related works and BGV leveled FHE scheme, as well as the parameter set that we use. Section III provides the algorithms and optimization methods for computation intensive operation of homomorphic evaluation function. The hardware architecture overview and details are provided in Section IV. The resource utilization and performance of our implementations are shown in Section V. Section VI summarizes the paper.

II. BACKGROUND AND RELATED WORK
A. RELATED WORK As mentioned above, software implementations are not yet efficient enough for real-time applications, which may require minutes or hours to evaluate some simple functions or algorithms. For instance, a homomorphic evaluation of AES-128 [16] is reported to take over 36 hours based on the NTL C++ library, and running on a Intel Xeon processor with 2.0GHz and 256GB RAM. Even using SIMD techniques, the amortized rate is about 40 minutes per block. Another software homomorphic evaluation of the decryption function of a lightweight block cipher SIMON-32/64 (resp. SIMON-64/128) [17] is reported to take around 3062s (resp. 12418s) and 1029s (resp. 4196s) using C++ implementation of FV and YASHE respectively on 4-core Intel Core i7 processor at 3.4GHz. To address the shortcomings of software implementations, many optimized architectures and accelerators based on Graphics Processing Units (GPUs) and FPGAs/ASICs have been proposed.
GPU is an alternative computing platform to accelerate the homomorphic evaluation in FHE. Wang et al. [18] proposed the first GPU-based accelerator of FHE targeted at Gentry-Halevi scheme [19], the FHE primitives were implemented on NVIDIA C2050 GPU with a dimension of 2048, and achieved speedup factors of around 7 compared to original CPU implementation, which was on Intel Xeon X5650 processor running at 2.67GHz, 14GB RAM. Then a GPU-based implementation of BGV leveled FHE accelerator was introduced by Wang et al. [20] further, the CRT-based matrix-vector multiplication was evaluated on NVIDIA Tesla K20, which had 35.2 times and 273.6 times speedup compared to the CRT-based method and NTL library implementations on CPU. Badawi et al. [21] proposed the multi-threaded CPU execution as well as GPU implementation of RNS variants of the BFV scheme on NVIDIA Tesla K80 and V100-PCIe, the performance was faster by two orders of magnitude than prior results. However, GPU-based implementations normally offer less performance per watt of power and the speed is still too low compared to hardware implementations.
Many FPGA-based or ASIC-based accelerators have been proposed to improve the performance of FHE schemes. A lines of research focuses on the large integer multiplication hardware accelerations [22]- [27], which are the main bottlenecks of FHE schemes. Cao et al. [23] proposed the first hardware implementations of encryption primitives for FHE over integers based on Xilinx Virtex-7 FPGA platform, the performance of which was improved a factor of up to 44 compared to corresponding software implementation. A large-integer FFT multiplier and a Barrett modular reduction were proposed that could accelerate the FHE operations by 11 times. A 768K-bit multiplier based on 64K-point FFT processor was introduced by Wang et al. [24], the multiplier was prototyped on Altera Stratix-V FPGA at 100MHz, and was about twice as fast as the same algorithm executed on the NVIDA C2050 GPU at 1.15GHz. Doröz et al. [27] presented a custom architecture for Gentry-Halevi FHE scheme, the architecture featured an optimized multi-million bit multiplier based on Schönhage-Strassen multiplication algorithm, which occupied a footprint of less than 30 million gates with the frequency of 666MHz when synthesized using 90nm TSM library, the performance was equivalent to that of Xeon software but slower than GPU implementation.
Several works [28]- [33] focus on improving the performance of concrete FHE schemes based on FPGAs. Pöppelmann et al. [28] proposed an architecture for YASHE [38] scheme and implemented their design on the Catapult board equipped with an Altera Stratix V FPGA and two 4GB DRAMs, an efficient double-buffered memory access scheme and a Number Theoretic Transform (NTT) based polynomial multiplier were proposed, for parameter set (n = 16384, log 2 q = 512), they can perform a homomorphic addition in 0.94ms and a homomomorphic multiplication in 48.67ms. a hardware accelerator for LTV [39] based SWHE scheme was introduced by Doröz et al. [29], when synthesized for Xilinx Virtex-7 the presented architecture can compute the product of large polynomials in 6.25msec which is more than 102 times faster than software implementation. Roy et al. [30] also proposed the hardware architecture for YASHE scheme, which was compiled on a Xilinx Virtex-7 FPGA, the implementation evaluated SIMON-64/128 in approximately 157.s at 143MHz, and was 26.6 times faster than software implementation. However, the assumption of unlimited memory bandwidth which renders off-chip memory accesses free of cost is not a realistic. Perhaps, the closest work to ours was by Roy et al. [32], in which the authors presented an architecture for FV scheme and implement the design on Xilinx Zynq UltraScale+ MPSoC ZCU102, the implementation achieved over 13 times speedup at 200MHz with respect to the FV-NFLlib executing on an Intel i5 processor running at 1.8GHz.

B. BGV FHE SCHEME
In this section we briefly introduce the BGV fully homomorphic encryption scheme. The BGV scheme was proposed by Brakerski, Gentry, and Vaikuntannathan [11] in 2012. It is the first leveled FHE scheme that without Gentry's bootstrapping procedure and provides a choice of basing on the learning with error (LWE) or Ring-LWE (RLWE) problems that have 2 λ security against know attacks, this paper mainly focuses on the Ring-LWE based BGV.
For mathematical preliminaries, suppose the polynomial degree n is a power-of-two integer, we define an integer polynomial ring R = Z[x]/f (x) with reduction polynomial f (x) = x n + 1, whose elements have degrees at most n − 1. For the ciphertext modulus q, we define the ciphertext space R q = R/qR for the residue ring of R by modulus q for each coefficient of polynomials. In actual computation, we represent the coefficients of R q in [0, q − 1] ∩ Z. For the plaintext modulus p, the plaintext space defined as R p = R/pR, where each coefficient of polynomial is represented as a integer modulus p. In the operations of key generation, encryption and KeySwitch of BGV, the polynomials sampled from discrete Gaussian distribution χ σ with a small standard deviation σ is defined on R. In practice, we take the private key as a polynomial with coefficients form a narrow set like {−1, 0, 1}. The security of scheme is determined by the degree n, the size of the ciphertext modulus q, and the Gaussian distribution χ σ .
With these preliminaries, now we enumerate the functions used in BGV scheme as follows.
1) BGV.Setup(λ): For a given security parameter λ, choose the polynomial degree n, ciphertext modulus q, plaintext modulus p, and Gaussian distribution χ σ . Normally, n is a power-of-two integer, q is a positive integer satisfying q ≡ 1 mod (2n), p = 2 or a positive integer much less than q.
2) BGV.KeyGen(params): Sample polynomial s ← χ σ . Return the secret key sk = s = (1, s ) ∈ R 2 q . Sample polynomial a ← R q uniformly at random and error polynomial e ← χ σ . Compute b ← a s + pe ∈ R q , where p q. Return the public key pk = (b, −a ) ∈ R 2 q , which satisfies the equation pk · s = (b, −a ) · (1, s ) = b − a s = pe . The scheme needs another key called switching key in the function of SwitchKey, to compute the switching key, we first choose the parameter t ≤ q, sample polynomial vector ex_a ∈ R q and ex_e ← χ σ uniformly, then compute ex_b = ex_a · s + p · ex_e + Powerof ((s ) 2 ) ∈ R q . The function Powerof t, (a) scales an element a ∈ R q by the different powers of t, namely, Powerof t, (a) = (a · t i ) i=0 , where = log t q in this scheme. At last, return the switching key epk = (rlk 0 , rlk 1 ) = (−ex_a, ex_b) ∈ (R q , R q ).

C. PARAMETER SET
From the perspective of proof-of-concept, we use a small parameter set with the polynomial degree n = 128 (namely, the degree of the reduction polynomial f (x) is 128). In order to take advantage of the structure of polynomial ring and make our hardware implementation more practical, we choose the plaintext modulus p = 32 (i.e. log 2 (p) = 5 bits) and the ciphertext modulus q = 257 3 (i.e. log 2 (q) = 25 bits), so we can not only evaluate bit-level operations, but also the integer-level operations. By using the packing method [40], we can embed multiple plaintexts into different coefficients in a single ciphertext and evaluate a function on all of them in parallel with a single execution. Normally, the ciphertext q is chosen as a big prime integer and satisfying q ≡ 1 mod (2n). However, we choose q as the product of three primes mainly for the convenience of implementing and verifying the ModSwitch primitive of BGV, which can change the original modulus to a smaller number while preserving the correctness of decryption under the same secret key. Moreover, since the coefficients of polynomials in BGV scheme are signed integers, to keep some redundancy, we set the bit-width of polynomial coefficient and modulus q to 27-bit signed integers in our actual implementation. Following [11], [31], we take the discrete Gaussian distribution χ σ as a polynomial with coefficients from {−1, 0, 1} uniformly at random for simplicity. Despite this may affect the security to some extent, it is feasible from the perspective of hardware implementation of primitives. The parameters of our design are detailed in Table 1.

III. ALGORITHMS AND OPTIMIZATIONS
Through the analysis of BGV scheme, we find that the polynomial multiplication and modular reduction are the most VOLUME 8, 2020 frequently used and time-consuming operations. Without considering the encryption and decryption algorithm (normally belonging to the client), only homomorphic multiplication and KeySwitch operations (normally belonging to the server) need at least 8 polynomial multiplications and much more modular reductions each time. Therefore, we optimize these two operations respectively which will lay the foundation for the later BGV hardware implementation.

A. POLYNOMIAL MULTIPLICATION
In our parameter set, the polynomials consist of 128 coefficients and each has the size of 27 bits signed integers. For such large polynomials and coefficients, the computation time is significantly determined by the complexity of polynomial multiplication algorithm. At present, the main methods of polynomial multiplication include [41]: School-book algorithm, Karatsuba based algorithm, Toom algorithm, Fast Fouriter Transform (FFT) based algorithm and so on. FFT-based polynomial multiplication applies a divide and conquer technique to reduce the computation of the Discrete Fourier Transform (DFT) into smaller problems and has the lowest time complexity O(n log n), During a polynomial multiplication, forward FFT transform is applied on the input polynomials to bring them to Fourier domain. Then, a coefficient-wise multiplication is performed in Fourier domain. Finally, an inverse-FFT (IFFT) transform is required to bring the results back to polynomial representations. However, the FFT and IFFT transforms are computed on the real number field, and thus suffer from the approximation errors, which is not suitable for cryptographic applications. Instead, we use the NTT transform which is a generalization of FFT and performs the polynomial multiplication.
Suppose a(x) is the polynomial of degree less than n in the ring R q , let ω be a primitive n-th root of unity in R q , Then the n-point NTT of a(x) are defined as follows: And n-point inverse-NTT (INTT) can be calculated by the following formula: Since q is the product of primes, n has an inverse n −1 modulo q, where n · n −1 ≡ 1 mod q, ω has an inverse ω −1 modulo q satisfying ω · ω −1 ≡ 1 mod q. Note that NTT transform holds if and only if p − 1 can be divided by n for each prime factor p of q, and for the primitive n-th root of unity ω, it satisfies ω n = 1 mod q. Because the time complexity of basic NTT algorithm is still O(n 2 ) and does not have any advantages, so we use the butterfly-based NTT transformation with time complexity O(n log n) to construct the polynomial multiplications. An iterative version of the butterfly-based NTT algorithm is shown in Algorithm 1. There are three nested loops performing the NTT algorithm, inside the inner-most loop, the butterfly operation which consists of a modular multiplication by the twiddle factors ω m followed by a modular addition and modular subtraction is computed. Using the NTT algorithm, one can perform the polynomial multiplication through application of the convolution theorem efficiently. Let R q be an arbitrary ring with polynomials a(x) and b(x) of degree n − 1 with the coefficients a i ∈ R q and b i ∈ R q for i = 0, 1, . . . , n − 1. Then the convolution of coefficient vectors results in the product polynomial c(x) of degree 2n − 2, the coefficients of which are c i ∈ R q for i = 0, 1, . . . , 2n − 2. Note that the product vector c has a length of 2n, the input vectors a and b must be zero padded to a length greater or equal to 2n with the form of 2 k to get the correct result. Though the time complexity is linear using NTT algorithm, the length of vectors and the number of point-wise multiplications are doubled.
In this paper, we use the negative wrapped convolution method [42] to compute the product of polynomials a(x) and b(x) in R q , which can avoid the zero padding. Let a = (a 0 , a 1 , . . . , a n−1 ), b = (b 0 , b 1 , . . . , b n−1 ), and c = (c 0 , c 1 , . . . , c n−1 ) be vectors of length n, then each elements of c can be calculated as This computation is equivalent to perform the polynomial multiplication modulo (x n + 1), which has the property x n ≡ −1 mod (x n + 1). Let ω be the n-th primitive root of unity in R q and e is the square root of ω, which satisfies the equation e 2 = ω mod q. In order to guarantee the existence of e, when q is the product of primes or a prime and n is power of 2, we should have q = 1 mod 2n. To compute the negative wrapped convolution based on NTT, we do the following first: a = (a 0 , ea 1 , . . . , e n−1 a n−1 ) Then the negative wrapped convolution ofã andb can be computed by NTT and INTT as At last, we perform the following operations to get the final result of polynomial multiplication of a(x) and b(x): Details of polynomial multiplication of negative wrapped convolution using NTT is shown in Algorithm 2. Compared with the zero padding method, the negative wrapped convolution method can reduce the length of NTT transformation and point-wise multiplication from 2n to n. Moreover, the modular operation x n +1 is omitted. Therefore, the performance of polynomial multiplication can be improved greatly, especially with the increasing of the length of polynomials, the improvement becomes more obviously.

Algorithm 2 NTT-Based Polynomial Multiplication
Input: Coefficient vectors a = (a 0 , a 1 , . . . , a n−1 ) and b = (b 0 , b 1 , . . . , b n−1 ) of polynomials a(x) and b(x) respectively, where a i , b i ∈ R q , i = 0, 1, . . . , n − 1, primitive n-th root of unity ω and ω −1 in R q , square root e and e −1 of ω, polynomial degree n and n −1 . Modular reduction is another time-consuming unit in BGV scheme, especially for polynomial multiplications. Almost all polynomial coefficients in ciphertext space are bounded over R q , where q = 16974593 in our scheme. Concretely, a modular reduction must be performed after each multiplication in order to work with elements in the polynomial ring at all times, the same is true for multiple addition operations. Modular reduction operations take a significant amount of time and resources, representing a critical part within the polynomial multiplications. Therefore, it is important to design an efficient and area saving modular reduction. Since the modular reduction of addition is relatively simple, it only needs a few steps of conditional subtraction operation. This paper focuses on the modular reduction of multiplication.
Some previous works exist to perform efficient general modular reduction based on the specific modulus, such as Solinas primes [28] or pseudo-Fermat primes [42], and most of them use the Barret modular reduction algorithm [30], [36]. However, these implementations are restrictive and do not allow for arbitrary selection of the modulus value, and through some experiments, it is confirmed that these methods take up a relatively large resource for the value of q that we choose.
Therefore, from the perspective of hardware design, we propose a lookup table (LUT) based modular reduction algorithm as shown in Algorithm 3. We adopt a divide and conquer strategy. Let the bit length of modulus q is k, and the bit length of input number x is m. When m > k we firstly divide the bits higher than k and the bits lower than k. For the bits lower than k, we judge whether the value of x is greater than q, and determine whether to subtract q from x. Moreover, for the bits higher than k, we divide every four

Algorithm 3 Modular Reduction Algorithm
Let k be the bit length of q, and x be the integer with a maximum bit length m.
sum = sum + t i 10: end for 11: x = sum 12: m = sum.length 13: end while 14: y = x 15: Return(y) VOLUME 8, 2020 bits into a group, and perform pre-computations, which can be mapped into a 4-input LUT in hardware. Then add the results, and take the sum result as a new value x and the length as a new value m, repeat above operations until m ≤ k, and output the modular reduction result y. Compared with traditional methods, our modular reduction algorithm can support arbitrary length of modular operation. At the same time, by introducing the pre-computation method, we can quickly reduce the number of input bits, the overall algorithm complexity and resource occupation.

IV. ARCHITECTURE OVERVIEW
In this section, we first describe the overall architecture design of BGV accelerator. Then the kernel acceleration units including polynomial multiplication and modular reduction are introduced in detail. Finally, the hardware implementation of each component of BGV accelerator will be presented.

A. OVERALL ARCHITECTURE
The architecture overview of our BGV accelerator is presented in Fig.1. Despite the goal of our implementation is to accelerate the server-based homomorphic evaluation operations BGV.HomMult and BGV.KeySwitch (and polynomial multiplication and modular reduction in general), from the perspective of proof-of-concept, we also implement the encryption and decryption operations which are assumed to be performed on a client. We would like to note that except for the Gaussian sampler, almost all components required for BGV scheme are already present in our design. By default, we think that the secret key sk = (1, s ), public key pk = (b, −a ) and switch key epk = (rlk 0 , rlk 1 ) are directly generated by CPU, the plaintext polynomials m 1 and m 2 are read from the memory of the CPU according to the requirements of the specific application program. When the accelerator receives the plaintext polynomials, it first performs the encryption operation under the control of main controller. Then, the ciphertext ct 1 and ct 2 are sent to homomorphic addition and homomorphic multiplication module respectively to perform homomorphic evaluation computations. Due to the problems of dimension expansion of ciphertext and the excessive noise growth in homomorphic multiplication, the KeySwitch and ModSwitch operations are required. If more than one homomorphic addition or multiplication operation is needed, the result of homomorphic evaluation function can be fed back to the corresponding unit to perform the next round computations. Otherwise, the final result of homomorphic addition or multiplication can be obtained by decryption.
It should be noted that the modulus q in our polynomial multiplication unit is fixed, so the output of ModSwitch cannot be directly decrypted by the accelerator. The ModSwitch module in our accelerator is just a primitive verification, and it is not connected to the decryption circuit. In order to reduce the length of critical path and keep the data updated in time, the intermediate results or parameters can be temporarily stored in block RAMs or registers. Further analysis shows that, compared with homomorphic addition, the calculation path of homomorphic multiplication is extremely long, it is necessary to optimize the kernel computing units (e.g. polynomial multiplication and modular reduction) and KeySwitch module.

B. KERNEL ACCELERATION UNITS 1) NTT-BASED POLYNOMIAL MULTIPLIER
Following the iterative version of the butterfly based NTT algorithm, a pipelined architecture for the NTT transformation is designed and presented in Figure 2. As can be seen from Figure 2, the overall data path is divided into four functional processing units. First of all, in order to temporarily store the input data and enable pipelining in our architecture, a simple dual-port RAM, which can read and write concurrently, are used to store the n coefficients of polynomials. The dual-port RAM has two input buses and two output buses, corresponding to the upper data path and lower data path of butterfly unit respectively. Since the dual-port RAM has two ports for reading and two ports for writing, it can support two reads and writes in a cycle. Then, the butterfly unit with Decimation in Frequency (DIF) structure performs the addition and multiplication with modular reduction and shares the same data path for all the stages in NTT, the output data is feedback to the dual-port RAM for the next butterfly computation. Instead of generating the ω i mod q, ω −i mod q, e i mod q and e −i mod q on the fly, the ROMs are designed to store these pre-computed values respectively. In addition, an update unit is responsible for updating the address index i 1 , address index i 2 , offset k 1 , offset k 2 , group index and stage index of NTT transformation, and writing the coefficients back to corresponding RAM address. With the fully pipelined architecture, after a certain period of computation, the coefficients of the NTT will appear successively at the output at each cycle. Last but not least, since the input values of polynomials occur in natural order, while the index of the NTT values is in bit reversed order, a Bit-Reverse operation is performed after the computation of NTT to reverse the order of the output sequence.
Update unit is the most significant and complex component of NTT, which directly determines the processing order of butterfly operations. According to the parameter set we proposed in Section II and the NTT algorithm presented in Section III, the NTT transform consists of log 2 n = 7 stages, from stage 1 to stage 7. Each stage is divided into multiple groups, and each group further contains multiple butterflies. When the stage index is increased by one level, the number of groups is doubled, and the number of butterflies in each group is halved. At the last stage, there are total n/2 = 64 groups, and a group includes only a single butterfly. In addition, note that the exponents of twiddle factors ω i mod q only participate in the multiplication of the lower part of each group, while the upper part remains unchanged, and for each group the same type of twiddle factors occurs. The address index i 1 is used to write the upper path data of butterfly in each group to the dual-port RAM, and the address index i 2 performs the similar operations, but is the lower path data of the same butterfly. The parameter k 1 and k 2 are the inter group offset and intra group offset of the address index i 1 and i 2 respectively, and they satisfy the relationship k 1 = 2k 2 .
With the NTT processing unit mentioned above, we find that the structure of INTT is almost the same as the structure of NTT, except that the exponents of twiddle factors ω i mod q are replaced by ω −i mod q, and an additional multiplication by n −1 mod q is needed at the end of the INTT. Hence, the INTT processing unit can reuse the NTT unit, which can simplify the complexity of control logic. The data flow of NTT based multiplier is depicted in Figure 3. Considering that step 3, 4 and step 6, 7 have no data dependency in Algorithm 2, we can enable the high speed design by parallel processing these two steps. Owing to the more compact iterative butterfly based NTT structure, thus the NTT n ω (ã) and NTT n ω (b) can be computed directly by copy the NTT structure, and the INTT n ω (C) can also be computed using the NTT In order to fully utilize the last stage of NTT, the point-wise multiplication can be absorbed in the last stage. That is, instead of performing the multiplication by the powers of ω m in the last stage, the point multiplication is computed directly, which will lead to a reduction of n cycles compared with the original method. In general, the values of n −1 mod q and e −i mod q are pre-computed and their multiplications are performed separately, we improve this point by pre-computing the values of e −i · n −1 mod q directly, which can further save n cycles. Furthermore, similar to the pointwise multiplication, the multiplication by the e −i · n −1 mod q can also be combined with the multiplication by the powers of ω m in the last stage of INTT, and the number of cycles is further reduced by n. The cycle requirement for the NTT based multiplier is presented in Table 2. Without considering the pipeline, the cycles required for one polynomial multiplication can be reduced to n · (log 2 n + 1).

2) LUT-BASED MODULAR REDUCTION
Since the coefficients of polynomials in BGV scheme are signed integers, and modulo q is a 25-bit number, in order to keep some redundancy, we set the bit-width of polynomial coefficient and modulus q to 27-bit signed integers. Correspondingly, the product of two polynomial coefficients will produce a maximum 54-bit signed number. The modular reduction unit is used to reduce the 54-bit output from the polynomial multiplication by 27-bit modulus q. Considering the input data is 27-bit larger than modulus q, the efficiency of bit-by-bit modular reduction is too slow. Hence, we propose a resource-saving and high performance LUT-based modular reduction unit to perform the reduction of multiplication in [0, q − 1], and then we can central-lift the result to a value in (−q/2, q/2). Since the input of additive modular reduction VOLUME 8, 2020 increases only a few bits compared to modulus q, generally 1∼2 bits, so the architecture is relatively simple, it only needs a few steps of conditional subtraction operations by judging whether input data is larger than modulus q or not, i.e. if it is larger than q, then we subtract q from input data as the output, otherwise, we directly output original input data. This paper mainly focuses on the architecture of modular reduction of multiplication. The architecture for computing the reduction modulo q for multiplication is presented in Figure 4. As can be seen from Figure 4, the architecture has three stages. At the beginning of the computation, the low 25-bit input data is compared with the modulus q by performing a conditional subtraction. Based on the sign of the subtraction, either x[24 : 0] or x[24 : 0] − q participates in the subsequent computation as the input of the adder. For the high 27-bit input data, we can take every 4 bits as a group, and pre-compute the modular reductions of each group and store them in the corresponding 4-input LUTs. Note that the high 27-bit cannot be divided by 4, so the remaining 2 bits can be set as the input address of a 2-input LUT separately. Then we sum all the outputs of LUTs and MUX using the adder circuit. Similarly, in the stage 2, we perform the conditional subtraction on the addition result of the previous stage, and pre-compute the LUT value of the remaining high 3-bit input data. In the last stage, only a conditional subtraction is performed on the previous addition result, and the final reduction result is output by padding 2'b00 to the most significant 2 bits. Through analysis, we find that in our implementation, the number of input bits can be reduced a lot after each stage, for example in stage 1, the number of bits has been reduced from 54-bit to 27-bit, almost a half. In addition, since the LUTs are the inherent resource of FPGA, we can make full use the hardware resource of FPGA to improve the performance of modular reduction and decrease the area cost. Note that our modular reduction architecture can be easily extended to support arbitrary modulo with larger bit-width.

C. UNITS OF HOMOMORPHIC ENCRYPTION
Theoretically, the homomorphic encryption runs on the client side, including encryption and decryption algorithms. However, for functional integrity and performance optimization, this paper also gives the architecture of homomorphic encryption as depicted in Figure 5. As can be seen, the whole architecture is directed by the control signal generated by a controller. The architecture includes three NTT-based multipliers that performing the modular polynomial multiplication over ring R q , a Gaussian sampler to generate the error polynomials which in fact a binary uniform random distribution is used in hardware implementation, adders and modular reductions to compute the addition result, and some registers that enable the temporary storage of input and output data.

1) ENCRYPTION UNIT
The encryption unit is used to encrypt the input plaintext polynomial m with the public key pk = (−a, b). Initially, all input polynomials are stored in the input registers for a level, and then the error polynomial r is multiplied by the public key −a and b simultaneously using previous designed NTT-based multipliers Mult 1 and Mult 2. After Mult 1 and Mult 2 are multiplied, their multiplication results are added in parallel with p · e 1 and p · e 2 respectively. Moreover, the plaintext polynomial m can be further added to p·e 1 +b×r, and 27-bit additive modular operations are performed on the polynomials m + p · e 1 + b × r and p · e 2 − a × r in parallel which reduce the coefficients of the resulting polynomials to within R q . Note that in our hardware implementation, instead of directly using DSPs to realize the multiplications of p · e 1 and p·e 2 , they can be implemented by left-shifting each coefficient of polynomials p bits (for p = 32, is the power of 2), which can effectively reduce the occupation of DSPs and speed up the coefficient multiplications. At last, the ciphertext polynomials c 0 and c 1 are obtained after buffering a level.

2) DECRYPTION UNIT
Correspondingly, the decryption unit is used to recover the original plaintext polynomial m from the ciphertext polynomialsc 0 andc 1 , which may come from the output of encryption module directly or the output of KeySwitch module.
When the decryption unit is enabled by the controller, the multiplication between ciphertext polynomialc 1 and private key polynomial s is computed using the NTT-based multiplier Mult 3. Then the output of Mult 3 is added with the ciphertextc 0 , and next an 27-bit additive modular reduction is performed on the addition result, which reduce the coefficient of the polynomial to the rang of [0, q−1]. However, according to the BGV scheme, in order to decrypted the ciphertext correctly, the coefficients of the polynomial need to be further reduced to (−q/2, q/2). Hence, we add a conditional judgment using a MUX logic to reduce the coefficients of the polynomials to the correct range, that is, if the input coefficient is larger than (q + 1)/2, then we subtract q from the coefficient as the output, otherwise, we directly output original coefficient. At last, a modulo p is performed on the output of MUX, due to modulus p equals to 32, which is the power of 2, we directly take the low p bits of the polynomial coefficient as the output of the modular reduction p. Similarly, the plaintext polynomial m can be obtained after a level of buffering. It should be noted that since all operations in our architecture are performed on signed numbers, more attention should be paid to the sign bits and specific values of the polynomial coefficients during the computation process. When necessary, we need to carry out the sign extension operations.

D. UNITS OF HOMOMORPHIC EVALUATION
In this section we describe the hardware architectures of homomorphic encryption to accelerate BGV.HomAdd, BGV.HomMult, BGV.KeySwtich and BGV.ModSwitch. Note that these functions are normally executed on the cloud server side and are the focus of our acceleration. Let ciphertext polynomials ct 1 = (c 0 , c 1 ) and ct 2 = (c 0 , c 1 ), BGV.HomAdd is used to add the two input ciphertext polynomials and output the sum of them ct HomAdd = (c 0 + c 0 , c 1 + c 1 ). BGV.HomMult is the most complicated function and it directly leads to dimension expansion and rapid noise growth of ciphertexts, which ultimately leads to the failure of decryption. To solve these two problems, the BGV.KeySwitch and BGV.ModSwitch are introduced. BGV.KeySwitch can reduce the dimension of ciphertexts from 3 dimensions back to 2 dimensions, while the BGV.ModSwitch can reduce the noise by diminishing the modulus, and still ensure the correctness of decryption.

1) HOMOMORPHIC ADDITION AND MULTIPLICATION
The architecture for homomorphic addition and multiplication of BGV is depicted in Figure 6. Initially, the ciphertext polynomials ct 1 and ct 2 are input to the MUXs, and are cached by the input registers. The homomorphic addition logic is relatively simple, the output of the registers directly perform the modular addition operations ct HomAdd = (c 0 + c 0 , c 1 + c 1 ) in parallel. If we need to perform homomorphic addition or homomorphic multiplication further, the results of modular addition will be written back to the corresponding MUX after storing in the output registers for a level. Otherwise, the results of the modular addition will be directly output to the decryption unit to recover the addition of the two plaintext polynomials. Similarly, the output of input registers can simultaneously perform the homomorphic multiplication using the NTT-based Mult 1, Mult2, Mult3 and Mult4. For the output of the intermediate two multipliers, an addition modular operation is needed further. Finally, we can get the multiplication result ct HomMult = (d 0 , d 1 , d 2 ) = (c 0 c 0 , c 0 c 1 + c 1 c 0 , c 1 c 1 ). However, since dimension of input polynomials and output polynomials of homomorphic multiplication are not matched, a KeySwitch module should be added further, which can transform the multiplicative ciphertext polynomials from (d 0 , d 1 , d 2 ) to (c 0 ,c 1 ). The details of KeySwitch will be described in next subsection. The output ciphertext polynomials of KeySwitch can also be fed back to homomorphic addition and multiplication units to participate in the next round computation.

2) KEYSWITCH UNIT
KeySwitch technique is used to make a ciphertext decryptable with a different secret key homomorphically, it can reduce the dimension of homomorphic multiplication results at the cost of small noise growth. Concretely, when multiplying two ciphertexts, KeySwitch can transform the multiplicative ciphertext ct HomMult with three components, which can be decrypted with s Mul = (1, s , s 2 ), back to a new ciphertext ct keyswitch with two components that decryptable with secret key s = (1, s ). KeySwitch includes the generation of switching key and switching key two functional parts. Generally, the former one is generated by the software in the key generation step, it takes an extended key (initially we have pk) and produces another extended key by adding in a so-called ''key-switching matrix'' from the s 2 to s , to return a new extended key epk (i.e. switching key). The latter one is used to transform the ciphertext decryptable with s 2 to a new ciphertext decryptable with s using the previous switching key, and generate the ciphertext ct keyswitch . Excluding the generation of switching key, the KeySwitch algorithm is presented in Algorithm 4.
j = j + 1 5: end for //Sum the product of the components of switching key and d 2 . 6: for (i = 0; sum 1 = 0; i ≤ ; i = i + 1) do 7: sum 2 = sum 2 + Mult(rlk 0,i , d 2,i ) 9: end for 10:c 0 = (d 0 + sum 1 ) mod q 11:c 1 = (d 1 + sum 2 ) mod q 12: Return(ct keyswitch = (c 0 ,c 1 )) d 2 , we can write d 2 in base t (according to step 1∼ step 5 in Algorithms 4) as where d 2,i is a polynomial with coefficients in [0, t −1].Then, we can further output the KeySwitch ciphertext (according to step 6∼ step 11) as where epk = (rlk 0,i , rlk 1,i ) 0 = {(−ex_a i , ex_b i )} 0 is the switching key for the key s 2 . Note that the function Mult in step 7∼ step 8 represents the modular polynomial multiplication in R q . KeySwitch is another most computationally intensive operation in BGV scheme. In order to reduce the number of for-loops in step 1 and step 6 of Algorithm 4, we set the key switching parameter t to 2 13 in out implementation, so = log t q = 1 and the number of components of switching key epk is equal to 2, i.e. rlk 0 = {−ex_a 0 , −ex_a 1 } and rlk 1 = {ex_b 0 , ex_b 1 }. A parallel processing architecture for KeySwitch is shown in Figure 7  Due to adopting the full pipeline and parallel processing architecture, and the polynomial multipliers have been optimized, the overall performance of KeySwitch module can be improved greatly. However, when KeySwitch reduces the dimension of multiplicative ciphertext, it also brings some additional noise, which may lead to the failure of decryption. Hence, we need to consider how to reduce the KeySwitch noise further to ensure the correctness of decryption.

3) MODSWITCH UNIT
ModSwitch (also known as modulus switching) gives us a very powerful and lightweight way to manage the noise in BGV scheme. This technique permits the evaluator to reduce the magnitude of the noise in a ciphertext by scaling down the ciphertext, and without knowing the secret key. More specifically, suppose c is a valid encryption of m under secret key s modulo Q, and c is a simply scaling of c, which is closest to (q/Q)c such that c = c mod 2. If s is a short vector and q is sufficiently smaller than Q, it can be proved that c is a valid encryption of m under secret key s modulo q.
In other words, we can reduce the noise of ciphertext c by transforming c modulo Q into a smaller ciphertext c modulo q while preserving the correctness under the same secret key.
As mentioned before, if the noise of ciphertext generated by key switching grows too fast, we can choose to reduce the ciphertext noise using the ModSwitch for further increasing the depth of multiplication circuit. Compared with the previous works, one of the main contributions of our work is to design a ModSwitch algorithm and hardware architecture for plaintext modulus p = 2(p equals to 32 in our scheme). The ModSwitch algorithm we proposed is shown in Algorithm 5. Suppose ct keyswitch = (c 0 ,c 1 ) is the KeySwith ciphertext with modulus q l at level l (initially, q l = q and l = 1), and q l is the product of primes satisfying q l = l j=0 p j for l = 0 to L − 1, where p j ≡ 1(mod p), L is the system parameter. Suppose the q l is the modulus of ModSwitch ciphertext at level l , where l > l, and the scaling factor = q l /q l satisfying q l < q l . Then, we first perform modulo reduction and modulo p reduction respectively for (c 0 ,c 1 ), and in order to ensure the coefficients in δc 0 and δc 1 is divisible by p, we further subtract δ c 0 and δ c 1 from each coefficient of δc 0 and δc 1 respectively (as shown in step 3∼ Step 17). After fixing the coefficients of δc 0 (s.t. δc 0 ≡c 0 (mod ) and δc 0 ≡ 0(modp)) and δc 1 (s.t. δc 1 ≡c 1 (mod ) and δc 1 ≡ 0(modp)), the ModSwitch ciphertext (c 0 ,c 1 ) is obtained by performing c 0 = floor((c 0 − δc 0 )/ ) andc 1 = floor((c 1 − δc 1 )/ ) respectively. According to the Algorithm 5, we further propose the architecture of ModSwitch as shown in Figure 8. In our implementation, the initial ciphertext modulus q l = q = 257 3 = 16974593, the ModSwitch ciphertext modulus q l = 257 2 = 66049, and the scaling factor = 257. Since there is no data dependency between ciphertextc 0 andc 1 during the modulus switching process, we can perform the ModSwitch operations of them completely in parallel. When ciphertextsc 0 andc 0 comes, we first perform the modulo reduction operation and modulo p reduction operation respectively (step 1∼step 2 in Algorithm 5). Since modulus p is the power of 2, its modular reduction can be simplified by directly taking the lower 5 bits. Next, the results of modulo (namely δc 0 and δc 1 ) subtract the product of and the results of modulo p (namely δ c 0 and δ c 1 ). The MUXs are used to select output the results of modulo or the results of subtraction (step 3∼Step 17 in Algorithm 5). Then, the subtractionsc 0 − δc 0 andc 1 − δc 1 are performed on the output of MUXs respectively. At last, we can get the ModSwitch cipertexts (c 0 ,c 1 ) by performing the division operations on the previous subtraction results. The division operation can be directly realized by IP Core. Noted that although we present the hardware architecture of ModSwitch unit, it is not connected in the overall accelerator at last, mainly because the ciphertext modulus of the NTT-based multiplier we designed is fixed and it can no longer be applied to the ModSwitch result.

E. GENERALIZATION AND DISCUSSSION
Although our FPGA-based high parallelism architecture mainly focuses on the BGV FHE scheme, it is worth noting that the architectures of kernel acceleration unit (including polynomial multiplication unit and modular reduction unit), KeySwitch unit and ModSwitch unit are also perfectly suitable for other leveled Ring-LWE homomorphic encryption algorithms such as FV [12], [31], [32] and YASHE [17], [28] etc. Furthermore, the homomorphic encryption unit and homomophic evaluation unit can be applied to other Ring-LWE FHE schemes well with minor modification of multiplication factors, the sign of polynomials or some key parameters. For example, when performing encryption algorithm, we only need to multiply the plaintext polynomial by = q/t , and multiply the ciphertext by δ = t/q additionally when performing decryption algorithm and homomorphic multiplication, then our homomorphic encryption accelerator can be fully applicable to FV and YASHE algorithms. On the other hand, although we use a small parameter set with the polynomial degree n = 128 and 27-bit ciphertext modulus, but our hardware accelerator still supports larger parameter set just by increasing computation cycles of polynomial multipliers and minor modifying the architecture of modular reduction unit. Therefore, the Ring-LWE accelerator we proposed has high-level generalization ability for different application scenarios.

A. RESOURCE CONSUMPTION
The proposed hardware accelerator for Ring-LWE based BGV scheme is descripted with Verilog HDL language, synthesized and implemented in Xilinx VIVADO on a Virtex UltraScale FPGA platform, which has a chip XCVU125-FLVA2014-1HV-E. We evaluate our design on a small range of parameters: the size (n) of the ciphertext polynomial is 128, the coefficients of ciphertext polynomial are 27-bit signed integers with the ciphertext modulus (q). However, from the perspective of hardware implementation and the application scenarios of packing technology, we set the plaintext modulus (p) as a 5-bit number, which means that the sum or product of the coefficients of two plaintext polynomials cannot exceed p, otherwise, folding will occur. We elaborate in detail on the resource consumption of each component of our design from the basic modules to the whole accelerator as shown in Table 3. As can be seen, since NTT based multiplier occupies three identical NTT transformations, two of which are used for input polynomials and one for output polynomial, the resource consumption of the multiplier is about 3 times that of NTT transformation. Further analysis shows that the multiplier is the primary part of the accelerator resource overhead, which consumes a total of 11 multipliers. Specifically, the encryption module and decryption module occupy two multipliers and one multiplier, while homomorphic multiplication and KeySwitch module consume four multipliers respectively. Therefore, the area overhead of these modules is approximately a multiple of the number of multipliers consumed, e.g. the LUTs/ Registers/ BRAM/ DSPs of KeySwitch module are almost 4 times that of NTT based multiplier. Note that the 27-bit modular reduction (MR) and 54-bit modular reduction represents the modular operation for addition and multiplication, the architecture of which are implemented by combinational logic circuits. The LUTs and registers consumed by ModSwitch are slightly large, this is mainly due to the use of two divider IP Cores, each of which occupies 495 LUTs and 315 registers. The Block RAM (BRAM) in our implementation represents the on-chip memories for fast reading and writing operation, and can be used to realize the dual-port RAMs or read-only ROMs. The BRAM consists of RAMB36 units in our design can hold 128-many 27-bit values. The Digital Signal Processor (DSP) is capable to perform the 27-bit coefficient multiplications using DSP48E2. Finally, the resource consumption of complete design (excluding that of ModSwitch module) is given in the last row of the Table 3.

B. PERFORMANCE EVALUATION
In our implementation, all coefficients of polynomials are input to each component of the accelerator in serial, and the intermediate results of computation are temporarily stored by BRAMs or registers to maintain high-speed pipeline processing. The operating clock frequency directly affects the performance of accelerator, in order to reduce the time delay of critical path of our design, we have eliminated some critical paths during many design iterations by altering the data flow of computation, minimizing the number of logic circuits per pipeline stage, etc. Finally, our accelerator can run at 150MHz on Virtex UltraScale FPGA. In Table 4 the performance of basic operations are presented. Similarly, the NTT based multiplier is still the most significant unit affecting the performance of the accelerator. A single NTT transformation takes 1153 cycles to process 128 coefficients in serial using four stages of finite state machine. At 150MHz, this corresponds to 7.69 µs. Since the multiplier employs two NTT transformation (which are used as NTT and INTT), it consumes approximately twice as many cycles as that of NTT. However, if the pipeline is full, the speed of multiplier will be reduced to 6.84 µs. For the same reason, the encryption, decryption, homomorphic multiplication and KeySwitch module all adopt a single level of multiplier respectively, thus, the clock overhead is the same as that of a multiplier. Since the homomoprhic addition is a combinational logic circuit, the addition operation can be performed with the following register storing operation, which only occupies very few clocks. Then, the performance of homomorphic addition and multiplication from the encryption module to decryption module are evaluated. For a single set of polynomial inputs, there are total 4235 cycles and 8341 cycles are spent on the HomAdd_Enc_Dec and HomMult_Enc_Dec, which correspond to 28.24 µs and 55.63 µs respectively. If the pipeline is fully, the cycles of HomAdd_Enc_Dec and HomMult_Enc_Dec will be reduced to 1025, corresponding to 6.84 µs respectively.

C. COMPARISON WITH RELATED WORKS
Firstly, we compare our 54-bit LUT-based modular reduction unit with the Barrett, pseudo-Fermat primes and straight forward modular reduction in Reference [37] as shown in Table 5. A 64-bit input value with 31-bit modulus q (which is equal to 0×439.0001) is chosen as the input parameter in Reference [37], as can be seen, though the bit width of input value and modulus in our design is slightly smaller, the LUTs consumed by our modular reduction is still about 10 times less than the Barrett algorithm, which has the lowest resource overhead in Reference [37]. For a fair comparison, we also refer to the implementation method of Barrett in Reference [43] and design an improved Barrett modular reduction unit with the same parameters as our LUT-based modular reduction. Still, the LUT cost of our design is about 2 times less than the improved Barrett method in Reference [43] under the same condition. Secondly, the resource consumption and performance of polynomial multipliers are compared. Because there are differences in the choice of the parameters and implementation platforms, a totally fair comparison between the different implementations is not always possible. Hence, we compare our polynomial multiplier with related works from the perspective of throughput and normalized efficiency as shown in Table 6. In Reference [37], a Pease's polynomial multiplier and a Cooley-Tukey's polynomial multiplier for BGV homomorphic encryption are proposed with the polynomial length N = 32768, the bits q = 192 and q = 64 respectively. A pipelined and loop unrolled Schoenhage-Strassen FFT polynomial multiplier (N = 1024, q = 31 bits) is presented in Reference [44]. Reference [45] describes an open-source NTT-based polynomial multiplier (N = 128, q = 14 bits) FPGA implementation for post-quantum cryptographic primitives. At last, an optimized Karatsuba-based multiplier is proposed in Reference [46].
In order to improve the processing speed and throughput of multiplier, we adopt the multi-level pipeline structure in the horizontal direction and the parallel processing structure of NTT in the vertical direction. When the pipeline is full, the speed of the proposed multiplier is about 6.84 µs, and the corresponding through is 505.26 Mbps. The throughput of our design can achieve a 1.89∼7.31 times speedup when compared with other works, except for the Cooley-Tukey's multiplier in [37] and Karatsuba-based multiplier in [46]. This is mainly due to the fact that Cooley-Tukey's multiplier [37] adopts a ''ping-pong'' BRAM structure with higher performance and two parallel butterflies at the cost of area, while Karatsuba-based multiplier [46] has a lower algorithm complexity. However, in terms of normalized efficiency, our multiplier still has the advantages of 1.62 times and 1.18 times compared with Cooley-Tukey's multiplier [37] and Karatsuba-based multiplier [46]. In addition, though our efficiency is slightly lower than Schoenhage-Strassen multilier in [44], we can achieve more than 7× throughput. Therefore, if the pipeline is full, the proposed NTT based multiplier will have great advantages in performance and normalized efficiency compared with state of arts. However, if the pipeline is not full, the speed of the proposed multiplier will be between 6.84 µs and 14.54 µs, the performance and efficiency of our multiplier will be reduced by half at maximum, and the advantages of our parallel and pipeline structure cannot be fully utilized. At last, it is worth noting that the resource occupation and performance of the reference [37] in Table 6, which uses Stratix V as the FPGA device, are directly indexed from the original literature, so they can be used as the performance comparison without resource occupation conversion. If it is necessary to consider the impact on the time cost due to different technologies of different devices, we can normalize the throughput by dividing the throughput indexes by the clock frequency. The results show that the speedup ratio is further increased by 1.2 times compared to Pease's implementation [37], while compared to Cooley-Tukey's implementation [37] the speedup ratio still maintains a certain advantage.
Lastly, we compare the performance of our accelerator with the similar works as shown in Table 7. Since the implementations of the BGV scheme are limited in the literature, in addition to comparing with the performance of the existing BGV software and hardware implementations in [37], we also compare our accelerator with the FV implementation in [31], which is the most similar Ring-LWE FHE scheme to BGV. As discussed in Subsection E of Section IV, the BGV scheme can be easily extended to FV scheme with minor modifications, so it is reasonable to compare our BGV accelerator with the FV implementation.
In Reference [37], they proposed a typical software implementation on general purpose computer, and further presented two hardware implementations of BGV homomorphic encryption accelearator based on Pease's multiplier and Cooley-Tukey's multiplier respectively. In order to improve the performance of accelerator, they not only use the negative wrapped convolution to speed up the NTT-based polynomial multiplier, but also use the Chinese Remainder Theorem (CRT) to optimize the polynomial multiplication on a larger ciphertext space. For the parameter set with polynomial size of 32768 and the ciphertext space 1088 bits, the speed of software implementation is about 670 ms and 324ms for the encryption and decryption algorithms, while the Pease's implementation and Cooley-Tukey's implementation require 327 ms and 166ms for encryption algorithm, 53ms and 73 ms for decryption algorithm respectively. As mentioned previously, if there are multiple sets of input polynomials and the pipeline is full, the processing speed of our design is up to 6.84 µs for encryption and decryption algorithm, and the throughput is 505. 26 Mbps, which is about 9.49 times and 4.60 times larger than that of software implementation [37], and improves about 4.64 times and 2.17 times compared to the Pease's implementation [37]. When compared to the Cooley-Tukey's implementation [37], though our design can achieve 2.36 times improvement for the encryption algorithm, while the throughput of the decryption algorithm is increased by 1.03 times.
Roy et al. [31] introduced an FPGA-based multicore processor HEPCloud for FV somewhat homomorphic function evaluation, to efficiently implement the homomorphic addition and homomorphic multiplication of FV scheme, they simplify the modular reduction by lifting a polynomial in R q to the ring R Q with larger modulus Q, and scaling back to the ring R q when the computations are completed. They report the computation time of homomorphic addition and homomorphic multiplication is about 0.05s and 26.67s respectively due to the slow memory access. The throughput of our design is improved about 5.05 times and 167.30 times for homomorphic addition and homomorphic multiplication evaluations. In terms of normalized efficiency, although the throughput per LUT of our homomorphic addition is slightly less than that of [31], but the normalized efficiency of homomorphic multiplication of our design is still increased by 31.6 times.
Furthermore, if the pipeline is not full, the latency of homomorphic encryption and homomorphic evaluation will be increased to a maximum of 14.55 µs and the throughput of our implementation will be reduced accordingly. However, our performance is still better than the previous works.
Finally, it is noted that although Table 7 lists two different FPGA devices (i.e., Stratix and Virtex) which belong to different companies, when comparing resource overhead and performance, reference [37] only provides the time cost of encryption and decryption, and does not provide the resource consumption. Therefore, different FPGA platforms and devices have no impact on the resource overhead comparison. If it is necessary to consider the impact on the time cost due to different technologies for different devices, the throughput in Table 7 can be divided by clock frequency to eliminate the impact of clock frequency. In this case, the speedup ratios are reduced by 2/3 times. It can be found that the performance of our accelerator still has several to dozens of times advantage, except for the decryption unit and homomorphic addition unit.

VI. SECURITY DISCUSSION
The security of our accelerator includes the security of the homomorphic encryption algorithm and the security of FPGA-based hardware accelerator architecture two parts.
From the perspective of proof-of-concept, we use a small parameter set with the polynomial degree n = 128 and 27-bit ciphertext modulus, so the security level of our design is slightly less than 128 bits. However, as we discussed in subsection E of Section IV, our hardware accelerator can be easily extended to larger polynomial degrees to support higher security levels with only minor modifications of the computation cycles and the structure of modular reduction. For example, when the polynomial degree n = 1024 and ciphertext modulus log 2 q = 27, the security level of homomorphic encryption algorithm will equal to 128 bits [47]. Meanwhile the security of our BGV scheme is based on the Ring-LWE assumption, which is reducible to worst-case problems on ideal lattices, can ensure our accelerator and the FHE algorithm resistant the attacks of future quantum computer.
On the other hand, the security of our FPGA-based hardware accelerator is mainly guaranteed by the FPGA platform and homomorphic encryption algorithm. The classical method to reverse engineer a chip is the black box attack [48], the attacker inputs all possible combinations, while saving the corresponding outputs. Due to the complexity of our design and the size of our state-of-the-art FPGA platform, it is infeasible to extract the inner logic of our accelerator without a lot of powerful computers. Furthermore, the nature of our FHE algorithm also prevents the attack as well. Readback attack [49] is another conventional attack of FPGA implementation, the idea of the attack is read the configuration of the FPGA through the programming interface or JTAG to obtain private keys or FHE algorithms. The readback attack can be prevented by setting a security bit in FPGA which is used to disable different features, and it is better to embed our FPGA-based accelerator into a secure environment, where the configuration information can be deleted once detecting interference. In order to get the private keys or the FHE algorithms, one has to reverse-engineer the bitstream [50]. FPGA manufactures claim that the security of the bitstream relies on the disclosure of the layout of the configuration information. Hence, the encryption of the configuration file is the most effective and practical countermeasure, it not only prevents the reverse-engineering attack, but also the cloning of SRAM FPGAs. Although we have listed some possible conventional attacks and countermeasures for our FPGA-based hardware implementation, but with the development of the FHE algorithm and its acceleration technology, other attack methods and protection strategies for FHE will emerge in endlessly, and the security discussion of the FPGA-based hardware architecture is a separate and complex problem, so this paper will not further discuss in detail due to the length limitation.

VII. CONCLUSION
This paper focuses on the FPGA hardware implementation for Ring-LWE based leveled fully homomorphic encryption. We present a hardware implementation of BGV scheme that implements all components required for homomorphic encryption and homomorphic evaluation. Our architecture provides a trade-off between the hardware cost and performance. To accelerate the computational intensive operations of homomorphic encryption functions, we put forward an iterative NTT-based modular polynomial multiplier with high performance and a self-designed LUT-based modular reduction unit with less resource consumption. On this basis, we accelerate the each functional component of hommorphic encryption and homomorphic evaluation from the functions of client and server. In particular, we implement the homomorphic evaluation module including modulus switching of BGV scheme for the first time. Finally, we evaluate the resource consumption and performance of our implementation on Virtex UltraScale FPGA platform. We find that our modular reduction can save at least 2 times area, and our polynomial multiplier has at least 20% higher normalized efficiency when compared to existing implementations. Besides, we demonstrate that the performance of our overall architecture is also optimal, at the cost of slightly larger resource occupation. As for future work, we will extend our design implementation to the multicore application scenarios and support wider range of parameters.