Accelerating Finite Field Arithmetic for Homomorphic Encryption on GPUs

Fully homomorphic encryption (FHE) is a rapidly developing technology that enables computation directly on encrypted data, making it a compelling solution for security in cloud-based systems. In addition, modern FHE schemes are believed to be resistant to quantum attacks. Although FHE offers unprecedented potential for security, current implementations suffer from prohibitively high latency. Finite field arithmetic operations, particularly the multiplication of high-degree polynomials, are key computational bottlenecks. The parallel processing capabilities provided by modern GPUs make them compelling candidates to target these highly parallelizable workloads. In this article, we discuss methods to accelerate polynomial multiplication with GPUs, with the goal of making FHE practical.

Fully homomorphic encryption (FHE) is a rapidly developing technology that enables computation directly on encrypted data, making it a compelling solution for security in cloud-based systems. In addition, modern FHE schemes are believed to be resistant to quantum attacks. Although FHE offers unprecedented potential for security, current implementations suffer from prohibitively high latency. Finite field arithmetic operations, particularly the multiplication of high-degree polynomials, are key computational bottlenecks. The parallel processing capabilities provided by modern GPUs make them compelling candidates to target these highly parallelizable workloads. In this article, we discuss methods to accelerate polynomial multiplication with GPUs, with the goal of making FHE practical. C omputer security continues to grow in importance as the computation and storage of sensitive data are increasingly outsourced to cloud-based computing services. Moreover, with the rapid development of quantum computers, we expect to see new classes of threats looming that can defeat the security of long-trusted cryptosystems.
Fully homomorphic encryption (FHE) is an emerging technology with the potential to address both of these problems. FHE enables computation directly on encrypted data, concealing both operands and results from wily attackers on untrusted computing servers (see Figure 1). The security of many modern FHE schemes is based on the hardness of the ring learning with errors problem, 1 which is presumed to be resistant to quantum attacks.
A major hurdle to deploying FHE in real-world applications is overcoming the high computational costs associated with its workloads. Fortunately, FHE workloads are highly parallelizable, making FHE a strong candidate for GPU acceleration. We focus our attention on two key computational bottlenecks: 1) modular reduction, or the computation of the remainder in integer division, and 2) the number theoretic transform (NTT) used to efficiently multiply polynomials. We begin by presenting background on these bottlenecks, followed by some potential approaches to accelerate them on a GPU. To that end, we leverage a number of architectural features, such as shared memory and intrinsic instructions.
We build on our preliminary exploration into accelerating FHE. 2 The main contributions of this work include the following: a modulus selection methodology for Barrett  a "lazy" Barrett-based NTT optimization; an evaluation of a residue number system (RNS)-based extension of our prior NTT implementation, including an evaluation of 32-bit versus 64-bit word sizes; and an evaluation of multiple methods for computing "1=N scaling" in inverse NTTs.

BARRETT REDUCTION
We begin with an investigation of one of the most wellknown and widely used modular reduction algorithms: Barrett reduction. In particular, we discuss the costly correctional subtraction operation appearing in computations involving Barrett reduction and present novel methods to significantly decrease the number of these operations.

Background: Modular Arithmetic and Correctional Subtraction
Let x mod q denote the remainder of a nonnegative integer x divided by a positive integer q. The naive method for performing modular reduction (i.e., the computation of x mod q) is via an integer division operation: x mod q ¼ x À bx=qc Â q. However, integer division is computationally expensive. Moreover, integer division typically takes a number of cycles that are highly dependent on the data, making it susceptible to side-channel attacks. 3 Fortunately, there are a number of attractive alternatives for performing modular reduction-especially in conjunction with arithmetic operations such as addition and multiplication-that avoid undesirable integer division operations.
For example, Algorithm 1 specifies a simple and efficient computation of the modular reduction of a sum. Observe that, assuming a and b are each reduced-i.e., they lie in ½0, qÞ-then either a þ b is reduced, or a þ b lies in ½q, 2qÞ and requires a single correctional subtraction to become reduced (see lines [2][3]. Correctional subtractions are a common feature in modular reduction and arithmetic algorithms. Fortunately, correctional subtractions can be implemented without branch instructions, avoiding control divergence that can significantly impact performance on GPUs.

Algorithm 1: Baseline Modular Addition Algorithm
Require: 0 a, b < q, lenðqÞ j À 1 Ensure: sum ¼ ða þ bÞ mod q 1: sum a þ b 2: if sum ! q then 3: sum sum À q 4: return sum Note that the bit length (i.e., the number of bits in the binary representation) of the modulus q is limited to be at most one less than the word length j, preventing overflow of the intermediate operations (e.g., a þ b).
Algorithm 2 specifies the classical Barrett reduction algorithm commonly used to efficiently reduce products.

SECURITY AND PRIVACY-PRESERVING EXECUTION ENVIRONMENTS
The key idea underlying Barrett reduction is that it replaces a computation of the quotient bx=qc with a computation of an "approximate quotient" where m ¼ lenðqÞ. Provided the "Barrett constant" l ¼ b2 2m =qc is precomputed, the approximate quotient can be efficiently computed without integer division, as the divisions by powers of two in lines 1-2 can be implemented via bit shifts (which are highly efficient operations on GPUs). The number of correctional subtractions required to fully reduce x via classical Barrett reduction equals the difference bx=qc À quot between the actual and the approximate quotient. Provided x is sufficiently small (i.e., less than 2 2m ), this number is either zero, one, or two (lines 4-7).

Omitting Correctional Subtractions
In the original article introducing Barrett reduction, Paul Barrett remarked-without further qualifications-that his algorithm requires a second conditional subtraction "only in 1% of cases." 4 This motivates the following question: is it possible to modify Barrett's algorithm so that the second correctional subtraction instruction (i.e., lines 6-7) may be omitted?

Improved Quotient Approximation
To address this question, we first consider an approach that involves modifying the classical quotient approximation. Dhem and Quisquater 5 introduced the following generalization of Barrett's approximate quotient with parameters a and b: Each instantiation ða, bÞ corresponds to both an upper bound on the number of required correctional subtractions and an upper bound on the bit length of the modulus in the corresponding reduction algorithm. These two upper bounds are directly related and, thus, present a tradeoff. Instantiations requiring as few as zero and as many as three correctional subtractions have demonstrated favorable performance for various applications. 6 Many prior studies and open source libraries (e.g., the well-known OpenFHE library 7 ) use the instantiation ða, bÞ ¼ ðm þ 3, À 2Þ, which requires at most one correctional subtraction and restricts the modulus length to at most four less than the word length (e.g., a 28-bit modulus on a single-precision 32-bit word on a GPU).
To the best of our knowledge, the instantiation ða, bÞ ¼ ðm þ 1, À 2Þ does not appear in any paper or open source library. This apparently overlooked instantiation, which we call BetterBarrett, is specified in Algorithm 3. Similar to the instantiation used by Open-FHE, BetterBarrett requires at most one correctional subtraction. However, BetterBarrett only restricts the modulus length to at most two less than the word length, making it an attractive alternative for applications where large moduli are desirable (such as FHE).
Observe that BetterBarrett shares the same restriction on its modulus as the classical Barrett reduction. As such, BetterBarrett may be viewed as a more performant replacement for the classical Barrett reduction (as shown in Figure 2). We are not aware of a practical context where classical Barrett is preferable to BetterBarrett.

Modulus Selection
Next, we examine a second approach for optimizing Barrett reduction that takes the choice of modulus into consideration. Recall that classical Barrett reduction

SECURITY AND PRIVACY-PRESERVING EXECUTION ENVIRONMENTS
September/October 2023 IEEE Micro requires at most two correctional subtractions to fully reduce an integer x in ½0, 2 2lenðqÞ Þ. This prompts an interesting question: are there any moduli q for which classical Barrett reduction requires at most one correctional subtraction to reduce any integer in ½0, ðq À 1Þ 2 (and, thus, any product, the common use case for Barrett reduction)? A naive approach to determining whether a given modulus q satisfies this property involves a brute-force verification that bx=qc À quot 1 for all x in ½0, ðq À 1Þ 2 . We prove in Theorem 1 that the number of required verifications can be reduced by a factor of q.
Theorem 1: Barrett reduction requires at most k correctional subtractions to fully reduce any for all j in ½0, bL=qc.
Proof: The claim follows from the fact that the quotient function x7 !bx=qc and approximate quotient function x7 !quotðxÞ are monotone increasing, with the former remaining constant on intervals ½j Â q, j Â ðq þ 1ÞÞ for each integer j.
Using a search algorithm based on Theorem 1, we found that moduli q that require at most one correctional subtraction to fully reduce a product via classical Barrett reduction are surprisingly common. For example, of the 395 30-bit primes q that are "negacyclic friendly" for N ¼ 2 16 (i.e., moduli that are relevant for FHE), 192 require at most one correctional subtraction to fully reduce a product. This provides yet another alternative to the widely used Dhem-Quisquater reduction corresponding to ða, bÞ ¼ ðm þ 3, À 2Þ that does not impose a further restriction on the modulus length.
In general, this modulus selection approach may be useful in contexts-such as implementations of NTT based on the RNS-where a relatively large set of moduli must be selected from a heavily restricted class of candidates. A currently prevalent modulus selection approach involves simply choosing the largest several moduli from the candidate set; see, e.g., the Microsoft Simple Encrypted Arithmetic Library.

POLYNOMIAL MULTIPLICATION
Next, we shift our attention from modular multiplication to polynomial multiplication. More specifically, we focus on NTT, an algorithm commonly used to efficiently implement polynomial multiplication in the context of FHE. Throughout, we fix a degree-bound N and represent a polynomial R NÀ1 i¼0 a i x i as an N-dimensional coefficient vector a ¼ ða 0 , a 1 , :::, a NÀ1 Þ.

Background: NTT
The naive approach to multiplying polynomials requires the computation of order N 2 products and sums. This can be reduced to order NlogðNÞ using the celebrated fast Fourier transform (FFT). FFT consists of an iteration of stages in which pairs of coefficients are transformed via a butterfly operation. There are numerous variations of the butterfly operation, but a classical baseline is the Cooley-Tukey butterfly defined as follows: In the context of FHE, the term "polynomial multiplication" typically refers to negacyclic convolution, a slight variant of the usual polynomial multiplication. On hardware platforms such as CPUs and GPUs, negacyclic convolution is commonly implemented via an efficient combination of two specialized variants of the FFT, known as the merged NTT (specified in Algorithm 4) and the merged inverse NTT. In this case, the negacyclic convolution of a and b can be computed via the expression ð1=NÞiNTTðNTTðaÞ NTTðbÞÞ. For further details, see P€ oppelman et al. 8 We henceforth omit the modifier "merged" for brevity.

SECURITY AND PRIVACY-PRESERVING EXECUTION ENVIRONMENTS
Although the intermediate operations are partially computed (hence, the qualifier "lazy"), the final result is correct. This results in a total of NðlogN=2 À 1Þ fewer correctional subtractions for each NTT.
Harvey's optimization is specified for NTTs based on Shoup's method, an approach for computing NTTs that requires precomputed modifications to the twiddle factors to be implemented efficiently. We propose a Barrett-based version of Harvey's lazy butterfly that requires no precomputed modifications.
Our lazy Barrett-based butterfly is specified in Algorithm 5. We define the truncated Barrett reduction function barrett trunc q to be Barrett reduction (e.g., Better-Barrett) with a correctional subtraction omitted (i.e., lines 4-5).

Algorithm 5: A Lazy Barrett-Based Butterfly
Require: 0 a 0 , a 1 2ðq À 1Þ, 0 f < q, m ¼ lenðqÞ j À 2, q < 2 mÀ1=2 Ensure: Our lazy Barrett-based NTT is defined by making the following modifications to Algorithm 4: 1) further constraining the modulus by q < 2 lenðqÞÀ1=2 , 2) replacing the butterfly in line 9 with the lazy butterfly specified in Algorithm 5, and 3) applying a correctional subtraction to each entry of the output greater than or equal to q.

Theorem 2: The lazy Barrett-based NTT is correct.
Proof: It suffices to show that the input to barrett trunc q in line 2 of Algorithm 5 lies in the prescribed domain, ½0, 2 2lenðqÞ Þ. If f lies in ½0, qÞ and a 1 lies in ½0, 2ðq À 1ÞÞ, then 0 f Â a 1 2ðq À 1Þ 2 . This is bounded above by 2 2lenðqÞ since q < 2 lenðqÞÀ1=2 .

Background: The RNS
To guarantee a sufficient level of security for FHE, it is necessary that the set Z Q of coefficients be sufficiently large. For example, it is common to assume lenðQÞ > 1000. To efficiently implement NTT over such large coefficients on a hardware platform with significantly smaller word lengths (e.g., the 32-or 64-bit data types available on many GPUs), we use the wellknown RNS.
To use the RNS, we choose Q to be a product of distinct word-sized, negacyclic-friendly primes q 1 , q 2 , :::, q ' , each of equal length. Then, a computation of an NTT with respect to Q can be replaced by a computation of ' NTTs with respect to the ' distinct prime moduli. The number ', called the number of limbs, is inversely related to the modulus length m by ' ¼ Q=m. This implies an interesting tradeoff between the workload size (i.e., the total number of butterflies computed in ' NTTs) and the operational complexity (i.e., the complexity of arithmetic operations with respect to the m-bit moduli).

EFFICIENT METHODS TO ACCELERATE POLYNOMIAL MULTIPLICATION ON GPUs
Given the large-size N and a large modulus Q for each polynomial in FHE, polynomial multiplication requires a significant amount of computation. Fortunately, polynomial multiplication algorithms based on NTT (and, more generally, FFT) are highly parallelizable and can exploit the massive number of threads and high memory throughput on modern GPUs.
GPU architectures are based on the single instruction multiple thread programming model where a group of threads (called warps) execute instructions over a collection of data in parallel (in a lockstep manner). These warps are grouped into blocks, which are further grouped into grids. In particular, we target the NVIDIA V100 GPU, which consists of 128 kB of L1 data and instruction caches per streaming multiprocessor (SM), a multibanked shared L2 cache (6.1 MB), and 16-GB HBM2 global memory. Each SM also has a lowlatency, user-configurable cache called shared memory (each configurable in size up to 96 kB). As we will explain, we tailor our implementations by applying incremental optimizations that are aimed at harnessing these underlying architectural features to speed up our FHE workloads.
We obtain performance metrics for our kernels using the following tools: the NVIDIA Binary Instrumentation Tool for tracing memory transactions, the Nsight Compute kernel profiler for fetching performance counters, and the Nsight Systems performance analysis tool to obtain kernel scheduler performance and measure synchronization overheads. Our experiments provide insights into our kernel implementations and are aimed at mitigating the performance impacts of microarchitectural bottlenecks.

SECURITY AND PRIVACY-PRESERVING EXECUTION ENVIRONMENTS
For all of our implementations, we use BetterBarrett to implement modular multiplication. We consider implementations based on both 32-bit and 64-bit data types, using 30-bit and 62-bit moduli, respectively (the largest permitted bit lengths for BetterBarrett reduction). We fix N ¼ 2 16 for the NTT size. We use lenðQÞ ¼ 1860 for the modulus length. Using the RNS, this choice of modulus length translates to 62 and 30 limbs for our 32-bit and 64-bit implementations, respectively. For each limb, we use the merged NTT algorithm.

GPU Implementations and Analysis
Next, we discuss a series of incremental optimizations targeting the GPU architecture.
Our baseline NTT implementation employs the hybrid-kernel optimization, 10 which consists of a combination of single-stage and multistage GPU kernels. This approach leverages the following observation: after k stages of NTT have been executed, the remaining computation can be subdivided into 2 k subcomputations, each operating on a distinct set of N=2 k coefficients.
We outline the hybrid-kernel approach for N ¼ 2 16 that has 16 stages. During the first five stages, a naive single-stage kernel approach assigns one kernel per stage and synchronizes between kernels with interblock synchronization to ensure that all blocks have access to updated data. In practice, this is achieved by using a barrier on kernel finish. After the fifth stage, the remaining computation can be subdivided into subcomputations, each operating on a distinct set of 2 11 ¼ 2048 coefficients. For each of these subcomputations, a multistage kernel approach assigns the associated set of coefficients to a 1024-thread block. Multistage kernels exchange interblock synchronizations for intrablock synchronizations, where the barrier only affects the threads of a block, ensuring that the data in shared memory are updated so we can continue with the next stage. This does not affect the rest of the blocks, allowing them to continue computation without interruption.
Additionally, multistage kernels effectively leverage shared memory for efficient coefficient storage. Using shared memory mitigates redundant memory accesses (reducing memory pressure on the L1 and L2 caches) while increasing compute throughput. In Figure 3, we see the impact of shared memory on single-limb NTT and inverse NTT kernels. Incorporating shared memory provides more than 20% improvement in the compute throughput of NTT while significantly reducing the demands on L1 and L2 data caches. [Shared memory bandwidth does not contribute to L1 and L2 cache bandwidth, but it does contribute to dynamic randomaccess memory (DRAM) bandwidth.]

Radix Configurations
Furthermore, we experimented with various configurations of higher-radix butterflies. The radix of a butterfly corresponds to the number of its operands. An increase in the radix results in a decrease in both the number of stages in each NTT and the number of kernel launches (and the associated kernel launch overhead). However, an increase in the radix can also reduce the amount of parallelism as each thread processes more data. This tradeoff can be seen in Figure 4. We achieved the best performance with a mixed 4/16-radix implementation.
Hierarchical NTT To further increase block-level parallelism, we implemented a (2-D) hierarchical NTT. 11 We refer to our implementation as our 2-D NTT. In our 2-D NTT, a set of 2 16 coefficients are arranged into a 2 8 Â 2 8 -row major matrix. Processing consists of two steps. In the first step, a 2 8 -point NTT is performed on each of the 2 8 columns. In the second step, a 2 8 -point NTT is performed on each of the 2 8 rows. Block-level synchronizations occur between these two steps. We assign each of the 2 8 -point NTTs to a single block, resulting in 256 blocks per step. Each step maps to a kernel, where each thread computes a single butterfly per stage and performs thread-level synchronizations between stages. This approach allows us to make extensive use of

SECURITY AND PRIVACY-PRESERVING EXECUTION ENVIRONMENTS
shared memory while reducing the pressure on L1 and L2 data caches.
Word Lengths of 32 Versus 64 Bits: The choice of word length is associated with a tradeoff between operational complexity and workload size (i.e., the number of NTTs, or limbs, in the workload). Figure 5 presents performance metrics for both 32-bit and 64-bit implementations over various values of the modulus Q. The evaluation is based on two assumptions. First, we fix the NTT size as N ¼ 2 16 . In practice, the value of N depends on the value of Q. Second, we model the relationship between the number of limbs ' and the modulus length m as ' ¼ dQ=me. As seen in Figure 5, the baseline NTT runtimes for the 32-bit and 64-bit implementations are similar. This result aligns with prior work 12 that found a "negligible difference" in the runtimes associated with these word lengths.
We see how our DRAM throughput is impacted by the use of shared memory. In our 2-D NTT implementation, we use shared memory throughout the entire computation, whereas the baseline implementation requires five stages to read coefficients from global memory directly. The resulting latency from those accesses reduces the number of active warps, as they must wait for data to arrive, resulting in a degradation of compute throughput.
In 2-D NTT, 64-bit operations take more cycles to execute than 32-bit operations; thus, 64-bit operations are more effective in hiding memory latencies. In addition, we find that the reduced workload size resulting from using 64-bit operations further reduces the impact of extra execution cycles associated with 64-bit operations.

Comparison of 1/N-Scaling Methods
If a and b are polynomials, their product can be computed via the following expression: ð1=NÞiNTTðNTTðaÞ NTTðbÞÞ, where denotes the Hadamard product. In this section, we compare the performance of three methods of performing the 1=N scaling: two that are commonly used and a third method using precomputation. The naive approach simply scales the output of iNTT by 1=N.
However, the 1=N scaling does not need to occur at the end. It is straightforward to verify that scaling commutes with butterfly operations, stages in NTTs, and entire NTTs (and their inverses). This enables a second approach to perform 1=N scaling, which consists of scaling the output of each butterfly by 1=2. We refer to this approach as butterfly scaling. Note that 1=2 scaling can be efficiently computed on the GPU's arithmetic logic unit via bit shifts and sums. A third approach involves moving half of these 1=2 scalings to precomputation in the twiddle factor array, which we refer to as twiddle factor scaling.
We collected performance metrics for each of these three methods for both our 32-bit and 64-bit  SECURITY AND PRIVACY-PRESERVING EXECUTION ENVIRONMENTS baseline iNTT implementations. As expected, the naive 1=N scaling was the slowest. Butterfly scaling provided 1.29% and 3.85% speedups, respectively. Twiddle factor scaling provided the most significant speedups at 3.13% and 4.86%, respectively. This was also to be expected, as much of the computation is moved into precomputation.

Lazy Arithmetic
To investigate the performance improvements provided by our lazy Barrett-based NTT algorithm, we collected performance metrics for both lazy and nonlazy variants for both the baseline and 2-D NTT kernels. Our results are summarized in Figure 6. In our baseline implementations, laziness provides 4.3% and 6.7% speedups for our 32-bit and 64-bit baseline implementations, respectively. In our 2-D implementations, laziness gives 2.6% and 9.0% speedups, respectively. This performance improvement is largely due to an increase in memory throughput (DRAM, L1 cache, and L2 cache throughput), as shown in Figure 6.

CONCLUSION
Finite field arithmetic operations are significant computational bottlenecks in FHE, impeding the widespread deployment of FHE (and lattice-based cryptography more generally) in real-world systems. We present optimizations to classical algorithms that can decrease the number of correctional subtractions required, resulting in significant performance improvements. In addition, in highly parallel workloads, such as NTT, we can obtain further speedup by carefully mapping kernels to manycore systems (e.g., GPUs). We found that applying all of the proposed optimizations processes a 2.08Â speedup over a state-of-the-art NTT implementation. For FHE to be deployed in next-generation systems, there still remain performance barriers to overcome. Some potential paths forward include using larger word sizes as well as leveraging multi-GPU systems.