Faster Homomorphic Trace-Type Function Evaluation

Homomorphic encryption enables computations over encrypted data without decryption, and can be used for outsourcing computations to some untrusted source. In homomorphic encryption based on the hardness of ring-learning with errors, offering promising security and functionality, a plaintext is represented by a polynomial. A plaintext is treated as a vector whose homomorphic evaluation enables component-wise addition and multiplication, as well as rotation across the components. We focus on a commonly used and time-consuming subroutine that enables homomorphically summing-up the components of the vector or homomorphically extracting the coefficients of the polynomial, and call it homomorphic trace-type function. We improve the efficiency of the homomorphic trace-type function evaluation. The homomorphic trace-type function evaluation is performed by repeating homomorphic rotation followed by addition (rotations-and-sums). To correctly add up a rotated ciphertext and an unrotated one, a special operation called key-switching should be performed on the rotated one. As key-switching is computationally expensive, the rotations-and-sums is inherently inefficient. We propose a more efficient trace-type function evaluation by using loop-unrolling, which is compatible with other optimization techniques such as hoisting, and can exploit multi-threading. We show that the rotations-and-sums is not the optimal solution in terms of runtime complexity and that a trade-off exists between time and space. Experimental results demonstrate that our proposed method works 1.32–2.12 times faster than the previous method.


I. INTRODUCTION
Homomorphic encryption (HE) allows computations over encrypted data without decryption. Since Gentry's first construction of fully HE [1], numerous improvements in its theory and implementation have been made to increase efficiency as well as different types of computation (e.g., [2]- [5]).
These subsequent schemes are based on the hardness of learning with errors (LWE) problem [6] or its ring variant called ring LWE (RLWE) problem [7]. RLWE-based HE schemes provide amortization because a plaintext is represented by a polynomial. This allows one to pack multiple messages into the polynomial coefficients, thereby performing homomorphic operations such as component-wise addition, polynomial multiplication (convolution), and rotation (with the change of signs for a certain case). Moreover, the polynomial can be seen as a vector via batching [4], [8], The associate editor coordinating the review of this manuscript and approving it for publication was Remigiusz Wisniewski .
where each of the elements is called plaintext slot. The batching enables homomorphic operations such as componentwise additions, multiplications, and rotations without changing signs across the slots.
Various useful methods for homomorphic function evaluation are expressed by a rotation followed by an addition, and they are performed recursively. For instance, total-sums operation [9] sums up all the slot values, and the result is stored back to all the slots. Total-sums and its partial variant (i.e., adding up the consecutive slots partially) are used in several applications (e.g., [10]- [12]). These operations are performed by O(log N ) cyclic rotations and additions for O(N ) slots. Moreover, the trace function isolates coefficients by recursively performing the rotations and sums as in total-sums but using different rotation amounts [13]- [15]. The trace function is used during conversion between RLWE/LWE ciphertexts [15] and coefficient selection [14]. We refer to these types of operations (i.e., successive rotations and sums) as trace-type functions. VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Because of the functionality above, the trace-type functions have been used in many applications. However, their processing is computationally expensive compared with homomorphic addition and multiplication due to the successive application of homomorphic rotations. Specifically, the trace-type function evaluation is 2.49-5.43x (resp. 85.83-124.26x) slower than the homomorphic multiplication in the B/FV scheme 1 (resp. the CKKS scheme) according to the benchmark provided in [18, Table 1]. 2 A homomorphic rotation is realized by an automorphism, which not only rotates plaintext slots but also changes the corresponding secret key. To homomorphically evaluate the addition of rotated ciphertext with respect to the changed key and original ciphertext, the rotated key should be reverted back to the original one via key-switching, which is a computationally expensive operation. Thus, trace-type functions become inefficient due to the O(log N ) sequential iterations, involving automorphisms followed by the expensive key-switchings.
Contribution: In this paper, we strive to improve the efficiency of the trace-type function evaluation. To this end, we propose a more efficient trace-type function evaluation using loop-unrolling to reduce the number of expensive operations by leveraging a property of automorphisms, special modulus technique for key-switching, and using a multi-core environment, thus reducing the computational cost compared with the sequential method [9], [15] at the expense of slightly increasing the required storage. The high-level idea of our approach is to successively unroll consecutive subloops in the trace-type function evaluation. Loop-unrolling enables parallelization and optimization that cannot be implemented in the sequential method. We parameterize the number of iterations after the unrolling and demonstrate that the proposed method with a certain unrolling parameter outperforms the sequential method both in the single-and the multi-threading modes.
The remainder of this paper is structured as follows. Section II presents notations and background for our development. Section III details the proposed method, and its performance evaluation is reported in Section IV. In Section V, we describe related work. We conclude the paper in Section VI.

II. PRELIMINARIES A. NOTATION
For an integer q, we denote by Z q := Z/qZ a quotient ring which is identified by [−q/2, q/2) ∩ Z. Let N be a power-oftwo integer. We denote by R = Z[X ]/(X N + 1) a polynomial ring, and R q = R/qR its quotient ring. We also denote by [a] q the reduction of a modulo q into the interval for some integer a. We use [a] to denote range [0, 1, . . . , a − 1] ∩ Z for some positive integer a. We use Z * 2N = {2k + 1} k∈ [N ] to denote a set of positive integers smaller than 2N . For a set S, we denote by u ← S sampling of an element from S uniformly at random. We denote ← X by sampling of an element from a distribution X . Symbol ⊕ denotes homomorphic addition. We denote ·, · by the inner product of two vectors consisting of elements in a ring. | · | represents the length of an array or size of a set.

B. RESIDUE NUMBER SYSTEM (RNS)
Ciphertext modulus is typically larger than the CPU word size (e.g., 64 bits). To efficiently handle the ciphertext data, the modulus is represented by a product of small moduli relatively co-prime to each other. Let Q be a ciphertext modulus such that Q = i=0 q i is a product of + 1 small integers, each fitting into the word size, and each pair of the small moduli are relatively co-prime. Then, with the Chinese Remainder Theorem, the addition and multiplication of elements over Z Q can be performed on its residues each in Z q i independently in parallel. Similarly, an element in R Q can be represented via an array of elements, each in R q i for each i ∈ [ + 1]. This representation is called RNS representation, where the addition, multiplication, and automorphism on the element in R Q can be performed on the element in R q i in parallel for each i. The parameter is reduced throughout the homomorphic evaluation (typically after homomorphic multiplications), and is upper bounded by L. We call a ciphertext w.r.t. modulus Q a level-ciphertext.
Typically, two different representations are used to express elements in each R q i . Coefficient representation is the most natural representation as the coefficient with respect to the power basis is aligned as a vector. This representation allows the direct application of nonlinear operations (e.g., scaling, lifting, and word decomposition) on R q . On the other hand, the multiplication of two elements in R q i suffers from quadratic complexity in N . The evaluation representation enables multiplication in O(N ) at the cost of O(N log N ) complexity of conversion between these two representations. To convert the coefficient representation into the evaluation representation, the number theoretic transform (NTT) is required, whereas the inverse NTT is required in the reverse direction. The (inverse) NTT causes a bottleneck in HE implementations with Full-RNS capability [19]- [22].
Throughout the paper, we use a superscript within parentheses to denote a ring element with respect to a large modulus in the RNS representation. Namely, for x ∈ R Q , we denote its RNS representation by {x (i) } i∈ [ +1] , where x (i) = [x] q i , and call each of x (i) 's an RNS component.

C. AUTOMORPHISM AND KEY-SWITCHING
Automorphism: For some t ∈ Z * 2N , an automorphism is defined as a map ψ t : R → R, a(X ) → a(X t ) mod 2N (X ). After applying an automorphism on a ciphertext, the corresponding secret key changes. The homomorphic evaluation between ciphertexts under different secret keys is not supported. To perform further computations on ciphertexts under different secret keys, one needs to switch the corresponding key into another via key-switching. If we apply 53062 VOLUME 9, 2021 automorphism ψ t to a vector of polynomials, then it affects each of the polynomials. The effect of an automorphism is to rotate the plaintext slots/coefficient of a polynomial by a specific amount [23].
Key-Switching: For some c 1 ∈ R Q , let Decomp be a function that outputs a set of small elements (c 0 , c 1 , . . . , c d−1 ) ∈ R d for decomposition length (or the number of digits) d such that Let χ be a discrete Gaussian distribution. Key-switching consists of the following two algorithms:  ). Output (c 0 , c 1 ) ∈ R 2 Q , where c 0 = c 0 + ĉ, swk s→s [0] /P and c 1 = ĉ, swk s→s [1] /P . We defer the details of the KeySwitch with the RNS representation to the next section.

D. RNS-FRIENDLY KEY-SWITCHING
The idea behind RNS-friendly key-switching is to decompose c 1 ∈ R Q into small ring elements via simple modular reductions by partial products of Q , followed by constant multiplications. Such modular reductions are almost free with the RNS representation, as RNS components of c 1 w.r.t. the partial products are already available. Hence, they can be performed by simply splitting RNS components of c 1 . The size of the partial products determines the decomposition granularity and the size of the special modulus. Since the decomposition granularity is controllable [26], we can choose the best decomposition granularity, depending on a target computation and the security level.
More specifically, the decomposition granularity is controlled by the maximal number of digits d for a particular L. Let α = (L + 1)/d be the number of RNS moduli for each digit, where (L + 1) is divisible by d. We use β = ( + 1)/α to denote the number of digits for key-switching on a ciphertext at level . The difference between d and β is due to the fact that changes during the homomorphic evaluation. [d] be partial product associated with the j-th digit. Let Q j = Q L /Q j . Then, for x ∈ R Q , the decomposition function is defined by j . We denote C = {q 0 , q 1 , . . . , q } and B = {p 0 , . . . , p k−1 } by bases that respectively correspond to ciphertext modulus Q and the special modulus P. We further denote C j = {q jα , . . . , q (j+1)α−1 } by a base corresponding to partial prod-uctQ j . We use D i = {∪ j∈[i] C j } ∪ B and C j = D β \C j . Note that when α + 1, the (β − 1)-th digit consists of ( + 1) mod α elements, because α − (( + 1) mod α) elements are all zeros and do not need to be stored.
Recall that the inner-product is required during keyswitching. Using the special modulus technique, an evaluation key for level-ciphertext is defined over R β×2 PQ . However, each of the small digits is represented over RQ j . To make these digits interact with the evaluation key, we need to extend the base of each digit from C j to D β through (approximate) fast base conversion [27] (see Appendix C for more details). Note that the input of the fast base conversion must be in coefficient representation, whereas the inner-product requires its input to be in evaluation representation to perform efficient multiplications. Thus, we need several (inverse) NTT calls during key-switching.
Below, we describe the key-switching procedure by following the implementation of PALISADE v1.9.2 [28]. 3 We assume that both input and output ciphertexts are in evaluation representation. For a ciphertext c = (c 0 , c 1 ) ∈ R 2 Q whose RNS representation is 1 ) for j ∈ [ + 1], the key-switching procedure comprises the following subroutines, in which the computationally expensive steps consisting of (inverse) NTTs are underlined.
1. RNS-decomposition and modulus-raise: This step inputs an element in R Q , and outputs an array of elements R PQ of length β. Both input and output are in evaluation representation. 1,j } i∈ [α] into D β via fast base conversion. Since C j ⊂ D β , it suffices to extend from C j to C j followed by performing NTTs to get the results in evaluation representation for the inner product in the next step. By concatenating the resulting RNS components and the original digits already available in evaluation representation, we have each digitĉ 1,j as an element in R PQ . Denoting by ModUp C j →D β these procedures, we com- ). The details of ModUp C j →D β are provided in Algorithm 7 in Appendix C. 2. Inner Product: This step performs the inner product of the digits (ĉ 1,0 , . . . ,ĉ 1,β−1 ) and evaluation key swk = (swk 0 , . . . , swk β−1 ) both in evaluation representation, whereĉ 1,j = {ĉ

Modulus-down:
Forĉt (in evaluation representation), this step performs division by P followed by rounding, which shrinks the base D β into C as well as reducing the key-switching noise. These are performed by fast base conversion followed by subtraction and constant multiplication. More specifically, we first parse RNS components ofĉt w.r.t B, and apply fast base conversion from B to C on it. Then, we subtract the resulting elements fromĉt w.r.t. C followed by multiplication by P −1 . Asĉt is in evaluation representation, we need inverse NTTs before performing fast base conversion. We denote the procedure at this step by RNSModDown whose details are available in Appendix C (Algorithm 8). Thus, we compute {ĉ ) and

Addition and output:
Output Complexity: We relate the complexity to the number of single-precision modular multiplications (MMs) with respect to the small moduli in the RNS representation.
Step 4 (Addition and output) does not require any MMs.

Algorithm 1 Automorphism and Key-Switching (AKS)
Input: We outline the key-switching procedure performed immediately after the automorphism by index t ∈ Z * 2N in Algorithm 1. Whenever referring to Algorithm 1 (AKS), we omit the evaluation key swk s(X t )→s if it is clear from the context. For the details of RNSDecompModUp and RNSModDown, please see Algorithm 6 and Algorithm 8 in Appendix C, respectively.

E. HOMOMORPHIC TRACE-TYPE FUNCTION
A trace-type function can be expressed as recursive automorphisms-and-sums. We consider that the recursion proceeds M (≤ log N ) times, as it is naturally derived from existing functionalities such as total-sums and trace. Both functionalities are often used with several choices of M , depending on applications. For instance, when M = log N , the algorithm exactly corresponds to LWE-to-RLWE [15] and total-sums [9]. 4 When M < log N , the algorithm exactly corresponds to partial variants of LWE-to-RLWE (coefficient selection [14], [15]) and those of total-sums. For simplicity, we do not distinguish total-sums from its partial variants.

Algorithm 2 Homomorphic Trace-Type Function Evaluation
Input: ciphertext c, list of automorphism indices k such The framework of the trace-type function evaluation is described in Algorithm 2. The concrete functionality is determined by specifying the list of automorphism indices k. For instance, the trace is defined by setting [15], and goes down from the largest to smaller ring in a tower of rings. The total-sums procedure is defined by a list of automorphism indices k = {g 2 i } i∈ [M ] for some generator g ∈ Z * 2N , 5 homomorphically permuting the slots by a powerof-two amount followed by addition at every iteration.
Despite the usefulness of the trace-type function, Algorithm 2 is inefficient because it requires O(log N ) keyswitching processes. In the next section, we provide a solution to reduce complexity.

III. PROPOSED LOOP-UNROLLING METHOD
The trace-type function is inherently sequential, with each loop consisting of O(log N ) (inverse) NTT calls. To gain more efficiency, we strive to reduce the total number of (inverse) NTT calls by reducing the number of iterations via loop-unrolling. As N typically ranges from 2 10 to 2 15 [29], unrolling all iterations is not efficient when M is large (see section IV).
Instead, we successively unroll fewer iterations. Namely, let M (≤ log N ) be the number of iterations required in Algorithm 2, and let h(≤ M ) be the total number of iterations after successive unrollings. We unroll M /h consecutive (and non-overlapping) iterations (h − (M mod h)) times. If (M mod h) = 0, we then unroll the remaining consecutive and non-overlapping M /h +1 iterations (M mod h) times. For instance, when M = 14 and h = 3, we unroll the first four iterations, the next five iterations, and the last five iterations sequentially, yielding three unrolled iterations.
By unrolling the iterations, consecutive automorphisms need to be applied on a ciphertext, which can be composed into a single one. This is because for some t 1 , AKS(g, c)). Thus, the number of automorphisms required per unrolled N ) automorphisms and sums, and the number of required automorphism over all iterations becomes O(h h √ N ). Due to the unrolling and composition, each unrolled loop requires additional automorphisms and the corresponding evaluation keys compared with Algorithm 2. As a result, this incurs higher storage complexity. Specifically, the increase of the complexity is substantial for smaller h while slight for larger h, as shown in Subsection III-B. The proposed algorithm is outlined in Algorithm 3 as a generalization of the existing trace-type function evaluation.

Algorithm 3 Concept of Proposed Homomorphic Evaluation of Unrolled Trace-Type Function Evaluation
Input: • c: a ciphertext • h: # iterations after unrollings • K = {k 0 , . . . , k h−1 }: a set of arrays of automorphism indices for each iteration Output: ciphertext c At a first glance, each of the iterations in Algorithm 3 requires different automorphisms on a ciphertext followed by key-switching, each consisting of the several sequential steps described in Subsection II-D. Indeed, Algorithm 3 can harness many optimizations to reduce the complexity of keyswitchings, as described below.

A. OPTIMIZATION TECHNIQUES
The proposed algorithm harnesses many well-known optimization techniques that can be independently applied to each unrolled iteration. We here consider one unrolled iteration VOLUME 9, 2021 consisting of n automorphisms and key-switching processes. In addition, we assume the automorphisms are applied to ciphertext c = (c 0 , c 1 ) ∈ R 2 Q with indices t 0 , t 1 , . . . , t n−1 ∈ Z * 2N .

1) HOISTING, PRE-AUTOMORPHISMS, AND LAZY ModDown
Hoisting and Pre-Automorphism: Given that the automorphisms per unrolled iteration are applied to the same ciphertext, we can leverage the hoisting technique [22], [30]. Specifically, RNS-decomposition and modulus-raise described in SubSection II-D are performed once. The resulting decomposed element in R β PQ is reused for different automorphisms followed by key switching. Although the initial idea in [22] is to apply n automorphisms to the decomposed ring element, it incurs a non-negligible cost due to coping and performing the automorphisms on the decomposed ring element. Recently, Bossuat et al. [30] suggested avoiding automorphisms on the decomposed ring elements by using its inverse on the corresponding evaluation keys during the key-generation. This method avoids the automorphism on the array of large ring elements.
Lazy Modulus-Down: While the hoisting avoids n invocations of RNS-decomposition and modulus-raise, it still consists of n invocations of modulus-down. We further adopt lazy modulus-down as in [30] (where it is called the second layer of hoisting). With the hoisting and the automorphism on the evaluation keys, n distinct automorphisms are applied to c 0 ∈ R Q and the result of the inner product over R 2 PQ . Then, with the lazy modulus-down, the results of the inner product are summed before the modulus-down is performed. Specifically, our goal is to compute PQ (performed during key generation 6 ). As a result, the number of (inverse) NTTs during the modulusdown is reduced down from O(n) in the naïve implementation to O(1).

2) PUTTING IT ALL TOGETHER
By incorporating the aforementioned techniques, the number of all operations in key-switchings before and after inner-product is reduced compared with the naïve approach to Algorithm 3. For convenience, we refer to RNSdecomposition, modulus-raise, and lazy modulus-down as hoisted operations. The hoisted operations are performed exactly h times. When h < M , the number of (inverse) NTT calls are reduced compared with Algorithm 2 as expected. However, the number of inner-products is O(h h √ N ), which 6 More specifically, swk e can be generated via SwitchKeyGen(s, s(X (t e ) −1 )).
increases as h decreases, and becomes non-negligible for relatively large M . Hence, we need to balance the complexities of hoisted operations and inner products.
As seen in the next subsection, the increase of the complexity of the hoisted operations and the increase of the complexity of the inner-products are asymmetric, thereby improving the computational efficiency by choosing a proper h. This comes from the difference between the unit costs of the hoisted operations and inner-product (i.e., the cost for a single inner product is cheaper than those for hoisted operations).

Algorithm 4 Homomorphic Evaluation of Unrolled
Trace-Type Function Evaluation (HoistKS) Input: Eval. Rep. 20: return (c 0 , c 1 ) Algorithm 4 shows the procedure for each unrolled iteration. 7 We hide the details of the procedures consisting of (inverse) NTTs, and defer them to Appendix C. Hence, all polynomials are in evaluation representation at the level of Algorithm 4. As an application of Algorithm 4, we also provide the unrolled version of the homomorphic trace-type function evaluation in Algorithm 5.

B. APPROXIMATE COST ANALYSIS
We derive an approximately optimal h for a given HE parameter in terms of modular-multiplication complexity as commonly considered in literature (e.g., [26], [30]). 7 It should be noted that Algorithm 4 is similar to the linear transformation provided in [30,Algorithm 6]. The difference is that the trace-type function does not contain any multiplicative factor.
More specifically, we consider the cost as the number of MMs over either Z q j or Z p i . Namely, we have the following cost for each unrolled iteration given in Algorithm 4. Note that some of the procedures are exactly the same as the non-hoisted keyswitching process already presented in SubSection II-D. This step is identical to the one in the key-switching in Subsection II-D. The complexity is (β + 1)( + 1)N log N + αβ( + 2)N + ( + 1)N MMs.
• 2) Inner-product-and-automorphism (Lines 5-9): This step iterates n := |k| times. Each iteration consists of automorphisms on three ring elements, one in R Q and the other two (i.e., result of inner product) in R PQ . The complexity of the inner product is exactly the same as the key-switching described in SubSection II-D, which is 2β( + 1 + α)N MMs. Thus, the total complexity is given by 2n · β( + 1 + α)N MMs. Note that the number of per iteration automorphisms are 2( +k +1)+( +1), as opposed to 2( + 1) in the rotations-and-sums. For convenience, we denote the total cost for all operations before (resp. after) inner-product-and-automorphism per unrolled iteration by C H1 (resp. C H2 ). 8 These two cost as well as the cost for a single inner product C IP are summarized in Table 1 to analyze the cost as a function of h. 9 We now define our cost function. The loop-unrolling controls the number of the hoisted operations and that of inner-product-and-automorphism. Specifically, the hoisted operations are performed exactly h times. For each unrolled iteration, we require either (2 M /h − 1) or (2 M /h +1 − 1) inner-product-and-automorphism, repeating h − r or r times over h iterations respectively. Thus, we have where r = M mod h. Now, let g(h) = 2 M /h · (h + r) and C H = C H1 + C H2 be an auxiliary function and cost for all hoisted operations per unrolled iteration, respectively. Then, the cost function in Eqn. (1) can be re-expressed as follows: From the cost function, we show that our loop-unrolling provides better complexity than the rotations-and-sums by using the following lemma:  Table 1, we compare N log N term of C H1 with C IP . That is, we consider C H1 = (β + 1)( + 1)N log N + otherterms where the last approximation is done by αβ ≈ + 1. As log N > 2 holds in practice, C H1 > C IP and thus 9 Techniques used in [30] drop the costs for the hoisted operations by ( + 1)N and ( + 1)N log N MMs on C H1 and C H2 , respectively. Although we did not adopt such optimizations, it is expected that our loop-unrolling still works better than the rotations-and-sums. However, speed-up factor will get slightly lower than what we show in the experimental results. VOLUME 9, 2021 C H > C IP holds. This implies that our loop-unrolling is expected to run better, regardless of the concrete choices of HE parameters, such as α, β and the ciphertext level .
Below, we show theoretical analysis to understand the effect of h. From Lemma 1, we derive the following corollary.
which is simplified as follows where the last inequality holds from Corollary 1.1.
Given C H and C IP , we can find maximal e satisfying the inequality in Lemma 2. The bound g(M − e) − g(M − e + 1)+1 becomes higher as M and e increase. The HE parameter only has an effect on the ratio C H /C IP , which tends to set the cost-efficient h smaller as the ratio goes higher. Nevertheless, we also provide that there exists an upper bound of h from the following theorem. Moreover, the speed-up factor over C(M ) can be given by where (g(h) − h)/h ≥ 1, whose equality holds when h = M . Thus, for h < M , the speed-up factor is less than M /h but approaches it as the ratio grows. As we see above, the ratio impacts both the cost-efficient h and the speed-up factor. We describe how the ratio changes along with the effect of k (= α). Effect of k (= α) and : The choice of k affects the complexity and the effect of the loop-unrollings. The effect of k depends on as they determine β. Figure 1 shows the change of the ratio by varying and k. For each k, the graph generally shows the decrease/increase of the ratio according to the growth of . The decrease of the ratio is caused by the increase of β, whereas the increase of the ratio is caused when β does not change with the increase of . For example, these extreme cases are observable when k = 1 and k = 10, respectively. These behaviors are explained by considering C H and C IP as functions of , and by comparing C H ( )/C IP ( ) and C H ( + 1)/C IP ( + 1), where we defer the details to Appendix B. The loop-unrolling is expected to work more effectively in terms of performance when and k are large.   Figure 3 with stacked-bars, where h = 1, 2 are omitted for visualizing purpose. We can see that dominant terms depend on β for relatively large h, whereas the C IP term becomes dominant for relatively small h. For larger (resp. smaller) β, C H1 (resp. C H2 ) tends to become dominant. Assuming all parameters except h are fixed, the dominant cost gets shifted from the hoisted part to the inner-product part as h decreases while reducing overall complexities. The cost for h = M is clearly bounded by the hoisted operations. Hence, with loop-unrolling, a linear decrease of the hoisted costs and non-linear increase of the inner-product cost in h enables the overall speedup of the homomorphic trace-type function evaluation.
Storage Cost: Although the loop-unrolling decreases the cost compared with the case h = M , it also increases the number of evaluation keys required, thereby increasing storage cost. The number of evaluation keys is exactly the same as the number of inner-products (i.e., g(h) − h). We show how the number differs depending on M and h for a fixed log N in Figure 4. The figure precisely shows the shape of g(h) − h, which is monotonically decreasing in h. Thus, the rotations-and-sums approach [9], [15] (h = M ) can be considered as a storage-efficient solution. Compared with the Figure 2, where the increase of h from some particular point typically increases the cost, we know that there exists a trade-off between storage and time.
Note for Parallelism: We note that the proposed method has more opportunities to adopt parallelization than Algorithm 2 does per iteration. Specifically, we can parallelize the line 5 of Algorithm 4, where such parallelization VOLUME 9, 2021 As the analysis is approximate in the sense that costs for automorphism, addition, and implementation factors such as memory access and cache-miss are ignored, the optimal h can differ for a given HE parameter and an actual implementation. In the next section, we provide experimental results.

IV. PERFORMANCE EVALUATION A. EXPERIMENTAL SETUP
We compared the performance of the proposed method with that of two existing methods, namely, the recursive rotations-and-sums procedure [9], [15] 10 and the baby-step giant-step (BS/GS) algorithm to evaluate all the necessary automorphism indices [22]. These methods can be respectively captured in our parameter setting, namely h = M and h = 2. Since the loop-unrolling in the proposed method enables parallelization of the entire procedure at the level of inner-product-and-automorphism (i.e., line 5 in Algorithm 4), we also compare all methods in both single-and multithreaded modes.
We conducted all experiments with PALISADE library (v1.9.2) [28], as it is the only library that provides both parallelization and the generic key-switching with the special modulus [26]. All experiments were run on a machine running ubuntu 16.04 equipped with Intel Xeon Gold 5122 CPU 3.60 GHz and 32-GB RAM. All codes were written in C++, compiled with g++ 7.2.0 using '-O3' option. We used the CKKS scheme with a scaling factor of 2 40 in APPROX-RESCALE mode by varying the multiplicative depth to see the 10 In the context of LWE-to-RLWE conversion [15], authors provided the performance result on top of the modified version of SEAL with singlethreaded mode only. Since SEAL does not support the HK20 key-switching method, and the implementation is not publicly available, we cannot provide a fair comparison against that. However, the runtime is expected to be reduced by using our method. scalability in the number of RNS moduli + 1. Note that the choice of the scaling parameter depends on applications. We aim to show the trend for variable choices of . All parameter sets satisfy 128-bit security level according to the standardization white paper [29]. We modified the implementation of the automorphism on polynomials in evaluation representation such that it is realized via a simple permutation. This is to reduce the automorphism costs. Instead of performing automorphisms on three (resp. two) ring elements in our method (resp. in the rotations-and-sums), we pre-computed a permutation vector once on the fly, and then applied a simple permutation on each of RNS components.
We assume that multi-threading is used differently depending on the choice of h. For h = M , we use it at the level of RNS moduli except for the fast base conversion used in modulus-raise and modulus-down which are parallelized at the level of ring-dimension, as provided in PALISADE. We used four threads for all experiments with multithreading. For h < M , we used the multi-threading in nearly the same way as h = M , except that we used it at the level of the inner-product-and-automorphism in an unrolled iteration.
Next, we confirm the actual speed-up in an actual computing environment with both single-and multi-threaded modes. All results are averaged over 10 trials. We show the runtime comparison for log N = 15 with M ∈ {2, log N /2 , log N }, L = 9, ∈ {1, 4, 9}, and k ∈ {1, 5, 10} to see the scalability of our method in ring dimension, multiplicative depth, and number of RNS moduli for the special modulus.

B. RESULTS
Tables 2a-2d show the runtime comparison for M ∈ {2, 15}, where we omitted M = 7 (shown in a subsequent plot) for the ease of readability. All tables show that our method works faster than h = M for all parameter settings. Tables 2a and 2b (resp. 2c and 2d) show the result for M = 15 (resp. M = 2) with single-and multi-threaded modes, respectively. They show that loop-unrolling runs faster than the rotations-and-sums. More specifically, it gives speed-ups by a factor of 1.32-1.48 and 1.47-2.12 with single-and multithreaded modes, respectively. Below, we describe the result with single-threading followed by that with multi-threading in detail.
With single-threading for M = 15, it can be seen that the most runtime-efficient h is less than 15 − 15/2 = 8, following the theoretical analysis. The difference of the concrete value of the runtime-efficient h depends on (k, ) pair. More specifically, when (k, ) changes such that the ratio goes up, h tends to become smaller. For instance, we can see that, for = 4, 9, the runtime-efficient h decreases as k grows. For M = 2, the runtime-efficient h is simply equal to one as the loop-unrolling effectively works.
We can also observe that the magnitude of the runtime is affected by k, regardless of the concrete choice of h. For a fixed , increasing k generally decreases the overall runtime as long as β decreases. This can be observed from the tables when = 9 with all k, and = 4 with k ranging from 1 to 5. 53070 VOLUME 9, 2021 TABLE 2. Runtime comparison (in millisecond) for log N = 15, M ∈ {2, 15} with different (k, ) pairs in both single-and multi-threaded modes. ''N/A'' represents the case where a process was killed by the operating system due to large memory consumption. For each table, h = 2 and h = M respectively correspond to [22] and rotations-and-sums [9], [15].
When the increase of ( , k) does not change β, the runtime simply increases. This is shown for = 4 from k = 5 to 10. An exception happens when = 1 where k = 1 incurring β = 2 provides faster runtime than larger k and thus reaches the minimum.
With the multi-threading, loop-unrolling has more opportunities to parallelize the entire procedure. Regardless of the concrete values of M , the multi-threading gives speed-ups over single-threaded mode. For M = 15, as opposed to the single-threaded mode, the optimal h becomes lower, which can be observed at h = 5. This simply means that the parallelization on the inner-product-and-automorphism can take advantage of multi-threading for smaller h. For M = 2, the runtime-efficient h is equal to one, which is the same as the one in the single-threaded mode. Since the trend of the other performance numbers is the same as the single-threaded mode, we do not go into further details.
As opposed to the theoretical analysis, we see that h = M can give lower runtime than h = M − 1 in the experiment. This comes from the difference between our algorithm VOLUME 9, 2021 (Algorithm 5) and the rotations-and-sums (Algorithm 2), which cannot be captured by the approximate cost analysis. For instance, the rotations-and-sums requires automorphisms on two elements in R Q , resulting in 2( + 1) automorphisms on RNS components. On the other hand, our algorithm needs to perform automorphism on two elements in R PQ and one element in R Q . Hence, it requires 2( + 1 + k) + ( + 1) = (3 +2k+3) automorphisms on RNS components. As a result, our algorithm has ( + 2k + 1) extra automorphisms on RNS components per each inner-product. Such extra overhead on our method accumulates as M becomes larger. In our experiments, we see this when M = 15.  For M = 7, 15, we see that each of the optimal h's tends to become higher than the theoretical estimate. One reason would be that the cost for hoisted operations is a bit overestimated. In the implementation, the approximate base conversion indeed leverages lazy mod reduce (e.g., see [31]), which is a technique of bypassing the modular reduction until the summation is done during the computation of innerproduct. This contributes to lower runtime on the hoisted operations than expected. Hence, the ratio becomes lower, yielding higher runtime-efficient h.
Speed-up Factor vs. Memory Usage: We show the tradeoff between runtime and storage in Figure 6 by changing k for fixed and M in the single-threaded mode, where we plot the speedup factor and the memory blowup over h = 15 with solid and dotted lines, respectively. Moreover, we add a horizontal blue line on the y-axis being one to visualize valid choice of h's outperforming h = M in terms of runtime. In Figure 6a, such h can be chosen from {5, 6, . . . , 12}, where the maximal speedup is given at h = 8. Since the storage cost increases as h decreases, the valid h should be chosen from {8, 9, . . . , 12} in practice. Note that the linear decrease with a slope being one from h = 8 to 14 for the storage cost is because of the property of g(h) − h and Corollary 1.1. We can also observe that all storage blowups are higher than the speed-up factors. In Figures 6b and 6c, we see some choices of h provide speed-up factors that outperform the storage blowup, showing a better trade-off than the above. This is because of choosing higher (k, ) pairs and that the storage blowup factor for a fixed h is independent of (k, ), which enables the loop-unrolling to work more effectively. Therefore, we can conclude that our loop-unrolling can give a time-space trade-off, and can even provide a speed-up factor larger than the storage blowup factor for some choice of HE parameter in practice.
Overall, apart from the existence of implementation factors, our experimental results follow the trend of the approximate cost function. Thus, it is worth applying the loop-unrolling to the trace-type function evaluation. Although we did not observe a parameter set when our method completely failed over h = M , there can be a case where the loop-unrolling cannot run due to large memory usage, which may not happen for the rotations-and-sums. This situation can happen with larger HE parameter settings having higher ring dimension (e.g., log N ≥ 16) and large multiplicative depth (large L), depending on storage availability.

C. DISCUSSION
As we have shown in the tables, the best h providing the minimum runtime depends on application-dependent parameters, such as M , log N , k, and . Based on the fact that a parameter of homomorphic encryption is pre-determined depending on applications, and the search space of h is bounded by M ≤ log N at around 15, it is easy to determine the best h (e.g., by examining all possible h ≤ M ) depending on objectives. For instance, since the choice of h incurs a tradeoff between the time and storage, it is recommended to find h satisfying a given storage requirement.
Moreover, it is possible to consider k as another parameter for each , rather than considering it as a static parameter over all 's. We can set the number of special moduli k for each level. As can be seen from the tables, k = 1 provides the minimum runtime for = 1, over all h. However, for = 9, k = 10 provides the minimum. Hence, it is reasonable to create evaluation keys for each level. However, such optimization requires additional memory costs, and we leave such adoption as a part of our future work.

V. RELATED WORK
The trace-type function evaluation is studied differently for total-sums and trace, and it can be seen as a special type of linear transformation. The total-sums algorithm was first presented in [9]. It works via a ''repeated-doubling'' approach, which recursively applies rotations-and-sums. Its design is generic for a vector of any length (including non-power-oftwo lengths) to be processed. In contrast, we focus on the vectors of power-of-two length as they are naturally derived from the power-of-two cyclotomic ring. Summing up all the slots is not always required. Instead, partial summation is used (e.g., [11]) by performing rotations-and-sums fewer than log N times.
While the total-sums operation works over the plaintext slots, the trace function works over polynomial coefficients. The main effect of the trace function is isolation of the 53072 VOLUME 9, 2021 coefficient of a polynomial. When only a specific coefficient should be extracted, the problem is same as LWE-to-RLWE conversion [15], [32]. The initial solution [32] applies as many key-switching processes as the ring dimension on an LWE ciphertext. Although this approach provides high parallelism, it suffers from computational inefficiency and requires gigabyte or terabyte of storage due to the required O(N ) evaluation keys. Recently, these limitations have been removed in a LWE-to-RLWE conversions by recursive automorphismand-sums [15], resembling total-sums but using different indices for the automorphisms. As a result, both storage and computational complexities have been reduced by an exponential factor. Our development further improves the efficiency of the homomorphic function evaluation expressed by rotations-and-sums (or equivalently automorphisms-andsums) with a trade-off between storage and computational costs.
The trace-type function can be considered as a linear transformation. The functionality of the linear transformation is achieved by homomorphically summing the rotated ciphertexts multiplied by some constants. In trace-type function evaluation, the multiplicative factor is always one, and there can be other potential solutions from the literature of the linear transformation. Halevi and Shoup [22] presented a set of optimization techniques to reduce the cost of homomorphic linear transformations. We leverage techniques, such as the hoisting and baby-step giant-step (BS/GS). 11 Note that the naïve adaption of the BS/GS algorithm suffers from inefficiency, specifically for large M . Our algorithm represents a recursive application of BS/GS, with improved efficiency over previous work. In [22], the authors introduced three variants of key-switching strategies: minimal key-switching, BS/GS, and a full strategy. These strategies differ in the number of key-switchings needed to perform a single automorphism on a ciphertext. Our implementation follows the full-strategy in the sense that a single 11 Their updated manuscript since its original submission is available on eprint, they indeed mention about trace map. But, detailed analysis as we provided such as lowest complexity and experimental results are not provided. Moreover, further optimizations other than hoisting, such as lazymod down and offline-automorphism are not considered. homomorphic automorphism is realized by exactly one automorphism and key-switching, whereas the other strategies require at most two or more key-switchings to realize a single homomorphic automorphism. As we are interested in the improvement of the computational efficiency, adapting the other key-switching strategies is beyond the scope of this paper.
Recently, Bossuat et al. [30] proposed a set of optimization techniques for hoisted automorphisms. Their techniques are generic and were adopted in our method. However, their problem setting for the generic matrix-vector multiplication is slightly different from ours. Specifically, their cost model regarding the key-switching for a generic linear transformation is asymmetric in BS and GS. The asymmetry of the cost model is caused by the existence of multiplication after applying an automorphism on a ciphertext, which requires the lazy modulus-down for a baby-step in each iteration of a giantstep. In our case, no multiplication is require. Hence, the lazy modulus-down inside the baby-step can be immediately performed independently from the giant-step. Moreover, they do not provide the timing results in the multi-threading mode, which is important in our loop-unrolling technique.

VI. CONCLUSION
In this paper, we developed a faster method to evaluate homomorphic trace-type function by using loop-unrolling along with a set of optimizations to amortize costly keyswitchings. We approximately demonstrated that the loopunrolling provides efficiency, and empirically showed that our method achieves 1.32-2.12x faster runtime than the previous rotations-and-sums method. Our method helps construct homomorphic encryption-based privacy-preserving applications to which we adopt our method in our future work. It would also be interesting to develop a way to automatically determine the best unrolling parameter h for a given computing environment.

APPENDIX B DETAILS OF THE BEHAVIOR OF THE C H TO C IP RATIO A. THE RATIO ACCORDING TO THE INCREASE OF
For a fixed α (≥ 1), we denote by β = ( + 1)/α . We treat C H = C H 1 + C H 2 and C IP as functions of . Namely, let be a function representing the cost for all hoisted operations per unrolled-iteration, and C IP ( ) = 2β ( + 1 + α)N be a function representing the cost for a single inner-product. We aim to show C H ( + 1) For and + 1, we have either 1) β +1 = β or 2) β +1 = β + 1. Moreover, C H ( + 1) and C IP ( + 1) can be expressed as C H ( + 1) = C H ( ) + a and C IP ( + 1) = C IP ( ) + b for some a, b, respectively. Thus, we need to show inequalities by specifying a and b depending on the conditions on β and β +1 . Below, we scale everything by 1/N since N appears as a common multiplicative factor. Case 1) β = β +1 : Since C IP ( ), C H ( ) > 0, we need to show We show that the inequality holds for the log N terms and the other terms separately.
When β ≥ , we only have two cases: α = 1 with any ; or = 0 with any α. In either case, we have β = + 1. Replacing β with + 1 yields the negative terms as −( 2 + 2 + 1 + 4 + 4) = −( 2 + 6 + 5) whose magnitude are smaller than that of the positive terms. When β < , the magnitude of the negative terms are clearly smaller than the that of the positive terms.

APPENDIX C DETAILS OF SUBROUTINES
c (i) ← NTT(c (i) ) 9: end for 10: return (c (0) , . . . , c ( ) ) We now describe the complexity of the base conversion in terms of the number of MMs. For each i ∈ [ 1 ], the y i := [x i · q i /Q ] q i term is computed once and is then reused to compute ( i y i · (Q /q i ) mod p j ) for each j ∈ [ 2 ]. Thus, the total complexity is 1 + 1 2 = 1 (1 + 2 ) MMs. The base conversion on an element in R Q can be done by applying it coefficient-wise. Thus, the complexity is given by N 1 (1 + 2 ) MMs. Note that a polynomial must be in coefficient representation.

B. ModUp AND ModDown
We detail the subroutines, ModUp C j →D β and RNSModDown, used for key-switching. These are based on the approximate base conversion algorithm. Since C j is a subset of D β , ModUp C j →D β can be implemented with ModUp C j →C j followed by aligning RNS components. For an element x ∈ R PQ , we assume its RNS representation is aligned as ([x] p 0 , [x] p 1 , . . . , [x] p k−1 , [x] q 0 , [x] q 1 , . . . , [x] q ) in order.