Revisiting Higher-Order Masked Comparison for Lattice-Based Cryptography: Algorithms and Bit-Sliced Implementations

Masked comparison is one of the most expensive operations in side-channel secure implementations of lattice-based post-quantum cryptography, especially for higher masking orders. First, we introduce two new masked comparison algorithms, which improve the arithmetic comparison of D’Anvers et al. (2021) and the hybrid comparison method of Coron et al. (2021) respectively. We then look into implementation-specific optimizations, and show that small specific adaptations can have a significant impact on the overall performance. Finally, we implement various state-of-the-art comparison algorithms and benchmark them on the same platform (ARM-Cortex M4) to allow a fair comparison between them. We improve on the arithmetic comparison of D’Anvers et al. with a factor <inline-formula><tex-math notation="LaTeX">$\approx 20\%$</tex-math><alternatives><mml:math><mml:mrow><mml:mo>≈</mml:mo><mml:mn>20</mml:mn><mml:mo>%</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="danvers-ieq1-3197074.gif"/></alternatives></inline-formula> by using Galois Field multiplications and the hybrid comparison of Coron et al. with a factor <inline-formula><tex-math notation="LaTeX">$\approx 25\%$</tex-math><alternatives><mml:math><mml:mrow><mml:mo>≈</mml:mo><mml:mn>25</mml:mn><mml:mo>%</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="danvers-ieq2-3197074.gif"/></alternatives></inline-formula> by streamlining the design. Our implementation-specific improvements allow a speedup of a straightforward comparison implementation of <inline-formula><tex-math notation="LaTeX">$\approx 33\%$</tex-math><alternatives><mml:math><mml:mrow><mml:mo>≈</mml:mo><mml:mn>33</mml:mn><mml:mo>%</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="danvers-ieq3-3197074.gif"/></alternatives></inline-formula>. We discuss the differences between the various algorithms and provide the implementations and a testing framework to ease future research.


INTRODUCTION
C URRENT standards for public-key cryptography, such as RSA or ECC, are under threat of quantum computers. In response, the cryptographic community started work on replacement algorithms that are secure in the presence of large-scale quantum computers. Such quantum computer resisting algorithms are known under the term post-quantum cryptography. In 2016, the National Institute of Standards and Technology (NIST) started a standardization process to find a new post-quantum encryption and digital signature standard [3]. At the moment we are in the final stage of this process, with 4 encryption finalists and 3 signature finalists. Out of these finalists, 3 encryption schemes (Kyber [4], Saber [5] and NTRU [6]) and 2 signature schemes (Dilithium [7] and Falcon [8]) are from the family of latticebased cryptographic schemes. In this paper we will specifically focus on lattice-based schemes.
When deploying the future standard, one has to take into account the possibility of side-channel attacks. Side-channel attacks are attacks that use information leakage as a result of computation, such as timing, power consumption or electromagnetic radiation. These leakages give an adversary extra information that could be used to break the cryptographic primitive with smaller effort compared to breaking the underlying mathematics.
One popular method to protect against side-channel attacks is masking. Masking has been introduced by Chari et al. [20] and provides a framework to harden cryptographic implementations against side-channel leakage. The main idea of masking is to split sensitive values into S shares, so that an adversary that has access to at most t < S shares does not learn any sensitive information. The parameter t denotes the order of the masking, and is typically equal to S À 1. The terminology around masking has been extended by Barthe et al. [21], introducing Non-Inference (NI) and Strong Non-Inference (SNI) to allow easier composition of masked building blocks, typically called gadgets.
Masked implementations of encryption standardization candidates were presented for Saber by Van Beirendonck et al. [22] for first order, and later by Coron et al. [2] for higher masking orders. A masked Kyber implementation for generic masking orders was introduced by Bos et al. [23]. Fritzmann et al. [24] optimized a masked implementation of Saber and Kyber using instruction set extensions.
For the signature candidates, Dilithium was masked by Migliore et al. [25].
Looking at the cost of the masking the various building blocks, one can see that there are different bottlenecks between masked and unmasked implementations. Unmasked implementations are typically dominated by the polynomial multiplication and the generation of the public matrix. For masked implementations, the most expensive building block is an equality check/comparison operation between the input ciphertext array and a re-encrypted ciphertext array. In this paper, we specifically look at different methods to securely implement this comparison. A complicating factor is that the input ciphertext is compressed for both Kyber and Saber, which will have an effect on which methods can be used in practice.
One observation that one can make is that there is a clear difference between first order and higher order masking, in that there are specific methods that can be used to speed-up first-order masking that do not scale to higher orders. For the comparison, one can use the first-order method of Oder et al. [26]. Their idea is to implement a check to see if a masked array is zero by hashing both shares separately and comparing only the hashed values in the end. A small change to their method, necessary for security has been discussed in [27]. The compression can be performed efficiently using table based A2B conversion [28], [29], specifically developed for first order masking.
For higher orders, several techniques have been developed, which follow the same pattern: first, a preprocessing on the arithmetically masked array, second, a conversion from the arithmetic to the Boolean masking domain, third, a postprocessing, and finally a comparison on the final Boolean masked values. The difference between the various methods lies in the preprocessing and postprocessing steps.
Barthe et al. [30] solved the masked comparison challenge by switching from the arithmetic masking domain to a Boolean masked representation, and then performing the comparison using masked bitwise operations. Bache et al. [31] showed a method to compress the number of array coefficients that needs to be compared by taking a random sum. Bhasin et al. [27] showed a security problem in this method, and adapted the idea to get around the security problems. The drawback of this method is that it only works for specific cases, i.e., prime moduli without compression of the ciphertext. D'Anvers et al. [1] later showed how to implement this method for both prime and power-of-two moduli with compression.
A different approach was taken by Bos et al. [23], who instead of compressing the masked ciphertext, leave it uncompressed and perform two masked checks to see if it is within the required range, i.e., a high-and low-end check. Removing the compression here comes at a cost of two (cheaper) checks per coefficient. Coron et al. [2] introduced several new ideas to more efficiently perform this range check.

Contributions
Our contributions are threefold: first we introduce an improved version of the comparison method of [1]. Instead of working with arithmetic multiplications modulo some big power-of-two, we propose to work in a Galois field, which saves us a conversion from the Boolean to the arithmetic masking domain and significantly reduces the cost of the comparison operation. We also develop a streamlined version of the Kyber-specific compression of Coron et al. [2]. Both our algorithms outperform the comparisons they are based on.
Second, we discuss specific implementation details such as bitslicing, and changing the Boolean representation after A2B conversion. We show that these implementation changes have a significant impact in reducing the cost of the algorithms.
In the third and final part of the paper we compare the state-of-the-art comparison methods. We implement several algorithms using the same underlying A2B conversion implementation and on the same target platform. We then perform the benchmarking on both Saber and Kyber. By doing this we aim to make an fair and practically useful comparison between the various comparison methods available. We will make our optimized implementations of these algorithms available at https://github.com/KULeuven-COSIC/ Revisiting-Masked-Comparison.

Notation
We denote with bÁc flooring a number to the nearest lower integer, and with Á d c rounding, with ties rounded upwards. bxc q!p is a shorthand for modulus switching and rounding an input These operations are extended for vectors, polynomials or vectors of polynomials coefficient-wise. As we will see in Section 2.2, these operations are also extended for masked variables by applying them sharewise. Let x ) b denote bitwise shifting x to the right with b positions, which is equal to floorðx=2 b Þ. For an array or a polynomial x, denote with x½i the i th coefficient of x. Let x $ x denote sampling x according to a distribution x, and let x r x denote a pseudorandom sampling based on a seed r. Let UðSÞ denote the uniform distribution over a set S.

Masking
Masking is a technique to protect implementations of cryptographic algorithms against side-channel attacks. The main idea is to split sensitive values into S shares so that an adversary only learns sensitive information if he has access to at least t þ 1 shares, where typically t þ 1 ¼ S. For a sensitive value x we will denote that it is masked with x ðÁÞ , where x ¼ x ðÁÞ . The notation x ðiÞ specifically denotes the i th share of the masked x ðÁÞ .
There are various methods to accomplish a sharing, and we will specifically utilize two: Boolean masking and arithmetic masking. In Boolean masking, a sensitive value is masked by XOR'ing it with uniformly random strings such that . For arithmetic masking, one chooses a masking modulus q and after which masking is performed by subtracting uniformly random strings such that x A ðÁÞ ¼ P SÀ1 i¼0 x A ðiÞ mod q. Arithmetic masking is typically used when performing arithmetic operations on the shares, as linear operations (addition, multiplication with a constant) are efficient under this masking. Boolean masking is typically used when performing Boolean operations on data. For more information on masking we refer to [21], [32].

Lattice-Based Encryption
In this paper we will specifically look at the comparison operation that happens at the end of the decapsulation if compiled using the Fujisaki-Okamoto transformation. To give some context we introduce lattice-based encryption in this section, and will explain the Fujisaki-Okamoto (FO) transformation in the next section. We focus on a general algorithm that can be used to describe both Saber and Kyber.
bv c e T !q ; 3 m bv À s T Á ue q!2 ; 4 return m; Algorithms 1, 2, and 3 depict a lattice-based encryption and decryption procedure. It works on vectors of ring elements R k q , with R q ¼ Z q ½X=ðX n þ 1Þ. In both Saber and Kyber, n ¼ 256 and k has a value between 2 and 4 depending on the security level. The main difference between the two is that Saber works with a power of two modulus q ¼ 2 13 while Kyber works with a prime q ¼ 3329. Both algorithms compress the ciphertext from modulus q to lower moduli p and T for transmission of the ciphertext (and the public key in case of Saber). The values of p and T differ between the various versions of Kyber and Saber. Both are chosen to be powers-of-two, with p ¼ 2 10 or 2 11 while T has a smaller value typically around T ¼ 2 4 . The modulus q 2 is the public key compression modulus, which equals 2 10 for Saber, but q 2 ¼ q for Kyber as it has no public key compression. The distribution xðR k q Þ returns vectors with small coefficients that are drawn from a binomial distribution. For more information we refer to the original publications of Kyber [33] and Saber [34].

Fujisaki-Okamoto Transformation
The encryption scheme described in Section 2.3 only provides security from passive adversaries (IND-CPA). To achieve active security (IND-CCA) one can use a generic transformation such as a post-quantum version of the Fujisaki-Okamoto transformation [35], [36]. The main idea of such a transformation is to make the encryption deterministic based on a random seed, which is then transmitted as the message. During decapsulation, the ciphertext is decrypted into the random seed, which allows the ciphertext to be recomputed. The re-encrypted ciphertext is then compared with the input ciphertext and the procedure is aborted if both ciphertexts are not the same.
Algorithms 4, 5, and 6 give a more detailed look into the Fujisaki-Okamoto transformation, where the functions G and H are cryptographic hash functions and where KDF is a key derivation function. We will denote variables computed during re-encryption with an accent, to clearly distinguish from the input ciphertext.  In this paper we will specifically look at the comparison in line 5 of Algorithm 6. The input ciphertext is a publicly known value, and thus not sensitive to leakage. The reencrypted ciphertext is sensitive and should be masked. As an example, an attacker that could see (part of) the reencrypted ciphertext could mount a chosen-ciphertext sidechannel attack comparable to the attack described in [10], where side-channel information of the re-encrypted ciphertext can be used to determine if a ciphertext failed to decrypt.
This re-encrypted ciphertext has initially coefficients modulo q, but is compressed in lines 5 and 6 of Algorithm 2 before the comparison. The comparison operation we investigate in this paper includes the compression as an integral part of the algorithm. The re-encrypted ciphertext (before compression) is typically arithmetically masked. We will also ignore the ring structure of the ciphertext, and consider a polynomial in R ¼ Z q ½X=ðX n þ 1Þ as a vector in Z n and a vector of polynomials in R k as a vector in Z kn . This is reasonable as we don't use any property of the ring in the comparison operation.

COMPARISON METHODS
On a high level, a comparison algorithm can be constructed by subtracting the input ciphertext from the reencrypted ciphertext and performing a bitwise OR on all bits representing the result of the subtraction. However, in practice, there are some obstacles that need to be overcome to do this.
First, the re-encrypted ciphertext is typically arithmetically masked, which works well for the subtraction of both ciphertexts, but is ill-suited for the subsequent bitwise OR operation. Therefore, one typically wants to perform an arithmetic to Boolean (A2B) conversion on the data between the subtraction and the bitwise OR.
Second, the input and re-encrypted ciphertext are not in the same domain, as the input ciphertext is compressed. Moreover, the A2B conversion is not straightforward when working with prime moduli q.
We will first discuss these issues, and then give an overview of three state-of-the-art comparison techniques.

A2B and Compression
In this section we will discuss the subtraction of both ciphertexts and subsequent A2B conversion. We will first tackle the case of Saber, i.e., power-of-two q, and then talk about Kyber, i.e., prime q. While the power-of-two technique is relatively straightforward, the necessary adaptations to make this technique work for prime moduli were introduced by Fritzmann et al. [24].
Looking at the first ciphertext component u c , we want to compute Du ðÁÞ ¼ A2Bðu 0 c ðÁÞ À u c Þ from the input ciphertext u c and the re-encrypted uncompressed ciphertext u 0 ðÁÞ .
First we look at the case of power-of-two q; p. To efficiently compute Du ðÁÞ we want to compute the arithmetic operations in the arithmetic domain, while computing the flooring operation in the Boolean domain. To this end we rewrite the equation as: For prime q, the step from Equations (4) to (5) is not straightforward for two reasons: first, log 2 ð p q Þ is not an integer, which would mean we have to shift with a fractional number which makes no sense, and second, the term in the A2B conversion has an infinite fractional representation.
Fritzmann et al. [24] noticed that only a limited precision is needed in the fractional representation. Given a number of bits needed for the required precision t, they rewrite the expression above as: where we can get to Equation (7) if t is large enough to avoid any error due to the flooring operation, as proven in [24]. Note that the flooring operation, the multiplications and the shift operation are performed independently on each share. In practice we need t to be an integer bigger than log 2 ðSÞ À log 2 ð dq=2e q À 0:5Þ, which is 13 for Kyber if S ¼ 3.

Algorithm 6. KEM.DECAPS
Input: Ciphertext of KEM c Input: Secret key of KEM sk 1 Extract ðsk 0 jjpkjjHðpkÞjjzÞ from sk; Similar derivations can be performed to calculate Dv ðÁÞ ¼ A2Bðv 0 c ðÁÞ À v c Þ, where one only needs to replace u with v and the modulus p with T . To simplify the algorithms presented in the rest paper, we will define a function precalc q!p ðu 0 ðÁÞ ; u c Þ that calculates Du ðÁÞ ¼ A2Bðu 0 c ðÁÞ À u c Þ from u 0 ðÁÞ and u c as described above. Similarly, we define precalc q!T ðv 0 ðÁÞ ; v c Þ as the function that calculates Dv ðÁÞ ¼ A2Bðv 0 c ðÁÞ À v c Þ from v 0 ðÁÞ and v c .

Simple Method
The simplest method to perform the comparison would be to perform the preprocessing as described above. This would result in a Boolean masked array of coefficients, of which should be checked if it equals zero. Then one can do the zero check by performing a bitwise masked OR operation, which can easily be obtained from a masked AND [37] operation combined with masked NOT operations, the latter operation only requiring a bitwise negation of one share. This description can be seen as a variant of the comparison method as used by Barthe et al. [30] to mask the GLP signature scheme. The resulting algorithm is given in Algorithm 7.

Arithmetic Comparison
The masked OR operation in the simple approach needs to be calculated on kn coefficients of log 2 ðpÞ bits and n coefficients of log 2 ðT Þ bits. To reduce the number of masked OR operations, D'Anvers et al. [1] propose a technique to reduce the kðn þ 1Þ coefficients that need to be checked into one (bigger) coefficient by summing them together. This technique is inspired by the random sum method of Bache et al. [31]. However, to avoid chosen ciphertext attacks where adaptations in one coefficient are offset with an inverse adaptation in another coefficient, all coefficients are first multiplied with a random number before summation. As this random number is the same for all shares of a coefficient, and due to distributivity (i.e, , it can be proven that the resulting sum equals zero if all masked coefficients are zero.
One drawback of this method is that there is a small collision probability in which an incorrect input ciphertext is wrongly accepted. This collision probability equals 2 Às with s a security parameter related to the bit-size of R, and can not be influenced by an adversary. As such, it is not possible to increase the probability of obtaining a failure using for example failure boosting [38]. In many adversarial models, the adversary is limited in the number of queries Q he can perform, and the parameter s should be chosen such that an adversary can not reasonably find collisions, or: 2 s ! Q.
For a more detailed description of the algorithm and more in-depth security analysis we refer the interested reader to the original publication [1].

Hybrid Comparison
Coron et al. [2] introduce a hybrid method to perform the comparison. They first build several subfunctions and combine them into one comparison algorithm aimed at prime moduli q, as used in Kyber. These subfunctions include two new tests to check the zeroness of a polynomial and 'decompress-and-multiply', a method to process a masked ciphertext without performing compression, by instead decompressing the nonsensitive input ciphertext u c . In this section we give an high-level overview of their comparison algorithm, which is given in Algorithm 9. For more details we refer to [2], [24] and [27].
The idea of the hybrid method is that the first and second ciphertext parts are processed using different approaches. The reason is that the first part of the ciphertext u only undergoes a small compression, while the second part v typically undergoes stronger compression. The decompress-and-multiply technique is only efficient for small compression, and is therefore only used for u, while v is processed in a more traditional approach. We will first look into the processing of the first part of the ciphertext u, then discuss the second part v and finally the postprocessing to combine both parts.

Algorithm 9. HYBRID METHOD
Input: Input ciphertext: u c ; v c Input: Re-encrypted ciphertext: u 0 ðÁÞ ; v 0 ðÁÞ // Adapted procedure for u To process the first part of the ciphertext, instead of compressing the masked coefficients of u 0 ðÁÞ , the public ciphertext u c is decompressed. For each coefficient u c ½i, this results in multiple possible decompressed values u½i½j. For each of these possible decompressed values we subtract u½i½j from the masked recomputed ciphertext u 0 ðÁÞ ½i. The result of this subtraction should equal zero for one j (the one corresponding to the original decompressed value of u c ½i). We then perform a masked multiply on all these values Du Â ðÁÞ ¼ Q j ðu½i½j À u 0 ðÁÞ ½iÞ, which results in a masked zero if and only if the decompressed ciphertext equals the recomputed ciphertext. These steps are given in line 1 to 6 of Algorithm 9.
Meanwhile, the second part of the ciphertext undergoes the simple comparison procedure from Section 3.2 in line 7-8 of Algorithm 9. This results in a Boolean masked bit representing the result of the comparison of v c and v 0 ðÁÞ . This bit is then converted to arithmetic masking modulo q and added to the processed first ciphertext part.
The result of the above algorithm is a vector in Z nkþ1 q that needs to be equal to zero. This vector fulfills the condition to use the ReduceComparison technique of Bhasin et al. [27], which reduces the number of coefficients that need to be checked for zeroness. This reduction is performed in line 11-17. The algorithm is then finished by performing a zero check on the resulting polynomial.
As is the case in the ReduceComparison technique, this algorithm also has a probability of accepting invalid ciphertexts. This probability is upper bounded by q Àl 2 , with q the modulus and l 2 the number of coefficients after compression. As before, the adversary can not increase this collision probability as it is entirely dependent on internal values of r.

NEW COMPARISON ALGORITHMS
In the previous section we detailed three state-of-the art comparison algorithms. In this section we first improve on the arithmetic comparison technique, and then present a simplified version of the hybrid comparison technique. We will show in Section 6 that both techniques outperform their original algorithms.

Galois Field Compression
We first describe an improved version of the arithmetic compression method described in Section 3.3. The main difference between both algorithms is that the multiplication is changed from an arithmetic multiplication modulo p Á 2 sÀ1 to a multiplication in a Galois field with characteristic 2. The main advantage of this approach is that addition in a Galois Field of characteristic 2 is an XOR of the inputs, which works well on a Boolean representation. Therefore, multiplication and addition can be natively perfomed on Boolean masked shares, eliminating the need for the expensive B2A pÁ2 sÀ1 conversion.
More precisely, for the multiplication operation we represent the inputs as polynomials with binary coefficients in Z 2 ½X and perform a polynomial multiplication, after which a reduction modulo an irreducible polynomial f is executed. We represent this multiplication operation with the symbol. It is possible to avoid the reduction step of this multiplication to reduce complexity of the algorithm. The downside is an increase in the number of coefficients that need to be processed in the OR operation. In section Section 3.3 we will see that the OR operation cost is negligible compared to the rest of the algorithm, and as such we implement the Galois field multiplication without the reduction. That is, as a multiplication between binary polynomials.
The Galois field comparison method can be found in Algorithm 10. The Galois field compression method of Algorithm 10 returns 1 upon input of a valid ciphertext ðu c ; v c Þ ¼ ðdu 0 ðÁÞ c q p ; ðdv 0 ðÁÞ c q T Þ and 0 with probability at least 1 À 2 Às if this condition is not fulfilled.
Proof. This proof largely follows the analogous proof of [1], with the difference that some sums are replaced with XOR operations, and some arithmetic multiplications with GF multiplications. We first derive the value of E ðÁÞ at the end of the algorithm.
which by definition of precalc q!p and precalc q!T equals: We will further denote the terms ð P SÀ1 k¼0 u 0 ðkÞ l k q!p Àu c Þ and ð P SÀ1 k¼0 v 0 ðkÞ l k q!T Àv c Þ with b i and g i respectively, for conciseness. This gives the following simplified expression for E ðÁÞ : Correctness. If the input ciphertext ðu c ; v c Þ matches the recomputed compressed ciphertext ðdu 0 ðÁÞ c q!p ; dv 0 ðÁÞ c q!T Þ, then all b i and g i are zero and thus E ðÁÞ is zero. This proves the first statement.
Security. If the input ciphertext does not match the recomputed ciphertext, there is at least one b i or g i that does not equal zero. Without loss of generality, we will assume that b 0 is a nonzero coefficient. We can then separate this coefficient from the equation: and simplify this equation into: by taking: The adversary is tasked with finding a value X and Y so that E ðÁÞ ¼ ðr 0 XÞ È Y ¼ 0. A necessary condition for this is that ðr 0 XÞ È Y mod f ¼ 0, with f an irreducible polynomial of degree 2 s . Which means that the condition can be rewritten as: As r 0 is independent of the terms X and Y and is unknown to the adversary, the probability of finding a ciphertext such that this condition is fulfilled is limited to the guessing entropy of r 0 , which equals 2 Às . This proof can be easily generalized if another value of b i or g i is nonzero. Proof. Due to the similarities with the arithmetic comparison method, we can rely on the proof from [1]. To support this claim we will highlight the differences between the Galois field method and the arithmetic comparison method, and then show that they do not change the security proof. The t-SNI security proof of the arithmetic masking [1] divides the algorithm in 4 types of gadgets. Gadget G 0 and G 1 correspond to the preprocessing in the arithmetic comparison and are exactly the same in the Galois field method. Gadget G 4 corresponds to the final equality test in the arithmetic masking method which is the same as OR operation on line 9. The only difference between both algorithms is thus gadgets G 2 and G 3 . Gadget G 2 is no longer needed in the Galois field method as we no longer need to perform B2A conversion.
This leaves us gadget G 3 which is different between both approaches. In the arithmetic masking, gadget G 3 computes an arithmetic random sum, while in the Galois field method, the gadget computes a random sum on binary polynomials. However, both approaches perform computations on each share separately. This property is what is used in the original proof [1] and as it also applies on the Galois field method, the original proof still holds for the Galois field method. t u

Streamlined Hybrid
In this section we introduce an improved version of the hybrid compression technique from [2]. One disadvantage of the hybrid method is that it is complex in comparison to the other comparison methods, due to the various subfunctions used. The aim of the streamlined hybrid method is to simplify the implementation of the hybrid method, while also improving its efficiency.
One of the main speedups of the hybrid comparison method is due to the reduction of the number of coefficients that need to be converted from arithmetic to boolean masking in the A2B step. This is achieved by using the decompress-and-multiply technique from [2] and then perform the comparison reduction from [27]. These steps are only efficient for the first ciphertext part u. In the streamlined hybrid method we still use these techniques as they provide a significant speedup.
After these operations we revert to the standard simple procedure from Algorithm 7. As we will show in Section 5, the A2B and OR operations can be sped up significantly using implementation tricks. This means that while these operations theoretically don't scale as well as some alternatives in [2], they do outperform these functions in practical implementations. Due to their simplicity and efficiency we choose the postprocessing of the simple method over the postprocessing of the hybrid method. Specifically, we convert the remaining coefficients, from both the compressed ciphertext and the second ciphertext part v, to the Boolean domain and perform the OR operation on the Boolean masked coefficients. Algorithm 11 gives a high level overview of our streamlined hybrid method.
Proof. The ciphertext consists of two parts. The second part v is treated in the same way as the simple method, and thus shares the same characteristics: if v c ¼ dv 0 ðÁÞ c q T then Dv ðÁÞ ¼ 0, and if the ciphertexts do not match then Dv ðÁÞ 6 ¼ 0.
As such we will focus the proof on the value of E B ðÁÞ . We will first consider a valid first ciphertext part u c ¼ ðdu 0 ðÁÞ c q p Þ, and then an invalid first ciphertext part where u c 6 ¼ du 0 ðÁÞ c q p .
Correctness. If u c ¼ ðdu 0 ðÁÞ c q p Þ, then by definition of Decompress, for each coefficient i one of the decompressed values u½i½j equals u 0 ðÁÞ ½i, which means u½i½j À u 0 ðÁÞ ½i ¼ 0 for this u½i½j. This also implies that one term of the multiplication is zero for each coefficient and thus Du ðÁÞ is the zero vector. If Du ðÁÞ is a zero vector, then E ðÁÞ is a sum of terms that are all zero, and thus E ðÁÞ equals zero.
This leaves the A2B conversion of line 12 of Algorithm 11. While E ðÁÞ is the zero vector, the individual shares are not necessarily equal to zero. However, similar to the derivation in [24], we can write: where e is a rounding error in [0,1). Now we can use the fact that E ðÁÞ equals zero to simplify this expression to: At the upper bound, where all e k ¼ 0, we have: while at the lower bound, with e k ¼ 1, this gives which also results in zero as long as 2 t À 1 > S, or t > log 2 ðS þ 1Þ. We proved that if u c ¼ ðdu 0 ðÁÞ c q p Þ, then L SÀ1 k¼0 E B ðÁÞ equals zero, and we know from the proofs of the simple method that if v c ¼ dv 0 ðÁÞ c q T then Dv ðÁÞ ¼ 0. As the result is computed as an OR of these values, we proved that a valid ciphertext ðu c ; v c Þ ¼ ðdu 0 ðÁÞ c q p ; ðdv 0 ðÁÞ c q T Þ will return 0. Security. If the ciphertext is invalid, ðu c ; v c Þ 6 ¼ ðdu 0 ðÁÞ c q p ; ðdv 0 ðÁÞ c q T Þ, at least one of the coefficients of u c or v c is invalid. As v c is processed using exactly the same procedure as the simple comparison, we know that an invalid coefficient of v will propagate to a nonzero Dv ðÁÞ and the result of the algorithm will be 1.
The second case is that u c has at least one invalid coefficient, and without loss of generalization we will assume that this is the first coefficient. This is synonymous to the fact that none of the decompressed values u½0½j equals the recomputed ciphertext u 0 ðÁÞ ½0. As the multiplication to obtain Du ðÁÞ is calculated in the field Z q , and none of the terms are zero, we know that Du ðÁÞ ½0 6 ¼ 0.
This means that with this probability, at least one of the terms of E ðÁÞ is nonzero.
In the end, our goal is to have at least one coefficient of E B ðÁÞ to be nonzero, which would result in a returned value of 1 due to the OR operation. If a coefficient of E ðÁÞ is nonzero (and without loss of generalization we assume it is the first coefficient), we have: with e k a value in [0,1) as derived above. Remember that E ðÁÞ 6 ¼ 0. As such E B ðÁÞ can only occur due to over-or underflow. The two closest values are E ðÁÞ ¼ 1 or E ðÁÞ ¼ q À 1.
First we will look at the possibility of an underflow. The worst case scenario is that E ðÁÞ ¼ 1 and all e k ¼ 1, which gives: This does not equal zero as long as 2 For the scenario of an overflow, we have a worst case scenario E ðÁÞ ¼ q À 1 and e k ¼ 0.
In conclusion, a non valid ciphertext will result in at least one nonzero coefficient in res B v ðÁÞ with probability at least 1 À q Àl 2 , and thus a result of 1. Proof. The streamlined hybrid comparison is a combination of the hybrid comparison and the simple comparison. As such we can use the t-SNI security of the gadgets of the hybrid comparison, as proved in [2], to prove t-SNI security of the streamlined hybrid comparison. In this proof we will divide the streamlined hybrid method into gadgets that correspond to gadgets already proven t-SNI secure for the hybrid comparison ( [2], [27]) or the simple comparison ( [1]). The streamlined hybrid comparison can be split into 4 gadgets. Gadget G 1 to G 3 correspond to gadgets in the hybrid comparison: Gadget G 1 is the masked multiplication calculated in line 3 to 5 of Algorithm 11. These lines correspond to the secMultList algorithm of [2], and is proven t-SNI secure according to theorem 16 of that paper. Gadget G 2 represents line 7 to 10, which corresponds to the ReduceComparison technique from [27], proven t-SNI secure in Theorem 2 of that paper. Gadget G 3 is the A2B conversion, which should be chosen as a t-SNI secure A2B conversion. The rest of the algorithm, considered gadget G 4 , proceeds exactly as the simple comparison method and has therefore the same security guarantees.
t u

IMPLEMENTATION ASPECTS
To obtain an efficient implementation of the comparison, it is not only important to search for an optimal algorithm, but also to consider implementation aspects. In this section we will first look at the importance of bitslicing the A2B conversion and the OR operation. We will show that bitslicing the A2B conversion gives a significant speedup and is essential to obtain an efficient implementation. Moreover, bitslicing is applied in the comparison implementation by D'Anvers et al. [1] but not in the implementations by Bos et al. [23] and Coron et al. [2]. This makes comparing these results difficult.
Bitslicing typically needs a pre-and postprocessing to correctly align the memory. In the second part of this section we will show that it is not always necessary to perform this postprocessing in our case, due to a reinterpretation of the outputs.

Bitslicing
Most comparison implementations use the A2B conversion and OR operation of Coron et al. [37]. A first observation is that this A2B conversion and the OR function involve almost exclusively bitwise operations. These operations can be bitsliced on a 32-bit CPU, where 32 inputs are taken as input and the bitwise operations are performed on all 32 inputs at the same time.
Such an implementation requires an pre-and postprocessing to rearrange the inputs in memory. This means that 32 input coefficients are taken in, and re-arranged in memory. In the preprocessing, the first bit of each coefficient is put in the first register, the second bit of each coefficient in the second register, and so on. Bitwise calculations are then performed on each register, digesting 32 coefficients at the same time. At the end of the A2B conversion, the postprocessing restores the output to the 32 coefficients of log 2 ðpÞ or log 2 ðT Þ bits.
Due to this pre-and post-A2B memory realignment, the speedup is not a full factor 32, but as can be seen from Table 1, bitslicing does have a significant impact on the efficiency of the algorithm.

Reinterpretation of the Boolean Masked Bits
After the A2B conversion, the main goal of the previous comparison techniques is to check if all coefficients of the polynomial or vector are zero. This is equivalent to stating that all Boolean masked bits need to be zero. As such, after A2B conversion one can represent the bits at will, for example by representing it as a vector with coefficients in Z 2 32 instead of coefficients in Z p and Z T .
There are multiple advantage to such a change in representation. For example, in the simple comparison method it is more efficient to perform the OR operation on coefficients of 32 bits due to bitslicing. It is also possible to use such a representation switch in the Galois field method, where line 4 to 8 of Algorithm 10 would work on a vector with coefficients in Z 2 32 instead of Z p , which results in fewer coefficients that need to be processed, and thus lower execution time and less randomness consumption.
Moreover, the representation can be chosen in such a way to avoid any post-A2B memory alignment in the bitsliced A2B function. Remember that the coefficients at the output are aligned in memory as log 2 ðpÞ (or log 2 ðT Þ for v) registers of 32 bits. We can then reinterpret the registers to be log 2 ðpÞ coefficients in Z 2 32 . This means that we reinterpret the 32 coefficients of log 2 ðpÞ bits into log 2 ðpÞ coefficients of 32 bits. The reason this works is that if all 32 coefficients of log 2 ðpÞ bits are zero, then it must also be true that the log 2 ðpÞ coefficients of 32 bits are zero.
Avoiding the post-A2B memory alignment step has a significant impact on the total A2B cost, as can be seen in Table 2, where the postprocessing accounts for 19-23% of the full A2B procedure. Table 1 depicts the impact of a reinterpretation of the coefficients, as described in this section. Our method leads to a speedup of around 23% for the simple comparison. To the best of our knowledge, such a change of representation has not been presented or implemented in previous works.

EVALUATION
We have implemented and benchmarked the various algorithms described in this paper. Benchmarking was performed on an STM32F407 board with an ARM-Cortex M4F using arm-none-eabi-gcc version 9.2.1 with -O3. The system clock was set to 24 Mhz and TRNG clock to 48 Mhz, in accordance to the popular benchmarking framework PQM4 [39]. Randomness is sampled from the on-chip TRNG and its sampling cost is included in the cycle counts.   19 (19%) The Simple, Galois field and streamlined hybrid methods can be optimized using the optimized bitslicing from Section 5.2, which is the case for the numbers in Table 3. The arithmetic method has a security parameters s ¼ 54, while the Galois field method has an increased security of s ¼ 64, which should be sufficient for cases where an adversary has a limit of 2 64 queries. The reasoning for the specific s values is that for these values the implementation variables nicely align with 32 bit registers of our microprocessor. For the Galois Field method with reinterpretation of the Boolean masked bits we have 32 bit coefficients and 64 bit randomness r, which results in a 96 bit output E ðÁÞ . If one would want to increase s, one can select 32 bits coefficients with for example 96 or 128 bit randomness r which would result in respectively 128 and 160 bits E ðÁÞ at the output. However, we believe such an increase of s is overkill in most scenarios as discussed in Section 3.3.
The Hybrid method is the original implementation of Coron et al. [2], adapted to allow execution on an ARM platform and with bitsliced A2B conversion to allow fair comparison. Both the hybrid and streamlined hybrid method have a collision probability under 2 À128 . Note that it would be possible to increase this collision probability to around 2 À64 without sacrificing security in many situations as discussed above. However, since the cycle cost between both options is minimal, we stick to a similar value as in [2].
We choose not to measure stack memory usage, as these comparison methods can be easily optimized for this if implemented in a full decapsulation operation. The idea of such optimization would be a greedy approach: Immediately after a coefficient of u 0 ðÁÞ or v 0 ðÁÞ is available, as much of the comparison is calculated as possible. Such a greedy approach would lead to a minimal stack usage compared to other functions in the decapsulation, as only the coefficient in current use and a limited number of intermediate variables need to be stored.
In the rest of this section we will compare the different methods to the simple method. We will start with the (streamlined) hybrid comparison, and then move to the arithmetic/GF method.

Hybrid Comparison
The (streamlined) hybrid comparison essentially performs an additional preprocessing step in order to reduce the number of coefficients that need to be A2B converted. As can be seen in Table 4, the preprocessing becomes significantly more expensive, but it is compensated with a larger subsequent reduction in A2B cost. Notably, the hybrid comparison only works for prime moduli schemes, i.e., Kyber, and not for power-of-two q.
Our streamlined hybrid comparison, using the simple method to finish calculations, outperforms the hybrid comparison from [40]. While the initial calculations are the same, the final comparison is significantly faster in the streamlined hybrid comparison. An additional advantage is that the codebase of the streamlined hybrid comparison is less complex as it requires less functions and the complexity of the functions is lower.

Arithmetic/GF Comparison
Comparing the arithmetic and Galois field comparison methods weighs clearly in favour of the Galois field method. This is mostly due to the elimination of the expensive B2A conversion.
On the other hand, on the ARM-Cortex M4 there is native support for the arithmetic multiplication, while lacking support for the Galois field multiplication. This impacts the cost of the multiply-accumulate operation, which costs 197k cycles for the arithmetic operations, and 1,619k cycles for the Galois field multiply-accumulate (these numbers include randomness sampling, multiplication and addition to obtain E ðÁÞ ).
In a scenario where the Galois field multiplication would have similar hardware support, for example in a hardware implementation or a hardware-software codesign, the Galois Field multiplication would slightly outperform the simple comparison method as can be derived from Table 4. This would come at a slight increase of implementation complexity.
It is possible to combine the streamlined hybrid method with the Galois field method. The streamlined hybrid method focusses on reducing the preprocessing cost, while the Galois field method focusses on the postprocessing cost. It is therefore straightforward to combine both methods, in which the output of the A2B conversion would serve as the interface between both methods.  The state-of-the art higher-order masked comparison techniques can be generalized into a common framework which consists of a preprocessing, an A2B conversion, a postprocessing and a final OR operation. In the most simple case, preprocessing is kept to the bare minimum and no postprocessing is performed. Coron et al. [40] introduced a hybrid method, specifically aimed at prime moduli q, to reduce the A2B cost by performing additional preprocessing. We sped up this design with % 25% in our streamlined hybrid algorithm. D'Anvers et al. [1] introduced a technique to speed up the OR operation at an increased postprocessing cost. We improved this method with % 20% by replacing the arithmetic multiplication with a Galois field multiplication. While this method does not outperform the simple comparison method on a microprocessor platform due to the lack of hardware support for the Galois field multiplication, it might be interesting to compare both methods on other platforms where support for the multiplication can be build in.
We also looked into implementation optimizations. We reiterated the importance of bitslicing, and showed that additional speedups are possible when reinterpreting the output of the Boolean masked bits output from the A2B conversion. The latter optimization simplifies our codebase and reduces the cycle count of the simple method with % 33%.
Our comparison was performed on an ARM-Cortex M4 microprocessor. Interesting future work could be to make a similar comparison on other platforms, where one can add hardware support for the masked A2B and OR operations, or the multiplications needed in the hybrid and Galois field compression methods.
Jan-Pieter D'Anvers received the MSc and PhD degrees in electrical engineering from KU Leuven, in 2015 and 2021, respectively. He is currently a postdoctoral researcher with the COSIC Research Group, KU Leuven, funded by an FWO (Research Foundation Flanders) postdoctoral grant. His research focuses on the design, security and side-channel security of post-quantum cryptography, and fully homomorphic encryption. He is co-designer of Saber, one of the final candidates in the NIST post-quantum standardization process.
Michiel Van Beirendonck received the BSc and MSc degrees in electrical engineering from KU Leuven, Belgium, in 2017 and 2019, respectively. He is currently working toward the PhD degree with the Research Group COSIC, KU Leuven. During his MSc studies, he spent one year with EPFL, Switzerland, as part of the SEMP exchange program. His research focuses broadly on the implementational challenges of lattice-based cryptography. He has worked extensively on side-channel attacks and countermeasures for post-quantum cryptosystems, as well as hardware acceleration of fully homomorphic encryption schemes.
Ingrid Verbauwhede (Fellow, IEEE) is a professor with the Research Group COSIC, Electrical Engineering Department, KU Leuven. At COSIC, she leads the secure embedded systems and hardware group. She was elected as member of the Royal Flemish Academy of Belgium for science and the arts in 2011. She received the IEEE 2017 Computer Society Technical Achievement Award. She is a recipient of two ERC Advanced Grants, one in 2016 and a second one in 2021. She is a pioneer in the field of efficient and secure implementations of cryptographic algorithms on many different platforms: ASIC, FPGA, embedded, cloud. With her research, she bridges the gaps between electronics, the mathematics of cryptography, and the security of trusted computing. Her group owns and operates an advanced electronic security evaluation lab. She is a fellow of the IACR.