Speeding Up LAT: Generating a Linear Approximation Table Using a Bitsliced Implementation

The substitution box (S-box) is one of the major components of cryptographic algorithms. An important issue for cryptographic algorithm designers in ensuring sufficient security from linear cryptanalysis, one of the most powerful attacks, is finding an S-box with a sufficiently low linear spectrum. However, to the best of our knowledge, most of the published S-box analysis tools cannot generate linear approximation tables for large S-boxes, such as 16-bit S-boxes. Even tools that support the generation of 16-bit linear approximation tables using parallel processing, such as Eval16BitSbox, require a long time. We used bitslice, which can efficiently process bitwise operations in parallel by taking advantage of independent operations, for generating a linear approximation table. In this study, the linear approximation table generation method implemented using the element unit operation of the existing S-box was upgraded to a $vector$ unit operation in a bitslice manner. This improved method enabled the immediate creation of tables, even for 16-bit S-boxes. This approach allows cryptographic algorithm designers to consider a wider variety of S-boxes.


I. INTRODUCTION
For post-quantum security, block ciphers with large block sizes are required. If the block size of the block cipher is large, but small-size components are used, a large number of round functions must be used to ensure adequate security. The complexity in one round may increase if the component size is increased. Therefore, large-size components are required for secure and efficient encryption while reducing rounds [1]- [5]. However, the security measurement becomes more complicated as the size of components increases. A substitution box (S-box) is a representative nonlinear function used in cryptographic algorithms.
The S-box has various security properties, including differential uniformity, linearity (or non-linearity), algebraic degree, and strict avalanche criteria. When an S-box is used The associate editor coordinating the review of this manuscript and approving it for publication was Chao Shen .
for an encryption algorithm, cryptanalysis is usually performed using these security properties of the S-box.
Differential cryptanalysis and linear cryptanalysis are the most significant and powerful attack techniques [6]- [11]. Differential cryptanalysis is an analysis technique that exploits the effect of differences in the inputs and outputs of each round. When using an S-box for a nonlinear function, this attack uses a difference distribution table (DDT) of the S-box. The DDT is a chart of how often a difference in the input bit makes a difference in the output bit. Linear cryptanalysis uses an equation generated by a linear approximation of the relationship between the input and output bits of the S-box. Such an approximation can be achieved using a linear approximation table (LAT) of the S-box.
Cryptanalysis largely utilizes these two tables. Finding an S-box for which both properties are secure is therefore an important issue for cryptographic algorithm designers. S-box analysis tools, such as PEIGEN, SAGE, SET, BSAT, etc., generally provide facilities for the generation of DDT VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ and LAT [12]- [15]. However, as the bit size of the S-box grows to around 16 bits, most tools do not support LAT generation, or the generation process is not finished. In the most widely used table generation method for an n-bit S-box, 2 2n computations are required to generate the DDT, and 2 3n computations are required to generate the LAT. Hence, for a 16-bit S-box, the DDT requires 2 32 computations, and the LAT requires 2 48 computations. The amount of computation necessary for LAT generation is difficult to perform on a typical computer. To the best of our knowledge, papers proposing a 16-bit S-box suggested a theoretical security boundary or used Eval16BitSbox [1]- [3], [5]. For example, in the case of MISTY, the boundary of security was presented through the structural proof of the S-box, but a concrete security was not presented. A tool, called Eval16BitSbox, generates the LAT of 16-bit S-boxes using parallel processes [16]. Using a parallel programming, the amount of resources used (the number of workers) can speed up the generation of the LAT, but for most computers, the computational resources required are still too large, at 2 3n divided by the number of workers. To the best of our knowledge, there is no tool that can perform this amount of computation and generates an LAT for a 16-bit S-box in a short time.
We searched for a universal and easy way to generate LATs for large S-boxes, and found a solution in bitslice. A feature of bitslice proposed by Kwan, is that it can efficiently parallelize bitwise operations [17]. This feature is often used to effectively implement cryptographic algorithms [18]- [20]. It is also used to process parallel S-boxes in mobile and embedded platforms because of its efficient parallel processing and constant operation speed [21], [22]. A characteristic feature of algorithms using bitslice is that independent operations are grouped into registers for effective parallel processing. In this study, we applied bitslice to LAT generation. We computed the LATs and the linear spectrums for 16-bit S-boxes. This approach solved the problem of the large amount of computation required to generate LATs, and could effectively generate a 16-bit LAT in a few seconds.
The contributions of this study are as follows: 1) Our proposed high-speed LAT generation method performs operations in units of vector by bitslice. The amount of computation is reduced by pre-computation. Our proposed tool reduces the n-bit S-box LAT generation time to approximately O(n 2 ) from O(n 3 ). The resulting 16-bit S-box LAT is generated in less than 3s. 2) We generated 8-and 16-bit S-box LATs based on our method and others and compared them. We found that the LAT generation time of an S-box in a single core was 65-1,200 times faster than that of existing algorithms. 3) For the first time, we presented the whole of the linear spectrum and the linearity of MISTY's 16-bit S-box based on our improved LAT generation method. 4) We expect our method to help cryptographic algorithm designers to create a wider variety of S-boxes.

II. PREVIOUS LINEAR APPROXIMATION TABLE GENERATION METHODS AND GENERATION SPEED LIMITATIONS
The LAT of an S-box is a table that lists the linear biases between the input and output masks. For input a and output b masks, the LAT of the n-bit S-box (S n ) is defined as follows: The maximum absolute value of the LAT, excluding the (0, 0) entry, is called linearity. The count of the number of biases in the LAT is a linear spectrum. Low linearity and a small number of high biases in the linear spectrum make linear cryptanalysis difficult. Algorithm 1 is the most basic LAT generation algorithm known, based on the abovementioned definition [12]- [15].

Algorithm 1 Basic Algorithm for Generating LAT
Require: bit size n, S n 1: for a in range(2 n ) do 2: end for 7: end for 8: end for 9: return LAT A rough calculation based on the dot product of input and output masking indicates that the basic algorithm requires an operation of O(n 3 ) for an n-bit S-box. The previous LAT generation algorithm requires an additional eight times operation whenever the S-box size increased by 1 bit; hence, it is not suitable for the LAT generation of large S-boxes, such as 16-bit S-boxes. Eval16BitSbox processes these algorithms in parallel to improve the speed of execution.

III. NEW METHOD FOR GENERATING LINEAR APPROXIMATION TABLES WITH A BITSLICED IMPLEMENTATION A. APPLICATION OF THE KEY IDEA
The application of bitslicing to an n-bit S-box S is to see the S-box as an n-tuple of component functions S = (S 0 , S 1 , · · · , S n−1 ) and encode the Boolean functions S i : GF(2 n ) → GF (2) to the truth table of its every outputs as a 2 n -bit vector [23], [24]. For example, the identity permutation I over GF (2 4 ) would be represented as the 4-tuple of 16-bit words (0 × 00FF, 0 × 0F0F, 0 × 3333, 0 × 5555). In this paper, we denote the bitsliced f as f . Line 4 of Algorithm 1 can be operated in parallel, because it independently computes x for the masks, a and b. The key concept of our study is to compute line 4 using bitslice. First, we changed the x · a bitslice representation and considered ''x'' ''I (x)''. Next, we obtained the value t in by XORing the components corresponding to a. Subsequently, we calculated S and acquired the t out value by XORing the components corresponding to b. Finally, the bias for masks a and b was computed as the Hamming weight (HW) by XORing t in and t out . Fig. 1 shows an example. In the data, the top is the most significant bit (MSB), and the bottom is the least significant bit (LSB).
Algorithm 2 is a non-optimized bitsliced implementation of the LAT, reconstructed based on the abovementioned methodology. N indicates bitsliced data N . I is the identity permutation of GF(2 n ) indicating S n input.

Algorithm 2 Generating LAT With Bitsliced Implementation(1)
Require: bit size n, S n 1: for a in range(2 n ) do 2: for b in range(2 n ) do end for 10: end for / * HW returns hamming weight * / 11: return LAT B. ALGORITHM IMPROVEMENT Algorithm 2 has room for improvement with respect to speed. We analyzed the LAT generation process and improved the speed by reducing the loops and pre-computing the t in and t out operations.

1) PRE-COMPUTATION PROCESSES
To improve the performance, we pre-computed and tabulated the data generated by S-box independent operations before the LAT generation process. The S-box dependent operations were also pre-computed when the LAT generation process was started. With these pre-computations, the main operations (lines 5 and 6 in Algorithm 2) in the LAT generation were changed to O(1) and O(n), respectively, requiring only simple table references and HW calculations. The generation algorithm of the LAT was divided into three operations: t in generation; t out generation; and an XOR of the results. The first operation, t in , is dependent only on I and a. We can produce the table M in by pre-computating it before generating the LAT. The amount of computation can be reduced by making and using M in for each S-box size. The second operation, t out , does not have any dependency on b. We can remove t out from the loop and generate table M out by precomputation, as with t in . For the final operation, the innermost loop becomes two table references and a simple XOR.

2) SEPARATION OF THE LAT MASK
We separated the mask to reduce the amount of computation involved in M in , M out and the storage space of all tables by half. In this process, each loop was reduced by half, but the innermost operation was quadrupled. We considered each group when the MSB of the n-bit mask was 0 and 1. The groups were identical, except for the MSB. Thus, when the MSB of the mask was 0, the LAT value was computed, and the LAT value, including the MSB, was naturally computed by This procedure cuts the memory required and the amount of computation needed for the pre-computation by more than half. Algorithm 4 is an improved bitsliced implementation for LAT generation.
A rough calculation based on the innermost loop, suggests that the improved algorithm requires O(n 2 ) operations for an n-bit S-box.

IV. COMPARISON WITH PREVIOUS METHODS
We investigated various tools used to analyze the S-box properties, to evaluate the performance of the proposed algorithm. We used a personal computer with an INTEL core i7-11700K Processor@ 3.6 GHz, NVIDIA GTX 1080ti, 64 GB RAM, and MSI MAG Z590 TORPEDO mainboard end for 17: end for / * HW returns hamming weight * / 18: return LAT for our experiments. For the 8-bit S-boxes, we collected cases in which the source code was released together with an article. We excluded cases in which the tool did not work properly, or in which the LAT generation result was incorrect. The tool language was divided into C/C++ or Python, but the LAT generation tool implemented in Python was treated as C/C++ because it was implemented as ctypes, a C compatible library. Thus, for Python, the basic algorithm was implemented and compared. Finally, we selected the S-box analysis tools PEIGEN [12], SET [13], and SAGE [14] as the LAT generation time comparison groups for the 8-bit S-boxes. We measured the average of the LAT generation times of 10,000 random 8-bit S-boxes using each tool, except for the basic algorithm implemented in Python. This basic algorithm was used to measure only 100 random 8-bit S-boxes, because of the problem of execution time. In the case of 16-bit S-boxes, the LAT generation time was compared with that of Eval16BitSbox, an LAT generator that supports multi-core-based parallel processes through the Parallel Java 2 Library [25]. Due to the generation speed problem, Eval16BitSbox estimated the time based on the bias generation time of the output masks corresponding to the random as at randomly generated 16-bit S-boxes. Table 1 shows the experimental results.
Among the tools implemented for an 8-bit or less S-box in Table 1, PEIGEN was the best performing tool based on the C/C++ language, taking 12 milliseconds to generate an LAT. The slowest tool was SAGE, because it utilized the C library in python. We also tested other tools shared on GitHub, as well as the tools in Table 1, but we could not find any tool with a better performance than PEIGEN. Our method took 182 microseconds to generate the LAT, an over 65× performance improvement. In the case of Python, the LAT generation method implemented using Algorithm 1 required more than 1 minute. The rapid generation of LATs in Python requires the use of a C library through ctypes, such as SAGE. However, our method, implemented in Python, took 50 milliseconds to generate an LAT, an over 1,200× performance improvement. When generating LATs for a 16-bit S-box, the performance was improved by over 174×. The notation [workers] in Table 1 means the threads used, and the time was measured based on a single thread. Our method was effective even with a single thread, but parallel programming it is was more effective, due to the use of bitslice. As a result of parallel programming using CUDA, it took up to 3 seconds to generate an LAT for a 16-bit S-box. PEIGEN, SAGE and SET did not work for the LAT generation of a 16-bit S-box due to the large size of the S-box.

V. CONCLUSION
Our study has widened the size of S-boxes that can be investigated in the future, and it is expected to be valuable to designers of cryptographic algorithms using large S-boxes.
In this study, we developed a method of generating LATs using a bitsliced representation. The amount of computation was reduced to approximately 2 2n by pre-computing t in and t out . Our algorithm was faster than other tools, and the LAT of a 16-bit S-box could effectively be generated in a short time. Our study widened the size of S-boxes that can be investigated in the future. We expect our method to be valuable to cryptographic algorithm designers using large S-boxes.