Designing a New XTS-AES Parallel Optimization Implementation Technique for Fast File Encryption

XTS-AES is a disk encryption mode of operation that uses the block cipher AES. Several studies have been conducted to improve the encryption speed using XTS-AES according to the increasing disk size. Among them, there are researches on parallel encryption of XTS-AES using GPU. Although these studies focus on parallel encryption of AES, optimization for the entire XTS mode has not been performed. The reason is that the <inline-formula> <tex-math notation="LaTeX">$\alpha ^{j}$ </tex-math></inline-formula> computation process included in XTS mode is not suitable for parallel operation. Therefore, in this paper, we proposed several techniques for high-speed encryption in GPU by modifying XTS-AES into a form that is advantageous for parallel operation. The core idea is to pre-calculate the <inline-formula> <tex-math notation="LaTeX">$\alpha ^{j}$ </tex-math></inline-formula> calculation on the CPU into a form that is easy to operate on the GPU. To achieve this goal, we analyzed the <inline-formula> <tex-math notation="LaTeX">$\alpha ^{j}$ </tex-math></inline-formula> calculation process and present the parts that can be optimized. First, we presented a method that can replace multiple operations with a single table reference through the analyzed <inline-formula> <tex-math notation="LaTeX">$\alpha ^{j}$ </tex-math></inline-formula> computation progress. Thereafter, we proposed a method that can be calculated by partially skipping the entire <inline-formula> <tex-math notation="LaTeX">$\alpha ^{j}$ </tex-math></inline-formula> computation process that must be sequentially calculated through the table reference technique. For the proposed optimization implementation, we presented various results for evaluating the optimal implementation. In addition, we compared the performance of XTS-AES OpenSSL implementation on CPU and our proposed optimization implementation on GPU.


I. INTRODUCTION
Various security systems and cryptographic algorithms have been developed to protect user information. Disk encryption [1] is a type of technology that encrypts a computer's hard disk to prevent information leakage caused by theft or loss. Representatively, Bitlocker [2] on Windows performs a Full-Disk Encryption (FDE) function that encrypts the entire disk partition with one key. In addition, various disk encryption software, such as Veracrypt and Truecrypt, have been used in this area.
A common disk encryption method is the XTS operating mode using the block cipher algorithm AES [3]. XTS mode is a tweakable encryption method. The tweakable encryption method uses the sector address of the block in the sector The associate editor coordinating the review of this manuscript and approving it for publication was Kuo-Ching Ying . and the tweak value, which is a combination of the index, which has the advantage of having different cryptographic statements depending on the location of the file.
Since the size of the disk increases, optimization of XTS-AES is required to effectively perform disk encryption. In XTS-AES, since the encryption process for each plaintext block is performed independently, a parallel computing device such as a GPU can be utilized. However, in the XTS mode, not only the AES encryption process but also the calculation process for the tweak value required for encryption is included. In XTS mode, the plaintext is encrypted according to the α defined in the Galois field. Depending on the total number j of plaintext blocks, α is raised to a power of j, and encryption is performed using α i in each i-th block.
Various optimization studies have been conducted on XTS-AES so far for fast file encryption. Although these studies contributed to the fast encryption of multiple plaintext VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ blocks by computing AES in parallel, they still have not proposed an optimization technique for the XTS mode itself. This is because the α j computation process included in the XTS mode is not suitable for parallel operation. Since the α j computation is operated while reading one most significant bit(msb), it is a sequential structure in which the next α i+1 operation cannot be performed before the previous α i operation is finished. Therefore, in this paper, we have introduced a technique that can optimize the entire XTS mode by utilizing GPU. We have proposed a method to optimize the α j calculation process using a lookup table and intermediate values.
This method is a new technique that goes through the pre-processing process on the CPU to efficiently perform operations on the GPU. Additionally, we have presented an efficient implementation technique that can parallelly compute AES block cipher algorithms used in XTS mode. The major contribution of this paper is to provide new insight regarding: • First entire XTS mode optimization implementation In this paper, we have introduced an optimization technique for XTS-AES. However, this technique does not simply suggest a parallel encryption optimization method for AES. We have changed the operation structure of XTS mode, which is not suitable for parallel operation, to make it suitable for parallel operation, so that the entire XTS-AES process can be operated in parallel.
• State-of-the-Art implementation for XTS-AES The optimization implementation proposed in this paper showed the best computational performance than before. We could confirm that these results represent 240.1 times faster performance than the Naive CPU version and 21.96 times better performance than XTS-AES implemented in parallel with Naive GPU. This result was also 12.23(XTS-AES-128) and 14.64(XTS-AES-256) times faster than the most recent work on XTS-AES in OpenSSL.

• Foundation technique available in various fields
We have proposed a fundamental optimization method for XTS-AES, and it can be utilized in various fields of research using XTS-AES. In addition to disk encryption, this optimization technique can be used for memory encryption using memory addresses, and can also be applied to encrypt data on the network or mobile devices.
This paper is organized as follows: Section II introduces optimization research papers conducted on existing XTS-AES. Section III looks at the structure and encryption procedures for XTS-AES. Section IV describes the optimization technique for XTS-AES that we intend to propose in this paper. Section V presents the implementation performance results for the optimization technique proposed in this paper and compares them with the results of the existing XTS-AES optimization study. Finally, we present significance for the findings of this paper in Section VI.

II. RELATED WORKS
Researches that optimize the AES encryption process on GPUs have still being studied. Recently, various optimization methods and results of AES using GPUs have been presented in [4], [5] and [6].
However, not much research has been done on XTS mode yet. In [7] and [8], a method for parallel encryption of XTS mode using OpenMP [9] has been proposed. [8] have conducted the idea that multiple threads can encrypt blocks in parallel by utilizing the part that each plaintext can be encrypted independently like in CTR mode. [10] have presented several optimized techniques of XTS-AES that utilizes up to 32 processors simultaneously to rapidly encrypt large amounts of data in parallel using MPI [11]. In [12], A framework that can be utilized in mobile devices by speeding up XTS mode to GPU has been proposed. The proposed system was implemented in the form of a software module that performs parallel encryption for each 512-byte sector data. In the case of the hardware environment, studies on some XTS modes have been performed on FPGA. In [13], [14] and [15], various implementation techniques have been proposed to efficiently encrypt XTS-AES on FPGA.
Apart from research, XTS-AES has been provided by OpenSSL [16], an open-source implementation of TLS and SSL.

III. BACKGROUND FOR XTS-AES
A. NOTATION Table 1 summarizes the parameters for understanding XTS-AES.

B. XTS MODE
XTS [1] mode is a type of operation mode that operates block ciphers such as ECB, CBC, and CTR modes. XTS mode is a type of tweakable cipher specialized for encrypting blockoriented data, which was standardized by IEEE in 2007. XTS mode has been used in various disk encryption technologies so far, with the advantage of preventing vulnerabilities to existing CBC and XEX modes. The operating structure of the XTS mode is shown in Figure 1.
The encryption process via XTS mode is as follows. A total of two different keys are used in XTS mode. First of all, the tweak value is encrypted by the first key. This encrypted result is shared by all plaintext blocks with the same value. The encrypted tweak value is then multiplied by the α i according to the number i of each block. The multiplied values are different for each block and are utilized for a total of two XOR operations. The first XOR is performed with plaintext. The second XOR is performed on the result of encrypting the first XORed value with the second key.
The big difference between XTS mode and other block cipher operating modes is the multiplication operation with α j . The multiplication computation process with α j is illustrated in detail in Figure 2. The first α 0 value uses the tweak value encrypted with the first key. Alpha values are stored in the form of polynomials in the Galois field (2 128 ) from the  top byte α i [15] to the bottom α i [0] and calculates α j by multiplying α repeatedly at each step. The multiplication of every α is taken twice the total value by shifting all bits one by one to the left, and if the existing top bit is 1, then 135 is XORed at the bottom. That is, to perform one α multiplication, a total of 16 left shifts and 15 or 16 XORs are required (15 times when the most significant bit(msb) is 0, 16 times when the msb is 1).

C. AES
AES [3] is one of the well-known standard block ciphers and is still used in various fields. AES is divided into three main types, depending on the size of the key, and XTS mode uses only two types. The AES-128, which uses 128-bit keys, uses a total of 256-bit keys because two different keys are required in XTS mode. The AES-256, which uses 256-bit keys, uses a total of 512-bit keys in XTS mode. The encryption process for one round of AES can be seen in Figure 3. The AES block cipher algorithm is based on the Substitution-Permutation Network(SPN) structure. Each round consists of SubBytes, ShiftRows, MixColumns, and AddRoundKey. SubBytes is a non-linear permutation function. ShiftRows is a function that performs a cyclic rotation on the state. MixColumns is a matrix multiplication operation that takes x 8 +x 4 +x 3 +x +1 as a reducing polynomial on a Galois field (2 8 ). AddRoundKey is a function that XORs the expanded round key with the state. The number of rounds according to the key length is 10(AES-128) and 14(AES-256) rounds. The one-round process of AES can be integrated and saved as a 32-bit version of the table. It is called T-table [17].
Using T-table, encryption can be performed by referring to the  T-table value 16 times in one round. The principle of T-table  creation and usage for encryption is as follows. s i,j means the j-th word of the state in i-th round.
GPU is a device developed to handle graphical operations. Recently, lots of General-Purpose computing on GPU(GPGPU) techniques that can utilize GPUs for general operations by utilizing Nvidia's CUDA [18] library have been used. The advantage of GPUs is that they can handle multiple operations in parallel.
GPU consists of multiple blocks on a grid, and each block is composed of multiple threads. In NVIDIA GPU, the maximum number of threads that each block can utilize is 1024, but since there are limited resources per block, it is necessary to adjust the number of threads while considering the memory such as registers used by one thread.
Many cryptographic algorithms perform operations using not only basic bit-wise operations but also lookup tables. Since various types of memory exist in GPU, the performance difference greatly increases depending on which memory the reference table is stored and used. The overall memory structure of the GPU can be shown in Figure 4.
If the reference table is stored and used in the global memory that is not mounted on the GPU chip, which can be accessed by all threads, the load speed of the global memory is very slow, so it can show significantly slower performance compared to other memories. In the case of shared memory, since it is mounted on a chip, it has the advantage of being faster than the global memory and has the characteristic that it can be shared and used by threads within the same block. However, the shared memory has a limited size, and there is a disadvantage in that a bank conflict problem occurs when a plurality of threads access the same shared memory bank may occur. Registers show the fastest memory speed, but since the register size that a thread can utilize is greatly limited, efficient register design and use are required. Separately, the constant memory of the GPU has a slow memory access speed, but since frequently used values can be cached and used, it has a characteristic that it can show a memory access speed comparable to a register. Therefore,   when using memory in GPU, it is important to avoid using global memory but to create the best reference environment by properly distributing shared memory, constant memory, and registers.

A. PROBLEM
In XTS mode, each plaintext block can be encrypted independently, as in CTR mode. But the XORed values in each plain block are all different depending on the sector number of the block. If the plaintext block is up to the j-th, the XTS mode requires the sequential computation of the multiplication operations of 1 to j power of α for the tweak value to be encrypted. The problem is that the larger the size of the data you want to encrypt, the larger the j, the greater the load of the j-th power operation of α. For example, if we encrypt data in size 1 GB, each plain block has a size of 16 bytes, so j assigns a number from 1 to 67,108,864 to each plain block. This method of computation needs to be improved because the capacity of the storage devices used by the user has increased significantly compared to the past. Therefore, in this paper, we introduce an optimization technique that can parallelly speed up the sequentially processed operations for efficient XTS-AES encryption on large data.

B. MAIN IDEA
Our main idea is a technique that uses intermediate values for parallel processing of the j-th power operations of α, which are computed sequentially. Rather than multiplying α one by one, we present a method that allows the index to be calculated by skipping a certain interval, such as α to the 8th and 128th. With this calculation of the intermediate value on the CPU, the remaining intervals that have been skipped can be calculated in parallel through the GPU. Figure 5 summarizes our main idea.

C. LOOKUP TABLE
During the power operation of α, whole 128-bits data are shifted to the left by 1 bit each time α is multiplied, and 135 is XORed or not at the bottom of the data according to the msb. 135 is 8-bit data expressed in binary as 10000111 (2) . While considering the msb one by one, we decided to calculate the final XORed value by considering the top 8 bits at once rather than deciding whether to XOR at 135 or not. This is possible because XOR values at the bottom of the data do not affect the Most Significant Byte(MSB). For example, suppose the top 8 bits of data were 11111111 (2) . If we operate the 8th power of α, XOR 135 and 1-bit shift to the left will be repeated a total of eight times. The first XOR 135 will finally be a 7-bit left shift, and the second XOR 135 will be a 6-bit left shift. The final results for a total of 8 bits are shown in Figure 6.
This XORed value depends only on msb, regardless of the lowest value of the data. Therefore, we can make a table of 256 results for the values 00000000 (2) to 11111111 (2) that the highest 8-bit can have. Results for 8-bit inputs can be found in Table 2.

D. INTERMEDIATE VALUE
The use of tables reduces the operation of the 8th power of α to a single table lookup. In this case, the entire data needs to perform an 8-bit left shift, but since 8 bits is a single byte, it is only necessary to increase the byte index of the data one by one without having to shift. Using this, we can compute the next 128 bits, that is, to the 128th power of α, using a table of all 16 bytes of data. Figure 7 shows the process of referencing the MSB data to a table and then XORing it to the least significant 2-byte. Figure 8 shows the process of referencing each byte of 16-byte data to a table and then XORing it to its location, considering the shift.
The primitive operation of repeated multiplication of α can be found in Algorithm 1. For an α i of 128-bits, each operation is performed as follows: The value of the α i+1 is doubled for mod2 128 from α i . In implementation, this can be implemented by shifting left by one bit. After that, the most significant bit of the alpha is checked, and if it is 1, the result of XORing 135 becomes the final α i+1 result.
The optimized operation that shortens the time to multiply the α one by one can be found in Algorithm 2 and 3. The difference from the Algorithm 1 is that α i+8 or α i+128 can be directly calculated through α i instead of α i+1 .
In the case of Algorithm 2, α i+8 is the result of multiplying α from α i 8 times, so 2 is multiplied 8 times. Alpha is data composed of 16 bytes. Therefore, in byte representation, there is no need to multiply by 2 8 , just move the positions of the  byte data array one by one. In Figure 7, it can be seen that byte data moves to the next byte position. The process of performing XOR 135 while reading the most significant bit 8 times is converted into a single table reference. The most significant byte goes into table T ( Table 2) and comes out as a 16-bit result, which is XORed on the result.

E. PARALLEL OPERATION IN GPU
By changing the operation process for alpha into a form that is easy for parallel operation and transferring it to the GPU, the GPU can perform the powering operation on alpha in parallel and then independently encrypt each plaintext block. Inside the GPU, encryption proceeds in two stages. The first is the process of generating tweak values for all blocks from α 0 to Algorithm 1 Primitive Operation of Repeated Multiplication of α Input: 128-bits data α i Output: 128-bits data α i+1 Algorithm 2 Optimized Operation of Repeated Multiplication of α 8 Input: 128-bits data α i T = Alpha Table(Table 2) Output: 128-bits data α i+8 α j−1 using the received α 0 , α 128 , α 256 , . . . , α j−128 , and the second is the process of encrypting the plaintext using the tweak values.

F. PARALLEL ENCRYPTION PROCESS
Each GPU thread performs encryption using one tweak value and a plaintext block. In XTS mode, the tweak value is XORed with the data before and after the encryption process. Therefore, plaintext should be stored in GPU memory.
To encrypt the plaintext after XORing it with the tweak value inside the GPU, it is necessary to copy the plaintext data from the CPU to the GPU in advance. We use CUDA streams to reduce the memory copy time between CPU and GPU. By dividing data by the number of streams, each stream can perform memory copy and operation asynchronously. We leveraged 32 CUDA streams to maximize the pipe-lining effect of encryption and memory copy. 32 is the maximum Algorithm 3 Optimized Operation of Repeated Multiplication of α 128 Input: 128-bits data α i T = Alpha Table(Table 2) Output: 128-bits data α i+128 1: for j = 0 → 14 do 2: n = j-th Least Significant Byte of α i 3: number of streams available on the GPU and shows the highest performance.
In XTS-AES, since all plaintext blocks use the same key value, the round key can be expanded in the CPU and copied to the GPU's constant memory for use. GPU constant memory improves memory reference speed by caching frequently used values. In addition, we implemented AES with a 32-bit word size using T-box to speed up the AES encryption process by utilizing the 32-bit size register of the GPU.
In the case of the implementation method that uses the T-box by storing it in shared memory, if the threads access the same bank address in the shared memory, a bank conflict problem may occur. To avoid this problem, we implemented T-box to be copied as much as the bank size so that each thread refers to a different bank address in the same shared memory. In our implementation, 32 identical T-boxes corresponding to the bank size were stored in shared memory and used for encryption.

V. EVALUATION
In this section, we evaluated the performance of our proposed XTS-AES optimization implementation. First, we profiled the computational weight required to calculate the tweak value in each environment of the CPU and GPU. Afterward, we compared the performance difference when the encryption was exclusively performed in each environment of the CPU and GPU. In addition, we summarized how performance differs for different implementations of optimizations. Finally, for performance comparison with other XTS-AES implementations, we compared the performance of XTS-AES provided by the open-source OpenSSL [16] with the performance of our proposed implementation. OpenSSL provides CPU multi-threading technology and parallel operation through AVX instructions [19]. This allows cryptographic operations to run very quickly even on the CPU.
The environment used to measure the implementation performance is as follows. In AMD Ryzen 9 5900X 4.7GHz OC CPU environment, we evaluated the performance of our Naive CPU implementation and benchmarked the XTS-AES  implementation of OpenSSL 3.0.1. Performance of all GPU implementations was measured on NVIDIA GeForce RTX 3090 GPU. All performance measurement results were calculated as the average of the results of 1,000 iterations. Due to the computational characteristic of GPUs, there is a size of parallel computation data at which performance reaches a critical point. Therefore, Tables 3 and 4 present the performance results at the 128 MB data size, which is the performance saturation point.
The performance results of the GPU were measured based on the time it takes to copy all the plaintext data from the CPU to the GPU and then copy the ciphertext data from the GPU back to the CPU after encryption.
Each processing time for α j computation and encryption in XTS-AES is shown in Table 3. Naive CPU is an implementation that sequentially encrypts AES using all the generated α i values after all α j computations are sequentially performed. Naive GPU is an implementation that performs α j computation sequentially but encrypts AES in parallel using the generated α i . When comparing the Naive CPU and Naive GPU implementations, it could be seen that the α j computation time of both is the same, but the encryption VOLUME 10, 2022 operation time is greatly reduced in the GPU environment. Therefore, it was confirmed that the time to encrypt the entire data through XTS-AES is reduced by about 1/11(10.69 times faster) of the Naive GPU (35.72 ms) compared to the Naive CPU (381.76 ms). Table 4 shows the comparison of our several optimization implementation. The optimized GPU is an implementation that calculates the intermediate value α 8 or α 128 through a lookup table so that the α j computation process can be performed in parallel, and then performs the rest α i calculation and encryption in parallel. Unlike the Naive GPU implementation, which computes only the encryption process in parallel, it could be seen that the α j computation time is greatly reduced in the implementations that optimize the α j computation process in a form that is easy for parallel operation. It could be seen that the encryption time increases as the intermediate value range for α j are set larger, but the total operation time of XTS-AES gradually decreases. Finally, it was confirmed that the computation time of the implementation optimized to compute α 128 as an intermediate value (1.59 ms) compared to the computation time of the naive GPU implementation (34.91 ms) is reduced by about 1/22(21.96 times faster). Table 5 compares the performance of XTS-AES in OpenSSL with the performance of the optimization implementation proposed in this paper. The percentage figures in the table are the performance improvement of the GPU implementation compared to OpenSSL 3.0.1. It could be seen that the overall performance of the GPU implementation improves as the block size increases in XTS-AES. When the block size is 8192, it was confirmed that the performance improvement of XTS-AES-128 was about 12.23 times, and that of XTS-AES-256 was about 14.64 times faster.

VI. CONCLUSION
In this paper, we propose several optimization techniques that can efficiently compute XTS-AES, an encryption method used for disk encryption. We proposed implementation techniques that can change the tweak operation process, which is not suitable for parallel operation, into a form that is easy for parallel operation by using a lookup table and intermediate values. As a result of this implementation, it was possible to achieve about 12.23(XTS-AES-128) and 14.64(XTS-AES-256) times improvement in performance compared to the implementation of OpenSSL. The techniques and results proposed in this paper can be used for various disk encryption functions and can be used not only for disk encryption but also for mobile device encryption or network encryption that can utilize location information for encryption. In addition, since it is not a block cipher AES optimization technique for the XTS operation mode and does not depend on a specific algorithm, it can be used for the XTS mode of various cryptographic algorithms. In the future, we plan to compare the performance improvement by applying our optimization techniques to Veracrypt, an open source FDE software.