Bitsliced Implementation of Non-Algebraic 8×8 Cryptographic S-Boxes Using ×86-64 Processor SIMD Instructions

The article is devoted to software bitsliced implementation of randomly generated $8\times 8$ S-Box block ciphers, focused on the use of logical SIMD instructions from the SSE, AVX and AVX-512 extensions in ×86-64 processors. A heuristic algorithm for minimizing non-algebraic S-Boxes in three logical bases is proposed: universal—based on logical instructions AND, OR, XOR, NOT, which allows implementation on any 8/16/32/ 64-bit processors; extended—based on the instructions AND, OR, XOR, NOT, AND-NOT, which allows implementation on ×86-64 processors; ternary—based on ternary logic instructions, for implementation on ×86-64 processors with AVX-512 support. On average, bitsliced representations of non-algebraic S-Boxes in these logical bases require 400/380/200 logical instructions, respectively. The performance of bitsliced implementations of the S-Box cipher “Kalyna” using logical instructions SSE/AVX/ AVX-512 for the Intel Xeon Skylake-SP processor was measured. A fast alternative—non-bitsliced approach to the bytesliced SubBytes operation based on the AVX-512VBMI extension, resistant to timing and cache attacks—is proposed.


I. INTRODUCTION
S YMMETRIC block ciphers (SBC) are used for cryptographic data protection in wide range of applications. Given the growing amount of data, modern SBC must provide high performance software implementation of encryption for a wide class of microprocessor architectures with different computing capabilities and available resources. An equally important requirement is increased resistance to attacks through side-channels: for low-end CPUs (8/16/32-bit microcontrollers) it is primarily power analysis attacks, for high-end CPUs (x86, ARM Cortex-A) it is mainly timing and cache attacks.
There are several approaches to implementing SBC, differing in computation time, security, and resource requirements: classic, tab le-based, SIMD-based, and bitsliced. For many ciphers, if hardware cryptographic accelerators are not used, it is the bitsliced approach that provides the highest speed of software implementation on various processor platforms. [1], [2], [3], [4], [5], [6], [7]. Moreover, the bitsliced implementation approach offers the following benefits: • constant execution time of cryptographic operations with resistance to timing and cache attacks [8], [9], • undemanding to computing resources, • maximum use of the capabilities of modern high-end microprocessors to increase performance as a result of parallelization of both code execution (superscalarity) and data processing (SIMD technology), • allows adaptation for low-end CPU and hardware implementation on FPGA and ASIC.
The basic idea behind Bitslicing is to describe a cipher in terms of logical operations -for example AND, XOR, OR, NOT. In processors, any such logical operation can be represented by an appropriate instruction (including a vector one) that can simultaneously process many bits. High performance is achieved due to the fact that the CPU processes many elements of the cipher (bytes, blocks) in parallel, using fast logical instructions and some simple-to-perform operations, for example, bit permutations. The larger the bitliced register capacity, the greater the performance gain, therefore this approach is especially effective for vector instructions operating with multi-bit registers.
The main problem in the bitsliced implementation of the SBC is the representation of S-Boxes larger than 6×6, especially if they are randomly generated and are not described by algebraic equations. This article is devoted to this very case. Since software bitslicing gives the greater gain, the larger the register length, we focus on representing arbitrary 8×8 S-Box in the case of using vector 128/256/512-bit SIMD extensions of the instruction sets of x86-64 processors.
Universal classic approaches to minimizing logical functions do not take into account the specifics of bitslicedrepresentation, and moreover, they must also take into account the balance between efficiency and computational speed. In the case of minimizing static S-Boxes that are designed to serve for many years in various cryptographic primitives, speed can be sacrificed by using more time-consuming minimization algorithms, but with better efficiency. This paper proposes one of such approaches, consisting in the initial breakdown of the truth table describing the S-Box into fragments, which allow for an exhaustive search for the optimal (in terms of the number of gates) representation and further gradual minimization, taking into account the intermediate results obtained in the previous steps. This is much more effective than two-level minimization.
The work also offers approaches to bitsliced-and byteslicedcalculations of any 8×8 S-Boxes based on the new capabilities of AVX-512 technology. In particular, the use of a ternary instruction (computing any logical function with three arguments) allows to reduce the number of instructions compared to traditional, based on AND, XOR, OR, NOT instructions, bitsliced representations. In addition, new byte manipulation instructions allow the use of vector registers for bytes to perform a non-linear swap operation without memory or cache access, making the implementation of the algorithm immune to side-channel attacks.
The aim of this article is to estimate the minimum number of logical instructions/gates required for the bitsliced representation of the non-algebraic 8×8 S-Box on the example of the "Kalyna" cipher code, provided that the logic instructions offered by the SIMD x86-64 CPU extension are used.

II. ANALYSIS OF RECENT RESEARCH AND LITERATURE
The most difficult step in a bitsliced implementation of cryptographic algorithm, a step that largely determines the overall performance, is the logical representation of the nonlinear S-Box tables. In the case of a hardware implementation, the logical base are the logic gates (Gate Equivalent, GE) AND, OR, XOR, NOT. In software bitsliced implementation, the gates are replaced with corresponding instructions that are currently available in most processor architectures. So from now on, the terms gate and instruction will be used synonymously. It should be noted that some processors do not have the NOT instruction, which is emulated by the XOR instruction, due to the fact thatx = 1 ⊕ x. Since processor logic instructions mostly support two operands, the logic gates must be twoinput (Fig. 1), therefore it is possible to unambiguously switch between a logical representation and a software one.
Let us consider the approaches to the S-Box representation in the form of a combinational logic with a minimum number of two-input gates AND, OR, XOR, NOT. There are different approaches to optimize the 8×8 S-Box depending on the following three variants of its construction: 1) S-Box with built-in algebraic structure, such as in AES or SM4 cipher. This type of S-Box enables compact bitsliced implementation by taking advantage of the cryptographic algorithm transform properties. However, the possibility of analytical description facilitates cryptanalysis. 2) Using smaller S-Boxes (usually 4×4), from which 8-bit S-Boxes are generated. Three constructs are most commonly used for generation: Feistel network, SPN or MISTY structure [10]. These S-Boxes can also have a compact bitsliced implementation. Examples of ciphers include: CLEFIA, Crypton, ICEBERG, and many others. 3) Non-algebraic S-Boxes are randomly generated with specific cryptographic properties. The absence of any analytical relationships that would describe the S-Box data complicates their bitsliced representation, but at the same time makes them more resistant to differential, linear and algebraic cryptanalysis. Examples of ciphers that implement this approach are: "Kalyna", KHAZAD and Anubis. Much research is devoted to the logical minimization of the S-Box AES algorithm, with the best results summarized in Table I. The main approach is to split the table into linear and non-linear parts and minimize them using heuristic methods.
The results (minimal number of logic gates) presented in Table I relate to the hardware implementation, not the software implementation. It should be emphasized that, as for the hardware implementation, the minimization criterion should not only be the number of logic gates, but also the crystal area required for a given technology or the path length (Depth) of signal propagation or energy consumption. From the physical point of view, not all logic gates are equivalent according to the criteria mentioned, which was taken into account, for example, in studies [14], [15].
The results presented in Table I, if considered in terms of software implementation, show the best result when using 128 gates (within the work [12]), because the other ones use the NAND/NOR gates, which require two processor instructions. Therefore, this 128-gate description of the AES table from [12] will be used to compare the performance with the bitsliced representations of non-algebraic S-Boxes obtained in this paper.
The research described in [11], [12], [13], [14], and [15] is exploiting the algebraic structure assumed in the AES S-Box table, which of course significantly simplifies the minimization, compared to randomly generated data, so such or similar approaches cannot be used directly in this paper.
SAT-Solver tools are used for minimizing small S-Boxes (e.g. 4×4) that are typical in lightweight ciphers and are used during generating 8×8 S-Boxes. The approach to the representation of various minimization criteria in the SAT terms is described in [16], and in the context of presented research, the following are particularly important: 1) Bitslice gate complexity (BGC) -minimum number of AND, OR, XOR, NOT operations, used in bitsliced software implementations for various processor architectures. 2) Gate complexity (GC) -the minimum number of gates with the additional use of NAND, NOR and XNOR, required for an efficient hardware or software implementation in appropriate processors.
The results for these minimization criteria are shown in Table II.
The problem is that this approach only works for small S-Boxes, up to 5×5, although at the same time difficulties begin with small S-Boxes, as it is not always possible to find a solution (Table II: Minalpher, Keccak). In the case of 8×8 S-Boxes, this approach cannot be implemented due to the computational complexity. It is possible to break 8×8 tables into smaller parts and minimize each part separately with SAT-Solvers, but the result will not be optimal, as it will not take into account the relationships between them. The functional capabilities of existing specialized utilities for the bitsliced representation of S-Boxes are shown in [17], however, the common disadvantage of these utilities is that they cannot work with randomly generated 8×8 S-Boxes.
An example of minimizing non-algebraic S-Boxes is [18], which describes several AES S-Box hardware implementations synthesized in the Synopsys Design Compiler environment, including a direct tabular implementation (without using the AES S-Box algebraic structure), which requires 1312 gates. Similar results were published in [19] where AES S-Box was synthesized into VHDL in Xilinx Design Suite 14.5 with a total of 1150 gates. It can be expected that similar results will be obtained for other non-algebraic tables.
The 8×8 S-Boxes can be considered as logical instructions given by truth tables. Classic methods of minimizing logical instructions represented by a truth table, such as the Karnaugh map or the Quine-McCluskey prime implicants method, are not appropriate in this case because they perform a two-level minimization using only AND, OR, and NOT (without XOR) and do not directly consider the two-entry gates requirement. The de facto global standard for minimizing logical instructions with a large number of variables is the Espresso program, which uses a heuristic algorithm [20]. There are also derivatives of this program such as BOOM (which is faster), Logic Friday (which provides a graphical interface for the Espresso algorithm), ABC and others. In practice, Espresso gives results very close to the global minimum, using a few orders less memory and time compared to other methods. The Espresso algorithm is used in many tools and CAD packages for the synthesis of logic circuits (FPGA, ASIC) at the minimization stage.
However, it should be emphasized that the use of Espresso for non-algebraic S-Boxes is also not the best solution, because there is still the problem of using only the AND, OR, NOT operations (the XOR operation can be implemented after obtaining the minimization result, but not in the process) and does not take into account the limitation of gates inputs number being equal two.
The results of the S-Boxes minimization of the "Kalyna" cipher using the Logic Friday program are presented in Table III.
Note that the number of gates includes NAND elements that require two processor instructions in the software representation, so the total number of operations in the software implementation will be even greater than that indicated in Table III. Although the results are better than in [18] and [19], they are not satisfactory in terms of speed.
Thus, none of the minimization methods considered gives satisfactory results, nor can it be directly applied to the randomly generated 8×8 S-Boxes. Therefore, the authors developed and implemented their own heuristic methods in C++ to minimize such S-Boxes, targeted at three scenarios: 1) UNIVERSAL. This bitsliced representation uses only {AND, OR, XOR, NOT} operations and can be implemented in software on any processor having appropriate instructions to perform these operations (including SIMD), as well as in hardware, using appropriate logic gates. 2) EXTENDED. It is a software bitsliced representation oriented to the use of AND, OR, XOR, NOT and AND-NOT instructions, which are available both in the basic instruction set architecture and in all SSE/AVX/AVX-512 extensions of x86-64 CPU. This representation is universal for x86-64 processors. 3) TERNARY. It is a software bitsliced representation based on the use of the logical SIMD vpternlogd instruction, which is only available on x86-64 processors with the AVX-512 extension.  IV   THE TRUTH TABLE FRAGMENT FOR SBOX0 For each scenario, the heuristic minimization program automatically generates C++ functions based on SIMD SSE/AVX/AVX-512 instructions, which perform the bitsliced computations for a given S-Box. To represent SIMD instructions in C++ code, intrinsic functions [21] are used, which are a high-level wrapper for assembly language instructions. As for the S-Boxes used in this study, their resulting bitsliced implementations are available at [22].

III. VECTOR EXTENSIONS OF THE X86-64
PROCESSORS INSTRUCTION SET Modern microprocessors with the x86-64 architecture support several sets of vector instructions: SSE, AVX/AVX2, AVX-512 [21], [23]. In the further part of the work, their abilities in the context of achieving the set goal are presented. SSE (Streaming SIMD Extensions) is a set of over 70 SIMD instructions that use eight 128-bit xmm0-xmm7 registers added to the processor architecture. Subsequently, the SSE technology was supplemented with new extensions such as SSE2, SSE3, SSSE3 and SSE4, which significantly increased its efficiency. With the advent of 64-bit processors, the number of vector registers has increased to 16 (xmm0-xmm15). Hereinafter, SSE or SSE-128 will stand for all 128-bit extensions SSE, SSE2, SSE3, SSSE3, and SSE4. An important limitation of SSE extensions is the two-argument instruction format (a = a+b), in which the content of one of the operands is lost, which requires additional data transfers. The advantage of the SSE technology is that it is supported by almost all x86-64 CPUs on the market today, as the last extension, SSE4, is dated 2008.
AVX (Advanced Vector Extensions) -an extension of the instruction set in x86 microprocessors, which has been available in processors since 2011. The AVX2 extension is a further development of AVX to work with integer operands, implemented in processors since 2013. Hereinafter, AVX or AVX-256 will stand for all 256-bit vector extensions, AVX and AVX2. In AVX, the width of sixteen SIMD registers (ymm0-ymm15) was increased from 128 to 256 bits, new instructions were added, the non-destructive instruction format (c = a + b) was introduced, and the requirements for aligning operands in memory were relaxed. AVX technology is currently supported by most of the mainstream processors, provides a high level of parallelism, so the authors consider it essential for bitsliced implementations.
The AVX-512 is a further extension of the AVX 256-bit instructions. The number of registers has been increased to 32 (zmm0-zmm31), their width has been expanded from 256 to 512 bits, many new instructions have been added and the existing capabilities have been extended. AVX-512 is the generic name for many extensions, not all of which are necessarily supported by the CPUs that implement them, with the exception of AVX-512F (Foundation). So far, a small percentage of processors support the AVX-512 extension, which began to appear in top processor models only since 2017.

IV. BITSLICED IMPLEMENTATION A. S-Box Representation
In bitsliced representation, substitution tables can be thought of as logical functions, given by 8×8 truth tables. For example, the Sbox0 fragment of the cipher "Kalyna" has the form shown in Table IV, where x0-x7 are input variables and y0-y7 are output variables.
The proposed approach to bitsliced implementation consists of three main stages: 1) Breakdown of 256-bit logical variables y0-y7, for which it is difficult to find a representation with a number of gates close to the minimum, into smaller partsvectors, for which an exhaustive search for all variants of representation (logical schemes) with the optimal logical scheme choice can be performed. A 32-bit vector was selected which provides realistic memory and computational complexity requirements during the following steps. Thus, each 256-bit variable yi will be described by Since the vectors yi[j] are 32-bit, 5 lower input variables were used to minimize them x0, x1, x2, x3, x4).
The x5-x7 variables form 8 masks m [j], which are superimposed on the vectors of the corresponding y0-y7 output variables. In the case of Sbox0, the truth table will take the form shown in Table V. The formula describing this representation has the form: The input data for the algorithm are vectors x0 = 0xaaaaaaaa, x1 = 0xcccccccc, x2 = 0xf0f0f0f0, x3 = 0xff00ff00, x4 = 0xffff0000. There are also: trivial zero vector 0 × 0 and the ones vector 0xffffffff.
So the minimization problem can be formulated as follows: a set of 5 vectors (32-bit numbers) x0-x4 is given. It is necessary to compute all 64 y_val vectors that make up the given nonlinear substitution table S-Box, using a minimum of logical operations and, respectively, intermediate variables.
Then the values of y0-y7 are calculated using the formula (1).

B. Universal
In this bitsliced representation, only the standard logical AND, OR, XOR, NOT instructions are used. Therefore, this variant can be implemented on any 8/16/32/64-bit processors, including the simplest 8-bit MCUs, as well as on x86-64 processors with vector instructions support, where SIMD instructions AND, OR, XOR are also needed (  t required for their calculation are entered into a special trace structure, from which a system of logical equations is then formed. The value generated by the logical expression is the key that is searched in trace. Every row in trace has the following format: The results of applying the developed algorithm to various S-Boxes are shown in Table IX. Overall, non-algebraic S-Boxes require about 400 gates, almost three times more than the AES S-Box. On average, 4.09 instructions (x = 4.09) are needed to compute one vector. The results obtained are much better than those obtained with the Espresso program.
The results for AES and SM4 are given in Table IX for comparison, although, from a practical point of view, applying the proposed heuristic algorithm to them does not make sense, as their algebraic structure allows for much more efficient optimization.

C. Extended
All x86-64 processors with SSE/AVX/AVX-512 extensions have an SIMD andn instruction that implements the ANDNOT(a, b) = !a&b logic function (Table VII). In this way, it is possible to extend the logical operation base for bitsliced representation to include this {AND, OR, XOR, NOT, ANDNOT} operation, which will slightly reduce the total number of instructions. The rest of the details are similar to the UNIVERSAL algorithm.
The AND-NOT instruction also has a non-vectorized version, which works with three 32/64-bit operands: andn r64/r32, r64/r32, r64/r32, so the EXTENDED representation, like UNIVERSAL, can be used for any x86 processors, even without SIMD extensions. The total number of logic gates (instructions) representing the 8 × 8 S-Box will be:  Table IX. On average, the number of gates is reduced by 18, and the calculation of one vector requires 3.84 instructions (x = 3.84).

D. Ternary
On x86-64 processors with AVX-512 support, it is possible to use the bitwise ternary instruction vpternlogd, which implements any three-operand logic function based on 3-input LUT table (Fig. 2). The instruction takes three 512-bit operands a, b, and c and builds 3-bit indexes index. The imm8 byte specifies any logical truth table that defines the initial binary value for each index.
In this way, it is possible to construct a bitsliced S-Box representation using only this instruction to synthesize the desired Boolean expressions. The three-element (ternary) instruction has the same latency and throughput values as the basic logical AND, OR, XOR instructions (Table VIII), but it reduces their values by combining more operations into one.
Note that the vpternlogd instruction is included in the AVX-512F (Foundation) and therefore is supported by all processors with AVX-512 technology, making this implementation universal for this class of processors. The functions for computing the masks m[j] = f(x7, x6, x5) require one ternary instruction each, as do the masking and computing yi = yi |

(m[j] & yi[j]).
Thus, the total number of ternary instructions to represent the 8×8 S-Box will be given by the formula:  TERNARYCMD(a,b, c, imm8)  where x -the average number of ternary instructions representing a single yi _val vector.
The algorithm for finding vector representations using ternary instructions is presented below, in Algorithm 2. The vector search is carried out from the initial base base = {x0, x1, x2, x3, x4}. First, all specific 32-bit vectors are searched for -vectors, which can be formed with the tern (a, b, c, imm8) ternary instruction, where a, b, c ∈ base, imm8 = 1…254. If one of the vectors yi[j] is found, then this value is added to the base base, and the corresponding entry trace = {yi[j]: a, b, c, imm8} is added to the trace, after which the search process is started from beginning.
If no vector is found, all possible vectors vect = tern (a, b, c, imm8), a, b, c ∈ base, imm8 = 1…254 and corresponding  tern(a, b, d, imm8) The results of the bitsliced S-Box representation using ternary instruction are given in Table IX. On average, computing one vector requires 2.14 instructions (x = 2.14).

V. PERFORMANCE EVALUATION OF S-BOX BITSLICED IMPLEMENTATIONS
In the case of a software bitsliced implementation, comparing only the number of logical instructions required for the S-Box representation does not answer the question of which implementation is faster and by how much, because it does not take into account the time of data transfer. Bitsliced implementations use many variables, the number of which is much greater than the number of general-purpose registers (16 in x86-64 processors) or SIMD registers (16/32 in x86-64 processors). For example, Sbox0 in the UNIVERSAL representation requires 229 intermediate ti variables to compute the y_val vectors. Therefore, these variables are stored in the stack, and the compiler generates code to store and load them from the stack into SIMD registers before and after calling the Intrinsic function. In superscalar processors, these intermediate operations will occur largely in parallel with the execution of logic instructions, but this is almost impossible to predict. Therefore, real-world experiments/measurements should be performed to evaluate the impact of data transfer operations on the performance.
All S-Box implementations and the code to evaluate their performance were compiled using compilers from Microsoft Visual Studio 2019 (hereinafter MSVC) and mingw-w64 (hereinafter GCC). Compilation was performed with maximum available optimization -with the use of the parameters /O2 (Maximize Speed) and -O3 (Optimize fully for speed) for MSVC and GCC, respectively. The unit of performance measurement: the number of cycles per the processing of one byte -cycles per byte (hereinafter cpb).
The number of cycles was measured using the rdtsc instruction to read the Time Stamp Counter (TSC) according to the method described in [24]. To reduce the impact of the processor context switching, multiple S-Box calculations were performed and the minimum value was taken as the final measurement result.
The speed measurement results are shown in Tables X-XII. As can be seen from the results presented, the bitsliced computation of the non-algebraic S-Box is approximately 4 (SSE-128/AVX-256) to 3 times (AVX-512) slower than the AES S-Box bit computation.
The implementations of the TERNARY realization, although they use almost two times fewer logical instructions (≈210) than the UNIVERSAL (≈400) and EXTENDED (≈380) representations, are only slightly ahead of them. This can probably be explained by the fact that ternary instructions use four instead of two operands and therefore require more effort to load from the stack and also more data dependency.

VI. BYTE-SLICED IMPLEMENTATION
It is believed that bitslicing is capable of providing the fastest software implementation of most symmetric block ciphers on the x86 CPU. This applies, for example, to the AES cipher without the use of AES-NI [3]. However, with the advent of the x86-64 AVX-512VBMI (Vector Byte Manipulation Instructions) processor extension, it is possible to time-efficiently, and with constant time, perform the byte substitution operation for 64 bytes in zmm registers, due to the use of the vpermi2b permutation instruction (Table XIII). This may make the bitslicing approach for symmetric block ciphers less efficient for such processors.
The vpermi2b instruction replaces bytes in registers a and b based on the indices in the idx register. The seventh bit of the index specifies the register from which the substitution will be performed: 0 − a, 1 − b, and bits 5-0 (six lower bits) specify the byte number to which the substitution will be performed, inside the register. If the proper bit of the mask k equals zero, 0 is written to the result register. This means that the vpermi2b instruction allows overwriting within 128 bytes (bytes are selected from two 64-byte zmm registers). The pseudocode of this instruction [21]: Suppose the substitution operation of the zmm register in is to be performed: out = SBox [in]. Prior to encryption, the 256-byte S-Box is sequentially placed in the 64-byte sb0-sb3 zmm registers. Since there is no access to memory or cache during the encryption process, and the S-Box is permanently stored in the zmm registers, this guarantees protection against timing and cache attacks. First, using _mm512_cmplt_epu8_mask(in, step) with a 64-bit mask m, bytes of a value lower than 128 are marked in the in register (the step register contains the value 128). Then they are replaced with the values from the sb0-sb1 registers, and the mask m is inverted (that is, it will now point to elements with value ≥ 128). Finally, the remaining bytes in in are replaced by the values from the registers sb2-sb3 (Fig. 3).
Only 5 instructions are needed to replace 64 bytes in the zmm register in: two vpermi2b SIMD instructions, one OR SIMD instruction, one SIMD comparison instruction and one 64-bit NOT instruction (loading the step and sb0-sb3 is done only once, before starting the encryption), and the number of instructions does not depend on the S-Box type (algebraic or non-algebraic) and does not require prior minimization of the table. Moreover, unlike bitslicing, there is no need neither to compute intermediate variables t, nor to pass them between stack and SIMD registers, all operations can be performed using zmm registers only.
Best software bitsliced AES S-Box calculation requires 128 SIMD instructions (excluding data transfer instructions) (see Table I and [12]), replacing 64 × 8 bytes (there are 8 64-byte AVX-512 input registers x0-x7) that is equivalent to 16 instructions / 64 bytes, which is almost three times more than using vpermi2b.
However, the AVX-512VBMI extension is still rare and not supported by all CPUs. Currently, support for AVX-512VBMI has been announced on processors with Intel Cannon Lake (2018), Ice Lake (2019) and Tiger Lake (2020) microarchitecture.

VII. CONCLUSION
The article presents software implementations of nonalgebraic SIMD-oriented 8×8 S-Boxes, as well as estimation and comparison of their performance with AES S-Box for x86-64 processors with SSE-128/AVX-256/AVX-512 extensions. Heuristic minimization algorithms using three logical function bases have been proposed: 1) UNIVERSAL {AND, OR, XOR, NOT}, enabling both hardware and software implementation for any processor, 2) EXTENDED {AND, OR, XOR, NOT, AND-NOT}, enabling software implementation on any x86-64 or x86-32 processor, 3) TERNARY, enabling software implementation only on x86-64 processors with AVX-512 support. The performance of bitsliced implementations can be potentially increased even more by further minimizing the description of nonlinear substitution tables, for example, by switching from 32-bit vectors to 64-bit within the heuristic minimization algorithm proposed in this paper.