Optimized Implementation of SM4 on AVR Microcontrollers, RISC-V Processors, and ARM Processors

. The SM4 block cipher is a Chinese domestic crpytographic that was introduced in 2003. Since the algorithm was developed for the use in wireless sensor networks, it is mandated in the Chinese National Standard for Wireless LAN WAPI (Wired Authentication and Privacy Infrastructure). The SM4 block cipher uses a 128-bit block size and a 32-bit round key. This consists of 32 rounds and one reverse translation R . In this paper, we present the optimized implementation of the SM4 block cipher on 8-bit AVR microcontrollers, which are widely used in wireless sensor devices, the optimized implementation of the SM4 block cipher on 32-bit RISC-V processors, which are open-source based computer architectures, and the optimized implementation of SM4 on 64-bit ARM processors with the parallel computation, which are widely used in smart-phone and tablet. In the AVR microcontroller, it is implemented in three versions, including speed-optimization, memory-optimization, and code-optimization. As a result, speed-optimization, memory-optimization, and code-optimization achieved 205.2 cycles per byte, 213.3 cycles per byte and 207.4 cycles per byte, respectively. This is faster than the reference implementation written in C (1670.7 cycles per byte). The implementation on 32-bit RISC-V processors 128.8 cycles per byte. This is faster than the reference C code implementation (345.7 cycles per byte). The implementation on 64-bit ARM processors is 8.62 cycles per byte. This is faster than the reference C code implementation (120.07 cycles per byte).


Introduction
A number of sensor nodes are used to collect the data in wireless sensor networks.Tiny sensor nodes have limited computation resources, such as computing power, memory size, and battery life.Since cryptographic algorithms are based on complicated operations, it is difficult to achieve the high availability for wireless sensor networks in secure packets.To resolve this problem, lightweight block cipher algorithms have been proposed.Lightweight cryptography algorithms require low resources than ordinary cryptographic algorithms.The SM4 block cipher is one of the lightweight block cipher, which is Chinese National Standard for wireless LAN WAPI (Wired Authentication and Privacy Infrastructure).[1] In this paper, we propose optimized implementations of the SM4 block cipher on low-end 8-bit AVR microcontrollers, 32-bit RISC-V processors and high-end 64-bit ARM processors.Main contributions are as follow: 1.1 Contributions -Optimized implementations of the SM4 block cipher on 8-bit AVR microcontrollers.We implemented the SM4 block cipher on low-end AVR microcontrollers.SM4 block cipher requires the 128-bit block size, while AVR microcontrollers only support 8-bit wise general purpose registers.Therefore, the efficient register allocation should be considered.We proposed the optimal register allocation.Furthermore, the SM4 block cipher requires the 32-bit wise rotation operation, while 8-bit wise operations are performed on AVR microcontrollers.We suggested the optimized implementation of 32-bit wise rotation on 8-bit development environments.-Optimized implementations of the SM4 block cipher on 32-bit RISC-V Processors.RISC-V is an open-source based computer architecture that supports new instruction sets for operations.This paper presents the first optimized implementation of SM4 on 32-bit RISC-V processors.In particular, we optimized S-Box operations with RISC-V instructions.-Parallel implementations of the SM4 block cipher on 64-bit ARM Processors.64-bit ARM processors support SIMD (Single Instruction Multiple Data) features, which can process the data in a parallel way.We propose the parallel implementation of the SM4 block cipher in 12-way approaches.This implementation encrypts 12 plaintext blocks at once through SIMD instructions.For the optimal implementation, we introduce the vector register allocation plan with arrangement and efficient instructions for the optimized parallel implementation.

SM4 Block Cipher
The SM4 block cipher is one of a China domestic crpytographic algorithm, which was first published in 2003.It was a cryptographic standard issued by the OS-CCA (Office of State Commercial Crpytography Administration) [2].Table 1 shows the list of SM4 parameters.The left of Figure 1 describes encryption tasks for the SM4 block cipher.The SM4 block cipher consists of 4 computations; Round function (F), Permutations (T and T'), Nonlinear transformation (tau), Linear transformations (L) and (L'), and S-box (S).

Round function (F).
The plaintext of the SM4 block cipher is divided in four 32-bit units, called X. Round function (F) requires 5 arguments, which are X 0 , X 1 , X 2 , and X 3 , and round key.F can be defined as the following equation.
The right of Figure 1 represents the Round function F structure.
Permutations T and T'.T is the permutation function that requires 32-bit input values, and makes 32-bit outputs.It has the reversible feature.Permutations T and T' consists of tau and L.
Nonlinear transformation tau.The nonlinear transformation (tau) uses 4 S-boxes, which needs 32-bit inputs and returns 32-bit outputs.It is performed in a parallel way.Each input value does not affect each other.Nonlinear transformation tau can be represented as follow, where A and B are 32-bit input value and 32-bit output value, respectively.The type of a i and b i is a 8-bit wise string.
A = (a 0 , a 1 , a 2 , a 3 ); tau(A) = (S(a 0 ), S(a 1 ), S(a 2 ), S(a 3 ));  [3].We evaluated the performance on ATmega128.This is the one of 8-bit AVR microcontroller family.It has 128KB of programmable flash memory, 4KB internal SRAM, 4KB EEPROM, and 64KB optional external memory space [4].AVR registers are denoted as R0 to R31.Some registers have special features as follows: -ZERO register: R1 is the zero register that always represents 0 value.However, it can be used freely for general purposes.This R1 register should be zeroed at the end of the operation.-Callee saved registers: R2-R17 and R28-R29 are callee saved registers (i.e.non-volatile registers).These registers saved important values (i.e.long-lived values and data from callee).These must be preserved in the stack before it is used.-Pointer address registers: R26-R31 can be used as a pointer address by combining two registers.When these are used for the pointer address, these are written as X (R26-R27), Y (R28-R29), Z (R30-R31) notation.R28-R29 are also callee saved registers.
RISC-V is a new computer CPU structure under development at UC Berkeley since 2010.It is not just for academic or research purposes, but for commercialization in the industrial world.The main feature of the RISC-V processor is that the basic instruction set is provided by the consortium, but there are no restrictions on the extended instructions that users can add.Therefore, when utilizing this, it is possible to increase the speed of the target application service by customizing the RISC-V processor.In this paper, the 32-bit structure RV32I used for performance comparison provides 32-bit registers 32 (x0-x31) [5].

Target Processor: 64-bit High-end ARM Processors.
ARMv8-A is the next generation ARM architecture of ARMv7, simply called ARMv8.It has two architectures, which are 32-bit AArch32 and 64-bit AArch64.
In this paper, we targeted the AArch64 architecture, in short A64.A64 has 32 64-bit general purpose scalar registers that can handle 32-bit, and 64-bit data.In addition, there are 32 128-bit vector registers, it can be utilized for the parallel implementation with SIMD [6].We used vector registers to implement the SM4 block cipher in a parallel way.

Related works.
In this section, we introduce optimized implementations of block ciphers on embedded processors.In [8], the revised version of CHAM was optimized on 8-bit AVR microcontrollers.In [8], they suggested optimized 8-bit wise rotation and 32-bit wise rotation.This implementation utilized the pre-calculation technique with the counter mode of operation.In [7], parallel implementations are presented.In [9], the optimized ARIA block cipher was presented.They optimized primitive operations, including rotation operation, a subsitutelayer, and a diffusion-layer on the low-end AVR microcontroller.In [10], they proposed the compact implementation of PRESENT block cipher, which is introduced in CHES'07 [11].It optimally implemented the PRESENT through pre-computation technique.In [12], the compact implementation of AES (Advanced Encryption Standard) block cipher on Intel processors was presented (i.e.FACE).This implementation applied pre-computation technique that precalculate repetitive values, and reused them.In ICISC'19, they proposed optimized implementation of FACE on the AVR microcontroller was presented [13].
It extended the pre-computation to the round 3. The implementation is also secure against CPA (Correlation Power Analysis).

Optimized Implementation of the SM4 Block Cipher
In this Section, we introduce the optimized implementation of the SM4 block cipher on 8-bit AVR microcontrollers, 32-bit RISC-V processors, and 64-bit ARM processors.The optimal performance is achieved through efficient register allocation and instruction techniques.

8-bit Low-end AVR Microcontrollers
Instruction set.AVR microcontrollers have useful instruction sets.Generally instructions take 1 or 2 clock cycles.Instructions used to implement the optimized SM4 block cipher are summarized in Table 2 [7].
Register utilization.For the optimized implementation, we efficiently allocated registers.Detailed descriptions are as follows: -X blocks.In Section 2.1, the SM4 block cipher stores 128-bit plaintext into 4 32-bit X.However, 8-bit AVR microcontrollers have 8-bit wise registers that can only represent the 8-bit data.Four registers are required to handle one Figure 2 shows the whole register allocation.Each rectangle represents 8-bit register.Two-colored registers are used for multiple purposes.

X blocks Address pointer Round key
Temporary values ZERO Fig. 2. Register allocation for the SM4.Optimized implementation of 32-bit wise rotation.The rotation of the SM4 block cipher is 32-bit wise operation, but AVR microcontrollers perform only 8-bit wise.32-bit wise rotation can be implemented with following instructions; LSL, ROL, ADC, MOV, and MOVW.Each rotation can be implemented following Table 3.When input and output values of the 8 or 16 rotation operation are in the same register, it needs one temporary register.This takes more clock cycles.We separated input and output registers.This makes results of rotation in different registers.This implementation eliminates the temporary register, and takes less clock cycles to transfer values to the temporary register than the previous method.
Efficient S-Box implementation.In this paper, there are three optimization perspectives on AVR microcontrollers (speed-optimization, memory-optimization, and code-optimization).In terms of speed-optimization, storing the S-box in RAM is effective.The LD instruction loads the S-Box value with 2 clock cycles.This can get the S-Box value, quickly.On the other hand, for the memoryoptimization perspective, S-Box can be saved to flash memory.In Section 2.2, it was confirmed that the AVR microcontroller has larger flash memory than RAM.Therefore, the memory-optimized implementation can be useful in situations where the lack of RAM.The memory-optimization can be implemented with the LP M instruction, which takes 3 clock cycles.As a result, the memoryoptimization takes a longer executing time than the speed-optimization.For the case of code-optimization, we utilized the looped implementation, which sacrificed the performance but achieved the optimal code size.3.2 32-bit RISC-V Processors 32-bit RISC-V processor supports 32-bit wise instructions.This is useful to perform 32-bit wise operations of SM4.For the optimal implementation in RISC-V, we propose Rotation optimization and efficient S-Box implementation.
Rotation Optimized Implementation.The rotation operation is not supported in RISC-V.Therefore, the rotation operation is implemented using the SLLI, SRLI, and OR instructions.ROL(n) can be implemented by OR the value of SLLI(n) and SRLI(32 − n).
Efficient S-Box implementation.RISC-V is using 32-bit registers.However, in S-Box, it is converted to a pre-computed value in bytes.Therefore, it is necessary to convert a 32-bit value by dividing it 8-bit units.For the implementation, SP (Stack Pointer) and LBU (Load Unsigned Byte) are used.SP has the address of the current stack, and LBU loads only the 1-byte value of the address pointed to.The S-Box process is the same as Algorithm 1.In Algorithm 1, the result value of S-Box is stored in T1, and A2 has the address of S-box.

64-bit high-end ARM Processors
On the 64-bit ARMv8 processor, the efficient implementation is possible by using vector registers.When implemented in a parallel-way, 12 plaintexts can be encrypted at once.Since ARMv8 has 32 vector registers, we utilized these registers in an optimal way.First, vector registers (v0-v11) are storing plaintext.Second, vector registers (v12-v15) have intermediate values, and the v15 register is also used for saving the round key value.Third, v16-v31 registers used for the S-Box look-up table.The SM4 encryption is performed on ARM processors as following order; Loading phase, Register transpose step, Round function layer, and Storing phase.Instructions summary.Table 4 shows instructions for implementing the SM4 block cipher, in a parallel-way.Most of instructions are vector instructions, except the ADR instruction.The ADR instruction is used to the store address of The UZP2 instruction does same operation, but read an odd numbered elements.In this process, registers are grouped by four and 32-bit blocks are arranged to be stored in one register.In total, 3 iterations are repeated to align 12 plaintexts.At the end of encryption, the transpose step is performed once again, to retrieve vector registers.Figure 3 shows the operation process of UZP1 and UZP2 instructions.
Algorithm 3 Alignment of the plaintext in vector instructions.Step #1 Step #2 Fig. 3. UZP1 and UZP2 instructions process for SM4.In Algorithm 4, line 9-20 shows the source code that implements linear transformations (L) of the round function.The rotation operation is implemented using the left shift operations SHL and SRI instructions.Using only three registers (v12, V13, v14,), v15 is used as a temporary register to store the round key value.In order to use only 3 registers, the rotation operation is performed and then XOR is performed, immediately.

Round function
Algorithm 4 Round Function of the plaintext in vector instruction.Storing phase.In the last storing phase, the encryption result is saved.Algorithm 5 is to perform an operation that stores the ciphertext in the memory.The result value (v0-v11) is stored in the memory address (x0) by 512-bits in a post incremental method, and 12 ciphertexts are stored by performing a total of 3 operations.

Evaluation
In this Section, we present the evaluation of proposed implementations.The evaluation is conducted separately for each implementation environment.The performance evaluation is based on clock cycles per byte (cpb).The reason for result is that the proposed implementation is implemented in an optimal form using an AVR assembly.In particular, the efficient rotation is used in the Linear transformation (L), it makes better performance than the reference C code implementation.In addition, it can be compare each criteria.Speed-optimization achieved best performance than the others, Memory-optimization requires the least RAM size, and Codeoptimization has the least ROM size.

Implementations of SM4 Block Cipher on 32-bit RISC-V Processors
This section analyzes and evaluates the performance of the SM4 encryption implementation on RISC-V.In this paper, Proceed performance measurements on RISC-V, optimization techniques were not applied.The RISC-V implementation does not use extensions and relies on the RV32I-based ISA.For the performance measurement, HiFive1 Rev B development board with 32-bit E31 RISC-V core was used.Results are shown in left part of Table 6.For the reference code, the execution timing is 345.7 cpb.The implementation achieved 128.8 cpb, showing a performance improvement by 2.68×.

Speed-optimization of SM4 Block Cipher on 64-bit ARM Processors
This section analyzes and evaluates the performance of the SM4 encryption implementation on ARMv8.It was written using Xcode and the calculation speed was measured by Apple A13 Bionic.The Apple A13 Bionic is a 64-bit ARMbased single chip (2.65 GHz) designed by Apple.The performance comparison is done with the reference code implemented in C language.Results are shown in right part of Table 6.For the reference code, the execution timing is 120.07 cpb.The proposed implementation achieved 8.62 cpb, showing a performance improvement by 12.93×.

Conclusion
In this paper, we present optimized implementations of the SM4 block cipher on AVR microcontrollers, RISC-V processors, and ARM processors.With optimized implementation techniques, the performance is significantly improved than previous approaches.We believe that this paper will be helpful to implement the SM4 block cipher in various environments, including both low-end and high-end Internet of Things.

Fig. 1 .
Fig. 1.Encryption flow of the SM4 block cipher and the round function structure.

Table 1 .
Parameters for the SM4 block cipher.

Table 2 .
4ummarized instruction set of AVR microcontrollers for optimized SM4 block cipher.Rd: Destination register, Rr: Source register.As a result, there are 4 X that quarters of plaintext.16registersare required to store the whole plaintext.-Roundkey,and T input/output.Each F requires a 32-bit round key.48-bit registers used to save the round key.The round key is used as the input value of T by performing the XOR operation with X blocks.Therefore, round key registers are also used to store parameters or results of T. -Nonlinear operation.The nonlinear transformation (tau) performs the rotation operation.8 registers are required for result and intermediate values of rotation.4 out of 8 registers store the tau output result.-Address pointer.In order to load the value into a register on AVR microcontrollers, it should be accessed through the address pointer.In this case, there are 3 kinds of values for the function call; Plaintext, Round key, and S-box values.We allocate X pointer for loading plaintext, and storing ciphertext, Y pointer for Round key, and Z pointer for S-box values.Especially, the X pointer address (R26 and R27) do not need to during round functions.These registers are used to store temporary values.The R30 register is always fixed to 0 value, because it stored the lower address of S-Box.This can be used to the temporary ZERO register.-Loop index.Using the CPI instruction, it is possible to compare a register value with a constant value.To implement the loop statement, it requires only one register to store loop index.This register is shared with the temporary value register.It needs to preserve an index value on the stack.It can be implement through PUSH and POP instructions.

Table 3 .
Optimized 32-bit wise rotation operation on 8-bit environments where i and j represent specific registers.
layer.Source codes for Round function layer are shown at line 1-8 of Algorithm 4, which operates the nonlinear transformation (tau).It is implemented by TBL and TBX instructions to seek the S-box table.TBL and TBX instructions read a value from a vector element in the index source register, search each result as an index in the byte table of the source table register, and write the result to the destination register.The first 64 bytes of S-Box is extracted through the TBL instruction.The TBX instruction searches the table in the next range of previous TBL instruction.To search for the next branch of S-Box, subtraction to the value of the index source register by 0x40 and then using the TBX instruction are performed, subsequently.

Table 6 .
Comparison result of execution timing (cycles per byte) on 32-bit RISC-V processors (left) and 64-bit ARM processors (right).Proposed implementations are targeted for the ATmega128 processor, which is one of AVR family.Source codes are implemented over Microchip studio framework, and compiled -O2 option.Since there are no other SM4 block cipher implementations on AVR microcontrollers.Performance comparisons are done with reference C code implementations.Comparison results are shown in Table5.Reference C code takes 1670.69cpb (clock cycles per byte), while the proposed speed-optimization implementation achieved 205.2 cpb, memoryoptimization implementation recorded 213.3 cpb, and code-optimization implementation reached 207.4 cpb.