Optimized Implementation of PIPO Block Cipher on 32-Bit ARM and RISC-V Processors

A lightweight block cipher PIPO-64/128 was presented in ICISC’2020. PIPO of the 8-bit unit using an unbalanced-bridge S-box showed better performance than other lightweight block cipher algorithms on an 8-bit AVR environment. So far, optimization methods for implementing PIPO have been proposed in various environments; however, no optimization research has been conducted for two popular 32-bit based processors: ARM Cortex-M4 and RISC-V. Since RISC-V and ARM Cortex-M series platforms do not support bit-based Single Instruction Multiple Data (SIMD) instructions, several aspects should be considered to apply a forced parallelization strategy. In this article, we discuss the implementation methodology of PIPO for 32-bit RISC-V and ARM Cortex-M4 environments. We optimize the performance of S-Layer via proposed register-scheduling and masking technique while we maintain parallelism to the R-Layer implementation. Moreover, we propose an on-the-fly key scheduling technique for further performance improvement. Finally, compared to the existing reference implementations in RISC-V and ARM Cortex-M4 platforms, when 4 plaintext encrypted simultaneously, our software achieved performance of 229% and 370%, respectively.

ated the importance of data confidentiality, integrity, and terms of memory usage, speed, code size, and low-power 30 The associate editor coordinating the review of this manuscript and approving it for publication was Junggab Son . consumption; recently, Lightweight Cryptography (LWC) 31 competition for lightweight Authenticated Encryption with 32 Associated Data (AEAD) was held by the National Institute 33 of Standards and Technology (NIST). Since the AEAD algo-34 rithms submitted for the LWC competition uses lightweight 35 cryptography as a core premetive, the performance of 36 lightweight cryptography is one of the important evalu-37 ation criteria. However, since IoT devices are relatively 38 vulnerable to side-channel attacks, whether side-channel 39 countermeasures can be applied in the core primitive and 40 performance evaluation of countermeasures are also impor-41 tant considerations.
in R-Layer with round keys held in general-purpose 105 registers. Finally, our implementation is further opti-106 mized through the hand-written assembly. 107 108 We compare our PIPO block cipher implementation in 109 detail with various lightweight/general block ciphers in 110 RISC-V and ARM Cortex-M4 environments. We eval-111 uate the practical applicability of PIPO block cipher 112 through a detailed comparison based on RAM usage, 113 code size, and Clock cycles Per Byte (CPB). Finally, 114 we show that our optimized PIPO software is suffi-115 ciently competitive. In this article, we expand on previous our work published in 121 WISA'21 [10]. In WISA'21, page limitations made it difficult 122 to describe our optimization technique in detail; therefore, 123 in this article, we describe the optimization methods in detail 124 and additionally present implementation techniques for the 125 ARM Cortex-M4 device to prove the expandability of our 126 optimization methodology. 127 C. OUTLINE 128 The rest of this article is structured as follows: Section II 129 introduces block cipher PIPO and our target platforms. 130 Section III reviews implementations of cryptographic algo-131 rithms on RISC-V and ARM Cortex-M series. An introduc-132 tion into our main idea for implementing PIPO is provided in 133 Section IV. Results of our implementations are presented in 134 Section V, before we conclude the article in Section VI.

136
In this section, we describe in detail of essential to the imple-137 mentation of PIPO [18]. Therefore, when design-144 ing the existing lightweight cipher, the side-channel attack 145 on point of view was not a major consideration; however, 146 the importance of mounting countermeasures which against 147 side-channel attacks in the IoT environment is increasing 148 due to various side-channel attacks studies. The PIPO block 149 cipher presented in ICISC'20 is a lightweight cipher that is 150 friendly to SW/HW implementation and countermeasures for 151 side-channel attacks [1]. The S-Layer is configured via an 152 Unbalanced-Bridge structure based on a few bit-operations, 153 which has the advantage of having some rounds com-154 pared to other lightweight block ciphers. In ICISC'20, the 155 performance evaluation was conducted in an 8-bit AVR envi-156 ronment, which is known as the most constrained embedded 157 device, and proved its superiority. However, since the perfor-158 mance of the actual cryptography algorithm differs depending 159 on the board of devices, performance evaluation in the most 160 used 32 bits-based devices should be considered in the future.

161
The overview structure of PIPO block cipher is shown in  ARM Cortex-M family is the most popular 32-bit platform. 242 So far, various bench-marking and optimization research for 243 cryptographic-algorithm are being performed on the Cortex-244 M series. In particular, Cortex-M4 is a target device for 245 the performance evaluation of algorithms submitted to Light 246 Weight Cryptography (LWC) and Post-Quantum Cryptogra-247 phy (PQC) competitions held by NIST [7], [26]. The tar-248 get device of this article, the Cortex M4, consists of 16 249 32-bit general-purpose registers. Among them, 14 registers 250 can be used in actual implementation except for the two 251 registers corresponding to the program counter and stack 252 pointer. On the Cortex-M4, Bit-wise and arithmetic instruc-253 tions require a single cycle, but memory access instructions 254 require 2 cycles. Like RISC-V, all 32-bit data cannot be 255 used for address reference, but compared to RISC-V, flexible 256 indexing is possible when registers are used as addresses. The 257 SIMD instruction can be used for a specific 8/16-bit unit, 258 but it is not a consideration in the case of block ciphers and 259 hash functions in which bit-wise shift instructions are mainly 260 used. Similarly, the optional flag instructions are not used in 261 this article. The most unique feature of the ARM processor is 262 the barrel-shifter, which can perform bit-wise shift and rotate 263 operations for almost any instruction at no additional cost.
Since the S-Layer of PIPO is designed for the bit-slice 265 friendly, PIPO has a different slicing way from general block 266 ciphers such as AES [6]; so, it is desirable not to apply 267 the ARM-specific instruction scheduler presented in [27]. 268 Therefore, it is necessary to devise a strategy to use as many    Unfortunately, the performance benefits of PIPO block cipher 308 seen in Table 1 are limited to only 8-bit AVR environments. The study of the implementation of a block cipher on RISC-V 316 has mainly moved toward designing an extended instruction 317 set by the HW environment. Except for various extension sets 318 introduced in Section II-B, [31] designed extension sets for 319 the ARIA block cipher and [32] proposed a cryptographic 320 extension set for LWC. However, since these are a study 321 of finite-field operations of ARIA and instruction sets for 322 4 × 4 S-Box of LWC, it has less relevance to do with 323 PIPO block cipher. Research on the SW implementation 324 of block cipher using the basic ISA has also been actively 325 conducted. In 2019, when RISC-V began to become popular, 326 benchmark studies of AES, ChaCha20, and keccak were 327 conducted using RV32I [33]. As a result of comparison with 328 the implementation with Cortex-M4, RISC-V has about twice 329 as general-purpose registers as Cortex-M4, but the efficiency 330 of barrel-shifter is higher in terms of implementation speed 331 for block cipher and hash function; therefore, It proved that 332 RISC-V difficult to catch up Cortex-M4, using only 32RVI. 333 In [34], an implementation study was conducted on the algo-334 rithm for LWC competition using RV32I. In their research, 335 32RVI and assembly characteristics were mainly used, not the 336 algorithm itself. The sublation of branch instructions, loop-337 unrolling, and interleaved methods to prevent pipeline-stalled 338 have been recombined to match the RV32I model. The work 339 focused on property of block cipher is the Fixslicing work 340 done on RISC-V and Cortex-M4. They applied Fixslicing to 341 GIFT and SKINNY [7] and finally expanded it to AES [6]. 342 The implication of these studies is that when only RV32I 343 is used without barrel-shifter, the optimization strategy for 344 block cipher should be designed to reduce memory access 345 using as many general-purpose registers as possible.  [37]. Cortex-M4 350 has a very similar core to Cortex-M3, but it supports Floating 351 Point Unit (FPU) instructions and single-cycle multiplica-352 tion instructions, making it particularly efficient for imple-353 menting public-key-based cryptosystems [38]. However, as in 354 the LWC competition held by NIST, SW/HW implementa-355 tion research for performance evaluation and optimization 356 of block cipher in Cortex-M4 is being actively conducted. 357 In this article, we can't cover all of the vast amount of research 358 ever conducted; therefore, we investigate a few core SW 359 implementations on Cortex-M4. While RISC-V has not made 360 progress in toolchain and compiler research, in Cortex-M 361 series, tools for instruction scheduling and register allocation 362 was proposed in [27]. These tools performed better than 363 commercial gcc and clang, demonstrating their efficiency 364 through the implementation of AES. In 2020, optimization 365 study of HIGH block cipher was conducted in [5]. Since, 366 HIGHT uses 8-bit words like PIPO block cipher, 4 words 367 in a 32-bit register can be processed simultaneously; how-368 ever, HIGHT is an Add-Rotate-XOR (ARX)-based block 369 cipher, it differs from PIPO in that SIMD instructions can 370 be applied. In [8], an AES implementation study was con-371 ducted. It implemented AES and CounTeR (CTR) Mode by 372 cost occurs on the S-Layer. Another method is to not fol-404 low bitslicing and use S-Box table. In order to use S-box, 405 bits should be rearranged to enable memory access using 406 SWAPMOVE technique for plaintext stored in the forward 407 direction [39]; however, since R-Layer performs rotate-shift 408 operations on forward array, the array should be rearranged 409 forward after referring S-box. Therefore, in the reference 410 code, the forward bitslicing implementation method shows 411 much better speed than s-box method which occurs the cost 412 of two SWAPMOVE for every round of the PIPO block cipher. 413 We examined whether Fixslicing technique introduced 414 in [7] can be applied to PIPO block cipher. If in the S-box-415 based implementation, the bits are rearranged to the position 416 of the next S-box bits instead of in the forward direction 417 after memory reference and it uses fewer instructions than 418 bitslicing method in each round, it can be effective enough. 419 In other words, our goal was to eliminate the SWAPMOVE 420 used in every round of PIPO block cipher. However, R-Layer 421 consists of a bit rotate-shift operation. Unlike AES and GIFT 422 block ciphers, there is no rule between each unit block in 423 PIPO block cipher. In addition, when using 4 plaintexts in 424 a 32-bit environment, the size of the S-box also increases 425 by multiples, so it is inefficient to use this method. After 426 all, S-Layer of the PIPO block cipher is configured based on 427 the Unbalanced-Bridge structure, unlike GIFT and bitslicing 428 AES, which require a realignment in the initial round; there-429 fore, it is difficult to apply Fixslicing technique. 430 Therefore, we choose the forward bitslicing implementa-431 tion and strategize for the forced parallel implementation. 432 Figure 3 shows register scheduling for a 32-bit platform. 433 In pt i j , j is the index of the plaintext and i is the index of 434 the bit of the j-th plaintext. For bitslicing-based S-Layer, 435 the four plaintexts have aligned 8-bits units in each register. 436 for RISC-V. We observe that the XOR value of two 32-bit 477 data pairs required for masking is 0xFFFFFFFF. For exam-478 ple, we XOR-operate 0 × 01010101 and 0xFFFFFFFF 479 to produce 0xFEFEFEFE, which is the masking pair of 480 0 × 01010101. 0xFFFFFFFF data can only be loaded 481 once at the beginning of the R-Layer. By using this technique, 482 it costs 18 (2(LW) × 1(masking) × 7(PT) + 2(LA) + 2(LW)) 483 clock cycle per round, 13 rounds of for PIPO-64/128 It costs 484 234 (18 × 13) clock cycles. This means this method can 485 achieve a bit-masking more efficiently than another method 486 using LI instructions.

488
In the case of Cortex-M4, a powerful technology called 489 barrel-shifter exists, but unfortunately, the RISC-V platform 490 has fewer effective instructions for block ciphers imple-491 mentation than Cortex-M4 platform. However, as intro-492 duced in Section II-B, RISC-V device has twice as 493 many general-purpose registers as Cortex-M4; therefore, 494 we focused on how to make the most of this characteristic. 495 In this section, we propose a methodology to reduce the 496 memory access cost by holding the master key in registers 497 of RISC-V environments. Detailed register scheduling of 498 RISC-V platform is shown in Figure 4. The shaded areas 499 of Figure 4 are registers that are not used by the actual 500 implementation. Our implementation divides and stores the 501 128-bit master key in 4 of the 32 general-purpose registers of 502 RISC-V (x14 to x17). If PIPO block cipher use a 256-bit 503 security level, we can still store the entire 256-bit master 504 key by additionally using x2, x7, x12, and x13 registers. 505 Since the master key is stored in several registers, there is 506 no need to load the master key every round. The only cost 507 in AddRoundKey process is to unpack and expand the master 508 key for forced parallelism. Of course, there is another method 509 to optimize AddRoundKey process via simple loading by 510 operating all round keys in advance, but since it needs to be 511 extended to 32-bit and stored, an additional memory space 512 made every effort to ensure that performance measurements 537 are performed as fair and accurate as possible. Our implemen-538 tation is available in two versions. The first implementation 539 (marked by †) is the software of encrypting one plaintext, 540 and the reference implementation methodology is ported to 541 software as a handwritten assembly suitable for RISC-V 542 and Cortex-M4 environments. The second implementation 543 (marked by ‡) is the code to which the methodology we pro-544 pose is applied, and it is a software that encrypts 4 plaintexts 545 together with the assembly instructions in parallel. We evalu-546 ate our performance based on the performance improvement 547 of the second software. The reference implementation of 548 PIPO block cipher presented in [1] was measured by com-549 piling on Cortex-M4 and RISC-V, respectively. Our RISC-V target platform is the HiFiveRevB board con-553 taining a SiFive's 32-bit RISC-V processor with 16 kB of 554 SRAM, 32 MB of flash, and 320 MHz of frequency. The 555 integrated development environment is FreedomStudio (ver. 556 2020) of SiFive, our work is handwritten assembly PIPO 557 block cipher code. For building code, we use the gcc (ver. 558 10.1.0) with -O3 option. Benchmarking was conducted in the 559 same method as in the RISC-V platforms. For benchmark-560 ing, 32 bytes of plaintext are encrypted 10,000 times and 561 averaged. To ensure that no instructions and data remain in 562 cache memory, a dummy operation is performed after one 563 PIPO block cipher is executed. Dummy operations are not 564 counted in the average of clock cycles. The process of loading 565 and aligning the 4 plaintexts and the process of rearranging 566 and writing out the ciphertext aligned in the general-purpose 567 registers are also counted by clock cycles. We evaluate our optimization strategy in a RISC-V envi-579 ronment through the set HiFiveRevB board. Table 2 shows 580 the performance of the implementation of various cryp-581 tographic algorithms and our software in a RISC-V plat-582 form. Unfortunately, research on lightweight block ciphers 583 of the RISC-V environment is relatively small compared to 584 Cortex-M3/M4 platforms; therefore, we investigate all the 585 latest cryptographic software implemented in RISC-V plat-586 forms. Fix-GIFT and Fix-AES are implementations to which 587 Fixslicing technology proposed in [7] and [6] is applied. 588 for the Cortex-M3 and M4 platforms. The biggest differ-615 ence between Cortex-M3 and Cortex-M4 is the multiplication 616 instruction and the Floating-Point Unit instruction, so it does 617 not significantly affect the performance evaluation of the 618 actual block cipher implementation. For that reason, we refer 619 to [6], Fix-AES performance report, which shows the fact 620 that the performance did not change significantly in the 621 Cortex-M3 and Cortex-M4 environments. Compared to the 622 reference implementation of PIPO block cipher, our PIPO † 623 software without parallelization strategy achieved 171 CPB 624 (a performance improvement of about 81 %). PIPO ‡ , applied 625 the forced parallelization strategy, achieved 66 CPB (approx-626 imately 370 % performance improvement). Comparing other 627 lightweight block ciphers with our PIPO software, PIPO ‡ 628 achieves better performance than the software of Fix-GIFT, 629 PRESENT, RECTANGLE, and SIMON. For HIGHT soft-630 ware presented in [5], it achieves the 56 CPB during 631 the encryption process, but, for key scheduling process, 632 it requires an additional 49 CPB. Since our software includes 633 the key scheduling inside the encryption process, we can 634 show that our PIPO ‡ case achieved faster results than HIGHT 635 implementation, considering all the clocks required for the 636 actual encryption process. In addition, our implementation 637 achieves lower RAM usage and higher performance than 638 general block ciphers with 128-bit plaintext lengths such 639 as Fix-AES and ARIA. Unfortunately, similar to what was 640 reported in the previous section, PIPO block cipher also 641  shows lower performance than the LEA in the Cortex-M4  Since different keys may be used based on a small size 657 of plaintext depending on the application situation, we do 658 not pre-compute all round keys, using minimal RAM usage. 659 Our choice makes our software flexible in that it can be 660 used fluidly for all block cipher operation modes and all 661 platforms and application situations. The additional overhead 662 incurred in the process of aligning plaintext, loading bit-663 masking value, and expanding round keys for forced par-664 allelization strategy is offset to an extent by architectural 665 characteristics. In the case of RISC-V, the process of load-666 ing the round key for each round of PIPO block cipher is 667 omitted by holding the key in the general-purpose register. 668 In the case of Cortex-M4, By using a barrel-shifter, the 669 clocks of bit-masking and round key processes are compactly 670 compressed. Finally, we report that our PIPO software genererence Implementation is 197 CPB, which is basically lower 692 than the LEA block cipher. Therefore, it can be considered