Efficient Implementation of Lightweight Hash Functions on GPU and Quantum Computers for IoT Applications

Secure communication is important for Internet of Things (IoT) applications, to avoid cybersecurity attacks. One of the key security aspects is data integrity, which can be protected by employing cryptographic hash functions. Recently, US National Institute of Standards and Technology (NIST) announced a competition to standardize lightweight hash functions, which can be used in IoT applications. IoT communication involves various hardware platforms, from low-end microcontrollers to high-end cloud servers with GPU accelerators. Since many sensor nodes are connected to the gateway devices and cloud servers, performing high throughput integrity check is important to secure IoT applications. However, this is a time consuming task even for high-end servers, which may affect the response time in IoT systems. Moreover, no prior work had evaluated the performance of NIST candidates on contemporary processors like GPU and quantum computers. In this study, we showed that with carefully crafted implementation techniques, all the finalist hash function candidates in the NIST standardization competition can achieve high throughput (up-to 1,000 Gbps) on a RTX 3080 GPU. This research output can be used by IoT gateway devices and cloud servers to perform data integrity checks at high speed, thus ensuring a timely response. In addition, this is also the first study that showcase the implementation of NIST lightweight hash functions on a quantum computer (ProjectQ). Besides securing the communication in IoT, these efficient implementations on a GPU and quantum computer can be used to evaluate the strength of respective hash functions against brute-force attack.

1) The first efficient implementations of PHOTON-Beetle, ASCON, Xoodyak, and SPARKLE on GPU platforms are presented in this paper. Proposed techniques include table-based implementation with warp shuffle instruction and various memory optimization techniques on GPU platforms. The performance of these implementations was evaluated on a high-end GPU platform (RTX 3080). The hash throughput of proposed implementation was up-to 1,000 Gbps, which is fast enough to handle the massive traffic of an IoT system. 2) We report the first implementation of PHOTON-Beetle, ASCON, Xoodyak, and SPARKLE hash functions on quantum computers. Hash functions were optimized taking into account the reversible computing environment in quantum computers, which is different from classical computers. The implementation was per- formed on ProjectQ, a quantum programming tool provided by ETH Zurich and IBM [17]. 3) For the purpose of reproduction, we share the GPU implementation codes in the public domain at: https: //github.com/benlwk/lwcnist-finalists and the quantum circuit implementation codes in the public domain at: https://github.com/starj1023/lwcnist-finalists-QC

II. BACKGROUND
This section describes how cryptographic hash functions are used to check data integrity in IoT communication. It also provides an overview of the selected hash functions and implementation platforms.

A. SECURE COMMUNICATION IN IOT APPLICATIONS
Referring to Figure 1, an IoT system consists of three communicating parties: sensor nodes, gateway device, and cloud server. Sensor nodes are usually placed ubiquitously to collect important sensor data. Because of this requirement, sensor nodes are designed with low power microcontrollers and powered by battery. Gateway devices are placed at a strategic location to obtain the IoT data from sensor nodes. These gateway devices need to handle connections from a lot of sensor nodes, so they are usually implemented with a more powerful processor and connected to a continuous power source. The communication between gateway device and sensor nodes utilizes wireless technology, like Bluetooth Low Energy (BLE) or Zigbee. In other words, the sensor nodes are usually not directly connected to the Internet. On the other hand, the cloud server communicates with the gateway devices through an internet connection, which is usually protected through TLS protocol. Data integrity is important for security because it ensures the collected sensor data is not maliciously modified during the communication process, from sensor nodes to the cloud server. With the use of a cryptographic hash function, any malicious modification of the communicated sensor data can be easily detected. This allows us to verify the integrity of the sensor data on the gateway or server side, which greatly strengthens the security of IoT communication. On top of that, the hash function is also used to construct a mutual authentication protocol [18] or Hash-based Message Authentication Code (HMAC) to ensure confidentiality and authenticity. The role of a hash-based signature in IoT systems was also investigated in a prior work [19].
Although hash functions are generally considered lightweight, efficient implementation is still important because of the massive amount of traffic in IoT communication. For instance, the gateway device may need to perform a data integrity check (i.e., recomputing the hash value) on all sensor data it receives. This can impose a huge burden on the gateway device and potentially degrade its response time, causing unwanted communication delay. Note that the gateway device may still need to perform other computations like data summarizing and edge computing; performing integrity check on many sensor nodes on a regular basis is definitely a demanding task. To mitigate this potential performance bottleneck, we can offload the data integrity check to an accelerator (e.g., GPU), following the strategy proposed by Chang et al. [20]. Hence, efficient implementation of hash functions on GPU platforms is crucial to secure future IoT communication systems, especially applications that have a large number of sensor nodes. Besides that, the security level of hash functions is usually allows estimated based on classical computers. Due to the emergence of quantum computers, their security level must be re-examined against this new processor architecture. In March 2021, NIST announced that four hash function candidates (PHOTON-Beetle, Ascon, Xoodyak, and Sparkle) had successfully advanced into the final round. Another five AEAD candidates (Elephant, GIFT-COFB, Grain128-AEAD, ISAP, Romulus, and TinyJambu) also advanced into the final round. Note that PHOTON-Beetle, Ascon, and Sparkle can also be configured to operate as AEAD. This sub-section provides an overview of the four finalist hash functions that were selected for implementation in the present study. More detailed descriptions can be found in the respective specifications submitted to NIST for standardization [21]- [24]. Notations used to describe operations in these four hash functions are presented in Table 1. PHOTON-Beetle [21] uses the PHOTON permutation function and sponge-based mode Beetle to construct the hash function. The main computation lies on the PHOTON permutation function, which is described in Algorithm 1. PHOTON permutation makes use of a 4-bit S-Box described in Table 2.

16:
A y ←Ā y + A y for y ∈ {0, 1, 2} 17: ρ east : [24] is an SPN based cryptographic primitive that can be used for authenticated encryption and hashing. The Sparkle permutation function consists of an Alzette ARX-box and a linear diffusion layer. The Alzette ARX-box, described in Algorithm 4, is a Feistel-like 64-bit block cipher, to provide quick diffusion.

C. OVERVIEW OF THE GPU ARCHITECTURE
A GPU is a massively parallel architecture consisting of hundreds to thousands of cores. To achieve high throughput, every core is assigned the same instruction, but operates on a different piece of data. This is essentially a Single Instruction Multiple Data (SIMD) parallel computing paradigm. The GPU has a deep memory architecture that needs to be carefully used in order to achieve high performance. The DRAM is the global memory in the GPU. It tends to be large in size but very slow in access speed. Shared memory is a usermanaged cache that can be used to cache temporary data or look-up table; it is faster than global memory but small in size (e.g., 96KB). The register is the fastest memory in a GPU, but it is limited to thread-level access and small in size (64K registers per streaming-multiprocessor). To exchange data across different threads, we need to rely on shared memory Algorithm 3 Alzette ARX-box in the Sparkle permutation function. 1: x [8] ▷ 256-bit state represented in eight 32-bit words 2: c ▷ Round constant 3: x ← x + (y ≫ 31) 4: y ← y ⊕ (x ≫ 24) 5: x ← x ⊕ c 6: x ← x + (y ≫ 17) 7: y ← y ⊕ (x ≫ 17) 8: x ← x ⊕ c 9: x ← x + (y ≫ 0) 10: y ← y ⊕ (x ≫ 31) 11: x ← x ⊕ c 12: x ← x + (y ≫ 24) 13: y ← y ⊕ (x ≫ 16) 14: x ← x ⊕ c Algorithm 4 Linear diffusion layer L 6 (x) in the Sparkle permutation function.
or warp shuffle instructions. A more detail explanation of the GPU architecture and its programming model can be found in [26].

D. QUANTUM COMPUTERS FOR BRUTE-FORCE ATTACK
Pre-image attack on hash functions involves finding a message that outputs a specific hash value. Pre-image resistance indicates that it is difficult to find a pre-image x for a given y in the hash function h(x) = y. Grover search algorithm is a quantum algorithm that is optimal for pre-image attacks on hash functions [27]. Compared to the pre-image attack, which requires 2 n searches (worst case) on a classic computer, Grover pre-image attack finds pre-image with a high probability with only 2 n 2 searches. The steps for Grover preimage attack are as follows.
1) n-qubit message is prepared in superposition state |ψ⟩ using Hadamard gates. This ensures that all qubits have the same amplitude.
2) A hash function implemented as a quantum circuit is located in oracle f (x) and is defined as follows.
Oracle operator U f turns the solution (i.e., pre-image) into a negative sign. Since (−1) 1 is −1, the sign becomes negative only when f (x) = 1 and applies to all states.
3) The probability is increased by amplifying the amplitude of the negative sign state in the diffusion operator. Grover algorithm repeats steps 2 and 3 to increase the probability of measuring a solution. The optimal number of Grover iterations is ⌊ π 4 2 n 2 ⌋ (about 2 n 2 ) [28]. That is, the classical pre-image attack which requires 2 n searches is reduced to 2 n 2 searches by using Grover search algorithm. What is important in this attack is to efficiently implement the hash function h(x) as a quantum circuit. Since the diffusion operator has a typical structure, there is no special technique to implement.
The advent of large-scale quantum computers proved to be a threat to the cryptographic community, as it is one of the best cryptanalysis tools available. Cryptanalysis, which has been performed on classical computers so far, needs to be performed on quantum computers as well in order to provide sufficient confidence to the underlying hash functions. This is evident from the effort of NIST in estimating the postquantum security strength according to the cost of applying the Grover algorithm for symmetric key cryptography [16].

E. QUANTUM GATES
Quantum computing is reversible for all changes except measurement. Reversible represents that the initial state must be re-produced using only the output state. There are quantum gates with reversible properties that can replace classical gates. Figure 2 shows representative quantum gates used in quantum computing. 1) NOT / X gate: NOT(x) = x, This inverts the input qubit.
2) CNOT gate: CNOT(x, y) = (x, x⊕y), One of two qubits acts as a control qubit. If the control qubit x is set to 1, y is inverted. 3) SWAP gate: SWAP(x, y) = (y, x), This changes the state of two qubits x, y. 4) CCNOT / Toffoli gate: Toffoli(x, y, z) = (x, y, x · y ⊕ z), Two control qubits are used. When both control qubits x and y are 1, z is inverted.

III. DEVELOPMENT OF IMPLEMENTATION TECHNIQUES ON GPU
This section describes the optimization techniques developed to implement the selected hash functions in the GPU. Note that in order to achieve high throughput, we adopt a coarse grain parallel method, wherein many parallel threads are initiated and each thread computes one hash value independently.

A. PHOTON-BEETLE
The PHOTON permutation function (Algorithm 1) operates in a 256-bit state organized in an 8-bit array (X) with 8 × 8 dimension. The SubCells, ShiftRows and MixColumnSerial operations can be combined and pre-computed in a  for j = 0 to 7 do ▷ Use Pre-computed table 10: for j = 1 to 8 do 13: end for 16: end for The pre-computed table in the PHOTON permutation function only consumes 128 32-bit words, so it can be cached in the shared memory for faster access speed. A closer look into Algorithm 5 reveals that the access pattern to Table is influenced by the state in PHOTON (X, line 9). Since the value in state X is random, the access to Table is also random. If Table is stored in shared memory, the access pattern is very likely to experience bank conflict, which is not an optimal solution.
To improve the performance and avoid bank conflicts, we propose another technique to store

B. ASCON
The Ascon permutation function operates in a 320-bit state, represented in a 5×64-bit array. The S-box in Ascon can be implemented in a bit-sliced manner, which is very efficient in both high-end processors and constrained devices. Algorithm 7 shows the implementation of one round of Ascon permutation, which is repeated for 12 rounds. We proposed to utilize the bit-sliced approach when implementing the Sbox (lines 8 -12) without using any shared memory, as was the case in PHOTON-Beetle. The linear layer in the Ascon permutation function can also be implemented using simple logical and shift operations (lines 17 -21). Note that the NVIDIA GPU does not come with a native rotate instruction. Rotate operations were replaced with two shifts and one XOR instruction.

C. XOODYAK
Xoodyak uses a permutation (Xoodoo) similar to the Keccak hash function. Unlike the other three selected hash functions, Algorithm 7 Implementation of Ascon permutation function.

/Linear diffusion layer ends
Xoodoo does not have any S-box or ARX-box layer. In our GPU implementation, the round constants are used by all threads, so they were stored in constant memory. Unlike the pre-computed Table in PHOTO-Beetle at each round, these Xoodoo round constants are only read once and consumed by every thread, so it is highly possible to be cached at the L1 cache. Hence, we did not store them in the shared memory, as it wouldn't have provided any performance gain. Our GPU implementation of the Xoodoo permutation function follows Algorithm 2 closely. We do not repeat it here.

D. SPARKLE
The SPARKLE permutation consists of an ARX-box layer followed by a linear layer. The Alzette ARX-box in SPARKLE can be executed efficiently using only logical operations (see Algorithm 4. Due to the same reason in Xoodyak, the round constants in SPARKLE are stored in constant memory instead of shared memory or registers. The implementation of the SPARKLE-256 permutation function is illustrated in Algorithm 8.

A. PHOTON-BEETLE
The PHOTON permutation function (Algorithm 1) operates in a 256-qubit state organized in a 4-qubit array with 8 × 8 dimensions. The PHOTON permutation function, which consists of AddConstant, SubCells, ShiftRows, and MixColumnSerial, was implemented as a quantum circuit as follows.
In AddConstant, the predetermined constants RC and IC are XORed with each other. In this case, it can be implemented using only NOT gates, and the overlapping parts are omitted. For example, when k = 1 and i = 1, in , two NOT gates are performed on the first qubit of X[1, 0], so it is omitted and the NOT gate is performed only on the second qubit of X [1,0]. Subcells apply the 4-qubit S-box × 64 to the 256-qubit state. When implementing an S-box in classical computing, a lookup table is a common choice. However, in quantum computing, this approach is quite inefficient. Therefore, a quantum circuit that computes the output for the input of the SBox should be implemented. Quantum circuit implementations for SBox sometimes incur additional qubits or increase circuit cost. To solve this, we use the LIGHTER-R tool [29] to convert Table 2 into ANF (Algebraic Normal Form). The LIGHTER-R can find reversible implementations of the 4-bit SBox. The implementation works in place, thus no additional qubits are allocated. Since the most cost in the PHOTON permutation function is used in SBox, efficient implementation of SBox is important. The PHOTON S-box quantum circuit of ANF is shown in Figure 3. LIGHTER-R is described in detail in [29].
In ShiftRow, the arrangement of qubits is changed, which can only be done with Swap gates. For convenience we used Swap gates in the implementation, but we did not count them as quantum resources. This is because Swap gates can be replaced by relabeling qubits [30]- [32] (called a logical swap). Algorithm 9 describes Shiftrows implemented as a quantum circuit. SWAP4 means a Swap operation in units of 4 qubits. In MixColumnSerial, the matrix multiplication in GF (2 4 ) is used. For the general multiplication, Tofffoli gates replace AND operations. Since constant multiplications are used in this matrix multiplication, only CNOT gates are used, where the gates have a lower cost than the Tofffoli gates. We already know the modulus x 4 + x + 1, thus we can implement the multiplication circuit for each constant using only CNOT gates [33]. When the constant C = 2, C ·X mod x 4 +x+1 is shown in Figure 4. Since X has to be used continuously, the product is stored in the newly allocated qubits r 0 , r 1 , r 2 , r 3 . We prepare modular multiplication quantum circuits for C(0 ∼ 15) and used them according to the value of C in the matrix multiplication of MixColumnSerial.

B. ASCON
The Ascon permutation function consists of AddConstant, a Substitution layer (Table 3), and a Linear diffusion layer (Equation 1). AddConstant adds a round constant to the state and is implemented using only NOT gates, as in PHOTON. For the Substitution layer, it is inefficient to implement an Sbox in the form of Table 3 as a quantum circuit. In PHOTON, we converted Table 2 to ANF using LIGHTER-R, but since Ascon uses 5-bit S-box, LIGHTER-R (only suitable for a 4bit S-box) could not be applied. Therefore, we implemented the S-box in ANF ( Figure 5) as specified in the Ascon paper [22]. The notation ⊙ indicates an AND operation. The Substitution layer and Linear diffusion layer operate in a 320-qubit state, represented in a 5 × 64-qubit array x i(i=0,..., 4) . When computing x 0 in the S-box, we need the final x 4 (yellow highlight in Figure 5). It is efficient to compute in the order x 4 , x 0 , x 1 , x 2 , x 3 . Generating the final x 4 , x 0 , x 1 is not a problem. However, in order to obtain x 2 and x 3 , the values of x 4 and x 0 before the S-box are required (red highlight in Figure 5). One way to solve this is to store the values (x 4 and x 0 before S-box) in temp qubits. However, we replaced it with additional qubits allocated from the Linear diffusion layer. In the Linear diffusion layer, to compute x 0 , values of x 0 ≫ 19 and x 0 ≫ 28 are needed, simultaneously. If the first qubit [28], and the original x 0 [0] value disappears. Since x 0 [45] and x 0 [36] cannot be computed, new qubits are allocated to store the updated value. To reduce the number of qubits, we present an S-box quantum circuit using newly allocated qubits in Linear diffusion layer. This approach allows the substitution layer and the linear diffusion layer to share temporary qubits. As a result, the use of qubits is minimized by allocating temporary qubits that should be allocated to the substitution layer and the linear diffusion layer only to the linear diffusion layer. We design an efficient S-box quantum circuit by utilizing the reverse operation and taking into account the Linear diffusion layer (Equation 1). Figure 6 shows the structure of the proposed S-box quantum circuit. In this quantum circuit, 1-qubit of each register operates the S-box and transfers the value to the temp qubit of the Linear diffusion layer using CNOT gates. Then, to compute x 2 , x 3 , a reverse operation (except for LD) is performed to obtain x 4 , x 0 before S-box. Finally, we computed x 2 and x 3 without temp qubits using x 4 and x 0 before S-box.

C. XOODYAK
The Xoodoo permutation function operates in a 384-qubit state, represented in a 3×128-qubit array (A 0 , A 1 , A 2 ), and each 128-qubit is arranged in a 4×32 array. Algorithm 10 describes each step of the Xoodoo permutation implemented as a quantum circuit.
For the mixing layer θ, we need to allocate a new 128-qubit P for P = A 0 + A 1 + A 2 . Then XOR A 0 , A 1 , A 2 to P using 3 × CNOT128. CNOT128 means CNOT gates operating in units of 128 qubits. In ≪ (a, b) of θ, a means a rotation in 32-bit units in a 128-bit state, and b means a rotation in 1bit units in a 32-bit state. We used RotateCNOT to XOR P to A 0 , A 1 , A 2 based on a logical swap for P . RotateCNOT is shown in Algorithm 11. In this way, the rotation operation can be performed without using Swap gates. In ρ west and ρ east , the rotation operations can be replaced with a logical swap as in RotateCNOT, but for the convenience of implementation, we used Swap gates. ι, which adds the constant C i to A 0 , is performed using only NOT gates in the same way as AddConstant in the PHOTON permutation function. Most of the quantum gates and qubits are used for the nonlinear layer χ. Toffoli gates (high cost) were used to replace AND operations on A 0 , A 1 , A 2 and the results were stored in newly allocated B 0 , B 1 and B 2 . However, we reduced the use of qubits by avoiding allocation for B 2 . After computing B 0 =Ā 1 ·A 2 , B 1 =Ā 2 ·A 0 , the reverse operations return the values of A 1 and A 2 . Then A 2 = A 2 +Ā 0 · A 1 (i.e., replace A 2 = A 2 + B 2 ) avoids allocating qubits for B 2 . When A 2 is completed, B 0 and B 1 can be XORed to A 0 and A 1 with CNOT128. The reverse operation for CNOT gates does not have a large overhead in the gate and depth of the quantum circuit. We save 128-qubit every round in permutation with less overhead for gate and depth. Lastly, ρ east is performed using Swap gates.

D. SPARKLE
This section only describes the Sparkle384 permutation implementation technique. This same technique works on Sparkle512. Sparkle permutations consist of an ARX-box layer followed by a linear layer. For additions in ARX-box, a quantum adder is required. For this, we used an improved quantum ripple-carry adder, called the CDKM adder [34]. The ripple-carry adder stores the result of the addition of A + B in B, keeps A as it is (i.e. ADD(A, B, r) = (A, A + B, r)). The ripple-carry adder allocates two carry qubit (r) for addition. However, since the ARX-box uses modular addition ignoring the highest carry, we only allocated a single qubit for r 0 . Since this r 0 is initialized to 0 after the addition, it can be reused in subsequent additions. However, we design parallel addition by using a few more qubits, which greatly reduces the depth. In a round, Sparkle384 operates ARXbox 6 times and Sparkle512 operates 8 times. Since these ARX-boxes are independent of each other, parallel addition is possible. For this, we do not use only r 0 , but r 0∼5 for SPARKLE-384 and r 0∼7 for Sparkle512. Implementation details can be found in our source code. Algorithm 12 describes an ARX-box implemented as a quantum circuit. For additions and XORs using rotated input (e.g. x + (y ≫ 31), y ⊕ (x ≫ 24)), resources for rotation were not used by using RotateCNOT and RotateADD based on logical swap. RotateCNOT32 and RotateADD32, which are based on logical swaps and operate in 32-qubit units, are similar to RotateCNOT in the Xoodoo permutation, but this can be implemented, simply. Algorithm 13 describes Ro-tateCNOT32. For RotateADD32, a CDKM adder in units of 32 qubits works. Similar to RotateXOR32, a i were relabeled according to the rotated result (i.e., logical swaps).
In the linear layer L 6 (x), t for y 0 ⊕ y 1 ⊕ y 2 was used. In classic computing, using temp storage (t) like this is not a problem. However, in quantum computing, the qubits for t must be newly allocated, and since they cannot be recycled, they must be allocated every L 6 (x), which is very inefficient. We solved this by designing a quantum circuit for L 6 (x) as in Algorithm 14. Algorithm 14 computes y 2 = y 0 ⊕ y 1 ⊕ y 2 (value preparation), and XORs y 2 to x 3 , x 4 and x 5 (lines 8∼19). CNOT16 and CNOT32 indicate CNOT operations in units of 16 and 32 qubits. In the last step, the value preparation is reversed to return to the original y 2 . In the linear diffusion layer, L 6 (y) is also performed on y. Since L 6 (y) differs from L 6 (x) only in operands and the implementation technique is the same, the quantum circuit for L 6 (y) is omitted.

V. EXPERIMENTAL RESULTS AND DISCUSSIONS
This section presents the implementation of the selected NIST lightweight hash functions on two different platforms: a GPU and a quantum computer. The GPU implementation was performed on a workstation equipped with an Intel i9-10900K CPU and an RTX 3080 GPU. The quantum computer implementation was performed on ProjectQ, which enables quantum programming and simulation.

A. RESULTS OF IMPLEMENTATION ON GPU
This study focused on achieving a high throughput for all of the hash functions implemented on the GPU. To achieve this, all experiments were conducted by launching P blocks in parallel, with each block consisting of 512 threads. Within each thread, we performed one hash operation with different lengths (MLEN) that ranged from 64 bytes to 512 bytes. This represents the common sizes of IoT sensor data typically found in sensor nodes that are built on constrained devices with only a few KB of RAM available. The throughput (Gigabit per second (Gbps)) was calculated as follows: Figure 7 shows the throughput achieved by PHOTON-Beetle in our GPU implementation. The shared memory version was always slower than the proposed warp shuffle version by approximately 40%. This is because in the PHOTON round function, the shared memory used to store the precomputed table is accessed in a random manner, which may introduce a lot of bank conflicts. In contrast, the warp shuffle version stores the pre-computed table in registers, which are not affected by any random access pattern. Hence, the throughput of the warp shuffle version consistently outperformed the shared memory version. The highest throughput VOLUME 4, 2016 Algorithm 10 Quantum circuit for Xoodoo permutation. B 1 ← 128-qubit allocation 25:
Compared to PHOTON-Beetle, the other three candidates achieved a much higher throughput. Referring to Figure 8, Sparkle was able to achieve very high throughput across different MLEN, ranging between 850 Gbps to 1000 Gbps. Xoodyak and Ascon performed at a similar level, achieving throughput that ranged between 400 Gbps to 500 Gbps. The throughput achieved by these three candidates were an order of magnitude higher than PHOTON-Beetle. The main reason for the difference in performance is that PHOTON-Beetle uses byte-wise operations, which is efficient in constrained devices (e.g., a 8-bit microcontroller), but is not efficient in a GPU with a 32-bit architecture. On the other hand, Sparkle, Xoodyak and Ascon are designed based on word-level operations (32-bit or 64-bit), which can be efficiently implemented in a GPU. Hence, the throughput achieved by these three candidates was much higher compared to PHOTON-Beetle.

B. RESULTS OF IMPLEMENTATION ON A QUANTUM COMPUTER
A large-scale quantum computer capable of implementing the entire quantum circuits proposed in this work is still not available yet. However, simulation and analysis can be performed using quantum programming tools. This is also a common practice found in other work [?], [8], [10], [11]. For our implementation, we used the quantum programming tool ProjectQ. The implementation of quantum circuits was validated using the ClassicalSimulator li-brary and the quantum resources used were analyzed using the ResourceCounter library. All of the hash functions implemented in this paper were optimized for qubits and quantum gates in the reversible computing environment of quantum computers. Table 4 shows the quantum resources for the quantum circuit of hash functions. The input message length is fixed to 256-bit (384-bit only for ESCH384). Among the four NIST lightweight hash functions, Xoodyak uses few quantum gates and has the lowest circuit depth. Conversely, Sparkle uses many quantum gates and has the highest circuit depth. This is because the quantum adder used in Sparkle requires many quantum gates and has a high circuit depth. One thing to note is that quantum addition uses a lot of resources, but in this work, the depth is greatly reduced by designing parallel addition. In terms of the number of qubits, Sparkle can be implemented with a relatively small number of qubits.
The quantum resources required to implement a quantum circuit for a hash function can be utilized to evaluate resistance to quantum attacks. At the current level of advancement in quantum computers, the number of available qubits is insufficient. So the number of qubits is related to when it can   actually work in a quantum computer. The depth represents the start to the end of the circuit, which is related to the execution time [35].

C. POST-QUANTUM SECURITY STRENGTH
In this section, we estimate the post-quantum security strength of NIST lightweight hash functions using the postquantum security requirements presented by NIST [16]. In symmetric key cryptography, the security strength is halved VOLUME 4, 2016 when Grover algorithm is applied. However, if the application cost is high, the target cipher or hash function can be evaluated to be resistant to quantum attacks. Therefore, the quantum cost required to attack the target symmetric key cryptography is being used to evaluate the post-quantum security strength. NIST presented the following requirements for the security strength of post-quantum cryptosystems.
• Attacks that break the security strength of a block cipher with a 128-bit key must require similar or more resources than those required for an attack against a hash function (e.g. AES-128). • Attacks that break the security strength of a 256-bit hash function must require similar or more resources than those required for an attack against a hash function (e.g. SHA-256 or SHA3-256). NIST estimates the quantum attack cost for symmetric key cryptography as D (total gates × total depth) [16]. For the block cipher AES-128, NIST estimates the cost of quantum attack to be 2 170 (D), citing Grassl's implementation of AES quantum circuits [8]. On the other hand, NIST did not give an estimated cost for hash functions (only for classic gates). Thus, we estimate the attack cost(D) for SHA3-256 [9] following the estimation method in [16]. The attack cost was estimated based on the quantum circuit for SHA3-256 (Table  4). For detailed analysis, we decompose the Toffoli gate into 7 T gates + 9 Clifford gates and 3 T depth, identical to the approach of [9]. X gates and CNOT gates are counted as Clifford gates. Table 5 is a resource analysis at T+Clifford level for NIST lightweight hash functions and SHA3-256. Now we estimate the cost of Grover's pre-image attack (D in Section II) for NIST lightweight hash functions and SHA3-256 based on the quantum resources in Table 4. Grover's algorithm consists of an oracle and a diffusion operator, but the cost of the diffusion operator is commonly ignored when estimating the cost [8], [11], [16], [36]. This is because the overhead for the diffusion operator is negligible. We also estimate only oracle as the cost of Grover's algorithm.
In the Grover's oracle, the hash function is executed twice due to (hashing + reverse). Therefore, the resources of Table  5 × 2 are used except for qubits. Resources using a single multi-controlled NOT gate to compare the generated hash value to a known hash value were omitted for simplicity. The optimal number of Grover search iterations is ⌊ π 4 2 n 2 ⌋. Thus, for a 256-bit input message, the oracle is repeated ⌊ π 4 2 128 ⌋ times (⌊ π 4 2 192 ⌋ for ESCH384). Finally, the resources for the attack were estimated as Table 5 ×2 × ⌊ π 4 2 128 ⌋ and is shown in Table 6 (Table 5 ×2 × ⌊ π 4 2 192 ⌋ for ESCH384). Note that the number of qubits was not counted in D. NIST does not consider the number of qubits in estimating the attack cost.
It can be seen that the attack costs for the 256-bit hash functions for PHOTON-Beetle, Sparkle, Xoodyak, and AS-CON (ESCH-256) are lower than for SHA3-256(1.574·2 295 ), which is the NIST security requirement. One of the ways to meet the NIST-defined security requirements against quantum computer attacks is to increase the length of the message, a well-known countermeasure. Even if the length is doubled, the security strength against quantum computer attacks is halved, but the originally intended security strength can be obtained. In addition, there are cost difficulties in performing quantum attacks because the number of Grover iterations increases exponentially with the length of the input message (e.g. ESCH384 (1.538 · 2 422 ) in Table 6). Lastly, increasing the number of permutation functions, which occupies the most quantum resources in NIST lightweight hash functions, will also be one of the methods to satisfy the post-quantum security strength in terms of cost.

VI. CONCLUSION
Conducting high throughput data integrity checks is essential to protect communications in IoT systems. In this study, we proposed techniques to optimize the four lightweight hash functions finalists in the NIST standardization competition (PHOTON-Beetle, Ascon, Xoodyak and Sparkle). All four candidates achieved high hashing throughput (70 Gbps to 1000 Gbps) on a GPU platform, which can be used to perform high performance data integrity checks in IoT systems. Implementing these four hash functions on a quantum computer was analyzed using ProjectQ. Further, we estimated the cost of a Grover pre-image attack and compared it with NIST's post-quantum security requirements. Our work contributes to the analysis of hash functions by a quantum computer. The output from this article can be used to protect IoT communication (high throughput integrity check) as well as analyze the vulnerabilities of these hash functions against brute-force attack [37].