A Hardware Architecture of NIST Lightweight Cryptography Applied in IPSec to Secure High-Throughput Low-Latency IoT Networks

The Internet of Things (IoT) has rapidly grown in recent years, making it an integral part of many areas of our lives. Many IoT networks require high data throughput and low latency, allowing for real-time communication and data transmission, enabling improved efficiency, cost savings, and enhanced decision-making capabilities in various industries such as manufacturing, healthcare, transportation, and smart cities. However, with the increasing amount of data being transmitted, the security of high-speed IoT networks becomes a critical concern. In this paper, we proposed a hardware architecture for Ascon, a NIST Lightweight cryptography standard to enable high-throughput, low-latency security services in IPSec protocols. Results show that the ESP protocol can achieve a maximum throughput of 8.806 Gbps and a minimum latency of 427ns for only 2812 Slice. This ESP core together with the proposed Ascon implementation can be used in IoT gateways to provide security services for high-speed, low-latency IoT networks.


I. INTRODUCTION
The IoT is becoming increasingly important in many areas of our daily lives, including smart cities, healthcare, transportation, etc. IoT networks enable a large scale of devices such as sensors, actuators, gateway, smartphones, etc., to connect. These devices can collect, exchange data or send it to the cloud for further processing. However, as more and more devices are connected to the internet, the potential for cyberattacks and data breaches increases, making security a critical concern for organizations that rely on IoT-enabled devices and services.
One of the main challenges associated with IoT security is that many of these devices are resource-constrained with limited processing power, making it hard for tasks like encryption. In addition, the requirements of IoT systems vary greatly, from high-throughput to the area-constrained. For this reason, many efforts have been made to research The associate editor coordinating the review of this manuscript and approving it for publication was Sedat Akleylek . and develop cryptographic algorithms that are well-suited for use in IoT networks [2]. One solution is to use Authenticated Encryption with Associated Data (AEAD) [3], combining the cryptographic services of confidentiality, integrity, and authentication into one algorithm, often requiring fewer resources than traditional ones.
In August 2018, the US National Institute of Standards and Technology (NIST) began a process to standardize lightweight cryptography (LWC). Most of the lightweight algorithms are AEAD and optional hash algorithms. After NIST received 57 submissions for consideration as a standard, 56 were chosen as candidates for the first round. Among them, 32 were selected to progress to the second round. In March 2021, NIST announced ten finalists, and Ascon is one of them. NIST LWC candidates are evaluated based on factors such as area, memory, throughput, latency and power consumption. Many works have been done on the hardware implementation of Ascon, such as in [5], [6], [7], and [8], while other works like [4] show that Ascon has the highest throughput-to-area (TPA) ratio compared to other candidates.
Many applications in IoT today, such as augmented reality, high-resolution video streaming, self-driven cars, smart environment, etc, require higher data rates, large bandwidth and high throughput. And with the emergence of Fifth Generation (5G) cellular networks, the demand for high-speed, low-latency and secure IoT networks is even more significant [9].
In order to secure IoT networks, LWC algorithms have to be implemented in some network security protocols [10]. Many studies have been conducted on the hardware implementation of Ascon, but most of them are focused on the implementation of the algorithm itself without evaluating it in specific network security protocols. Therefore, it is difficult to determine its suitability for protecting IoT networks in real-world scenarios.
IoT networks can utilize a variety of network security protocols, but the most popular protocols are IPSEC, TLS or DTLS. These protocols are commonly used to secure communication in IoT networks and can also be used to create Virtual Private Network (VPN) connections for secure remote access. The IPSec protocol encrypts and authenticates network layer traffic and protects all packets sent along a path. On the other hand, TLS/DTLS, which operates at the transport layer, ensures privacy and data integrity between two communicating applications. The choice of a protocol depends on the specific requirements of IoT networks. This work focuses on IPSec as a solution to secure IoT networks by providing confidentiality, integrity, and authenticity of the data transmitted. To meet the high-speed requirements of IoT networks, the IPSec processing operation, including the AEAD algorithm, must be implemented in hardware. Our previous work in [1] presented a high-throughput FPGA implementation of the IPSec core with two different cipher suites AES-256-GCM-16 and AES-256-CTR-HMAC-SHA256. This paper presents a hardware acceleration approach for the Ascon-128a algorithm on FPGA, enabling high-throughput and low-latency IPSec protocol implementations.
Our contribution is a novel architecture for the Ascon implementation suitable for an IPsec design, which enables high-throughput, low-latency security services. While there has been significant interest in hardware implementations of Ascon, most existing studies have focused only on implementing the algorithm itself without evaluating its effectiveness in specific network security protocols. Furthermore, many of these implementations use hardware APIs for a fair comparison with other LWC ciphers, but these APIs are not suitable for high-speed network designs.
Our work provides one of the first comprehensive insights into the hardware implementation of Ascon across the entire IPSec network stack. We also present our implementation of the IPSec protocol in FPGA using the proposed Ascon architecture and compare the results with other IPsec implementations. Our findings suggest that the proposed architecture can significantly improve the performance of IPsec implementations while maintaining a high level of security.

II. RELATED WORK
This section discusses the recent hardware implementations of Ascon presented in the literature. Additionally, some works on implementing the IPSec protocol in FPGAs are also reviewed.

A. ASCON IMPLEMENTATION
The first Ascon hardware implementation is presented in [5]. This paper examines two implementation strategies, one that prioritizes high throughput and another that aims for minimal area usage. The high throughput design allows for unrolling 1 to 6 rounds in a single clock cycle while the low area design process only one bit per cycle. The work in [11] was written by the same authors and achieved similar results. However, these implementations were designed for the application-specific integrated circuit (ASIC) platform, whereas our target platform is field-programmable gate arrays (FPGAs).
The FPGA implementations in [12] are carried out for ASCON-128 and KETJE-SR from Competition for Authenticated Encryption: Security, Applicability, and Robustness (CAESAR) using CAESAR Hardware API. The widths of I/O buses are set at 32-bit, and they use single-port memory to store the key along with state initialization constants. The results were evaluated in terms of area and throughput on the Spartan-6 FPGA.
In [6], the authors presented FPGA implementations of Ascon and ACORN that were selected as CAESAR finalists. They compared the throughput and throughputto-area ratio of these ciphers with AES-GCM. In [14], the authors implemented several CAESAR Authenticated Ciphers, including Ascon, using an improved version of the CAESAR HW Development Package designed for lightweight applications. They provided high-speed and lightweight implementations of these ciphers on Spartan-6 FPGA.
The authors in [4] proposed FPGA implementations of six NIST LWC Round 2 candidate ciphers, including Ascon, on the Artix-7, Spartan-6, and Cyclone-V FPGAs. Implementations are compliant with the LWC hardware applications programming interface (LWC HW API [13]) and are tested in actual hardware. They found that SpoC has the smallest area and power, while Ascon has the highest throughputto-area (TPA) ratio. A comparative study of FPGA and ASIC implementations of Ascon and several other algorithms selected from the CAESAR competition is discussed in [7]. The use of dynamic partial reconfiguration (DPR) technology for FPGAs, which enables switching between different lightweight ciphers and reduces area utilization, is presented in [16].
In [15], the authors introduced the performance and resource utilization of the cryptosystems implemented for both constrained devices (ASCON-128) and the high-performance platforms (hybrid ASCON) denoted as H_ASCON. The authors in [17] provide a hardware implementation of Ascon using different strategies (unrolled, round-based, and serialized) and apply it in different applications.
In [30], the authors explore the hardware performance of ASCON for artificial intelligence (AI) enabled IoT devices. Unrolled and recursive strategies have been adopted for ASCON implementations on Virtex-4, Virtex-7, and Spartan-6 FPGA families.
The authors in [30] propose low-cost error-detection mechanisms as countermeasures against fault attacks for the hardware implementations of ASCON. The proposed error detection schemes are also benchmarked on two FPGA families (Spartan-7 and Kintex-7).
In [32], a flexible, reconfigurable, and energy-efficient crypto-processor to run ASCON is introduced. The proposed ASCON crypto-processor runs in six different modes: Encryption, decryption, and hash function with different data sizes.
Most of the Ascon implementations mentioned above are not suitable for high-throughput, low-latency networks due to their interface requirements (using Hardware API or memory-map), which result in additional cycles for bus conversion or memory read/write operations and make integration into the IPSec stack difficult. Furthermore, the implementation need to be redesigned to efficiently process and handle different AEAD states without relying on complicated instructions or headers, such as described in [13]. This work introduces an architecture specifically designed for the implementation of Ascon that meets the requirements of high-throughput, low-latency network.

B. IPSEC IMPLEMENTATION
In [18], the authors proposed an IPSec implementation on Xilinx Virtex-II Pro FPGA. They moved the key management protocol into the software that runs on the PowerPC. The IPSec protocol was implemented using a softcore processor while encryption and authentication algorithms were performed in hardware.
In [19], an IPSec implementation on board ML410 is presented. The design uses AES128 in CBC mode as the encryption algorithm and AES-XCBC-MAC-96 as the integrity algorithm. The IPsec gateway bases on hardware for time-critical operations like data encryption, network filtering, and packet routing. It uses many hard macros provided by Xilinx Virtex-4 FX-series FPGA, like memories, media access controllers, and the embedded PowerPC CPU.
In [20], the authors propose an architecture for implementing IPSec on a Xilinx Virtex-4. The proposed solution is based on the partial reconfiguration technique. They use a round-robin scheduling algorithm to switch between Encapsulating Security Payload (ESP), Authentication Header (AH) and Internet Key Exchange (IKE) in hardware. However, this approach comes with a time delay for switching the crypto core and thus does not allow for extremely high throughput in a typical setting.
In [21], the authors presented a multi-core architecture to implement IPSec protocol. This multi-core architecture can be configured with the number of AH/ESP cores and AES-HMAC-SHA-1 cores to achieve high throughput.
An FPGA-based reconfigurable IPSec ESP core implementation is presented that can be used to provide security services to IoT applications using BITW Ipsec solution in [24]. The design supported ESP transport and tunnel modes and reported on Virtex FPGAs.
The authors in [22] proposed using parallel encryption modules for a single task using a round-robin style. Parallel access to SAD in this solution was done using an arbiter system in its own clock domain for the best performance.
In [23], the authors presented a complete IPSec implementation, both IPSec-AH and IPSec-ESP protocols, each with transport and tunnel mode operation. They use AES in CTR mode for IPSec-ESP and SHA-3 for IPSec-AH. The performance of IPSec-ESP tunnel mode was measured without packet integrity.
The hardware implementation of IPSec using AES for security services has been extensively studied. However, there has been limited research on the hardware implementation of Ascon AEAD within IPSec. This work proposed one of the first IPSec implementation using the Ascon architecture in hardware and evaluates the results with other IPsec implementations.

A. AUTHENTICATED ENCRYPTION WITH ASSOCIATED DATA
Traditional cryptographic algorithms only provided confidentiality and integrity/authenticity as separate services, but this approach alone was not enough to protect confidentiality without also ensuring integrity. This led to the development of the concept of authenticated encryption (AE). Authenticated encryption ensures the protection of both confidentiality and integrity under a single secret key. In addition, authenticated encryption also provides the authenticity of unencrypted data, such as TCP/IP headers for routing packets or AH/ESP headers for IPSec protocol. This scheme is called authenticated encryption with associated data (AEAD) [25].
The inputs of AEAD typically consist of key K, nonce N (a public number), associated data AD of arbitrary length and plaintext message PT, also of an arbitrary length. After encryption, the outputs are ciphertext CT and the generated tag, T (Figure 1). The decryption process takes key K, nonce N, associated data AD, ciphertext CT, and tag T as input. If the value of the calculated tag in the decryption process is equal to the tag T provided, then it is considered valid, and the original plaintext (PT) is revealed. If the calculated tag in the decryption process does not match the provided tag, the decryption is considered to have failed, and the transaction is rejected.

B. ASCON
Ascon is a cipher suite that provides authenticated encryption with associated data (AEAD) and hashing functionality. The Ascon cipher suite utilizes the sponge design methodology, which shares some construction similarities with the SHA-3 contest winner Keccak [26]. The design rationale behind Ascon is to provide the best trade-off between security, size and speed in both software and hardware.
The Ascon cipher suite has two versions, Ascon-128 and Ascon-128a [27]. These schemes have different parameters, as shown in Table 1.
Ascon-128a processes data in 8 rounds, as opposed to 6 rounds in Ascon-128, but it also has a larger block size of 128 bits, double that of Ascon-128's 64 bits. So Ascon-128a cipher suite has a higher throughput than the Ascon-128 version. Since we are targeting high-speed IoT networks, the Ascon-128a algorithm is more suitable. Note that the Ascon-128a is secondary recommendation after Ascon-128 and it is currently uncertain whether finalized standard will include both of them. However, since the two algorithms are not significantly different, transitioning from one to the other should not pose any major challenges.
Ascon has a state of 320 bits and two permutations p a and p b . The 320-bit state S is divided into an outer part S r of r bits and an inner part S c of c bits, where the rate r and capacity c = 320 − r depend on the Ascon variant. The 320-bit state S is split into five 64-bit registers words: Ascon authenticated encryption or decryption consists of four phases -Initialization, Associated Data (AD), Processing Plaintext/Ciphertext, and Finalization.

1) INITIALIZATION
The 320-bit initial state of Ascon is constructed from secret key K, nonce N, and an initialization vector (IV) that is predefined by the algorithm.
The initial state goes through a round of the round transformation and then is XORed with the secret key K.

2) ASSOCIATED DATA
Ascon appends a single 1 and the smallest number of 0s to associated data A to obtain a multiple of r bits and split it into s blocks of r bits, A 1 , . . . , A s . In case A is empty, no padding is applied and s = 0. Each block A i with i = 1, . . . , s is xored to the first r bits S r of the state S, followed by b round of the round transformation:

3) PROCESSING PLAINTEXT/CIPHERTEXT
The same padding rule with single 1 and the smallest number of 0s is applied to the plaintext P. The ciphertext is generated by XORing plaintext with the state S, followed by b round of the round transformation: The last ciphertext block is truncated to the length of the unpadded last plaintext: The decryption process is similar:

4) FINALIZATION
In the finalization, the secret key K is xored to the internal state and the state is transformed by the permutation p a using a rounds. The tag T consists of the last (least significant) 128 bits of the state XORed with the last 128 bits of the key K :  The encryption algorithm returns T together with the ciphertext. The decryption algorithm returns the plaintext only if the calculated tag value matches the received tag value. The whole encryption process is depicted in Figure 2.

C. PERMUTATIONS OF ASCON
The permutations of Ascon, known as the round transformation p, are based on the SPN structure and consist of three steps p C , p S , p L . These steps include the addition of constants, substitution layer, and linear diffusion layer. The permutations p a and p b are applied to 320-bit state S and differ only in the number of rounds. The constant addition step p C adds a round constant c r to register word x 2 of the state S in round i.
The substitution layer p S updates the state S with 64 parallel applications of the 5-bit S-box S(x) defined in Figure 3 to five registers x 0 , . . . , x 4 .
The S-box can also be implemented efficiently using bitsliced technique [27]: The linear diffusion layer p L provides diffusion within each 64-bit register word x i by XOR and right-rotation (circular shift) operations:

IV. DESIGN FOR HIGH THROUGHPUT AND LOW LATENCY IPSEC
The software implementation of IPSec (in the kernel space) generally cannot provide very high speeds and low latency. The protocol must be implemented entirely in hardware to achieve these goals. A general architecture of IPSec in hardware is shown in Figure 4, and it has the following main components: -Header parser analyzes the packet header and identifies its type, such as ARP (Address Resolution Protocol), IP (Internet Protocol), ESP (Encapsulating Security Payload), or other network protocols.
-FIFO (First In, First Out) stores packets in memory while they wait for processing.
-Packet filter inspects packets and selectively allows, protects or blocks them based on predefined rules in the security policy database.
-SAD (Security Association Database) controller retrieves cryptographic keys and other parameters required to protect incoming packets.
-ESP process provides confidential, authenticated, and integrity service following ESP protocol.
An IPSec design must run on a wide datapath or system bus to achieve high throughput. A wider datapath or system bus allows more data to be processed at a time, resulting in better throughput. The industry standard, such as the AXI (Advanced eXtensible Interface) Stream bus, supports large data widths and allows for flexible adjustment of data width. In addition, an IPSec design must use the smallest possible clock cycles for low latency for each operation. Latency in a network is critical, as it can result in packet loss or poor Quality-of-Service (QoS).
The most time-consuming tasks in IPSec are database lookup and cryptographic operations (encryption, decryption, packet integrity). While database lookup in FPGA can benefit from using a single clock cycle CAM/TCAM (Content Addressable Memory/Ternary Content Addressable Memory) search, the performance of cryptographic operations depends on the algorithm's nature. The AES cipher is widely adopted in the IPSec standard. However, AES does not provide authentication and integrity services on its own, so other authenticated and integrity algorithms, such as HMAC-SHA are generally used alongside it. This approach can result in higher latency, as computing HMAC-SHA for large packets is a computationally expensive process and cannot be parallelized easily. The AES-GCM algorithm, which has emerged in recent years, can provide both confidentiality and integrity protection with minimal computational overhead. Nevertheless, it is often implemented using a pipelined architecture that may require additional resources to achieve low latency. In this work, we evaluate the performance of the Ascon AEAD cipher as a potential replacement for AES-GCM in IoT networks.

V. PROPOSED ARCHITECTURE FOR ASCON
Most of Ascon's implementations are built on top of Hardware API for Lightweight Cryptography [13]. This is to ensure the LWC implementations compatibility of the same algorithms by different designers, fairness, and ease of benchmarking and evaluation. The simple version of LWC interface is shown in Figure 5.
The LWC interface can communicate directly with interfaces like AXI4 and FIFO. However, this hardware API also includes a communication protocol consisting of commands, headers, and data to process different data types required for authenticated encryption and decryption. This makes it more difficult as designers have to follow the header format and a sequence of operations in the protocol. Moreover, LWC hardware API restricts the bus to 8, 16, or 32-bit, making it hard for high-speed applications that require 64 or 128-bit data buses. Therefore, a new hardware architecture must be designed for this purpose. The data bus must be compatible with the bus of the design in order to avoid wasting additional cycles on bus conversion. Additionally, the authenticated encryption process must be as fast as possible to support highspeed, low-latency IPSec and other designs in IoT networks. The interface of architecture is shown in Figure 6.
The signals for associated data, datain and dataout have a 128-bit datapath and comply with the AXI stream standard.
The key and nonce are loaded at the beginning of the process  when the data is valid. The signal en_dec controls whether the process performs encryption or decryption.
The hardware architecture for Ascon-128a is illustrated in Figure 7. During the Initialization and Finalization phases, the Ascon permutation is unrolled four times, while in Associated Data and Processing Plaintext/Ciphertext phases, the Ascon permutation is unrolled eight times. The permutation in Initialization and Finalization phases is 12 rounds, so Ascon permutation unrolled ×4 will be iterated three times. These processes are executed only once, and as such do not affect the overall performance of the design. An iterative loop strategy is used to minimize resource usage. In order to maximize the throughput of the design, Ascon permutation unrolled ×8 is used in Associated Data and Processing Plaintext/Ciphertext phases. This approach allows the system to efficiently process 128-bit data in each clock cycle, resulting in faster and more efficient encryption and decryption operations.

VI. IPSEC CORE WITH ASCON
To put the Ascon hardware architecture into practical use, we integrate it within the IPSec protocol. For the VOLUME 11, 2023 89245 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  implementation of IPSec core in FPGA, we utilize the IPSec ESP implementation in [1] with some modifications for Ascon algorithm in Figure 8.
In ESP encapsulation process, for a fair comparison and compatibility with different cipher suites, we maintain the ESP header and ESP trailer format as shown in Table 2: The IPSec core uses Ascon authenticated encryption instead of AES-GCM or AES-CTR/HMAC-SHA [1]. In GCM and CTR mode, the authenticated and integrity operations must be implemented separately, whereas the Ascon cipher combines both operations into a single algorithm. The Ascon hardware architecture is capable of encrypting/decrypting 128-bit data each clock cycle after the key and nonce have been provided. The nonce is generated in a way that guarantees uniqueness for each packet. When the last data is processed, a tag will be used as ICV (Integrity Check Value) for packet integrity checks.

VII. RESULTS AND DISCUSSIONS
The Ascon-128a core was synthesized for device Virtex-7 using Xilinx Vivado 15.4. The proposed Ascon-128a source code is available on GitHub [35]. The results are compared with other implementations in Table 3.
The Ascon-128a core can process 128-bit data in each clock cycle at a maximum frequency of 104 MHz, resulting in a maximum throughput of 128 × 104 = 13312 Mbps. This throughput is 57% higher than in [29] but lower than [17]. As discussed in Section IV, our primary goal is to design an IPSec system with high throughput and low latency. In order to deliver security services with minimal latency, it is essential that the design uses the smallest possible clock cycles for each operation. Our proposed architecture offer one clock cycle for each 128-bit data while [17] take two. The IPSec system is a complex system with multiple components, making it sensitive to even a single clock cycle delay, which can adversely affect all other components. The resource utilization of our implementation is also higher as we use Ascon permutation unrolled × 8 instead of unrolled × 4 architecture used in [17] and [29]. This leads to a lower throughput-to-area (TPA) ratio compared to [17] and [29], but it is still higher than the remaining implementations. Within the context of IoT, where optimizing for area-constrained environments is crucial, certain scenarios prioritize achieving high performance of lightweight cryptography to secure realtime communication, video processing, or large-scale data transmission [33], [34]. Some examples of such scenarios include: -Video Surveillance: Real-time streaming and analysis of high-definition video feeds require high throughput to efficiently transmit, process, and store video data, especially when dealing with multiple cameras or performing video analytics at the edge.
-IoT Gateways: Acting as intermediaries between IoT devices and backend systems, gateways handle data aggregation, protocol translation, and security functions. High throughput capabilities are crucial to efficiently manage incoming and outgoing data streams, particularly in scenarios with numerous connected devices.
The IPSec core and Ascon are synthesized and implemented on the Xilinx Zynq-7000 SoC. With this Ascon-128a cipher suite, the maximum frequency of the IPSec core in the Zynq-7000 is 98.2 MHz. The core takes 129 clock cycles to process a 1446-byte packet, resulting in a throughput of 8.806 Gbps. The distributions of 129 clocks are given in Table 4. Figure 9 shows the speed of different packet lengths from 64 to 1446 bytes. Figure 10 shows the latency for the IPSec operations with different packet lengths. The minimum latency is 427ns (42 clock cycles) for 64 bytes packets. 89246 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.    The implementation result is also compared with other works in Table 5.
It is important to note that the resource results for the IPSec implementation reported here do not include the SPD and SAD databases as opposed to 64 SPD and 64 SAD used in [1].
The LUT and slice utilization of the IPSec core with the Ascon-128a cipher suite is relatively low when compared to other IPSec core with AES implementations, yet it still achieves high speeds of 8.806 Gbps. This throughput is only lower than that reported in [21], which achieved 11.28 Gbps. However it is important to note that the IPSec design in [21] utilized four core ESP, while our design utilized only one. Furthermore, their reported result is based solely on synthesis without a detailed slice report. While this Ascon architecture is well suited for high-throughput, low-latency networks, it also has limitations in terms of low frequency, resulting in a lower frequency for the overall IPSec design.

VIII. CONCLUSION
This work proposes a high-performance, low-latency hardware architecture for Ascon, a novel NIST lightweight cryptography standard. Additionally, we present the implementation and evaluation of the IPSec ESP protocol using this architecture in FPGA. All ESP processes and cryptographic algorithms are performed in hardware to achieve high performance with little overhead. The Ascon architecture and IPSec protocol implementation presented in this paper can help address the need for secure, high-speed, low-latency communication in emerging IoT networks. In future work, our focus will be on addressing the low power consumption of the Ascon and IPsec hardware architecture. This architecture is specifically targeted for resource-constrained, low-power edge devices within IoT networks.