SnapMem: Hardware/Software Cooperative Memory Resistant to Cache-Related Attacks on ARM-FPGA Embedded SoC

ARM-FPGA embedded SoCs have been widely used in the fields of 5G Wireless, next-generation advanced driver-assistance systems (ADASs) and Industrial Internet of Things due to its high performance and hardware design flexibility. However, this type of SoC suffers various security threats, one of which is cross-domain cache-related attacks, such as Flush+Reload, Flush+Flush, and Meltdown and Spectre. Many hardware and software defenses have been proposed to resist these cross-domain cache-related attacks. However, hardware defenses require modifications of basic architecture, which cannot be deployed on existing devices. On the other hand, software runtime defenses have incomplete coverage or introduce significant performance overhead. In this article, we propose SnapMem, a hardware/software cooperative memory that can make sensitive data burn after reading on ARM-FPGA embedded SoC. Any process can only access the SnapMem created by itself. Through the cooperation of software and hardware, SnapMem can transfer sensitive data in or out of main memory in real time. Based on this burn-after-reading mechanism, SnapMem can effectively prevent attackers from stealing sensitive data of the victim process or kernel space. Security and performance evaluations show that SnapMem can resist all cross-domain cache-related attacks while introducing lower performance overhead than other software runtime defenses on ARM-FPGA embedded SoC.

SnapMem: Hardware/Software Cooperative Memory Resistant to Cache-Related Attacks on ARM-FPGA Embedded SoC Jingquan Ge and Fengwei Zhang , Senior Member, IEEE Abstract-ARM-FPGA embedded SoCs have been widely used in the fields of 5G Wireless, next-generation advanced driverassistance systems (ADASs) and Industrial Internet of Things due to its high performance and hardware design flexibility.However, this type of SoC suffers various security threats, one of which is cross-domain cache-related attacks, such as Flush+Reload, Flush+Flush, and Meltdown and Spectre.Many hardware and software defenses have been proposed to resist these cross-domain cache-related attacks.However, hardware defenses require modifications of basic architecture, which cannot be deployed on existing devices.On the other hand, software runtime defenses have incomplete coverage or introduce significant performance overhead.In this article, we propose SnapMem, a hardware/software cooperative memory that can make sensitive data burn after reading on ARM-FPGA embedded SoC.Any process can only access the SnapMem created by itself.Through the cooperation of software and hardware, SnapMem can transfer sensitive data in or out of main memory in real time.Based on this burn-after-reading mechanism, SnapMem can effectively prevent attackers from stealing sensitive data of the victim process or kernel space.Security and performance evaluations show that SnapMem can resist all cross-domain cache-related attacks while introducing lower performance overhead than other software runtime defenses on ARM-FPGA embedded SoC.

I. INTRODUCTION
I N RECENT years, with the rapid development of automo- tive electronic systems, Internet of Things and 5G Wireless, the market demand for SoCs with high performance, low power consumption, and versatility is increasing.ARM-FPGA embedded SoCs, which combine both software and hardware, give system architects and ARM developers a flexible platform to meet the performance, power and functional needs of customers.This type of SoCs, such as Xilinx Zynq and Versal series [1], have been widely used in the fields mentioned above.However, like Intel, AMD's x86 or Qualcomm, The authors are with the Research Institute of Trustworthy Autonomous Systems and the Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen 518055, China (e-mail: gerty1986823@126.com;zhangfw@sustech.edu.cn).
Digital Object Identifier 10.1109/TCAD.2024.3392082 Samsung's ARM products, ARM-FPGA embedded SoC are also suffering a variety of security threats.Among them, cache-related attack is one of the most attractive threats.
On the other hand, the main goal of cache-related attacks is to steal sensitive data in main memory.Meanwhile, most of the data in main memory is not sensitive data and does not need to be protected.Therefore, not all data in main memory needs to be protected.The security of the system can be guaranteed as long as the sensitive data in the main memory can be protected.However, since cache-related attacks can theoretically steal the entire kernel and process memory, they can ideally steal all data in main memory.So a secure idea is to store sensitive data out of main memory.But this idea cannot be implemented on general SoC platforms because general SoCs can only access and process data in main memory.Fortunately, ARM-FPGA embedded SoC provides a platform for system developers to partially modify the hardware layer.
We can freely design hardware peripherals on this type of SoC and mount them on AXI bus of ARM CPU.Using this software/hardware cooperative mechanism, we can design a more secure memory mechanism to store sensitive data out of main memory.
In this article, we present SnapMem, a memory mechanism that burns after reading.SnapMem is a software/hardware codesign, which consists of a hardware module and a software module on ARM-FPGA embedded SoC.The software module has two functions, the first is the management of process access permissions, and the second is the control of the hardware module.Each process can create its own unique SnapMem and can only access its own SnapMem.Any process that wants to access the SnapMem of another process is strictly prohibited.On the other hand, the software module controls the hardware module to move sensitive data from the main memory to the hardware memory or vice versa.When sensitive data is not being accessed, it is not stored in the main memory but in SnapMem's hardware memory.This sensitive data is only moved to the main memory when it is accessed using SnapMem's API.More importantly, SnapMem's hardware memory does not have an address map for ARM CPU, it can only be accessed by SnapMem's hardware module.Therefore, cache-related attacks are powerless against sensitive data in SnapMem.
We implement SnapMem by designing the software module, the hardware module, the SnapMem's APIs, and modifying the Linux kernel.We perform various types of cacherelated attacks to evaluate the security of SnapMem.It shows that SnapMem can defend against all cache-related attacks, which is more secure than other software defense solutions on ARM-based platforms.In addition, we conduct performance evaluations on SnapMem APIs and Linux original APIs, respectively.Moreover, we evaluate the system performance overhead of SnapMem by running UnixBench and SPECrate2017.The two overheads of SnapMem (0.07% on UnixBench and 0.04% on SPECrate2017) are better than all other software runtime defense schemes.
Our main contributions are summarized as follows.
1) We create a hardware/software co-design of a more secure memory mechanism.This memory mechanism not only isolates sensitive data from main memory, but also isolates sensitive data from different processes, which can resist all cache-related attacks.2) Our design requires only one software module, one hardware module and minimal modification to the Linux kernel, which does not require modification of the basic CPU architecture.Therefore, it can be easily deployed on ARM-FPGA embedded devices that have been widely used.3) We conduct security and performance evaluations on the real ARM-FPGA embedded SoC.The security results show that the defense capability of this memory mechanism can cover all cache-related attacks.
Performance evaluations indicate that our design has lower performance overhead than all other software runtime defenses.4) We propose a new idea for the secure improvement of memory mechanism.This memory mechanism only requires isolated control and memory modules to be mounted on the data and address buses, so it can be widely used on various platforms, such as ASICs and CPU-FPGAs.

II. BACKGROUND AND MOTIVATION
In this section, we first give the detailed descriptions of the ARM-FPGA embedded SoC.Then, we discuss the evolution of cache-related attacks.Finally, we will introduce the existing defenses and their shortcomings in depth.

A. ARM-FPGA Embedded SoC
ARM-FPGA embedded SoCs combine the software programmability of ARM processors and hardware programmability of FPGAs, which enable differentiated and customizable solutions.This type of SoCs [1], [41] have flooded the market and are widely used in areas, such as autonomous driving, 5G communications, and the Internet of Things.Among these SoCs, Zynq UltraScale+ MPSoC [42] is a typical representative.For simplicity without loss of generality, we take Zynq UltraScale+ MPSoC as an example to explain the hardware architecture of ARM-FPGA embedded SoC in detail.
Fig. 1 shows the simplified hardware architecture of ZU9EG, a representative product of the Zynq UltraScale+ MPSoC family.To be more intuitive, Fig. 1 omits hardware components that are not related to our design, including dual-core Arm Cortex-R5F and on-chip memory, etc.As can be seen from Fig. 1, the simplified hardware architecture of ZU9EG consists of three parts, namely quad-core Arm Cortex-A53 MPCore, the main memory, and FPGA.On Zynq UltraScale+ MPSoC, the communication between the ARM MPCore and FPGA is based on the AXI bus Similarly, the communication between the FPGA and the main memory is also based on the AXI bus.In other words, the ARM MPCore can access the FPGA through the AXI bus, while the FPGA can also use the AXI bus to access the main memory.
On the other hand, the FPGA contains two most important types of components, namely custom IP and block RAM.
Block RAM is the storage space owned by the FPGA itself.These block RAMs can be configured to be visible or invisible to the AXI bus.When the block RAMs are configured to be invisible to AXI bus, they can only be accessed by the custom IP of the FPGA.We design SnapMem based on this feature of block RAM.The custom IP represents the hardware programming capability of the FPGA, which is the hardware module designed by the hardware engineer.When the custom IP needs to communicate with the ARM MPCore or the main memory, an AXI interface needs to be configured for it.AXI interface [43] can be divided into two categories in terms of function, namely master interface and slave interface.As the name implies, the master interface is the interface that can actively access the main memory according to the physical address.Correspondingly, the slave interface means that the custom IP has its own physical address and can be accessed by ARM MPCore.In the design of SnapMem, both AXI master and slave interfaces are required.

B. Cache-Related Attacks
Cache-related attacks can be divided into three categories, namely low resolution cache attacks, high resolution cache attacks, and transient execution attacks.Low resolution cache attack is a cache attack launched by early security researchers using statistical methods [2], [5], [8], [9], [44], [45].Because this type of attack has a lot of noise and the attack efficiency is very low, it is called low resolution cache attack.In the second decade of the 21st century, cache attacks based on cache flush instructions or operations began to emerge, typically such as Evict+Reload [11], Flush+Reload [10], and Flush+Flush [13].Lipp et al. [14] successfully implemented the Evict+Reload, Flush+Reload and Flush+Flush attacks on ARMv8-A.Beginning in 2018, transient execution attacks have entered the field of vision of researchers, including Spectre-BTB [16], Spectre-PHT [16], [17], Spectre-STL [18], Meltdown [15], Meltdown-GP [23], MeltdownPrime, and SpectrePrime [25], etc.This type of attack exploits the transient execution vulnerability of the CPU and steals sensitive information through the cache side channel.Therefore, this type of attack is also an important category of cache-related attacks.

C. Defenses and Limitations
Hardware Defenses: The most effective defense solution for Meltdown, Spectre and their variants is to modify the hardware architecture of modern processors.In the past three years, Many hardware defense schemes have been proposed, such as SafeSpec [26], Conditional Speculation [28], SpectreGuard [30], ConTExT [27], InvisiSpec [46], STT [47], Reusetrap [29], SpecCFI [31], MuonTrap [32], and GhostMinion [33].Most of these schemes are effective against Meltdown and Spectre variants and have little performance overhead.However, modifications to the processor hardware architecture can only be implemented on next-generation products.These defenses cannot be deployed on existing devices.
Software Runtime Defenses: Before the discovery of Meltdown and Spectre, Researchers [34], [35], [48] often defend against cache-related attacks by monitoring cache activity in real time.But these runtime defenses are only effective for Prime+Probe or encryption-targeted Flush+Reload.They are incapable of defending against Meltdown, Spectre and their variants.In the Linux kernel of ARMv8-A, there are three runtime defenses that can effectively defend against Meltdown and Spectre variants, which are kernel page table isolation (KPTI) [39], Spectre-BTB mitigation [37], and Spectre-STL mitigation [38], respectively.However, each of these defenses targets only one Meltdown or Spectre variant.They cannot cover all Meltdown or Spectre variants.Moreover, a certain performance overhead is also brought by each of the three defenses.In addition, there are two software runtime defenses [49], [50] that focus on designing more secure APIs to resist cache-related attacks.But both of the two researches are only effective for cache-related attacks using the flush instructions.
Static Code Fixing: Static code analysis/fixing defenses [40], [51], [52] are effective for defending against or detecting cache-related attacks.However, fixing static code imposes a large performance overhead on the runtime system [40].
Software/Hardware Collaborative Defenses: Two encryption implementations [53], [54] based on software/hardware co-design can resist cache timing attack, which is an old and low resolution cache-related attack.The two defenses are powerless against the new high resolution cache-related attacks.There are two software/hardware collaborative defenses [55], [56] that can resist the latest high resolution cacherelated attacks and Spectre/Meltdown variants.However, these two schemes are based on the detection of flush instructions and cannot defend against cache-related attacks that do not require flush instructions, such as Prime+Probe, Evict+Reload, MeltdowmPrime, and SpectrePrime.

A. Our Threat Model
Our assumptions of the attacker are as follows.First, the attacker can execute her arbitrary code on the same machine with the victim process, but without root privileges.Therefore, the attacker can launch all cache-related attacks, such as Prime+Probe, Evict+Reload, Flush+Reload, Flush+Flush, Meltdown, Spectre, and so on.Since the attacker does not have root privileges, she cannot access software or hardware modules of SnapMem.Second, the attacker knows the address layout of the victim process or kernel.In fact, it is not very difficult to obtain the address layout of the victim process or kernel.In many attack scenarios, an attacker can legally obtain a copy of the victim's process or kernel binary.Using a binary copy of the victim's process or kernel, the attacker can load and fully reproduce the address layout on his or her host.The only difference between the address layout of these copies running on the attacker's host and the victim process or kernel is the base address.For the victim process, the attacker only needs to exploit vulnerabilities such as memory reading to easily obtain its base address.As for the base address of the victim kernel, it often follows certain operating system rules.For example, in ARM64-based OS, the kernel base address is Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.usually 0xFFFF000000000000, while in x86_64-based OS, the kernel base address is usually 0xFFFF800000000000.Based on the obtained address layout, the attacker can infer the virtual and physical addresses of sensitive data in the victim process or kernel space.This assumption enables attackers to clean the targeted cache lines to launch high resolution cacherelated attacks.
In summary, the attacker can implement all cross-domain cache-related attacks to steal sensitive data throughout the main memory.However, the attacker cannot access SnapMem's modules and thus cannot affect SnapMem's data and control flow.Therefore, the attacker has one and only one way to access SnapMem, which is to call SnapMem's APIs.

B. Threat Model Comparison
In order to explain SnapMem's threat model more clearly and intuitively, we compare SnapMem with existing defense solutions.According to the introduction in Section II-C, defense solutions can be divided into four categories, namely hardware defenses, software runtime defenses, static code fixing, and software/hardware collaborative defenses.Among them, the threat models of the first two categories are relatively clear, while the threat models of the latter two categories are relatively vague.Therefore, we focus on comparing SnapMem with hardware defenses and software runtime defenses.
The first category is hardware defense solutions that modify the hardware architecture, such as Conditional Speculation [28], MuonTrap [32], SPECCFI [31], SpectreGuard [30], InvisiSpec [46], and STT [47].The threat models for this type of hardware defense solution are almost identical and can be summarized into three assumptions.The first assumption of the hardware defense solutions is the same as SnapMem, which is that the attacker can run code on the same machine as the victim process.The second assumption of the hardware defense solutions is quite different from SnapMem.SnapMem assumes that attackers can launch all cache-related attacks, including transient execution attacks (Meltdown, Spectre and their variants), as well as traditional cacherelated attacks (Prime+Probe, Evict+Reload, Flush+Reload and Flush+Flush).The hardware defense solutions assume that attackers can only launch transient execution attacks.The third assumption of the hardware defense solutions is the same as SnapMem, which also requires the attacker to obtain the address layout of the victim process.As can be seen from the above comparisons, the threat model of the hardware defense solutions make smaller assumptions about the attacker's capabilities than SnapMem's.
The second category is software defense solutions, such as KPTI [39], HomeAlone [34], and CloudRadar [35].The threat model assumptions of these three software defense solutions are very similar to SnapMem.However, when these three software defense solutions were proposed, transient execution attacks had not yet been discovered.Therefore, their threat models only assume traditional cache-related attacks and do not include transient execution attacks.In other words, these three software defense solutions have lower assumptions about attacker capabilities than SnapMem.
Overall, SnapMem's threat model makes higher assumptions about attacker capabilities than existing defense solutions.In other words, SnapMem's defense capabilities are more powerful than existing defense solutions.

IV. DESIGN
In this section, we first introduce the entire design of SnapMem.Then, we will provide details of the software and hardware modules of SnapMem.

A. Overview of SnapMem
SnapMem is a more secure memory mechanism for sensitive data.Its core idea is to hide sensitive data where the CPU cannot access it unless the SnapMem's APIs are called.SnapMem's APIs have a process-based authentication mechanism.In other words each process can create its own SnapMem, but can only access the SnapMem created by itself.Similarly, a malicious process can create its own SnapMem, but cannot access the SnapMem of other processes.Even if the malicious process can read the entire main memory using cache-related attack, it still cannot steal sensitive data in SnapMem.Because the sensitive data is stored in SnapMem's hardware memory, not in the main memory.And the hardware memory of SnapMem can only be accessed by the hardware module of SnapMem, not by the CPU.
Fig. 2 shows the entire hardware and software architecture of SnapMem.From Fig. 2, we can see that the entire software and hardware system is divided into three layers, namely user space, kernel space, and hardware layer.SnapMem has components in all three layers.In user space, SnapMem has a total of four API components, which are SnapMem_alloc(), SnapMem_read(), SnapMem_write(), and SnapMem_free(), respectively.As the name implies, the function of SnapMem_alloc() is to allocate hardware memory of SnapMem.The functions of SnapMem_read() and SnapMem_write() are to read and write SnapMem's hardware memory, respectively.SnapMem_free() completes the release of SnapMem's hardware memory.
In kernel space, the design of SnapMem mainly includes SnapMem software module and the modifications to kernel functions (_do_fork() and do_exit()).SnapMem software module implements the functions of four SnapMem's APIs in kernel space.The work of process identity authentication is done by the submodule of Process ID Checker.In the hardware layer, SnapMem has two components: 1) SnapMem hardware module and 2) SnapMem memory.SnapMem hardware module completes the functions of communicating with software and transferring data from hardware to the main memory.The submodule of Controller is mainly responsible for the function of communicating with software.The work of data movement is completed by another submodule Data Mover.The function of SnapMem memory is to store sensitive data on hardware.
To reduce the overhead under certain conditions, we design two versions of SnapMem, named SnapMem-strict and SnapMem-light.The strict version of SnapMem will actively clear sensitive data in the main memory after each read and  write.The light version of SnapMem does not actively clear the main memory after each read and write.SnapMem-light will passively clear the main memory only when the legitimate process actively calls the two APIs SnapMem_light_open() and SnapMem_light_close() to close the SnapMem.
The data and control flow of SnapMem-strict are shown in Fig. 3.For simplicity and intuition, Fig. 3 focuses on reading and writing to SnapMem and omits SnapMem_alloc() and SnapMem_free().From Fig. 3, we can see that there are eight steps for reading and six steps for writing to SnapMem.Below, we introduce the eight steps of SnapMem_read() and six steps of SnapMem_write() in detail, respectively.
① Process ID Checker checks if the caller is legal.1) The identity of the caller is checked by Process ID Checker.2) Software module transfers data from user to kernel space.3) Software module passes the information of memory size and offset to Controller.4) Controller tells Data Mover the memory size and offset.5) Data Mover reads out the data of the corresponding address in the main memory, and clears the main memory.6) Data Mover sends the read data into the hardware memory.In the SnapMem-light design, SnapMem_read() and SnapMem_write() are much simpler than SnapMem-strict.In SnapMem-light, SnapMem_read() only needs to run steps ① and ⑥, while SnapMem_write() only needs to complete steps 1) and 2).All other steps are completed by the two APIs SnapMem_light_open() and SnapMem_light_close().We will describe the differences between the two versions in more details in Section IV-B.

B. Software of SnapMem
As described in Section IV-A, the main function of SnapMem software module is to complete the operations of the four APIs in kernel space.Below, we describe the functional design of these four APIs in detail.In addition, we will introduce modifications to the original kernel functions to support SnapMem.For ease of introduction, we list all parameters and signals that the software module of SnapMem needs to use in Table I.
SnapMem_alloc(): SnapMem_alloc() mainly completes three operations.First, query whether the current process has created a SnapMem.Exits with an error if the current process has already created a SnapMem.Second, allocate SnapMem for the current process and store the process ID.Third, assign an SnapMem_ID to the current process.

SnapMem_read():
The operation steps of SnapMem_read() in kernel space are shown in Algorithm 1.A detailed description of all parameters and variables can be found in Table I Return an error to user space; light version.Starting from the third line of Algorithm 1, we can clearly see the difference between SnapMem-strict and SnapMem-light.The read operation of SnapMem-light omits the steps of controlling the hardware, so its steps are very simple.From the fourth line of Algorithm 1, we know that cache lines need to be flushed and invalidated before passing parameters to Controller.The purpose of these steps is to ensure that all data can be flushed from the cache back to the main memory.This ensures that the hardware transmits the correct data in the main memory.
After all parameters are passed to the hardware, we need to start the hardware to transfer the data from SnapMem memory into the main memory, as shown in line 7 of Algorithm 1. Subsequently, the cache line will be invalidated again in line 8.This step is to ensure that CPU access to SnapMem_data bypasses the cache and directly accesses the main memory.When the hardware completes the data transfer work, SnapMem_data will be moved from kernel to user space in line 10.Then, operations related to the cache lines and controlling the hardware are re-executed.The difference is that the data transfer direction of the hardware is reversed, from the main memory to SnapMem memory, as shown in lines 11-16.This reverse data movement is to rehide sensitive data in hardware memory.
SnapMem_write(): Algorithm 2 shows the operation steps of SnapMem_write() in kernel space.As can be seen from Algorithm 2, the operation steps of SnapMem_write() are very similar to SnapMem_read().SnapMem_ID and SnapMem_state related operations are the same as SnapMem_read(), as shown in lines 1-3.In addition, SnapMem_write() does not require cache line and hardware related operations before transferring data from user to kernel space.Lines 5-10 show the steps for transferring SnapMem_data from the main memory to the hardware memory, which is the same as lines 11-16 of SnapMem_read().In general, SnapMem_write() is much simpler than SnapMem_read(), because it only runs the operations related to the cache line and hardware control once.
SnapMem_free(): SnapMem_free() mainly runs two operations.First, clear the stored current process ID to zero.Second, clear the SnapMem_ID value assigned to the current process to zero.
Modifications to Kernel: SnapMem's modifications to the kernel are mainly for two kernel functions _do_fork() and do_exit().The two functions are run when the process is created and terminated, respectively.We modify them mainly to prevent the memory leak of SnapMem.As we all know, unexpected termination of a process occurs frequently in a running system.It will cause the allocated SnapMem to not be released.On the other hand, the release of SnapMem requires a call to SnapMem_free().If the programmer accidentally forgets to call SnapMem_free() in the end of program, SnapMem cannot be released too.The occurrence of the two situations will lead to memory leaks of SnapMem.On ARM-FPGA SoC, SnapMem is an invaluable resource.We cannot tolerate the occurrence of these two SnapMem memory leaks.Therefore, we modified _do_fork() and do_exit() so that the process checks for memory leaks of SnapMem on both creation and termination.The operation they add is to check whether the current process has its own unreleased SnapMem.If there is, release the SnapMem, clear the stored process ID and SnapMem_ID to zero.

C. Hardware of SnapMem
As can be seen from Figs. 2 and 3, SnapMem hardware module consists of two submodules, namely Controller and Data Mover.Below, we describe the operation steps of these two submodules in detail.The introduction of SnapMem memory is omitted here.Because SnapMem memory is created based on Zynq's block RAM, it uses Zynq's free commercially available mature interface IPs.In other words, SnapMem memory does not have any innovation in functionality.Therefore, the functional details of SnapMem memory will not be introduced here.
Controller: The function of Controller is mainly to complete the control of Data Mover and the conversion of addresses or signals.Controller has a total of four operation steps as follows.

2) Based on SnapMem_ID, SnapMem_offset and
SnapMem_size, calculate the address range of SnapMem_data.3) Send the address range of SnapMem_data to Data Mover.4) Send the trigger signal to Data Mover to start its data transfer function.Data Mover: In SnapMem hardware module, Data Mover is the core component.Fig. 4 shows the state machine of Data Mover.From Fig. 4, we can know that Data Mover has a total of five states, namely Idle, Controller_Clear, Read, Memory_clear, and Write.

V. IMPLEMENTATION
Section IV focuses on the design principle of SnapMem.In this section, we will introduce the implementation of SnapMem in detail.Like Section IV, we also describe the implementation from both software and hardware aspects.

A. Software of SnapMem SnapMem_alloc():
We use the ioctl() system call to pass a parameter to the SnapMem software module in the kernel.This parameter instructs the kernel to allocate SnapMem memory.We define a global array SnapMem_PID_list in the kernel.Each element of SnapMem_PID_list is used to hold the ID of the process that owns SnapMem.The index of the element is its SnapMem_ID.By iterating over the elements of SnapMem_PID_list, the first empty element can be found.Then, the ID of the current process is stored in this element.
SnapMem_read(): SnapMem_read() uses the read() system call to enter the kernel and passes two parameters, SnapMem_offset and SnapMem_size.The SnapMem_ID can be obtained by traversing SnapMem_PID_list.We use the two instructions DC CIVAC and DC IVAC to flush and invalidate the cache line.Then, we use iowrite32() to write the three parameters SnapMem_ID, SnapMem_offset, SnapMem_size, and the trigger signal to Controller.The kernel function copy_to_user() is utilized by us to move data from kernel space to user space.
SnapMem_write(): The implementations of SnapMem_write() and SnapMem_read() are similar, with only two differences.First, SnapMem_write() enters the kernel using the write() system call.Second, moving data from user to kernel space is done based on copy_from_user().
SnapMem_free(): The implementation of SnapMem_free() is also based on the ioctl() system call.After the kernel traverses the array SnapMem_PID_list, it finds the element that stores the current process ID and clears it to zero.
Modifications to Kernel: The operation steps for adding _do_fork() and do_exit() are the same as SnapMem_free().The two modified kernel functions traverse SnapMem_PID_list and clear the element storing the current PID.

B. Hardware of SnapMem
We wrote the hardware code for SnapMem using Verilog HDL language.Moreover, we synthesized and implemented the hardware of SnapMem in Vivado Design Suite.
Controller: We configured an AXI slave interface for Controller.Through AXI SmartConnect [57] of Xilinx, we connect the AXI slave interface of Controller to M_AXI_HPM0_FPD interface [43] of the ARM MPCore processor.Therefore, the processor can access Controller through the physical address of the AXI slave interface.
Data Mover: Because Data Mover needs to actively access the main memory and the hardware memory, it is configured with an AXI master interface.Similar to Controller, Data Mover is also based on AXI SmartConnect and is connected to S_AXI_HP0_FPD [43] of the ARM MPCore processor.

VI. EVALUATION
In this section, we detail the security, performance, and hardware evaluation of SnapMem.First, our evaluation environment is introduced in Section VI-A.Then, we describe the security evaluation results in Section VI-C.In Section VI-D, we show the performance evaluation results.Finally, hardware overhead is provided in Section VI-E.

A. Evaluation Environment
Our experimental platform is Xilinx ZCU102 Evaluation Board [58].It has one ZU9EG SoC with a total of four ARM Cortex-A53 cores.The platform also has 4GB of Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE III SECURITY EVALUATION OF SNAPMEM
DDR4 memory and 29 GB of hard drive storage.We implemented SnapMem and other defense solutions based on Linux kernel 5.4.0 running in Ubuntu 16.04.6LTS on ZCU102 platform.The cross compiler we utilize is aarch64-linux-gnuwith the gcc version of 10.1.0.

B. Evaluation Design
The evaluation experiments are divided into: 1) security analysis; 2) performance analysis; and 3) hardware overhead.In the security analysis, all attacks are running on the real ZU9EG SoC platform.However, due to the weak transient execution capability of ZU9EG, transient execution attacks are simulated.For a detailed introduction to simulated attacks, please refer to Section VIII.Performance analysis is divided into three aspects: 1) API; 2) encryption application; and 3) system performance overhead.API experiments mainly focus on the time overhead of API calls.Encryption applications use SnapMem to implement a more secure AES implementation.The system performance experiments compare the performance overhead of SnapMem, KPTI, Spectre-BTB mitigation, and Spectre-STL mitigation.running on the ZU9EG SoC platform.One thing that needs to be emphasized is that in the API comparison experiment, the overhead test of reading and writing a small amount of data was also included.The specific results can be found in Table IV.In this test, the hardware defense solutions (STT and InvisiSpec) was running in the Gem5 emulator, while the software defense solution and SnapMem were running on ZU9EG platform.The reason why Gem5 is not used to simulate the SnapMem solution is because the block RAM used by SnapMem has Xilinx's unique interface IPs and timing logic.Using Gem5 to simulate SnapMem will produce great distortion, and ultimately lose the authenticity of the overhead comparison results.

C. Security Analysis
We have tested all cache-related attacks that can be implemented on this ARM-FPGA SoC.Since the branch prediction and out-of-order execution capabilities of ARM Cortex-A53 core are not strong enough, Meltdown, Spectre and their variant attacks cannot actually run on the platform.Therefore, we simulated these types of variant attacks using /dev/mem of the Linux file system.Experimental results in Table III prove that SnapMem can effectively defend against all

D. Performance Analysis
APIs of SnapMem: In the performance evaluation experiments, we first test the latency of four SnapMem APIs.To make the results look more intuitive, we compared the different SnapMem APIs with the original Linux API of similar functionality.Fig. 5 shows the comparison results of SnapMem_alloc() and malloc().As can be seen from Fig. 5, the call delay of SnapMem_alloc is much larger than malloc(), which is about 10 times.One of the reasons for such a big difference is that the call delay of malloc() itself is very small.The second reason is because the process authentication mechanism of SnapMem_alloc() takes much more time.From Fig. 5, we can also know that the call delay of the two APIs has nothing to do with the allocated memory size.Fig. 6 shows the latency comparison between SnapMem's access APIs and memcpy().This result includes both SnapMem-strict and SnapMem-light for read and write operations.The ordinate in Fig. 6 is exponentially growing.As can be seen from Fig. 6, the latency of SnapMem-light's read and write functions and memcpy() is in an order of magnitude.Also, as the data size increases, the access latency of SnapMem-light gets closer to memcpy().This is enough to prove that the access overhead of SnapMem-light is indeed small.In contrast, the access overhead of SnapMem-strict is much greater than the previous two.This is mainly because each read and write operation of SnapMem-strict is accompanied by operations and waits on hardware and cache lines.Therefore, the SnapMem-light is more suitable for low security but high performance requirements, while SnapMem-strict is the opposite.
However, we can also see from Fig. 6 that SnapMem is not suitable for reading and writing small amounts of data.When reading and writing 128-byte data, whether it is SnapMemstrict or SnapMem-light, the performance overhead of reading and writing is very large.The results in Table IV more clearly Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.show the comparison of read and write overhead for a small amount of data between different defense solutions.Among hardware defense solutions, STT and InvisiSpec are two open source projects that can be simulated and evaluated on Gem5 [59].We used the Gem5 simulator to simulate the STT and InvisiSpec solutions, respectively, and tested the time overhead of memcpy().On the other hand, we also tested the time overhead of memcpy() of three software defense solutions KPTI, Spectre-BTB mitigation, and Spectre-STL mitigation.From Table IV, we can see that in the case of small data volume, the overhead of SnapMem is far higher than that of other solutions.Therefore, the current SnapMem design is not suitable for read and write operations of small data volumes.In addition, it can be seen from Table IV that the cost of hardware defense solutions is much greater than that of software defense solutions.This is because the hardware defense solution modifies the cache hierarchy or predictor, which greatly affects the memory read and write speed.Correspondingly, the modification of the kernel source code by the software defense solution has very little impact on the memory read and write speed.Fig. 7 shows the call delay comparison results of SnapMem_free() and free().This result is similar to that of  AES Encryption: In order to see the overhead of SnapMem in practical applications, we designed the T-table encryption implementation of AES based on the two versions of SnapMem.Fig. 8 shows the encryption time comparison between the original T-table AES, SnapMem-strict-based AES and SnapMem-light-based AES implementations.As can be seen from Fig. 8, when the amount of data is relatively small, the overheads of the two SnapMem-based AES implementations are much larger than the original version of AES.But as the amount of data continues to increase, the time of the three is getting closer and closer.This result shows that SnapMem has good performance overhead in the case of large data volumes, but poor performance in the case of small data volumes.
UnixBench [60]: Researchers design UnixBench to provide a basic indicator of the performance of a Unix-like system.We perform UnixBench under 5 system settings to test the performance overhead of critical operations on Linux system.Fig. 9 shows the comparative evaluation results of UnixBench.As can be seen from Fig. 9, the test Context Switching increases significantly on the SnapMem enabled Linux system.This is mainly because Context Switching creates and terminates a large number of processes, while SnapMem has additional operations to release SnapMem memory in the creation and termination of processes.
Overall, KPTI enabled Linux has the largest total overhead on UnixBench, which is as high as 1.5%.The total overheads of the Spectre-BTB mitigation and the Spectre-STL mitigation are 0.98% and 0.6%, respectively.Correspondingly, SnapMem's UnixBench has a total overhead of just 0.07%.This result indicates that SnapMem has much less impact on the critical operations of Linux system than the other three defenses.
SPEC_CPU 2017 [61]: SPEC_CPU 2017 suites provide a comparative measure of compute-intensive performance using workloads developed from real user applications.So we utilize SPEC_CPU 2017 to compare the impact of SnapMem and other defenses on user application performance.We evaluate all of SPEC_CPU 2017 INT and FP applications under five system settings.These 5 system settings are the same as UnixBench tests.Fig. 10 shows the evaluation results of SPECrate2017 benchmark.As shown in Fig. 10, SnapMem has a high overhead in namd and xalancbmk test.Because  both namd and xalancbmk create and terminate processes frequently.
Overall, in the SPEC_CPU 2017 tests, the KPTI mitigation has the highest total overhead at 0.14%.Followed Spectre-STL mitigation, the total overhead is 0.08%.Both SnapMem and spectre-BTB mitigation have a total overhead of 0.04%.This result indicates that SnapMem and Spectre-BTB mitigation has less performance impact on user applications than the other two defenses.Mover and Controller occupy much less hardware resources than AXI interface.Data Mover uses 5.8% of LUTs and 1.3% of registers, while Controller utilizes 2.9% of LUTs and 6.1% of registers.On the other hand, SnapMem memory takes up all the block RAM.Moreover, SnapMem memory also uses 3.9% of LUTs and 3.8% of registers due to the need for a communication interface.

VII. RELATED WORK
In this section, we present several software runtime defenses and software/hardware co-design to resist cache-related attacks similar to SnapMem.
Several effective studies on software runtime defense are proposed to resist Prime+Probe, Flush+Reload, and Flush+Flush attacks.HomeAlone [34] can detect Prime+Probe attacks by using the same side channel (L2 cache) as attackers.CloudRadar [35] can detect Prime+Probe and Flush+Reload attacks in real time using hardware performance counters.Secure collaborative APIs (SCAPIs) [50] designed a collaborative mechanism for flush and time APIs to resist Flush+Reload and Flush+Flush in real time.
Some effective software runtime defenses against Spectre, Meltdown and their variants are also proposed by researchers.Google presents return trampoline (retpoline) [36] to defend against Spectre-BTB.It replaces indirect branches with push+return instruction sequences to prevent BTB poisoning.KPTI [39] can defend against Meltdown because it ensures no valid mapping to kernel space in user space.EPTI [62] was designed to defend against Meltdown attack in cloud.It has less overhead than KPTI and can be applied to unpatched VMs.In addition, Linux open source community provides two software runtime schemes on ARMv8-A, which are Spectre-BTB mitigation [37] and Spectre-STL mitigation [38], respectively.
There are also some studies that utilize software/hardware co-design to defend against cache-related attacks in real time.AdapTimer [55] is a software/hardware collaborative timer, which can resist flush-based cache-related attacks on ARM-FPGA embedded SoC.SecFlush [56] designs a flush detector in hardware, which cooperates with software to mitigate cacherelated attacks.

VIII. DISCUSSION
Simulated Attacks: All experiments of SnapMem are based on Xilinx's ZU9EG platform, which has 4 ARM Cortex-A53 cores.However, due to the weak branch prediction and out-oforder execution capabilities of the ARM Cortex-A53 multicore processor, we cannot successfully run Meltdown, Spectre-BTB, and Spectre-STL attacks on this platform.Therefore, we can only run simulated experiments for these three attacks.The simulated Meltdown, Spectre-BTB, and Spectre-STL attacks utilize /dev/mem of the Linux file system to directly read and write the main memory corresponding to SnapMem.Although they are not real attacks, they have a greater ability to steal sensitive data than real attacks.The failure of these three simulated attacks is enough to prove the security of SnapMem.
Overhead of API: From the performance evaluation results, we can see that SnapMem has a very large overhead when reading and writing a small amount of data.This is because every read and write of SnapMem starts the hardware module regardless of the data size.In future work, we look forward to solving the problem of large overhead with small amounts of data.A feasible solution is to directly make the hardware memory of SnapMem accessible to the CPU when reading and writing small amounts of data.SnapMem's memory is only accessible when SnapMem's read and write APIs are called, and is inaccessible at other times.Here, we temporarily name this solution SnapMem_Opt.The core of the SnapMem_Opt solution is a control register that determines whether SnapMem can be directly accessed by the CPU.This controller is a privileged controller and can only be controlled by SnapMem's API in user space.Since this solution omits the process of moving data from SnapMem memory to main memory, it greatly reduces the reading and writing overhead of a small amount of data.We use Gem5 simulator to design an proof-of-concept version of SnapMem_Opt and conduct a feasibility assessment of the read and write overhead.To be typical without losing generality, the SnapMem_Opt solution does not use dedicated block RAM and interface IPs, but is designed based on a separately isolated physical space in the main memory.Due to time constraints, this proof-of-concept version of SnapMem_Opt did not perform any optimization of hardware and software.Therefore, it is far from optimal performance and full functionality.But it is enough to verify the feasibility of the technical route.Table VI shows the comparison of read and write overhead of small amounts of data between SnapMem_Opt and SnapMem.
As can be seen from Table VI, compared with SnapMem, the overhead of the proof-of-concept version of SnapMem_Opt has been greatly reduced.This experimental result proves that SnapMem_Opt's technical route is feasible.In our future work, we will conduct comprehensive performance optimization and improvements on the proof-of-concept version of SnapMem_Opt.We believe that the overhead of the improved version of SnapMem_Opt will be further reduced.

IX. CONCLUSION
Cache-related attacks have become a huge security threat to ARM-FPGA embedded SoCs.Existing hardware defenses cannot be deployed on existing SoCs, while software runtime defenses have limited defense capabilities and high performance overhead.This article presents SnapMem, a software/hardware collaborative defense that can be deployed on existing ARM-FPGA platforms.SnapMem is a more secure memory that burns after reading.Any process can only access the SnapMem created by itself.Sensitive data appears briefly in the main memory only when accessed by a legitimate process calling SnapMem's APIs.Based on this mechanism, SnapMem can resist all cache-related attacks.We implemented various cache-related attacks on real ARM-FPGA platform and verified that SnapMem is more secure than other defense schemes.The system performance evaluation results show that the overheads of SnapMem are the lowest among all defense solutions.

Manuscript received 28
November 2023; revised 1 March 2024; accepted 5 April 2024.Date of publication 22 April 2024; date of current version 20 September 2024.This work was supported in part by the National Natural Science Foundation of China under Grant 62372218, and in part by the Shenzhen Science and Technology Program under Grant SGDX20201103095408029.This article was recommended by Associate Editor R. S. Chakraborty.(Corresponding author: Fengwei Zhang.)

②
Software module communicates with Controller to tell the hardware what memory size and offset to access.③ Controller passes the size and offset information to Data Mover, and starts the read function of Data Mover.④ Data Mover reads out the data of the corresponding address in the hardware memory one by one.⑤ Data Mover sends the read data into the main memory.⑥ Software module transfers data from kernel to user space.⑦ Data Mover reads out the data in the main memory and clears the main memory one by one.⑧ Data Mover sends the read data into the hardware memory.Correspondingly, SnapMem_write() (strict) has similar six steps as follows.

Algorithm 2 : 2 Get its own SnapMem_ID; 3 if 4 Transfer 5 Flush6 7 Pass 8 Senda9 12 Transfer
SnapMem_write() of SnapMem Input: Offset of SnapMem: SnapMem_offset; Size of SnapMem: SnapMem_size; Output: Data written to SnapMem: SnapMem_data; 1 if The current process has a SnapMem then SnapMem_state is SnapMem-strict then SnapMem_data in the main memory from user space to kernel space; and Invalidate all cache lines of the main memory corresponding to SnapMem; Pass the SnapMem_ID to Controller; SnapMem_offset and SnapMem_size to Controller; SnapMem_in trigger to Controller; Invalidate all cache lines of the main memory corresponding to SnapMem; 10 Waiting for the Data Mover to complete its work; 11 else SnapMem_data in the main memory from user space to kernel space; 13 else 14 Return an error to user space;

Fig. 5 .
Fig.5.The big delay difference between SnapMem_free and free() is also because of SnapMem_free's process authentication mechanism.AES Encryption: In order to see the overhead of SnapMem in practical applications, we designed the T-table encryption implementation of AES based on the two versions of SnapMem.Fig.8shows the encryption time comparison between the original T-table AES, SnapMem-strict-based AES and SnapMem-light-based AES implementations.As can be seen from Fig.8, when the amount of data is relatively small, the overheads of the two SnapMem-based AES implementations are much larger than the original version of AES.But as the amount of data continues to increase, the time of the three is getting closer and closer.This result shows that SnapMem has good performance overhead in the case of large data volumes, but poor performance in the case of small data volumes.UnixBench[60]: Researchers design UnixBench to provide a basic indicator of the performance of a Unix-like system.We perform UnixBench under 5 system settings to test the performance overhead of critical operations on Linux system.Fig.9shows the comparative evaluation results of UnixBench.As can be seen from Fig.9, the test Context Switching increases significantly on the SnapMem enabled Linux system.This is mainly because Context Switching creates and terminates a large number of processes, while SnapMem has additional operations to release SnapMem memory in the creation and termination of processes.Overall, KPTI enabled Linux has the largest total overhead on UnixBench, which is as high as 1.5%.The total overheads of the Spectre-BTB mitigation and the Spectre-STL mitigation are 0.98% and 0.6%, respectively.Correspondingly, SnapMem's UnixBench has a total overhead of just 0.07%.This result indicates that SnapMem has much less impact on the critical operations of Linux system than the other three defenses.SPEC_CPU 2017[61]: SPEC_CPU 2017 suites provide a comparative measure of compute-intensive performance using workloads developed from real user applications.So we utilize SPEC_CPU 2017 to compare the impact of SnapMem and other defenses on user application performance.We evaluate all of SPEC_CPU 2017 INT and FP applications under five system settings.These 5 system settings are the same as UnixBench tests.Fig.10shows the evaluation results of SPECrate2017 benchmark.As shown in Fig.10, SnapMem has a high overhead in namd and xalancbmk test.Because

TABLE I EXPLANATIONS
OF ALL THE PARAMETERS AND VARIABLES USED IN SOFTWARE OF SNAPMEM . From Algorithm 1, we can see that if the process owns a SnapMem, it will get its own unique SnapMem_ID.The SnapMem_ID determines which SnapMem memory it can access.The related operation of this SnapMem_ID completes the authentication mechanism of the process.After getting SnapMem_ID, you need to check the status of SnapMem.This status determines whether SnapMem is a strict version or a Offset of SnapMem: SnapMem_offset; Size of SnapMem: SnapMem_size; Output: Data read from SnapMem: SnapMem_data; 1 if The current process has a SnapMem then SnapMem_out trigger to Controller; Waiting for the Data Mover to complete its work; 5Pass the SnapMem_ID to Controller; 6 Pass SnapMem_offset and SnapMem_size to Controller; 7 Send a 9 12 Pass the SnapMem_ID to Controller; 13 Pass SnapMem_offset and SnapMem_size to Controller; 14 Send a SnapMem_in trigger to Controller; 15 Invalidate all cache lines of the main memory corresponding to SnapMem; 16 Waiting for the Data Mover to complete its work; Table II lists all registers and variables that Data Mover needs to use.Below, we describe the operation steps in the five states of Data Mover in detail.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE II EXPLANATIONS
OF ALL THE REGISTERS AND VARIABLES USED IN Data Mover 1) Idle: This is the initial state of Data Mover.In the Idle state, the main function of Data Mover is to wait for the input of signals and parameters, including SourceStartAddr, SourceEndAddr, DestinStartAddr, DestinEndAddr, and DataMovTrig.2) Controller_Clear: When DataMovTrig is not equal to 0, Data Mover will enter Controller_Clear state from the Idle state.In this state, Data Mover mainly completes one operation, that is, clearing DataMovTrig register of Controller to zero.If this DataMovTrig register is not cleared, Data Mover will be triggered again to start after completing the data transfer and returning to the Idle state.3) Read: This state has two functions.The first function of Read state is to read the data at the source address into RdDataReg of Data Mover.Then, Data Mover puts the read data into WrDataReg and waits to be written to the destination address.4) Memory_Clear: There are two operations in this state.The first operation of this state is to clear the data of the source address of the read data.Then, RdAddrReg register will be incremented by 4. 5) Write: In this state, Data Mover first writes the data in WrDataReg to the destination address.Then, Data Mover will judge whether WrAddrReg is equal to DestinEndAddr.If WrAddrReg is equal to DestinEndAddr, Data Mover goes back to the Idle state.If WrAddrReg is not equal to DestinEndAddr, WrAddrReg is incremented by 4.

TABLE IV READING
AND WRITING OVERHEAD FOR SMALL AMOUNTS OF DATA (128 BYTES).THE OVERHEADS OF HARDWARE SOLUTIONS (STT AND INVISISPEC) ARE COMPARED WITH THE ORIGINAL CPU HARDWARE ARCHITECTURE, AND THE OVERHEADS OF SOFTWARE SOLUTIONS (KPTI, SPECTRE-BTB MITIGATION, AND SPECTRE-STL MITIGATION) ARE COMPARED WITH THE ORIGINAL LINUX OS.THE OVERHEAD OF SNAPMEM IS ALSO COMPARED TO THE memcpy() FUNCTION Table V provides the hardware implementation overhead of SnapMem.As can be seen from TableV, in the hardware design of the entire SnapMem, the AXI interface occupies the most hardware resources, which are 87% of all LUTs and 89% of all registers, respectively.Relatively speaking, Data

TABLE V HARDWARE
IMPLEMENTATION OVERHEAD OF SNAPMEM

TABLE VI COMPARISON
OF READING AND WRITING OVERHEAD FOR SMALL AMOUNTS OF DATA (128 BYTES) BETWEEN SNAPMEM_OPT AND SNAPMEM