TS-Perf: General Performance Measurement of Trusted Execution Environment and Rich Execution Environment on Intel SGX, Arm TrustZone, and RISC-V Keystone

A trusted execution environment (TEE) is a new hardware security feature that is isolated from a normal OS (i.e., rich execution environment (REE)). The TEE enables us to run a critical process, but the behavior is invisible from the normal OS, which makes it difficult to debug and tune the performance. In addition, the hardware/software architectures of TEE are different on CPUs. For example, Intel SGX allows user-mode only, although Arm TrustZone and RISC-V Keystone run a trusted OS. In addition, each TEE has each SDK for programming. Each SDK offers own APIs and makes difficult to write a common program. These features make it difficult to compare the performance fairly between TEE and REE on different CPUs. To obtain precise performance and behavior in TEE, we propose TS-perf which is a compiler-based performance measurement method. TS-perf accesses the hardware timestamp counter in TEE as well as REE and keeps a precise log. The codes for measurement are inserted in a TEE binary by the compiler options (i.e., profile option, constructor, and destructor). Furthermore, we utilize the separate compilation technique, and the same benchmark binary is used for a fair comparison between TEE and REE. The architecture of TS-perf is general and implemented for three TEE architectures (Arm TrustZone, Intel SGX, and RISC-V Keystone). TS-perf measures the performance of GlobalPlatform’s TEE internal APIs, matrix multiplication, memory access, and storage access. The comparisons show the difference in performance between TEE and REE and the unusual behavior of trusted applications (TAs).

Most TEE hardware architectures are implemented by changing the state from REE to TEE on a core, which means that the same core is used in REE and TEE. It seems to show the same performance in REE and TEE, but each feature/limitation of each TEE architecture affects performance. For example, Intel SGX offers 96MB encrypted memory and user-mode (ring 3) only for a TA, whereas Arm TrustZone and RISC-V Keystone offer supervisor-mode to run a trusted OS on normal memory. (For example, Arm TrustZone is used by many smartphones with a vendor specific trusted OS; KNOX for Samsung [19]- [21], RTOSck for Huawei [22], QSEE for Qualcomm [23], [24], etc. RISC-V Keystone also offers some choices: eyrie [14], [15], seL4 [25], etc.) In addition, each TEE has a software development kit (SDK) for programming a TA. Unfortunately, the abstraction depends on SDK, and APIs for TA are not standardized (namely, there is no POSIX-like standard). These features make it difficult to write a portable program for TEE.
On the other hand, availability is an important factor for security, but the current TEE does not support adequate performance measurement tools because TEE hides the behavior from the normal OS. Therefore, TA debugging and performance tuning are not easy. To solve this issue, some approaches have been proposed [26]- [34], but they include high overhead or dependency of special hardware. In addition, the previous approaches are not based on the same reliable time counter between TEE and REE, and the performance comparison is not accurate.
To solve these problems, we have implemented the portable GlobalPlatform TEE internal API library for different TEE architectures, but the performance details have not been measured because we lacked a suitable measurement method. In addition, we cannot confirm that the performance in TEE is the same as that in REE. This has given us motivations to establish a fair performance comparison that runs the same binary in TEE and REE and measures the performance with the same time resource in both environments.
This paper proposes the precise measurement method TS-perf which utilizes a CPU hardware timestamp counter in TEE and REE, and GCC [35] compiler options to insert the codes for measurement. TS-perf is combined with the separate compilation technique and makes it possible to run the same binary in TEE and REE to enable fair performance comparison. The performance measurement results show some unusual behaviors (e.g., strange core change and uncertain CPU load), and we confirm that they come from the difference in TEE implementations or a bug. The behavior of TA is monitored from the view of the normal OS and compared with the results of TS-perf. We confirm some unusual behaviors (i.e., core change and uncertain CPU load) and find an interrupt-handler bug between REE and TEE. The remainder of this paper is organized as follows. Section II describes the background knowledge. Section III depicts the design of TS-perf and the separate compilation. Section IV describes the implementation of TS-perf, and section V shows the measured performance and behavior. Section VI describes some related topics, and section VII concludes this paper.

II. BACKGROUND
This section describes the background of the TEE architecture, time counter, common TEE programming API, and related works for performance and behavior measurement in TEE.

A. TEE ARCHITECTURE
The three TEE architectures used in this paper are described. These TEEs are implemented by changing the state of the CPU cores.

1) ARM TrustZone
Since smartphones, game machines, and set-top boxes use Arm Cortex-A TrustZone [3], [4] 1 for critical processing, it is the most popular TEE. TrustZone offers 2 world view model, i.e., the secure world (i.e., TEE) and the normal world (i.e., REE). The world (namely, the state of REE and TEE) is changed by the SMC (secure monitor call) instruction. Each world has user-and supervisor-mode and runs applications on a kernel (i.e., trusted OS or normal OS). Many trusted OSes on Trust-Zone(e.g., QSEE [23], [24], OP-TEE [36]) offer the APIs defined by GlobalPlatform [37], [38] for TA programming. Some APIs require the help of a normal OS (e.g., secure storage).
The memory allocation for TEE is flexible, but most implementations allocate the limited memory at the boot time to keep a small trusted computing base (TCB). When an interrupt is issued, a trusted OS can handle it if the interrupt is caused by the secure world's peripheral. However, many interrupts are passed to a normal OS with a context switch because they are issued for the normal world.

2) INTEL SGX
Intel Software Guard Extensions (SGX) [5]- [8] offers a single address model of TEE, named enclave. SGX allows a TA as a part of a normal application on a host OS, and the TA runs on the user-mode (ring 3) in an enclave which is the state of TEE. An enclave is created dynamically, and the TA is loaded on it. The TA is implemented as a shared library offered by Intel SDK. When an interrupt is issued, the processing is passed to the normal OS in REE. The total memory for enclaves is defined by UEFI and reserved at power-on. The maximum size is fixed (128MB is reserved on SGX version 1, and 256MB is reserved on version 2). The memory region is encrypted with the key generated at power-on.
Intel SDK includes the enclave definition language (EDL) named edger8r, which offers glue codes for secure communications between the normal application and the TA (i.e., OCALL from the TA to the application and ECALL from the application to the TA). The glue codes check the region of the pointer and the size of the buffer. They wrap the edge/boundary and serialize/deserialize the arguments/results. The glue codes are crafted with formal verification to reduce attack surfaces. An exhaustive analysis of the EDL is described in [39].

3) RISC-V KEYSTONE
RISC-V is an open instruction set architecture (ISA) and has some TEE implementations [10]- [18]). This paper utilizes Keystone because it is open-source and runs on a real CPU (SiFive's Unleashed board). Keystone, a mixed architecture of Arm TrustZone and Intel SGX, can create a TEE dynamically as SGX, although TrustZone offers only one TEE at boot time. The TEE is named enclave as SGX but has userand supervisor-mode as TrustZone. Each TEE has its own runtime, which works as an OS kernel. The memory for an enclave is dispatched dynamically from REE (i.e., Linux) using the physical memory protection (PMP) mechanism. The dispatched memory is sanitized and used for critical processing. The PMP also manages the state of cores and changes from REE to TEE. In the same manner as Intel SGX, Keystone offers the EDL named keyedger, and the glue codes protect the OCALL. 2

B. TWO TYPES OF TIME COUNTERS
Current computers keep two types of time. One is the hardware timestamp, and the other is the system clock which indicates the global clock (i.e., calendar clock). 2 Current Keystone has not implemented ECALL yet.

1) TIMESTAMP CYCLE
The hardware timestamp counts the number of the internal processor clock cycle. The timestamp counter is a special hardware resource in a CPU, and access is allowed by special instructions. For example, the Arm CPU's counter-timer physical count register CNTPCT_EL0 is allowed for the supervisor only, but the counter-timer virtual count register CNTVCT_EL0 is allowed for the user. In addition, some CPUs can change their CPU speed frequency to reduce power consumption, but the internal processor clock cycle is not affected, and the timestamp counter remains monotonic. Table 1 shows the timestamp counters of the target architectures of this paper. They are Intel x86-64's timestamp counter (TSC) with rdtsc instruction, Arm Cortex-A's counter-timer virtual count register (CNTVCT_EL0) with mrs instruction, and RISC-V's hardware performance monitor (HPM) with rdcycle instruction. The table also includes the frequency of the hardware timestamp of the CPU used in this paper. Arm Cortex-A53 has a slower hardware timestamp than the CPU speed frequency (1,400 MHz). Thus, the resolution is not high but is sufficient to measure a function-level amount of code, as shown in Section V.

2) SYSTEM CLOCK
The system clock indicates the global clock, which is used for verifying certificates and logging. In general, the source of the system clock is based on an external peripheral named real-time clock (RTC) at boot time. The RTC has a battery power supply and keeps working even if the CPU power is down.
The system clock is obtained by a system call gettimeofday() on Linux. If gettimeofday() is implemented purely, it accesses the RTC. However, much time is needed to access the external peripheral. To reduce the access overhead, the initial value of the system clock is obtained from the RTC, and the system clock is maintained by other clock resources (e.g., timestamp counter). In addition, most implementations of gettimeofday() use a virtual dynamic shared object (VDSO) and omit the context switch of a system call.

C. COMMON TEE PROGRAMMING API
In general, each TEE has an SDK for its TA programming. For example, Intel offers SGX-SDK [5], and RISC-V Keystone also offers Keystone SDK [40]. To solve this problem, we implemented the portable library of GlobalPlatform's TEE Internal API [41] because the specification is opened and widely used on smartphones.
On Arm TrustZone, the trusted OS OP-TEE offers Glob-alPlatform's TEE Internal API already, and we implemented the portable library for Intel SGX and RISC-V Keystone. Intel SGX has no supervisor mode, and RISC-V Keystone offers eyrie runtime at the supervisor level, which provides limited API for a TA. GlobalPlatform's TEE Internal API includes approximately 300 APIs with many arguments that include many cipher suites. We selected notable APIs for common applications.
We divide the GlobalPlatform APIs into 2 categories: CPU-independent and CPU-dependent. CPU-independent APIs are cryptographic functions and can be implemented easily as a portable library, whereas CPU-dependent APIs are functions related to secure storage, timer, and random number. These APIs must be customized for each architecture but are also implemented as a library.

D. RELATED WORKS
The importance of TEE performance measurement has been recognized, and some methods have been proposed. Table 2 summarizes the related works. Gjerdrum et al. [26], SGX-Perf [27], and TEEMon [29] are Intel SGX specific solutions and cannot be applied to other architectures. Gjerdrum's and SGX-Perf are early approaches and use the timestamp outside TEE. Their main target is the performance of SGX primitive functions (e.g., context switches) and does not provide a method to measure the functions of a TA. TEEMon is a real-time performance monitoring framework that uses the timestamp inside TEE, but the overhead is reported to be high from 5% to 17%. TEE-Perf [28] is a general perf implementation but assumes a recorder process that occupies a core along with the perf process. The recorder process has a software counter on a shared memory, which is incremented by an infinite loop. The perf process obtains the number of the software counter via shared memory. TEE-Perf can be applied to REE, but the implementation is redundant and inaccurate (the overhead is reported to be up to 5.7× higher than that for Linux perf).
On Arm TrustZone, some trusted OSs offer measurement functions because they are based on the Glob-alPlatform APIs which include time measurement (i.e., TEE_GetREETime() and TEE_GetSystemTime()). Furthermore, OP-TEE offers gprof [30], [31] to measure the performance of each function. The paper [32] customizes the OP-TEE to measure the performance of the stress test benchmark; STRESS-NG [42]. They are useful but are not generic to enable comparison to other architectures.
Some CPU vendors offer hardware assistance to obtain the performance information. Intel Processor Trace (PT) [33] and Arm CoreSight [34] are well-known mechanisms and integrated current perf tools, but they are disabled by default on SGX and TrustZone.
TS-perf measures performance using a general hardware timestamp counter inside TEE and REE, and the results are precise. We offer the same APIs on different TEE architectures and make it possible to compare the performance between different architectures.

III. DESIGN
This section describes two design issues for TS-perf and the method of comparing REE and TEE.

A. DESIGN OF TS-PERF
TS-perf is a compiler-based performance measurement method that obtains the time from the hardware timestamp counter in TEE and REE. The time data are saved in memory during performance measurement because the writing of data involves high overhead. TS-perf writes the time data in memory to a file after the measurement.
To implement these functions, TS-perf has three challenges. The first concerns the access to the hardware timestamp counter in TEE and measuring the time before and after a function. The second is the memory allocation for logging in TEE. The third is the writing of data from TEE to a file in REE. TS-perf solves these challenges using the features of the GNU Compiler Collection (GCC) [35] (i.e., profile option, constructor, and destructor).
The means of accessing the hardware timestamp counter in TEE is dependent on each architecture. On Intel SGX, TS-perf is implemented for SGX version 2 (SGXv2) with rdtsc instruction because SGX version 1 (SGXv1) does not allow access to TSC in user-mode. SGX2 is offered for limited CPUs only, but we choose an available machine. Accessing the timestamp counter in SGX is not our contribution, but we utilize it for fair performance comparison. Arm TrustZone and RISC-V Keystone allow access to their TSC in user-mode (i.e., CNTVCT_EL0. and HPM with rdtsc and rdcycle instruction, respectively). The code to obtain the timestamp must be inserted on the top and bottom of a target function. Fortunately, the profile option of GCC offers this feature, and TS-perf utilizes it.
The second challenge concerns how to keep the time on memory in TEE. The memory region must be assigned before measurement because it reduces the runtime overhead. Fortunately, the constructor of GCC offers the mechanism to insert code before the main program. TS-perf utilizes the constructor to reserve memory for logging.
The third challenge is reporting log data to REE after measurement because access to REE includes heavy overhead. The destructor of GCC enables insertion of a code after the main function, and TS-perf utilizes it. The way to save data from TEE to a file in REE depends on the TEE implementation. TS-perf follows the manner suitable for each architecture.

B. DESIGN OF A METHOD FOR COMPARING REE AND TEE
TS-perf enables precise performance measurement based on the same timestamp in TEE and REE, but it is insufficient because the same binary cannot be run in TEE and REE in general. The reason is that the system call is different, and a TA must link special libraries for TEE execution. In addition, the compiler options are different in many cases. If link time optimization (LTO) is enabled, the functions may be optimized differently. To solve this problem, we utilize the separate compilation technique. Figure 1 shows the overview. The functions for measuring are saved in an original source file and compiled to an object file (in Figure 1, memory and CPU are the measuring targets). The benchmark main and TS-perf codes are prepared for TEE and REE because they depend on these environments. The benchmark main code includes the function to call the measuring target, and the TS-perf code uses GCC profiling. TS-perf offers measuring codes for TEE and REE. Hence, TS-perf does not use the GCC original prof because of the objective of fair comparison.
The measuring target object file is linked to the benchmark main and TS-perf object and creates a benchmark binary. The same measuring target binaries are thus used in REE and TEE.
Unfortunately, this technique is not applied to the function that uses TEE architecture-dependent APIs (e.g., secure storage). In that case, the source code is written as similar as possible and compiled with the same options.

IV. IMPLEMENTATION
TS-perf is implemented for Arm TrustZone, Intel SGX, and RISC-V Keystone. The implementation in TS-perf consists of three steps on GCC [35]. TS-perf is also implemented for REE, but this section concentrates on TEE because the implementation is almost the same and simple.
Step 1 (Compiler Phase): The profile option of GCC requires the compile flag to build an object for measuring functions. The flag is -finstrument-function, which inserts a code at the top and the bottom of every function. These inserted codes access the timestamp counter from TEE on each architecture.
TS-perf also requires the codes for preparing the log buffer and reporting the log to REE (i.e., normal OS). These codes must be executed before or after the main part of TA execution. The GCC compiler offers __attribute__((constructor)) and __attribute__((destructor)) to insert code at the start and end of the main function. However, the linker script for the TA is not compatible on each architecture, and it is not easy to insert arbitrary codes.
Fortunately, each TA has its own entry point which enables the insertion of arbitrary codes before and after the TA. The entry points are eapp_entry, ecall_ta_main and TA_InvokeCommandEntryPoint on RISC-V Keystone, Intel SGX, and Arm TrustZone, respectively. The code for obtaining a 64KB log buffer is inserted before the entry point. Its size is fixed for simplicity, and it is large enough to profile on each function. The code to report the log data to REE is inserted after the entry point. The implementation depends on each architecture described as follows.
Step 2 (Recorder Phase): During the TA execution, the address of the function and the timestamp counter value of each enter/exit are logged into the buffer. After the TA is finished, the buffer data must be written to the log file in REE. (Note: TS-perf runs a log-saving code in REE and writes the log file using the system call of Linux.) In TS-perf in TEE, the __profiler_map_info and __profiler_unmap_info functions are registered in the entry point of each TEE and play an important role in logging. The __profiler_unmap_info function is inserted before the main function of TA by __attribute__((constructor)) and prepares the log buffer (64KB). The __profiler_unmap_info function is inserted after the main function of TA by __attribute__((constructor)) and saves the log to a file in REE (i.e., Linux).
The implementation of the __profiler_unmap_info function is different on each TEE because it needs to collaborate with Linux. In Keystone and SGX, OCALL functions are used to write the log buffer to the file in REE. On the other hand, in TrustZone (namely, OP-TEE), the log data are moved into the shared buffer. After the end of the TA, it is written to the file in REE.
In RISC-V Keystone, the entry point is the eapp_entry function which should be added with the EAPP_ENTRY keyword with The __profiler_unmap_info. The __profiler_unmap_info uses OCALL functions such as open, write, and close for the log file on Linux (in Code 1). Code 1. Saving log in REE by OCALL on RISC-V Keystone.

Code 2. Saving log in REE by OCALL on RISC-V Keystone.
In Intel SGX, the __profiler_unmap_info is also registered at the SGX entry point ecall_ta_main. The OCALL functions are used in the same way as Keystone, but the arguments are not the same (in Code 2).
In Arm TrustZone, a secure OS OP-TEE manages a TA. OP-TEE has no OCALL as Keystone and SGX, and we have to write a program to share log data. We prepare the buffer both in REE and in TEE explicitly to share the log data. In REE, the shared buffer is allocated with the output flag. The buffer is conveyed to the TA by TEEC_InvokeCommand, which starts the TA in TEE (in Code 3).

Code 3. Getting Buffer between TEE and REE on OP-TEE/Arm TrustZone.
Inside the TA, the __profiler_unmap_info function is also registered by TA_InvokeCommandEntryPoint. In TEE, after the main program, the __profiler_unmap_ info function is called (in Code 4). In the __profiler_ unmap_info function, the log data are simply moved to the shared buffer.
When the TA terminates, the TEEC_InvokeCommand function returns in REE. The buffer is packed with the log data for each function. The log data are saved with POSIX-compliant functions such as open, write, and close in REE (in Code 5).
Step 3 (Analyzer Phase): The performance result is saved as a binary file. The analysis tool parses the data, organizes into a readable format, and compares the figure between the different architectures.

V. EVALUATION
TS-perf measured some types of benchmarking in TEE and REE on three different architectures listed in Table 3. The CPU speed frequency is fixed at 1,000 MHz by cpufreq-set to prevent an automatic change on Intel x86-64. However, we preformed evaluations on 1,400 MHz Arm because the Raspberry Pi3 B+ offers 600 VOLUME 9, 2021 The evaluations aim to show (1) the accuracy and precision of performance measurement, (2) the difference in TEE implementation on different CPUs, and the difference between TEE and REE on the same CPU, and (3) the unusual behavior in TEE.

A. OVERHEAD FOR OBTAINING TIME
To show the accuracy of TS-perf, we measured the time functions of GlobalPlatform TEE Internal APIs: TEE_GetREETime() and TEE_GetSystemTime(). TEE_GetREETime() obtains the system clock from REE, and TEE_GetSystemTime() obtains the hardware timestamp in the user-mode of TEE, which is the same as that in TS-perf. Table 4 shows the results. TEE_GetREETime() causes OCALL, and the average time is more than 15 µ-seconds on each architecture. On the other hand, the average time of TEE_GetREETime() is less than 0.5 µ-seconds on each architecture. Hence, the average time of TEE_GetSystemTime() is 30 times faster than that of TEE_GetREETime(). The maximum time and standard deviation in TEE_GetREETime() on the Arm TrustZone and Intel SGX were higher than those on the RISC-V Keystone. We speculate that the differences are caused by the complex hardware on Arm and Intel (e.g., cache hierarchy, branch prediction). The relative maximum time and standard deviation on TEE_ GetSystemTime() are less than those on TEE_GetREETime(), but the absolute values of TEE_ GetSystemTime() are shorter than those on TEE_GetREETime().
The time-related functions were measured by TS-perf, which uses the hardware timestamp as TEE_GetSystemTime(). The standard deviations were low; therefore, TS-perf is stable and accurate.

B. TEE AND REE BENCHMARKS
We use three original benchmarks because existing benchmarks are not suitable for TEE. They assume input/output or system calls that are not supported in TEE. In addition, we want to show a fair performance comparison between TEE and REE. The three benchmarks are CPU, memory, and storage intensive.

1) FEATURES OF BENCHMARKS
CPU Intensive: CPU-intensive benchmarks measure 25,000,000 multiplications of integers or double float numbers. They are simple arithmetic benchmarks and are assumed to have no difference in TEE and REE. The benchmarks utilize the separate compilation technique for a fair comparison. (Note: The iteration number is determined by the average elapsed time on all architectures.) Memory Intensive: Memory-intensive benchmarks measure 1MB memory read/write access sequentially or randomly. The benchmarks may cause performance differences when the memory is encrypted. However, cache and branch prediction may hide the performance difference. The benchmarks utilize the separate compilation technique for a fair comparison. (Note: the memory size is decided to compare all architectures.) Storage Intensive: Storage-intensive benchmarks measure the 1MB file read or write sequential access only because the current implementations of GlobalPlatform APIs for storage (i.e., TEE_WriteObjectData() and TEE_ReadObjectData()) do not allow random access. Read or write access occurs for each 32KB unit due to the TEE buffer size. Storage depends on the different API implementations in TEE and REE. Therefore, the separate compilation technique cannot be used. Table 5 summarizes the results for TEE and REE, and Figure 2 visualizes the results for the CPU-and memoryintensive benchmarks.

2) RESULTS OF BENCHMARKS
CPU Intensive: The results show almost the same performance for the multiplication of integers and double float numbers in TEE and REE. These results are quite natural because TEE and REE run on the same core architecture. However, each architecture has a slight difference. Arm TrustZone shows that TEE is approximately 3% slower; this impact is the highest among the three CPUs. The maximum and minimum times did not have large differences, but the differences were stable. We hypothesize that there are the architectural differences, and analysis is left to future work.
Memory Intensive: The results show almost the same performance for sequential and random memory access in TEE and REE of Intel SGX and RISC-V Keystone. Arm TrustZone shows that TEE is slower, especially with 11% overhead on random access. Arm has a large impact on random memory access in TEE, and programmers should exercise caution.
We expected that SGX's memory encryption mechanism would cause performance degradation, but the results do not show this feature. We analyzed further performance on Intel, and we changed the memory size from 1MB to 32MB. Figure 3 shows the results. The sequential access  performance is almost proportional to the memory size, but the random accesses are slower for TEE than for REE. Table 6 shows the detailed results. The performance degradation is not clear until 4MB. We expect the same performances on small memory to be caused by the CPU cache because Pentium j5005 has a 4MB L2 cache. We also expect the degradation in random access in TEE to be caused by memory encryption because the effects of the cache are the same in REE and TEE. The overhead for memory encryption is exposed upon large memory random access.
Storage Intensive: The results show the difference between TEE and REE. As expected, the results were unstable because TEE requires OCALL to save the encrypted data in REE (i.e., Linux). On TrustZone, both read and write performance in TEE showed a large difference, perhaps because implementation is complex for OP-TEE in terms of file access. SGX and Keystone cause OCALLs, which include glue code created by the EDL. We expected that the EDL affects the performance. However, the stability is not good, and the results cannot clearly show the effect of the EDL.

C. COMPARING FROM THE VIEW OF REE
The behaviors of TEE benchmarks were monitored by htop on Linux (REE), which shows the load on each core. The results showed two unusual behaviors: (1) the TEE running core was changed, and (2) the core load did not remain at 100% even if a heavy benchmark was run. Table 7 summarizes the results.

1) CORE CHANGE
Because the core maintained a 100% CPU load, htop informed which core was used for the TEE benchmark. The CPU load category shown by htop was different in each TEE architecture: system load for TrustZone and Keystone and user load for SGX, which indicated the view of the TA from REE. In addition, htop showed that the 100% load core changed sometimes. This behavior was unusual.
We confirmed that the TA's core was changed when the normal application in REE was changed by the taskset command, which can designate the running application core. These results indicate that the Linux scheduler changes the process's core even if the process uses TEE. Therefore, the current Linux scheduler does not recognize whether a process uses TEE. We regard this as a next research topic for the scheduler to collaborate with TEE.
TrustZone showed more unusual behavior. The TA sometimes did not follow the normal application, and thus, it ran on a different core from the normal application. We imagine that OP-TEE changes the core when it accepts SMC instruction. This does not violate the rule of the trusted OS, but we could not determine why, leaving this question to future research.

2) CPU LOAD
The htop showed the core load changed from 100% to 0% sometimes on RISC-V; this was unusual because the TA should consume the CPU until the finish, and the results of TS-perf did not show a large difference. We analyzed the code of the eyrie runtime, and the unusual behavior led us to find a bug. The bug is the treatment of handle_timer_interrupt, which shares the platformlevel interrupt controller (LPIC) between Linux and eyrie. The bug omits the CPU load on a core. We posit that this result was caused by a design mismatch between TEE and REE and subsequently discuss this topic in section VI-C.

VI. DISCUSSIONS
A. APPLYING TS-PERF TO ANOTHER TEE ARCHITECTURE TEE implementation is not limited to core sharing, e.g., Apple iPhone's Secure Enclave [46], [47]. The secure enclave is implemented on another CPU and does not cause core change. The performance information is also hidden from the normal OS. This style of TEE architecture is not related to core change and does not affect application performance in REE. In addition, the separated-CPU TEE can avoid microarchitectural vulnerability (e.g., Spectre [48]. The vulnerability also infects the TEE (e.g., ForeShadow [49])). However, the implementation results in a higher cost. Even if the core-sharing style TEE is used, hyperthreading technology causes vulnerabilities for side channel attacks. Disabling hyperthreading is recommended for some CPUs. Fortunately, the CPU used in this paper has no hyperthreading.
The portability of TS-perf can be reserved on separated-CPU TEE if the time measurement works and communication between REE and TEE is guaranteed. This extension is enabled by the compiler-based performance measurement method.
On the other hand, some core-shared TEE has another cryptographic accelerator, e.g., Secure Element (SE) [50] or Rambus CryptoManager [51], which work as a root of trust [52]. GlobalPlatform defines the API from core-shared TEE to SE [53]. This style hides the performance of the cryptographic accelerator, and current TS-perf cannot cover the performance measurement.

B. COVERAGE OF TS-PERF
Fair performance comparison is a fundamental issue because current hardware and OS have performance hiding mechanisms, e.g., cache hierarchy, branch prediction, and Linux's page cache for I/O. As mentioned in section V-B, the performance degradation caused by memory encryption was not easy to disclose. These performance hiding mechanisms are effective for small access sizes and fixed patterns. In general, traditional TAs have been used for cryptographic processing, and the binaries were small, which can yield the effect of performance hiding mechanisms. However, current TAs are used by machine learning, genome analysis, privacy processing, etc. The codes and data are large, and the processing shows native performance. Since performance tuning becomes more important, TS-perf aims in the development of these TAs.
TS-Perf is not limited to the same benchmark library and can measure the precise performance of different binaries using the hardware timestamp counter. For example, TS-Perf can measure the binary that is optimized for REE or TEE. The results may show another perspective on this difference. This topic is the subject of our future work.

C. INTEGRATED DESIGN BETWEEN REE AND TEE
As mentioned in section I, the programming and execution environments are different between REE and TEE, which includes hardware architecture as well as software architecture. This style was effective on smartphones because the target applications are limited (e.g., key management, DRM management). However, TEE has become popular, and many normal applications want to be executed in TEE (e.g., machine learning and genome analysis). They require the execution of the same normal program in TEE.
To run normal applications without customization, SGX-LKL [54] and SCONE [55] have been developed; however, they cannot offer complete compatibility. For example, SGX-LKL does not support fork(). Current SCONE supports fork() but recommends avoiding fork() based on the performance problem.
We think that these problems are caused by the unfixed abstraction of TEE. As mentioned in section V-C, the mismatch between TEE and REE causes some unusual behavior. A seamless programming style in REE and TEE is desired, but the abstraction model and its support formal verification tools are not established. Hence, TEE remains in use in many research fields. TS-perf is a compiler-based performance measurement method and offers a seamless programming tool that can bridge REE and TEE.

VII. CONCLUSION
TS-perf is a general compiler-based performance measurement method and can be applied in many TEE implementations. TS-perf is based on the timestamp counter that is available in REE and TEE on three architectures (Arm Cortex-A, Intel x86-64, and RISC-V U540), and this method enables a fair comparison between REE and TEE. To conduct a fair comparison, we also propose to utilize the separate compilation and enable the use of the same binary in REE and TEE. The performance results showed the sameness (arithmetic performance) and difference (memory encryption and storage) between REE and TEE. The TEE results were also compared with the view from REE, and strange core change and an interrupt-handler bug were found. AKIRA TSUKAMOTO received the M.S. degree in computer science from Columbia University, New York. He currently works with the National Institute of Advanced Industrial Science and Technology (AIST). He has worked on products based on Cell/B.E. and Arm. His research interests include software engineering on a networks, operating systems, and system security, and he is enthusiastic regarding any kind of technical development.