Software-Level Memory Regulation to Reduce Execution Time Variation on Multicore Real-Time Systems

Modern real-time embedded systems are equipped with multi-core processors to execute computationally intensive tasks. In multi-core architecture, last-level cache memory is shared by cores. The shared cache becomes a non-deterministic resource, which affects the independent execution of real-time tasks. We propose a solution to remedy a variation in execution time when interference happens in a shared cache. Current solutions have relied on memory scheduling approaches that avoid concurrent memory access to guarantee deterministic execution time. However, these methods required complex analysis to accurately estimate the worst-case execution time and to schedule tasks in an overly conservative manner. Unlike existing works, the proposed method prevents simultaneous memory access using the side effect of memory barriers rather than the complicated analysis. A memory barrier is inserted based on a simple code analysis that is performed in units of basic blocks using the LLVM compiler. The proposed method not only does not require the modification of the operating system or task execution flow but also relatively shows fast analysis time. To verify the proposed method, we compared the standard deviation of the execution time of each core in a situation where shared cache interference occurs in multi-core. Experimental results show that the proposed basic block-based memory barrier insertion method can reduce the variation in execution time by up to 80% when interference occurs.

• This study presents a basic block-based fine-grained 101 analysis method and passes implementation using 102 LLVM to insert a memory barrier in the task code. In par-103 ticular, it proposes inserting a memory barrier based on 104 the memory footprint of the basic block.

105
The structure of this paper is as follows. Section II 106 describes the memory structure and memory barriers of 107 the ARM architecture. Section III deals with the procedure 108 and implementation of the basic block-based memory bar-109 rier insertion proposed in this study. Section IV describes 110 the experimental environment and analyzes the interference 111 reduction effect through the proposed method. Section V 112 describes related works, and Section VI describes the lim-113 itations of the proposed method and future works. Finally, 114 Section VII provides the conclusion. 116 This study presents a PREM-like method for reducing shared 117 cache interference using a barrier insertion technique that 118 controls task scheduling. The task is analyzed in units of basic 119 blocks and interference is reduced as a result of the inser-120 tion of memory barriers based on the memory footprint of 121 the basic block. In particular, this paper prevents concurrent 122 access to memory by exploiting the operational side effect of 123 memory barriers for memory access ordering. In this section, 124 the memory hierarchy of the ARM Cortex-A architecture, 125 which is the experimental environment in this study, is briefly 126 described, and the operation of the memory barrier, based on 127 this, is explained.  Figure 1 shows the memory hierarchy of the ARM Cortex-130 A53 processor used in this study. We focus on the data cache 131 because it only considers memory access. 132 Cortex-A53 is equipped with a quad-core processor 133 wherein each core has a data and instruction cache (level 1 134 (L1) cache). The L1 data cache consists of a physically 135 indexed and tagged cache and operates with a write-back 136 (WB) policy. WB marks a cache line as dirty after a cache 137 write, and updates external memory only when the cache line 138 is evicted or explicitly cleaned.

II. BACKGROUNDS
The level 2 (L2) cache is organized in the L2 memory 140 system as the data cache, and instruction cache is unified. 141 Data in the L2 cache is not filled when it is first fetched 142 from the system, and data allocation occurs only when evicted 143 from the L1 cache. The L2 memory system is composed of 144 the integrated snoop control unit (SCU), advanced microcon-145 troller bus architecture (AMBA) 4 AXI coherency extensions 146 (ACE), AMBA 5 coherent hub interface (CHI) master bus 147 interface, accelerator coherency port (ACP), and an L2 cache. 148 The SCU maintains the coherency of the L1 data cache of 149 each core and connects the four cores to the cluster. The 150 SCU also manages interconnection actions such as arbitra-  In the use of a barrier, the access field parameter can be 195 used as follows [18]:

196
• Load-load/store: While all loads have to complete before 197 the barrier, stores do not. Loads and stores that appear 198 after the barrier in the program sequence must wait until 199 the barrier is completed.

200
• Store-store: The barrier only affects the store, and loads 201 can be freely re-ordered.

202
• Any-any: Both the load and store must be completed 203 before the barrier, and loads and stores after the barrier in 204 program order must wait until the barrier is completed. 205 Since a wait occurs in processing the memory barrier, the 206 execution time overhead according to the insertion of the 207 barrier must be considered. In addition, since the overhead of 208 the barrier depends on the processor and memory structure, 209 the characteristics of the architecture should also be analyzed. 210 A study analyzing overhead because of memory barriers in 211 ARM architecture showed that throughput could be reduced 212 by up to five times in DSB [19]. 213 In the ARM architecture, the system's memory map is 214 divided into several areas with various access rights, memory 215 types, and cache policies. Therefore, the range of the memory 216 area can be specified when calling the barrier [17]. The 217 memory area is depicted in Figure 2: area only a single process can access. Therefore, an NSH 220 cannot be accessed from other processes. It generally 221 corresponds to the private cache of each processor.

222
• Inner-shareable (ISH) domain: The ISH is a memory 223 area shared between multiple processors. However, it is 224 not shared with other system domains (e.g., a graphical 225 processing unit (GPU)). Therefore, each processor's pri-226 vate and shared caches are included.

227
• Outer-shareable (OSH) domain: An OSH is a memory 228 area shared by one or more domains in the system. 229 As OSH includes ISH, it contains the main memory, 230 which can be shared by domains outside the processor, 231 such as a multi-core processor or GPU.  ference between multi-cores, the ISH domain is targeted.

257
As described above, when the barrier is issued by the core  ACE and interconnect. Moreover, using DMB, the memory 290 operation that occurs after the barrier is pending. Besides, 291 execution time or throughput overhead occurs because of 292 the insertion of memory barriers. To solve this problem, this 293 study also analyzes the overhead caused by the proposed 294 method by subdividing the threshold of the memory footprint 295 of the basic block as a criterion for inserting the memory 296 barrier. The aim here is to minimize concurrent memory 297 access on multiple cores by exploiting these side effects. 298 This section describes the basic block-based memory barrier 299 insertion method proposed in this study.

301
The overall flow for inserting the memory barrier is shown 302 in Figure 3. The compiler generally divides the program into 303 basic blocks as the first step. A basic block is a straight-line 304 code sequence without branches, apart from entry and exit. 305 Therefore, the code is divided into basic blocks based on 306 branching (br) instructions between the blocks and the end 307 of the program (ret).

308
In this study, LLVM was used to convert C code to inter-309 mediate representation (IR) code for basic blocks analysis 310 (Figure 3 (a) and (b)), and LLVM Pass was employed to 311 insert barriers (Figure 3 (c)). LLVM is capable of source-312 and target-independent code generation. After converting the 313 source code, the IR basic block was analyzed using the imple-314 mented LLVM Pass. To analyze the basic block, the anal-315 ysis and insertion steps were separated using two passes: 316 a Basic Block Parsing Pass and an Insert Memory Barrier 317 Pass. Then, based on the analysis results, an IR code with 318 a memory barrier inserted was generated. This code was 319 compiled as a static library with Clang (version 9.0.0 armv7l) 320 (Figure 3 (d)). An accurate analysis of the basic block is 321 impossible if the extra code for execution time measurement 322 is inserted into the source code. Therefore, this study built a 323 static library and called it from an external code to measure 324 the execution time. C code was converted into an IR using 325 LLVM Clang, and another code was developed to analyze the 326 basic IR blocks using ModulePass. For the experiment, bit-327 code was converted into human-readable IR code (*.ll) using 328 the llvm-dis tool. The dyn_cast<> template was used to 329 examine the instructions of each basic block. All basic blocks 330 of the program were statically analyzed for barrier insertion 331 using LLVM Pass. For barrier insertion, the memory footprint 332 of each basic block, which was calculated by examining each 333 barrier's memory access instruction, was used.  When the example IR code in Figure 4 is executed, it moves to 388 the @atan basic block with a call instruction. When the exe-389 cution of the corresponding basic block is finished, it returns 390 to basic block #110. To identify the precise memory footprint 391 of basic block #110, it should be divided into two parts based 392 on the call instruction. Algorithm 1 provides a pseudo-code 393 that analyzes the memory footprint of the basic block with 394 consideration of this call flow.      Table 2 shows the arguments used for each benchmark and the    because of the insertion of memory barriers was analyzed. 498 For the experiment, as shown in Figure 6, the benchmark in 499 Table 1 was executed using only Core 0, and the cycle counter 500 of Cortex-A53 was read at the start and end points of the 501 benchmark to calculate the cycle elapsed. In addition, cached 502 memory data was cleaned for each benchmark execution.   Figure 8 shows the execution time overhead according to 568 the memory barrier insertion. For the experiment, the Insert 569 Memory Barrier Pass of Section III was used. It was also 570 measured when the memory barrier was inserted in all basic 571 blocks for each benchmark for comparison. As a result of the 572 experiment, the execution time of all benchmarks increased 573 when a memory barrier was inserted in all basic blocks. 574 In particular, in the case of Bitcount, the execution time of 575 small and large datasets increased by 2.2 times. Since the 576 Bitcount benchmark uses several algorithms for bit count, 577 there are fewer repetitively used codes than in other bench-578 marks; further, as shown in Table 2, the total number of basic 579 blocks executed is more significant than in other benchmarks. 580 Therefore, it is concluded that the execution time increases 581 significantly when a memory barrier is inserted in the entire 582 basic block.

583
In the small dataset, the increase in execution time was 584 insignificant for the insertion of memory barriers according 585 to all thresholds. In the benchmark of the large dataset, when 586 the threshold was 32, Basicmath and Qsort had a 1.1-fold 587 increase in execution. Bitcount and FFT did not affect the 588 execution time, although the basic block with a memory 589 barrier inserted at runtime showed a 1.2 and 6.0% execution 590 proportion, respectively, as shown in Figure 7b. When the 591 threshold was 64, only Qsort increased the execution time by 592 1.1 times. Even though a memory barrier was inserted in the 593 Basicmath and FFT, there was no change in execution time 594 when the threshold was 128, as in the case of 64.

595
Through the threshold-based memory barrier insertion 596 method proposed in this study, a memory barrier is inserted 597 except for Qsort of a small dataset. In particular, at thresholds 598 32 and 64, the execution time of some benchmarks on large 599 datasets increased by 1.1 times, and there was no significant 600 change in others. Therefore, it can be seen that the execution 601 time overhead according to the insertion of the memory bar-602 rier is not significant.   Qsort, and FFT were assigned to Cores 0, 1, 2, and 3. 618

619
To confirm the effect of the memory barrier insertion, the 620 effect of inter-core interference was analyzed. Figure 10   In contrast, in Qsort, the L2 cache miss increased by 0.2%, 641 but the standard deviation of the execution time was large.

642
As shown in Table 2, as Bitcount's execution time was the 643 longest, it is concluded that the effect of interference is minor, 644 even if it is run simultaneously with other benchmarks.

645
In the case of the large dataset, similar to that of the small respectively, except for Bitcount. In the cache miss ratio, 649 Basicmath and Bitcount more than doubled, while Qsort and 650 FFT decreased slightly. To insert the memory barrier through the basic block analysis 654 proposed in this study, the Insert Memory Barrier Pass shown 655 in Figure 3 was used. To analyze the change in execution time 656 because of the insertion of the memory barrier, the experi-657 mental method shown in Figure 9 was followed, as in the 658 previous experiment. Moreover, for comparison, the insertion 659 of memory barriers in all basic blocks was measured. 660 Figure 11 shows the experimental results on a small 661 dataset. The x-axis of each graph shows the experimental 662 results according to the threshold, and each benchmark indi-663 cates Core 0-4 (for comparison with the existing graph, the 664 benchmark's name was written instead of the core number 665 for convenience). Figure 11a shows the increase in execution 666 time because of memory barrier insertion and interference, 667 Figure 11b shows the standard deviation ratio of execution 668 time, and Figure 11c shows the amount of change in the cache 669 miss ratio for each case. The value of each graph represents 670 the ratio of the value in case of interference in Figure 10.

671
When a memory barrier is inserted in all basic blocks 672 (full in Figure 11), the execution time changes by about 673 1.1 to 2.2 times. The increase in execution time was simi-674 lar to the overhead caused by the insertion of the memory 675 barrier in Figure 8, but in the case of Qsort, it increased 676 by 0.3 times because of interference. This means that the 677 when the threshold was 32, the standard deviation ratio of the 697 execution time increased by 1.6 times in the case of Basic-698 math. The rest of the benchmarks had standard deviations 699 reduced by 0.2 times. The cache miss ratio was also similar 700 to that of threshold 32. When the threshold was 128, the 701 execution time did not increase, but Basicmath's execution 702 time standard deviation ratio increased by a factor of 3.0. 703 In the case of cache miss ratio, except for FFT, the L2 cache 704 miss ratio slightly increased.

705
Through experiments on small datasets, the threshold-706 based memory barrier insertion method did not reduce the 707 execution time compared to the situation where interference 708 occurred but typically reduced the standard deviation of the 709 execution time. In particular, when the threshold was 32, 710 the standard deviation of the benchmark execution time of 711 all cores did not increase in preparation for the interference 712 situation. Moreover, while there was no significant change in 713 the case of a cache miss, it decreased in the execution of some 714 core benchmarks. In the case of Qsort, no memory barrier 715 was inserted for all thresholds, as shown in Figure 7b, but the 716 standard deviation ratio of the execution time was reduced by 717 up to 0.3 times. This means that even if the memory barrier 718 is not inserted, it may be affected by the memory barrier 719 operation performed by other cores. 720 Figure 12 shows the result of inserting a memory barrier 721 for a large dataset. The axes and expression of the graph are 722 the same as in Figure 10.

723
When interference occurs, and memory barriers are 724 inserted in all basic blocks, results are similar to that of the 725 memory barrier insertion overhead in Figure 8. Similar to the 726 experimental results of the small dataset, the large dataset 727 also indicates that the overhead caused by the insertion of 728 the memory barrier is larger than the effect of interference 729 when the memory barrier is inserted in all basic blocks. The 730 standard deviation of execution time also increased by up to 731 2.2 times compared to the interference situation, and cache 732 misses by 1.6%. When the Threshold was 32, the execution 733 times increased by 1.1, 1.0, 1.1, and 1.0 times for each bench-734 mark, but the standard deviations of the execution times were 735 0.6, 1.0. 0.9 and 0.6 times. In contrast, there was no decrease 736 in the cache miss ratio as it increased by up to 1.6%. When 737 the threshold was 64, the change in execution time did not 738 increase except for Qsort. However, the standard deviation 739 of the execution time and the cache miss ratio also increased 740 up to 1.5 times. When the Threshold was 128, there was no 741 change in the execution time. The standard deviation of the 742 execution time also increased by up to 1.6, excluding the FFT. 743 As shown in Figure 7b, when the threshold was 128, except 744 for the FFT, almost no memory barrier was inserted in the 745 remaining benchmarks, indicating that the standard deviation 746 of the FFT execution time partially improved. Therefore, the 747 cache miss ratio did not show a significant change.  [25] was pro-772 posed to avoid or limit concurrent access to shared memory. 773 Task scheduling using time-division multiple access 774 (TDMA) [5], [26] in Figure 13 is a typical MCS approach that 775 executes only one task per globally scheduled time slot. In a 776 multi-core architecture, this approach is inefficient because 777 it allows only one core to run at a time. Therefore, TDMA 778 has low utilization but does not cause inter-core interference. 779 Consequently, the tight bounding of execution time is possi-780 ble, even in a shared memory structure.

781
A three-phase execution model was proposed to com-782 pensate for the low utilization of TDMA [27], [28], [29], 783 as shown in Figure 13b. The three-phase execution model 784 in Figure 13b is an example of several execution flows. 785 It increases concurrency by dividing the task into a memory-786 centric (M) phase (''Read'' and ''Write'' in Figure 13b) and a 787 computation (C) phase (''Execution'' in Figure 13b). The M 788 phase prefetch reads data and instructions from the shared 789 global memory to the local memory. During the C phase, 790 the processor performs computations with the data. By not 791 accessing shared memory, it avoids contention and can be 792 concurrently executed under the M phase. A problem with 793 this model is that either the code must be implemented from 794 scratch, or the legacy code must be modified according to the 795 model.

796
State-of-the-art [12], [13] three-phase execution mod-797 els comprise automated code analysis, transformation, and 798 scheduling for PREM execution. These studies aim to 799 avoid contention and eliminate interference between cores. 800 As shown in Table 3, the state-of-the-art PREM methods are 801 compared to the proposed method.

802
Previous studies performed automated region-based mem-803 ory profiling for source code transformation using a 804 three-phase model. The source code of the task was divided 805 into several segments for this model. Each segment was 806 then configured to be smaller than the core's private cache 807 (e.g., L1) based on the memory footprint used during the 808 code execution. Accordingly, the code was analyzed and loop 809 unrolling and tiling were performed. Each segment consisted 810 of three phases: read, execute, and write. As its memory usage 811 was larger than the original, the transformed code was divided 812 into more segments. Therefore, the time required for memory 813 access isolation increased in other cores.

814
The worst-case execution time (WCET) was estimated 815 using ILP analysis to optimize the three-phase task schedul- shared cache that may occur in a multi-core real-time system. 841 A benchmark consisting of tasks for a traditional embedded 842 system was used for the experiment. In particular, four bench-843 marks were selected and tested in two groups according to the 844 amount of input data.

845
A limitation of this study is the lack of experiments accord-846 ing to the combination of benchmarks with various work-847 loads. The impact of shared cache interference may vary 848 depending on the performance characteristics of each bench-849 mark. In particular, recently, memory-intensive deep learning 850 operations have been applied to real-time systems [30], [31]. 851 Therefore, future work will analyze the interference effect 852 according to the performance characteristics using various 853 benchmarks. In the experiment, the insertion of the memory 854 barrier is decided based on the threshold. At this time, the 855 same threshold value is applied to the benchmark of each 856 core. It is also necessary to consider performance overhead 857 and interference reduction by applying a threshold based on 858 the characteristics of the benchmark.

859
Another limitation of this study is that, unlike the previ-860 ously proposed PREM studies, inter-core interference may 861 still occur even if a memory barrier is used. Furthermore, 862 optimizations such as out-of-order cannot be used because of 863 the insertion of memory barriers. Performance degradation 864 because of this part should also be analyzed. 865 Finally, the operating characteristics of memory barri-866 ers differ depending on each architecture's implementation 867 method. Hence, it is necessary to analyze whether the pro-868 posed method can reduce the interference caused by the 869 shared cache outside of the ARM architecture.

871
This study aims to reduce the distribution of task execution 872 time because of the interference caused by the shared cache. 873 This can assist in the tight bounding of execution time, which 874 is one of the important factors in a real-time system. The 875 occurrence of interference caused by shared cache contention 876 in a multi-core architecture was analyzed and a method to 877 reduce task execution time variations by inserting memory 878 barriers into the basic blocks of the source code using LLVM 879 Pass was proposed. This study used side effects such as delay 880 of memory operation execution because of the memory bar-881 rier and block when simultaneous memory barrier requests 882 occur, and presented a fine-grain analysis method for dividing 883 a basic block based on its call instructions. The memory 884 footprint of each basic block was used for the memory barrier 885 insertion. Through experiments, the execution time overhead 886 according to the insertion of memory barriers was analyzed 887 to show the distribution of execution time by threshold. 888 In particular, when the threshold was 32-byte, because of 889 the insertion of the memory barrier, no increase in execution 890 time was evident. Additionally, it was shown that the stan-891 dard deviation of the execution time of all core tasks was 892 reduced by up to 80%. In addition, the proposed method 893 has the advantage of not modifying OS or task execution 894 flow.