Supporting Swap in Real-time Task Scheduling for Unified Power-Saving in CPU and Memory

As the size of data grows rapidly in modern IoT (Internet-of-Things) and CPS (Cyber-Physical System) applications, the memory power consumption of real-time embedded systems increases dramatically. Unlike general-purpose systems where memory consumes about 10% of the CPU power consumption, modern real-time systems have the memory power of 20-50% of CPU power. This is because the memory size of a real-time system should be large enough to accommodate the entire task set, and thus DRAM refresh operations become a major source of power consumption. In this article, we present a new swap scheme for real-time systems, which aims at reducing memory power consumption. To support swap with real-time constraints, we adopt high-speed NVM storage and co-optimize power-savings in CPU and memory. Unlike traditional real-time task models that only consider the executions in CPU, we define an extended task model that characterizes memory and storage paths of tasks as well, and tightly evaluate the worst-case execution time by formulating the overlapped latency between CPU and memory. By optimizing the CPU supply voltage and the memory swap ratio of given task set, our scheme reduces the energy consumption of real-time systems by 31.1% on average under various workload conditions.


I. INTRODUCTION
With the recent advances in IoT (Internet-of-Things) and CPS (Cyber-Physical System) technologies, reducing the power consumption in battery-based real-time systems is becoming increasingly important [1,2]. Dynamic voltage scaling is a widely used technique for CPU power-saving by lowering the supply voltage of a processor when the computational load of tasks is less than the processing capacity of CPU [3,4,5]. If we lower the voltage supplied to CPU, the computational speed of the processor becomes slow, which increases the execution time of tasks. However, as CPU power consumption is proportional to the square of the supply voltage, although the execution time is increased, the overall energy can be saved. Thus, CPU voltage scaling offers flexibilities in real-time task scheduling by considering the computational load of tasks and energy-saving effects.
Meanwhile, as the size of data grows rapidly in modern embedded systems, memory power consumption of the system increases dramatically [6,7]. Due to its volatile medium characteristics, DRAM needs continuous refresh of all cells in order to maintain its contents although no read/write operation is performed. As the memory size increases, the power consumption by refresh also increases, which accounts for a significant portion of total power consumption in realtime embedded systems [8].
Unlike general purpose systems, real-time systems keep the entire footprint of all tasks in memory to guarantee deadlines, so it is not possible to use virtual memory swap that loads data from storage on demand [9]. This is because predicting the time of accessing code or data in storage is not feasible upon the execution of tasks in CPU. Thus, the memory size of a realtime system should be large enough to accommodate the entire task set, which makes DRAM refresh operations the major source of power consumption. Note that this is not the case for general-purpose systems like laptops, where the two main sources of power consumption are CPU and display, while DRAM accounts for only 3% of total power consumption [45]. Specifically, when comparing CPU and memory power in laptops, memory consumes only 10% of CPU power consumption [45]. However, this large gap has been narrowed in modern real-time systems as CPUs adopt power-saving techniques like voltage scaling but the size of real-time tasks that should reside in memory continues to grow. For this reason, the memory power of mobile embedded systems and real-time systems has increased to 20-50% of CPU power [4,6].
In this article, we present a new swap scheme for real-time systems, which aims at reducing the DRAM size and memory power consumption. To support swap with real-time constraints, we adopt high-speed NVM (non-volatile memory) storage and accurately estimate the swap latency. NVM technologies have recently been caught attention and some commercial products like Intel's Optane TM are already available on the market [12]. As high-speed NVM storage has low-variance of access time [10,13], our idea is to place a certain part of a task in NVM storage rather than shadowing in memory. This can reduce the size of DRAM and memory energy consumption in real-time embedded systems.
We, then, integrate our swap scheme with CPU voltage scaling and formulate the effect of the two techniques as a unified measure. Although CPU voltage scaling and memory swap reduce the energy consumption, they increase the execution time of tasks, possibly resulting in the deadline misses of real-time tasks. Thus, we define an extended task model that characterizes memory and storage latency as well as CPU executions, and accurately estimate the worst-case execution time when adopting these energy saving techniques. In our model, we tightly evaluate the worst-case execution time by formulating the overlapped latency between CPU, memory, and swap storage. As co-optimizing the powersaving configurations of CPU and memory with real-time constraints is a complex optimization problem, we use genetic algorithms to determine CPU voltage level and memory swap ratio of given task set.
To assess the effectiveness of the proposed scheme, we perform simulation experiments for a wide range of workload conditions. Our experimental results show that the proposed scheme significantly reduces the power consumption of realtime systems. Specifically, the energy-saving effect is 31.1% on average and up to 45.6% without deadline misses. The main contributions of this article can be summarized as follows.
 Unlike traditional real-time task models that only consider executions in CPU, we define an extended task model that also characterizes the memory and storage paths of tasks.
 We propose a swap scheme for real-time systems, which partially swaps out a certain portion of a task and restores it before the task activates, thereby preventing page faults.
 Our model tightly evaluates the scaled worst-case execution time of a task, considering the overlapped latency between CPU, memory, and swap storage, to minimize overall energy consumption.
 We design a steady-state genetic algorithm to co-optimize the energy consumption in CPU and memory without deadline misses by defining appropriate cost functions and genetic operators.
The remainder of this article is organized as follows. Section II briefly summarizes previous works related to this article. In Section III, we explain the partial swap scheme and integrate it with CPU voltage scaling for energy efficient realtime task scheduling. Section IV describes the optimization of our problem with genetic algorithms. In Section V, we present experimental results to validate the effectiveness of the proposed scheme. Finally, we conclude this article in Section VI.

A. SWAP IN REAL-TIME SYSTEMS
Traditional memory swap used in general purpose systems determines the part of a task to be swapped based on the prediction of re-reference likelihood by the replacement algorithm [19,28]. For example, the CLOCK algorithm evicts a page not used recently as it is not likely to be used again in the near future. However, page faults may occur in these systems as an evicted page can be used again. Unlike such types of systems, real-time systems do not allow page faults since unpredictable I/O latency may incur deadline misses [9]. Thus, the full address space of a task is pinned on the physical memory once a task starts its execution.
In order to satisfy this semantic, swap in real-time systems should work in a completely different way from that in general-purpose systems. To this end, our scheme swaps out a certain portion of a task during its inactive period, but restores the swapped part to physical memory before its activation. Thus, it is guaranteed that an entire footprint of a task resides in physical memory while the task is active. This indicates that we do not need to consider the target of swap. Thus, the main focus of our swap is to determine how much a task's memory should be involved in swap rather than considering the replacement algorithm.
Previous studies on paging systems have suggested some replacement algorithms for soft real-time tasks. Kim et al. adopt flash memory as a code storage and present a new page replacement algorithm in portable media players [21]. Lee et al. present MRT-PLRU (multitasking real-time constrained combination of pinning and LRU) that combines pinning and LRU (least recently used) policies to reduce the memory size of real-time systems [22]. However, these studies probabilistically guarantee the real-time task's deadlines without fully satisfying the constraints of hard real-time systems. Thus, they are different from our approach that meets the complete real-time constraints with memory swap.

B. REAL-TIME TASK SCHEDULING
Real-time task scheduling has been widely studied for decades. For periodic tasks, EDF (Earliest Deadline First) is known to find a schedule that does not miss the deadlines of all tasks in the task set if there exists any feasible schedule. However, EDF cannot be used if there are multiple processors or cores to execute the task set. Baruah et al. present the Pfair (Proportionate-fair) scheduling that optimally and efficiently schedules periodic tasks on symmetric multiprocessors [24]. Pfair scheduling differs from traditional real-time scheduling principles in that tasks are explicitly required to proceed at a steady rate.
Anderson and Srinivasan present a work-conserving version of Pfair scheduling called ER-fair (Early-Release fair) [30]. ER-fair differs from original Pfair scheduling in that it allows the execution of the latter part of a task as soon as the former part of the same task is completed. Anderson and Srinivasan also define the notion of intra-sporadic task, in which subtasks of a task may be released late, and present variants of Pfair and ER-fair scheduling [31]. They prove the feasibility condition for scheduling intra-sporadic tasks, and present a polynomial-time algorithm that can be used to optimally schedule intra-sporadic tasks on 1 or 2 core processors.
In real-time systems, the utilization test with the worst case execution time of tasks should be done beforehand as we need to know if the fixed resources can accommodate the given real-time task set. However, actual executions may be completed much earlier than the worst case, which will lead to the waste of resources significantly. To cope with this situation, some reactive schemes have been presented. Chen et al. decide the baseline schedule of the task set in advance, but while the tasks are actually executed, new proactive schedules are generated by considering the completion of the tasks or arrival of new tasks, leading to efficient resource management [32]. This can be adopted in cloud computing environments where resources can be scaled as the workload evolves.
Dehnavi et al. utilize the hybrid cloud infrastructure for scheduling of real-time tasks in industrial systems [33]. They propose resource provisioning policies to partition a given workload among different computing tiers, including local private clouds, edge nodes, fog nodes, and public cloud data centers. Zhou et al. propose a theoretical model for real-time scheduling problems in dynamic cloud manufacturing services [34]. To improve performances, they also propose a scheduling policy based on dynamic data-driven simulations.

C. DYNAMIC VOLTAGE SCALING
Dynamic voltage scaling has been studied extensively for the power-saving of processors in various industrial systems [1,2]. Pillai and Shin propose three techniques to find the lowest voltage level to meet the deadlines of given real-time task set [1]. They are static, cycle-conserving, and look-ahead voltage scaling techniques. Static voltage scaling selects the voltage level of a processor statically, whereas cycleconserving voltage scaling makes use of the reclaimed cycles for decreasing the voltage level of a processor if the execution of a real-time task finishes earlier than its worst case execution time. Look-ahead voltage scaling tries to lower the voltage level of a processor even more by analyzing the required amount of computation in the near future and postpones the scheduling of the task based on the result of the analysis.
Lee et al. make use of the slack time to lower the voltage level of a processor [16]. Specifically, the voltage level of a processor is lowered if some clock cycles are reclaimed by completing a task before its deadline reaches. Ghor and Aggoune aim to find the schedules with the least voltage level of a processor for real-time tasks by utilizing the slack time [2]. In particular, their algorithm aims at stretching the worst case execution time of real-time tasks as much as possible without violating deadlines. Nam et al. present a task scheduling policy for real-time systems that aims at reducing the energy consumption of CPU and memory selectively by considering the relative energy-saving effect of the two layers [4]. To reduce the time overhead of scheduling and maximize the power saving effect, their policy adopts dynamic programming with the constraint of resource utilization. Bahn et al. present a new task model for hybrid memory placement in hard real-time systems [15]. Specifically, they re-evaluate the worst-case execution time of a task by considering the memory location of each task in heterogeneous memory environments.

III. THE PROPOSED SCHEME
In our task model, a real-time task set is defined as  = {1, 2, …, n}, and the target system has CPU with a voltage scaling function and main memory supporting swap as shown in Figure 1. A task i is characterized by <Ci, Ti, Mi>, where Ci is the worst case execution time of i with the default CPU voltage and no memory swap, Ti is the period of i, and Mi is the memory footprint of i. We consider periodic real-time  Task Task Task Task   Task Task tasks, and thus the deadlines are implicitly determined by the period.
By following the common assumptions of real-time task models in previous work [4,16], and considering our partial swap model, we make the following four assumptions. Assumption 1. All tasks are independent, and thus the result of a task does not affect others.
There may be some dependent tasks in real-world task set, but most real-time scheduling studies make this assumption to simplify the problem without loss of generality. That is, dependent tasks can be merged into a single task as they should be performed sequentially by using the result of the preceding task as the input to the following task.

Assumption 2.
Tasks can be preempted and the overhead of context switch from one task to another is negligible.
In computer systems, CPU is a representative resource that allows the preemption of a task during its execution. This is because the context of a task can be easily saved and restored without incurring large overhead. Specifically, the context switch of a processor usually takes 5-10 microseconds, which is less than 0.01% of the minimum time quantum between context switch, and thus we can hide it by including in the actual execution time of a task [35].

Assumption 3.
When the target clock frequency is determined, the supply voltage of CPU can be adjusted accordingly.
When the clock frequency of a processor increases, the supply voltage of a processor also becomes higher. Although they do not have exact linear relations, it is known that the supply voltage of real processors is adjusted according to clock frequency based on a linear-like function. For example, in Transmeta Crusoe processors, when the clock frequency is changed from 500 MHz to 1 GHz, the supply voltage is adjusted from 1.35 V to 2.80 V [36].

Assumption 4. The access time of swap storage is predictable.
This was not possible in HDD or flash storage, where the access time depends heavily on the internal state of storage [20]. In HDD storage, the access time varies depending on the disk scheduling and the head movement. In flash memory, the access time fluctuates greatly when garbage collection is performed [18,19]. However, NVM storage has predictable access time [11,14], and thus we can estimate the worst case access time to load storage data to memory if we know the size of data to be swapped.

A. BASIC MODEL
In our model, the worst case execution time Ci of a task should be adjusted as we use CPU voltage scaling and memory swap. That is, Ci should be recalculated based on the longest time path between CPU and memory considering the increased latency due to the lowered supply voltage and swap I/O. Actually, the scaled worst case execution time is determined by the slower time component of executing instructions in CPU and accessing memory since executions in CPU and memory can be overlapped. That is, we tightly estimate the latency that may overlap between CPU and memory by defining the function f for scaling Ci as follows.
f CPU (Ci) and f SWAP (Ci) are the scaled worst case execution time of i by applying CPU voltage scaling and swap, respectively, and εi is the stall factor for executing swap I/O commands in CPU. f CPU (Ci) and f SWAP (Ci) can be defined as where μi is the relative clock frequency of CPU compared to the default frequency to execute task i, ri is the swap ratio of i, and latency SWAP is the time required to access the swap storage as a function of the swap I/O size. Based on this model, the schedulability test of a real-time task set can be performed by the following utilization test, implying that the scaled worst case execution time after adopting CPU voltage scaling and memory swap should satisfy this inequality.
If a real-time task set passes this schedulability test, we can obtain a feasible schedule for the given set of tasks by the earliest deadline first (EDF) algorithm [16,17]. Note that EDF schedules the task with the nearest deadline first.
Let us look at TABLE I to see an example situation of the utilization test. There are two tasks, 1 and 2, whose worst case execution times C1 and C2 are 4 and 8, respectively, and their periods are equally 25. The schedulability of the task set can be tested by calculating the utilization of the tasks, i.e., U = 4/25 + 8/25 = 0.48. As U is less than 1, the task set is schedulable. Figure 2(a) shows the scheduling result for the example in TABLE I. As shown in the figure, all the tasks can be executed within their deadlines, but the scheduling generates a large portion of idle intervals.
This idle slot can be reduced by lowering the supply voltage of CPU, thereby increasing the system utilization. For example, if a low clock frequency of 0.5 is applied for both tasks 1 and 2, the scaled worst case execution time f(C1) and f(C2) will be 8 and 16, respectively. As a result, the CPU utilization increases to U = 8/25 + 16/25 = 0.96, where U < 1 is still satisfied, so it is schedulable. Figure 2(b) shows the scheduling result after voltage scaling is adopted. As we see, the idle slot is greatly reduced compared to Figure 2(a), which will eventually lead to reduced power consumption. Now, let us see how the power consumption can be further reduced by considering the memory system. In real-time systems, storage I/O does not occur as all tasks reside in memory and virtual memory swap is not allowed. This is because storage I/O increases the memory access time excessively and it makes the prediction of the execution time difficult. However, we focus on the fact that recently emerged NVM storage has fast and predictable access time, and thus we suggest partial swap that places some part of a task in NVM storage and loads the swapped part to memory before the task is executed in CPU. This eventually leads to the reduction of the DRAM capacity, contributing to powersaving of the system.
As swap-out and swap-in need I/O time, the worst-case execution time of tasks should be re-evaluated by considering this latency. If we increase the swap ratio of a task, memory power can be saved more by reducing the DRAM capacity, but the worst-case execution time of the task may increase due to the handling of swap I/O. Conversely, if we lower the swap ratio, the worst-case execution time of a task may increase less, but the powersaving effect would be reduced.
Meanwhile, there is a trade-off between memory swap and CPU voltage scaling. For example, if we increase the swap ratio, the possibility of lowering the CPU voltage is decreased. Thus, it is necessary to maximize power-savings by combining these two techniques appropriately. In addition, as CPU voltage scaling and memory swap can be overlapped, tight evaluation of the worst case execution time is necessary.
Let us see the situation of our partial swap with the example in TABLE I. Figure 3 shows the result of swap in conjunction with the CPU voltage scaling of Figure 2(b) with this example. The figure shows when the swap ratio of tasks 1 and 2 are equally 0.5 and a swap command takes 0.5 time unit in CPU. Note that the exact locations of the swap command may be varied in real situations, but we have marked them at the end of each task for simplicity. Remind that in the schedule of Figure 2(b), 1 time unit per period remains after adopting voltage scaling. Thus, swap overhead should not exceed this remaining time slot. Although the I/O time for swap-out and swap-in increases as the swap ratio of a task becomes higher, actual swap I/Os can be overlapped with CPU executions. Thus, in an ideal case, I/O latency can be hidden and the worst case execution time will be increased only by the swap I/O command in CPU. However, as the swap ratio increases, the swap I/O time of a task may exceed the inactive period of the task. Due to this reason, the swap ratio of a task is generally set to less than 1.0. As shown in Figure 3, the swap ratio in our example is set to 0.5, and the memory size is reduced to 75% compared to the system without swap. Figure 3 also shows how the design of the system can be changed by adopting our swap with this example.

B. EXTENDING THE BASIC MODEL
From now on, we will discuss how the basic model can be extended for multi-core systems. In our CPU model, we assume that all cores have the same computing powers (i.e., symmetric multi-core processor) as in most server and embedded processor architectures. In this architecture, if a task executed on one core is interrupted, it can be resumed later on another core. However, it is not allowed that a task is executed in multiple cores concurrently as a single task should be executed in a sequential manner.
Based on this multi-core architecture, the feasibility test in our basic model can be simply extended by replacing the right side of Equation (4) by K, where K is the number of cores in the processor.
In a single-core processor system, if the schedulability test is passed, we can use the EDF algorithm to determine a schedule that does not miss the deadlines of all tasks. However, this does not work in multi-core processors as a single task is not allowed to be executed in multiple cores concurrently. To cope with this situation, Pfair (Proportionate-fair) scheduling has been introduced as a way of scheduling periodic tasks on multi-core/multi-processor systems [24,30]. The basic philosophy of Pfair is similar to EDF, but it performs scheduling based on time quantum to  VOLUME XX, 2021 6 make progress of each task at steady rates. We determine the schedule of a task set through Pfair scheduling when it passes the utilization test of Equation (4)'.

C. ENERGY POWER MODEL
The CPU energy E CPU of a CMOS processor is dominated by charging and discharging gates in circuits, and can be formulated as a function of supply voltage and operating frequency [37,38], that is where c is the effective switch capacitance, Vi is the supply voltage for task τi, fi is the CPU clock frequency for executing task τi, and ti is the time to execute task τi under this CPU mode.
In our model, supply voltage Vi is adjusted according as the clock frequency fi is varied. It is known that the clock frequency and the supply voltage have some linear-like relations, although they do not have the exact linear relation. The function to model this relation depends on processors, and we use the ARM Cortex-R52 processor model [39].
The memory energy E MEM is the sum of dynamic energy E M_dyn and static energy E M_stat [40], that is The dynamic energy E M_dyn is energy consumed while a read or a write operation is performed [41], which can be modeled as where readi and writei are the number of memory read and write operations on task τi, respectively, and E M _read and E M _write are the read and write energy for the default access size of DRAM, respectively. The static energy E M_stat is the energy consumed consistently irrespective of any operations in DRAM memory, which can be calculated as where P M is the static power of DRAM per capacity, ri is the swap ratio of task τi, and T is the total running time of the system. The storage energy E STR of our swap can be calculated as where ri is the swap ratio of task τi, ni is the total number of swaps performed on task τi, and E STR_read and E STR_write are the read and write energy for the unit access size of NVM, respectively. Note that we do not consider the static power of storage as it is non-volatile, and thus does not spend refresh power.

IV. OPTIMIZATIONS WITH GENETIC ALGORITHMS
Our problem is to select the CPU voltage level and the memory swap ratio of all real-time tasks in the task set, which aims at minimizing the energy consumption of the system with deadline constraints. This is a kind of combinatorial optimization problem known as NP-hard. For example, if there are 4 CPU voltage levels and 5 memory swap ratios, the number of possible states for each task is 20. When the number of tasks is N, there are 20 N cases, and searching all of these is not feasible even with high-end server systems.
To cope with this situation, we explore our search space by genetic algorithms [23]. Specifically, we maintain a small number of candidate solutions that represent the voltage level and the swap ratio of the tasks, and evolves the solution set until it converges. Typical genetic algorithms evolve the solution set by completely replacing an old set with a new one at each iteration. However, this usually loses some superior solutions maintained, making convergence difficult. Specifically, if there are multiple domains to optimize together like our problem (i.e., CPU and memory), it takes much time to converge, and in some cases, the solution set does not converge even after a large number of iterations [9]. To resolve this issue, we replace only a few solutions per iteration. This type of genetic algorithms is called steadystate GA, which has the ability of fast convergence [25].

A. ENCODING
In genetic algorithms, a solution is typically represented by a linear string. As our problem needs to determine the memory swap ratio and the CPU voltage level of all tasks in the task set, we use two strings and the length of the string is equal to the total number of tasks as shown in Figure 4.
In a theoretical aspect, we can set various levels of CPU voltage and memory swap ratio, but practically we need to set a certain limited number of levels. The default setting of our configuration consists of 4 CPU voltage levels and 4 swap ratios. Specifically, in our encoding, each entry in the CPU string is represented by a 2 bit value {0, 1, 2, 3}, which represents the clock frequency of CPU {1, 0.5 0.25, 0.125}, respectively. Similarly, entries in the memory string can have a 2 bit value {0, 1, 2, 3} representing the swap ratio of {0, 0.125, 0.25, 0.5}, respectively. Note that task 1 in Figure  4 is executed under the CPU clock frequency of 1 and the swap ratio of 1 is 0. Based on this encoding method, we randomly generate 100 solutions as an initial population.
In genetic algorithms, a cost function is needed to evaluate the quality of a solution. We define our cost function as the energy consumption of the system when the scheduling is performed with the given resource configurations the solution represents. If the CPU utilization exceeds 1 and thus the scheduling is not feasible with the given solution, a penalty value is added to the cost function. That is, where Energy(i) is the energy consumption of the tasks scheduled by solution i, α is the weight factor, and Penalty(i) is the penalty function of the solution i in case it does not pass the schedulability test, that is

B. SELECTION OF PARENT SOLUTIONS
A selection operation chooses two parent solutions in the current population for generating one or two offspring solutions to evolve the population. This is usually based on a probabilistic rule that assigns higher probabilities to better solutions in order to improve the solution set. In our problem, the goodness of a solution is evaluated based on the cost of the solution. However, if the selection probability is excessively biased, the result of selection may be limited to a small number of extremely superior solutions. This has a risk of premature convergence to a local optimum as the characteristics of a few solutions may rapidly dominate the entire population. To cope with this situation, instead of assigning a selection probability based on the cost of a solution as it is, we rank the solutions by their cost order and then assign selection probabilities based on their ranks. Specifically, we normalize the selection probability such that the best solution in the population is four times more probable to be selected than the worst one [23].

C. CROSSOVER AND MUTATION
The crossover operation merges a certain part of the string from two parents to generate offspring. We use 1-point crossover that randomly selects a cut point within a string, and the offspring is generated by copying the left segment of the cut point from one parent and the right segment from the other. As we have two strings that represent CPU and memory configurations, we select the cut point of each string to perform 1-point crossover independently.
After crossover, a mutation operation is performed for the offspring generated, which perturbs a certain location of the strings in order to widely search the problem space not to stay in a local optimum. Our mutation is performed by selecting a certain random location of the strings and changes it to another random value. The mutation probability of our genetic algorithm is set to 0.01.

D. REPLACEMENT
After an offspring is generated by crossover and mutation, a new population is produced by substituting a solution in the current generation by the offspring. In this article, we discard the worst solution, i.e. a solution that incurs the highest cost, in the current population and insert the offspring generated. Note that this is the most commonly used replacement method in steady-state genetic algorithms [25].

E. STOPPING CRITERIA AND CONVERGENCE
It is not an easy matter to determine the number of iterations repeated for the evolution of genetic algorithms as it is sensitive to the experimental configurations and the problem domain. Instead of setting the constant number of iterations, we repeat the evolution until the population converges [23]. In order to ensure the convergence of our genetic algorithm, we monitor the cost of each solution and the utilization of the system when the task set is scheduled by the solution. Our monitoring results showed that the population converges within 10,000 generations in all cases, and the average running time of our genetic algorithm is 8.7 seconds for its convergence. We also confirmed that our genetic algorithm does not converge to a local optimum since the utilization of the final solution approaches the full capacity of the resources unless the task set is too small to fully utilize the resources. This proves that our policy have the ability of finding sufficiently good solutions that satisfy real-time constraints as well as energy efficiency.
The complexity of genetic algorithms is not easy to prove, but in empirical aspects, we found that our genetic algorithm converges with a constant number of iterations regardless of the number of tasks (up to 1,000 tasks we tested), and hence the complexity of our genetic algorithm can be considered as O (1). Also, as we consider hard real-time systems where resource configurations and scheduling possibility should be determined at the design phase, the running time of our genetic algorithm does not affect the actual execution of tasks in target systems.

V. EXPERIMENTAL RESULTS
In this section, we conduct simulation experiments to assess the effectiveness of the proposed scheme called PSVS-GA (Partial Swap with Voltage Scaling using Genetic Algorithms). We developed our in-house simulator to evaluate the effectiveness of PSVS-GA [42]. We compare PSVS-GA with three schemes, VS-GA (Voltage Scaling using Genetic Algorithms), PS-GA (Partial Swap using Genetic Algorithms), and Baseline. Baseline does not use either CPU voltage scaling or memory swap. PS-GA uses partial swap for memory power-saving similar to PSVS-GA, but does not use CPU voltage scaling, and VS-GA optimizes the CPU voltage level for each task, but does not consider memory swap. Similar to the proposed PSVS-GA, PS-GA and VS-GA make use of genetic algorithms for their optimizations. This implies that the improvement of our scheme against PS-GA and VS-GA is obtained through the tight modeling of worst-case execution time by hiding the overlapped latency rather than just thorough optimizations by genetic algorithms.
The experimental configuration of our simulation consists of 1.6GHz 4-core ARM cortex-R52 real-time processor [39]. For simulating NVM storage, we use PCM (Phase-Change Memory), which is considered as a type of fast storage media in many previous studies [27,29]. The read and write latency of PCM is set to 100 (ns) and 350 (ns), respectively, and the read and write energy of PCM is set to 0.2 (nJ/bit) and 1.0 (nJ/bit), respectively [28]. For simulating DRAM memory, the read and write latencies are equally set to 50 (ns) and the read/write energy is set to 0.1 (nJ/bit) following previous studies [26,28]. The static power of DRAM is set to 1 (W/GB) and the default size of DRAM is set to the entire footprint of workloads in order not to incur any page faults.
Our experiments were conducted under a wide range of workload conditions. We vary the workload density from 0.1 to 0.8, where the workload density of 1.0 indicates the saturation of full CPU resources in the system. The number of real-time tasks is set to 100 and the worst-case execution time of the tasks is randomly generated between 1ms and 500 ms. The period of a task is determined based on the target workload density of 0.1 to 0.8. To assess the effectiveness of the proposed scheme in more realistic target systems, we performed additional experiments under two realistic workload conditions, Robotic Highway Safety Marker (RSM) workload [43] and IoT workload [44]. RSM is a task set for the actions of a mobile robot that carries safety markers in a highway for road construction safety. IoT is a task set for the actions of a real-time controller in an industry machine hand. Tables II and III list the task configurations of the RSM and IoT workloads, respectively.  the energy consumption of each scheme normalized to that of Baseline. That is, the energy consumption of Baseline is set to 1.0 and the relative value of each scheme scaled to Baseline is plotted. As shown in the figure, PSVS-GA exhibits the best results in all cases. Specifically, the energy-saving effect of PSVS-GA is large when the density of workloads is not high. The reason is that there are more chances of resource optimizations with respect to energy-saving when the load of tasks is low. We can make use of energy-saving techniques such as voltage scaling and swap more aggressively without deadline misses in such situations. On the other hand, as the density of the workload increases, idle time slots are reduced and hence power-saving techniques become less effective. For example, lowering the CPU supply voltage is difficult in these cases as it may incur deadline misses of real-time tasks. Now, let us compare the results in detail. The reduced energy consumption of PSVS-GA is 31.1% on average and up to 45.6% compared to Baseline. When compared to PS-GA and VS-GA, the energy-saving effect of PSVS-GA is 23.1% and 14.1%, on average, respectively. When comparing the relative effect of CPU voltage scaling and memory swap, VS-GA performs better than PS-GA in most cases, implying that voltage scaling is more effective than swap in terms of energysaving. The only case where PS-GA outperforms VS-GA occurs when the workload density is 0.1. In that case, the scheduling incurs too much empty time slot, which cannot be filled even if the CPU voltage is lowered as much as possible. This will be further discussed in Figure 8, which shows that the CPU utilization is still low despite applying CPU voltage scaling maximally. However, in any case, PSVS-GA outperforms PS-GA and VS-GA, which implies that combining the voltage scaling and swap techniques makes even better results. Figures 6 and 7, respectively, show the energy consumption in CPU and memory for the four schemes as the workload density is varied. As shown in Figure 6, schemes using CPU voltage scaling, i.e., PSVS-GA and VS-GA, reduce the CPU energy consumption significantly compared to those that do not use it. The power-saving effect of voltage scaling is small when the workload becomes heavy because the possibility of utilizing idle time slots of CPU by lowering the supply voltage is reduced. When we compare PSVS-GA and VS-GA, VS-GA performs slightly better than PSVS-GA with respect to CPU energy consumption. This is because PSVS-GA needs additional time to access swap storage, which also increases the execution time in CPU. However, as shown in the figure, the effect of swap on CPU energy consumption is very small. Now, let us see the energy consumption in memory. As shown in Figure 7, PSVS-GA and PS-GA that support swap consume less memory energy than VS-GA and Baseline, which use memory shadowing. This is because supporting swap has the effect of reducing the DRAM capacity of the system, which can save the refresh power of DRAM significantly. Meanwhile, when using swap, additional CPU time is required for the swap-in and swap-out process, which may increase the CPU energy consumption. This is shown in Figure 6 that PS-GA spends more CPU energy than Baseline. However, such overhead is compensated by the energy-saving effect in memory as shown in Figure 7. Moreover, although PSVS-GA also adopts swap, the CPU energy is not increased compared to VS-GA as shown in Figure 6. This implies that using CPU voltage scaling along with memory swap and cooptimizing them can hide the swap overhead in CPU by overlapping the execution in CPU and I/O. That is, CPU issues I/O commands and executes other tasks while the actual I/O is performed. Another notable phenomenon is that the effectiveness of swap is less influenced by the workload density. That is, swap is still effective in power-saving even when the workload becomes heavy as shown in Figure 7, which is different from the effectiveness of CPU voltage scaling in Figure 6. Figure 8 shows the CPU utilization of the four schemes as a function of the workload density. As we see in the figure, PSVS-GA exhibits high utilization of almost 1.0 except for the case of the workload density 0.1. Note that PSVS-GA makes use of the power-saving techniques as much as possible to  maximize the resource utilization. However, as discussed previously, when the workload density is 0.1, there are too much empty time slot, which cannot be filled even with the full usage of CPU voltage scaling and memory swap. The utilization of VS-GA is also near 1.0 as we optimize it by genetic algorithms. However, the energy-saving effect of VS-GA is lower than PSVS-GA as it does not use swap. PS-GA also raises the CPU utilization compared to Baseline, but the effect is limited as it does not make use of CPU power-saving techniques. The utilization of Baseline is consistently lower than PSVS-GA and VS-GA, and the gap becomes wider as the workload density is decreased. Figure 9 compares the DRAM size used in the four schemes. As we see in the figure, PSVS-GA reduces the DRAM size significantly compared to VS-GA and Baseline. Specifically, the DRAM size used in PSVS-GA is smaller than these two schemes by 34.3% on average. The effect of reducing the DRAM size is large when the workload density becomes low. This is because the inactive period of a task becomes longer when the workload is not heavy, implying that the swap ratio can be further increased without deadline misses in this case. When we compare PS-GA and PSVS-GA, PS-GA reduces more DRAM capacity than PSVS-GA in synthetic workloads, especially when the workload is heavy. As our optimization goal has focused on the overall energy-saving rather than the DRAM size reduction, this implies that CPU voltage scaling is more effective than reducing the DRAM size in an energysaving effect under heavy workloads. In real workloads, however, PSVS-GA performs better than PS-GA as shown in Figure 9(b). This implies that the density of real workloads we simulated is not high compared to the synthetic workloads.

VI. CONCLUSIONS
As the size of data grows rapidly in modern embedded systems, the DRAM memory of the system keeps increasing, which accounts for a large portion of the power consumption. In this article, we presented a new real-time task scheduling scheme that supports partial swap in order to reduce the   DRAM size of the system. To enable swap functions, we adopted high-speed NVM storage with predictable access latency, which allows for the feasible estimation of real-time task's worst case execution time. Unlike typical real-time systems that maintain entire footprint of tasks in memory, we place a certain part of real-time tasks in NVM storage and perform swapping. The ratio of swap for each task is determined based on the schedulability and the power-saving effect.
We also combined our swap scheme with CPU voltage scaling by formulating the effect of the two techniques as a unified measure, and co-optimized the supply voltage of CPU and the swap ratio of memory for each task with respect to energy consumption. Our experimental results under a wide range of workload conditions showed that the energy-saving effect of the proposed scheme is 31.1% on average without any deadline misses.