Virtualizing and Scheduling FPGA Resources in Cloud Computing Datacenters

Cloud service providers consistently leverage their computing infrastructures by adding reconfigurable hardware platforms such as field-programmable gate arrays (FPGAs) to their existing infrastructures. Adding FPGAs to a cloud environment involves non-trivial challenges. The first challenge is virtualizing FPGAs as part of the cloud resources. As a standard virtualization framework is lacking, there is a need for an efficient framework for virtualizing FPGAs. Furthermore, FPGA resources are used in conjunction with central processing units (CPUs) and graphics processing units (GPUs) to accelerate the execution of tasks. Therefore, to gain the benefits of these powerful accelerating platforms, the second challenge is to optimize the allocation of tasks into the capable resources within a cloud data center. This work proposes an FPGA virtualization framework that abstracts the physical FPGAs into virtual pools of FPGA resources. The work further presents an integer linear programming (ILP) optimization model to minimize the makespan of tasks where FPGA resources are part of the cloud data center. Given the complex nature of the problem, a simulated annealing (SA) metaheuristic is developed to achieve gains in performance compared to the exact method and to scale up and handle many tasks and resources while providing near-optimal solutions. Experimental results show that SA has reduced the makespan of a large dataset with 1000 tasks and 100 resources by up to 30% when compared to first-come-first-served (FCFS) and shortest-deadline-first (SDF) algorithms. Lastly, to quantify the performance of FPGA-enabled cloud datacenters, the work extends the CloudSim simulator (an open-source cloud simulator) to enable FPGA as a resource in its environment. The proposed virtualization framework and the SA scheduler are integrated into the environment. Simulation results show that the execution time of tasks is reduced by up to 78% when FPGA accelerators are used.

• Implements and evaluates the simulated annealing algo- 98 rithm that obtains a near-optimal solution for the 99 modeled scheduling problem.

100
• Extends the CloudSim simulation toolkit to include the 101 proposed FPGA virtualization framework and schedul-102 ing algorithm to validate the proposed solution. 103 The rest of the paper is organized as follows: Section II dis-104 cusses the research work in the literature. Section III proposes 105 the FPGA virtualization framework. Section IV discusses the 106 proposed model for FPGA resource scheduling. Section V 107 describes the proposed heuristic-based scheduling algorithm. 108 Section VI explains the various experiments conducted and 109 the obtained results. Lastly, Section VII concludes the paper. 110 111 Multiple research approaches in the literature discuss virtu-112 alization, partition styles, and resource allocation for FPGA 113 within different settings. This section reviews the related 114 work and discusses the research gap that this work aims to 115 solve.

116
A. FPGA VIRTUALIZATION APPROACHES 117 Chen et al. [4] proposed a virtualization framework for 118 enabling FPGAs in the cloud. The hardware layer of the 119 framework is divided into three logical sublayers -user sub-120 layer, service sublayer, and platform sublayer. These layers 121 have static hardware modules implemented within the FPGA 122 fabric. The platform sublayer consists of static functional 123 components such as memory and network controllers that 124 handle data communication from and to the FPGA. The 125 service sublayer manages the configuration of application 126 hardware logic and data using modules such as a config-127 uration controller, a job queue, and a job scheduler. This 128 is considered the most significant layer because it enables 129 partial reconfiguration of accelerators and provides interfaces 130 for users to access their FPGA accelerators. The user sublayer 131 comprises a static layout with four asymmetric partitions in 132 the topmost sublayer. The partitions, also known as empty 133 accelerator slots or partially reconfigurable regions (PRRs), 134 are marked with alphabets A, B, C, and D. When the con-135 figuration controller receives the hardware application logic 136 in the form of bitstreams, it configures the hardware defined 137 by that logic into one of the four accelerator slots. Moreover, 138 the framework has a hypervisor layer which is responsible 139 for receiving user requests to create accelerators. Using the 140 Accelerator-as-a-Service (AaaS) model, when a user requests 141 for a specific accelerator, the hypervisor layer either selects an 142 idle accelerator slot to configure the requested accelerator or 143 finds an existing accelerator belonging to the user, or rejects 144 the request if there is no slot available. 145 In addition, the hypervisor tracks the usage and status of 146 each accelerator slot, whether the slot is idle or occupied. 147 Using this framework, each FPGA is divided into four on-148 chip accelerator slots, and each slot is different in terms of the 149 number of resources constrained by the logical partitioning. 150 wrapper, is automatically generated based on the user's spec-207 ifications. The wrapper virtualizes the physical I/O resources 208 in the FPGA and allows users to design accelerator hardware 209 without considering physical I/O constraints. Furthermore, 210 Al-Aghbari et al. implemented their framework in real-world 211 cloud infrastructure [10]. They explained the implementation 212 of the FPGA hypervisor and elaborated on how its frontend 213 functions are exposed to the cloud user as application pro-214 gramming interfaces (APIs) while the backend functions are 215 implemented in the FPGA chip. Using the hypervisor, users 216 can create, manage, and destroy accelerators configured in 217 the vFPGAs. Moreover, each accelerator in the vFPGA is 218 assigned an IP address allowing any host machine or FPGA in 219 the network to communicate with the accelerator. Addition-220 ally, the authors used Xilinx Virtex-6 XC6VLX550T FPGAs 221 as their hardware platform. They further implemented their 222 virtualization framework and used 58,123 look-up tables, 223 52,649 flip-flops, 422 random access memory (RAM) blocks, 224 and 560 digital signal processing (DSP) blocks in each FPGA. 225 B. FPGA PARTITIONING STYLES 226 DPR-supported FPGAs perform reconfigurations based on 227 a partitioning style. Partitioning refers to the way partially 228 reconfigurable regions (PRRs) are formed after logically 229 dividing the FPGA fabric. According to [11], there are three 230 partitioning styles -island style, slot style, and grid style. 231 The island-style partitioning is the least difficult to implement 232 where an FPGA has one or more PRRs. Each PRR can exclu-233 sively configure one application hardware, and an application 234 cannot share more than one PRR, hence the name ''island.'' 235 However, the limitation to this partitioning style is when a sin-236 gle island does not have sufficient resources to configure the 237 hardware. The slot style partitioning is where identical PRRs 238 called slots are created on the fabric, either column-wise or 239 row-wise. Hardware designs then occupy one or more slots 240 once they are configured. This partitioning is not straight-241 forward to implement because each slot must have a static 242 interface to communicate between the configured application 243 hardware and the static hardware. However, it addresses the 244 limitation of the previous partitioning style. Hence, an appli-245 cation can be provisioned for more than one slot if required. 246 Another challenge in this partitioning style is that an FPGA 247 has heterogeneous resources. Therefore obtaining identical 248 partitions or slots that are architecturally homogenous is chal-249 lenging. Since some regions in the FPGA fabric are excluded 250 from being a part of any slot, the resources in these regions are 251 unutilized, leading to poor resource utilization. Finally, the 252 grid style partitioning is where an FPGA is segmented into 253 a grid, and the grid cells are either static regions or PRRs, 254 depending on the type of hardware configured. Hardware 255 designs occupy one or more cells to actualize accelerators. 256 Moreover, it improves resource utilization compared to the 257 slot style because fewer unpartitioned regions exist. In con-258 trast, implementing this style and producing homogenous 259 grid cells is more challenging as an FPGA fabric contains 260 heterogeneous resources. In addition, each PRR cell must 261 implementations of these algorithms are being proposed in 311 the literature, making FPGA scheduling an active research 312 topic. For example, the proposed scheduling algorithm in [19]  proposed scheduler provisions. In the scheduling algorithm, 317 an accelerator slot's computing capacity is a parameter 318 defined as the number of virtual CPUs (vCPUs). Suppose 319 a task is executed using one of the accelerator slots and 320 accelerated n times faster than a single vCPU. In that case, the 321 slot's computing capacity is said to be n vCPUs. Moreover, 322 it is a metric-based scheduling algorithm where the proposed 323 metric is called benefit. The benefit of an accelerator slot 324 is calculated by summing the speedup of all tasks on that 325 accelerator slot in terms of number of vCPUs. The objective 326 is to assign each task to the slot with the highest benefit value 327 for that task. Additionally, the authors make the scheduling 328 algorithm dynamic by allowing task preemption where new 329 incoming tasks yielding higher benefit on a particular slot 330 can replace the existing ones. However, there are two dis-331 advantages in the algorithm. The first is that the resource 332 pools are based on the virtualization framework of [4] and 333 consist of heterogenous partitions or accelerator slots. This 334 leads to poor resource utilization as bigger accelerator slots 335 will always hold higher benefit value than the smaller ones, 336 and the scheduler always selects slots with the highest benefit 337 for every task. The second issue is that the scheduler must 338 calculate the benefit across the entire pool of accelerator slots 339 for each task before resource allocation. This is computation-340 ally intensive and therefore, the algorithm is not scalable with 341 larger number of resources or tasks.

342
In [20], an FPGA resource scheduling algorithm is pro-343 posed that minimizes the makespan of a batch of requests to 344 improve resource utilization at the node level. The requests 345 are acceleration tasks that are to be executed using FPGA 346 hardware. They are split into three categories -computation-347 intensive, network-intensive, and a combination of both. 348 Three optimization models are presented to tackle each task 349 category independently. These models represent NP-hard 350 problems, and as a result, an approximation algorithm is 351 proposed that uses relaxation and rounding to find feasi-352 ble solutions. The results of the algorithm are compared to 353 shortest-job-first and longest-job-first algorithms. However, 354 the proposed models work with a resource pool that consti-355 tutes whole physical FPGA chips instead of PRRs. Therefore, 356 the algorithm only allocates one or more chips per task and 357 cannot allocate low-level FPGA resources, resulting in poor 358 resource utilization.

359
Most of the proposed scheduling algorithms in the liter-360 ature consider the whole FPGA chips within the resource 361 pool. The authors in [21] proposed a metric-based multi-362 objective scheduler that minimizes the energy consumption 363 by allocating computation-intensive tasks to compute nodes 364 with FPGAs. The scheduler decides whether to schedule a 365 task to a compute node with or without an FPGA based on 366 the tasks' workload. Moreover, in [22], the proposed model 367 is a max-min joint optimization model which maximizes 368 cloud users' satisfaction while minimizing loss of benefits 369 for the cloud providers. The proposed algorithm is a generic 370 MATLAB scheduler presented as a black box. Although the 371 authors claim that the resource pool was obtained because 372 of MFMA virtualization, this is not indicated as the pool 373 contains whole FPGA chips instead of slots or PRRs. 374 It was observed from the reviewed scheduling algorithms 375 that most of the algorithms deal with resources from a 376 resource pool consisting of whole FPGA chips [19], [20], 377 [21], [22], [23], [24], [25], [26], [27], [28] [31].  One of the static modules is the clock management. It is 432 also known as the clock manager. It is possible to have 433 different modules run on different clock frequencies in the 434 same FPGA. For example, the network manager may be oper-435 ating at a clock frequency different from the configuration 436 manager. Such modules are then said to be operating in dif-437 ferent clock domains. The clock manager enables the correct 438 frequency clock signals to be routed to both synchronous 439 modules. Furthermore, the clock manager can utilize PLLs 440 (phase-locked loops) available in the FPGA to generate the 441 different clock frequencies needed. Although the primary 442 function of a PLL is to detect and fix time violations in the 443 circuit, it can also be used as a clock generator to drive clock 444 signals at the desired frequency to one or more modules of the 445 same clock domain. However, an extra piece of hardware is 446 required when communication between two clock domains 447 operates at different speeds. We use asynchronous first-in-448 first-out (FIFO) buffers extensively for inter-clock domain 449 communication. When a module transfers data in a specific 450 frequency to another module, it writes the signal or the data 451 to a buffer. The recipient module can then read from the 452 asynchronous buffer at its operating frequency. Using buffers 453 eliminates the danger of metastability -a state in which 454 a digital system becomes unstable and gives an uncertain 455 output (i.e., neither logical '1' nor '0') for an unbounded 456 time [32].  In addition, since Ethernet packets also use TCP/IP stack, 476 a single implementation for sending and receiving data from 477 both the ports in the network manager is applied.

478
As a packet is received and decapsulated, the payload 479 is sent to a data router, whereas the destination IP address  The vFPGA manager is discussed in detail in section G.

519
The configuration manager is a static hardware module 520 mainly responsible for creating the user's accelerator design 521 in the FPGA. Accelerator designs come in the form of bit-522 streams or bit files. Application hardware logic is sent as 523 partial bitstreams over the network from a host machine 524 using a toolchain provided by the FPGA vendor. The network 525 manager receives bitstreams within the payload content of 526 Ethernet packets, decapsulates the packets, extracts the bit-527 streams, and writes to buffers from which the configuration 528 manager can read the bit files. Moreover, the configuration 529 manager uses volatile memory to hold bitstreams and a ded-530 icated controller submodule to communicate with the ICAP 531 interface [9], [33]. DPR uses the ICAP controller to download 532 bitstream data into dynamic regions (PRRs) and reconfigures 533 the regions to create the accelerator hardware specified in 534 the partial bitstreams. Once an accelerator is configured, the 535 configuration manager informs the vFPGA manager of the 536 new accelerator's physical location. The network manager 537 receives an acknowledgment that an accelerator has been 538 successfully created. Next, the network manager assigns the 539 accelerator a unique IP address, and the address table stores 540 this information. Users can then establish TCP sessions with 541 their respective accelerators to send and receive data.

543
Accelerators in the FPGA require a static interface to 544 exchange data with static logic modules such as the network 545 manager. Irrespective of the design of an accelerator, the 546 approach in which it will communicate with the network man-547 ager does not change. Hence, the communication interface 548 must be static. According to [9], the interface is called a 549 wrapper. However, in this work, we refer to it as the adapter 550 interface or an adapter. As shown in Fig. 3, an adapter consists 551 of buffer memories, serializer, deserializer, bit packer, and bit 552 unpacker submodules. It has a read and a write buffers where 553 the accelerator can read-from and write-to, respectively. Data 554 are moved in chunks of bits called (data) words. Submodules 555 inside the adapter are used to change the size of data word 556 between modules. This is mainly to resolve any mismatch in 557 word length when static logic modules send data in specific 558 This idea of encapsulating an accelerator and an adapter 601 module to create a vFPGA is different from the work done in 602 [9], in which a vFPGA represents a PRR and might be either 603 occupied with an accelerator hardware or empty.

605
Each FPGA can contain one or more vFPGAs depending 606 on the number of resources the FPGA can provide and the 607 number of resources each vFPGA demands. Therefore, this 608 requires a vFPGA manager, as shown in Fig. 5. The vFPGA 609 manager is a module that is responsible for monitoring 610 and maintaining vFPGAs. Each vFPGA, upon instantiation, 611 is given a unique ID by the vFPGA manager for internal 612 addressing. A vFPGA can be in one of the two states -idle 613 or busy. It is idle if its accelerator is not currently process-614 ing a task and is otherwise busy. The manager constantly 615 monitors the status of each vFPGA instance, whether idle or 616 busy. Moreover, since each accelerator must be addressable to 617 enable communication with the user, assigning and retract-618 ing IP addresses concerning vFPGAs are performed by the 619 vFPGA manager. This is done by maintaining two tables; one 620 table contains IP address-to-vFPGA ID mappings, and the 621 other contains vFPGA ID-to-physical address of the vFPGA. 622 Whenever the network manager routes data to this manager, 623 it looks up the destination IP address in the first table to find 624 the corresponding vFPGA ID. Then, it looks up the second 625 table to find the physical address of the vFPGA. Therefore, 626 the vFPGA manager forwards the application data from the 627 network manager to the intended accelerator. This is the core 628 element of the implemented virtualization framework in this 629 work, where physical address spaces are mapped to virtual 630 address spaces.

631
Similarly, when an accelerator attempts to send back data 632 to its user, the manager looks for the physical address of the 633 network manager and writes the data into appropriate buffers 634 for the network manager to read. In addition, all communica-635 tion between the accelerator inside vFPGA and the vFPGA 636 manager happens via the adapter interface that was mentioned 637 VOLUME 10, 2022 The result of virtualization is to obtain a pool of abstract 679 resources that can be quickly and efficiently provisioned 680 to tasks. To achieve this, one of the backend functions of 681 the FPGA hypervisor is to track the number of occupied 682 and unoccupied regions at any given time. With this, we 683 can obtain the total number of unutilized regions across all 684 FPGAs in the data center at any given time. Therefore, we can 685 generate a pool of available FPGA regions using the hyper-686 visor of each FPGA that monitors and tracks the number of 687 regions. Upon provisioning any FPGA region for configuring 688 an accelerator, the hypervisor of that FPGA will notify a 689 centralized control module, i.e., a unified manager, that the 690 total number of resources in the pool has been reduced by one. 691 Thus, the resource pool is constantly updated owing to the 692 communication between FPGA hypervisors and the control 693 module.

695
As a result of implementing the proposed virtualization 696 framework, we obtained a pool of FPGA resources from 697 which resources can be provisioned to cloud tasks. In this 698 context, a cloud task refers to a user request for configur-699 ing an accelerator, whereas a resource refers to a PRR in 700 an FPGA chip that resulted from the partitioning process. 701 The FPGA resource allocation must effectively manage the 702 cloud resources and execute the consumers' tasks. Hence, 703 this section presents an integer linear programming (ILP) 704 optimization model for the FPGA to minimize the competi-705 tion time of tasks within the pool of FPGA resources. The 706 work assumes a cloud datacenter infrastructure with com-707 puting servers, limited FPGA-based accelerators, and net-708 working resources that interconnect compute-to-compute and 709 compute-to-FPGA resources. Moreover, the model assumes 710 the following in the cloud computing environment: are ignored [13], [38].

Constraint
(2) ensures that a block can be part of only one 765 task at a time. That is, a resource block may be allocated to at 766 most one task at any point in time. Two or more tasks cannot 767 share the same resource block simultaneously.

768
Constraint (3) ensures that any task's execution must be 769 completed before its deadline. This is performed by checking 770 whether a task's start time and execution time are less than or 771 equal to its deadline. 772 Constraint (4) ensures a task must have all the required 773 blocks ready at its start time. Therefore, it ensures that the 774 number of blocks allocated at a task's start time equals the 775 number of blocks required.

776
Constraint (5) ensures that a task must have the same block 777 allocation at its start and end times. Blocks allocated at the 778 start time should remain allocated until the end time of a task. 779 Constraint (6) ensures that a task can be allocated to a 780 specific number of blocks for a fixed number of time units. 781 While constraint (4) checks for the correct number of blocks 782 allocated at the start time, constraint (6) ensures that the same 783 number of blocks remain consistently allocated from the start 784 to the end time of a task. 785 Constraint (7) ensures that blocks will continue executing 786 the task until it is completed without any interruption. Hence, 787 this Constraint safeguards the no-preemption assumption.

795
The proposed model is validated for small-scale problem 796 instances using IBM ILOG CPLEX Optimization engine 797 that implements the branch-and-cut exact solution method. 798 CPLEX ran on a Windows machine with 16 GB DDR4 799 DRAM at 3000 MHz and a 6-core processor at 3.6 GHz. Four 800 experiments with an increasing number of tasks and resources 801 VOLUME 10, 2022     Table 5.    The created model was validated with the exact solution 809 method for the first three experiments and achieved the results 810 presented in Fig. 6, Fig. 7, and Fig. 8, respectively. The figures 811 show the discreet time units t, the FPGA blocks as b, and the 812 allocated task k into a block at a specific time slot. 813 Fig. 9 illustrates the elapsed time (y-axis) on CPLEX for 814 finding the exact solution in each experiment (x-axis). The 815 simulation times of these experiments are ∼1 ms, 860 ms, 816 and 320,000 ms, and the makespans are 3, 11, and 19, respec-817 tively. Although the reported solution for the third experiment 818 was after running CPLEX for around 5 minutes, we allowed 819 CPLEX to run for over 20 hours to find the optimal solu-820 tion. The entire search space could not be exhausted, and an 821 improved solution was not found. For the fourth experiment, 822 CPLEX could not handle the dataset size.

823
In the exact solution approach used by CPLEX, the number 824 of permutations grows exponentially with an increase in the 825 number of tasks and resources, as shown in Fig. 9. Exploring 826 the entire search space is infeasible due to the exponen-827 tial growth in time complexity. The presented scheduling 828 problem is NP-hard and requires a heuristic-based approach 829 that ensures a near-optimal solution. The following section 830 introduces the proposed heuristic approach.  function and the initial temperature, the perturbation is either 886 accepted or rejected, and accordingly, the neighbor becomes 887 the current state and the starting point of the next perturba-888 tion. This process continues iteratively, and the temperature 889 reduces in each iteration based on the predefined cooling 890 rate. The process terminates if the SA does not find a better 891 neighbor for a certain number of iterations or the temper-892 ature reaches close to zero. Reaching the predefined itera-893 tion threshold is known as the convergence of the solution. 894 We have chosen 16 to be the iteration threshold based on 895 several parameter-tuning experiments.

Algorithm 1 Implementation of Simulated Annealing
Input: Task list that consists of task execution time, required blocks and deadline; Initial configuration: Task list is sorted using earliest deadline first X soln ; Determine initial temperature T(0); Determine freezing temperature T f ; while (T(i) > T f and not converged) do repeat Perturb (X soln ) by swapping two tasks randomly; Find neighbor solution X new ; Compute Z = cost (X now − X soln ); if ( Z ≤ 0) then Update X soln ; /*accept perturbation*/ else if (random (0, 1) < e − Z/T(i) ) then Update X soln ; else Reject X new ; endif endif until thermal equilibrium Save best-so-far X soln ; Check convergence T(i + 1) = T(i); / * cooling schedule * / endwhile

897
In section IV, the model was validated using a very small 898 number of tasks and resources, given the limitation of the 899 exact solution method in CPLEX. Besides the gain in per-900 formance in finding a solution in comparison with an exact 901 method, heuristics solutions must also be scalable to handle 902 large problem sizes. To test the scalability of the proposed 903 algorithm, we generated a long list of tasks with an adequate 904 pool of resources such that there exists a feasible solution.

905
Recall that each task needs to have a specific number of 906 required FPGA resource blocks or PRRs, execution time, and 907 deadline. We use Poisson distribution to randomly retrieve a 908 value for both the number of blocks and the execution time. 909 The Poisson distribution is a discrete random distribution that 910 gives the probability of several events occurring over a fixed 911 time interval. It assumes that events occur at a constant rate 912 and each event occurs independently of the time since the 913 last event. In contrast to a continuous normal distribution, the 914 VOLUME 10, 2022  Xsoln undergoes several perturbations during the anneal-959 ing process until it reaches an equilibrium where the final 960 task allocation represents the best solution. The initial config-961 uration is obtained by allocating the tasks using the earliest 962 deadline first algorithm, where the task with the earliest 963 deadline is scheduled first, followed by the next task until all 964 tasks have been allocated.

965
The 1D array representation is by far the best alternative 966 because the only Constraint that must be checked during the 967 perturbations is the deadline constraint. This contrasts with 968 a 3D Boolean array which was the initial alternative and too 969 many constraints must be validated to determine the feasibil-970 ity of the solution. We also devised a linked list representation 971 where each time node is connected to the next time node 972 in a singly linked list. Each time node is also connected to 973 another singly linked list which consists of resource blocks 974 or resource nodes.

976
There are three main components to the Metropolis step, 977 namely, the perturbation, the acceptance criteria, and the ther-978 mal equilibrium criteria. We start by perturbing the existing 979 solution X soln by randomly selecting two tasks in the task 980 list and swapping the order in which they appear on the task 981 list. After that, we attempt to schedule the task list by taking 982 them in the order of task deadline yielding X new . While doing 983 so, only the deadline constraint, i.e., Constraint (4), needs to 984 be rechecked to ensure that deadlines are not violated. Next, 985 the acceptance criterion outlined in Algorithm 1 checks the 986 change in the objective function, Z = Z (X new ) − Z (X soln ). 987 If the change due to perturbation reduces the objective func-988 tion, the perturbation is accepted and X soln becomes X new . 989 In other words, if the makespan of the X new schedule is 990 smaller than that of the X soln schedule, which is the best-991 so-far, X soln is updated to X new . On the other hand, if the 992 perturbation causes an increase in the objective function, 993 it will only be accepted with a probability of e − Z /T (i) . The 994 acceptance criterion applies only to perturbations yielding a 995 feasible X soln . Furthermore, the inner loop in the algorithm 996 deals with thermal equilibrium. As more neighboring solu-997 tions are found for the same temperature value, it is said that 998 the algorithm is reaching thermal equilibrium. Hence, ther-999 mal equilibrium is nothing more than a predefined number of 1000 iterations for the inner loop. We set the thermal equilibrium 1001 criterion to be one-third of the dataset size.

1003
The initial temperature T(0) yields a high acceptance proba-1004 bility of around 0.8 for moving to worse neighboring states. 1005 On the other hand, the freezing temperature yields a very 1006 small acceptance probability of around 2 −25 , rendering worse neighboring moves impossible, and hence only better neigh-1008 boring states are allowed. The cooling schedule used in our 1009 work is T (i + 1) = αT (i), where α = 0.9. The symbol α denotes the cooling rate of the temperature for the next 1011 iteration.

1059
Algorithmic parameters can affect the simulation time heav-1060 ily. In the proposed SA, the simulation terminates when 1061 the solution does not change for a set number of iterations. 1062 Therefore, the iteration threshold becomes the algorithm's 1063 termination condition, and we must tune this parameter to 1064 reduce the simulation time as much as possible. At the same 1065 time, minimizing the simulation time must not compromise 1066 the quality of the solution too much.

1067
To conduct the parameter tuning experiments, we consider 1068 3 datasets of different sizes in terms of the number of tasks, 1069 the execution time for each task, and the number of blocks 1070 each task requires. The problems were inputted through the 1071 proposed algorithm, varying the maximum number of itera-1072 tions allowed from 2 to 128 as 2, 4, 8, 16, 32, 64, 128. The 1073 solution behavior in terms of improvement in the objective 1074 function and the degradation in the incurred simulation time 1075 trying to converge. Table 6 shows the specifications of dif-1076 ferent dataset sizes. Note that through all parameter tuning 1077 experiments, we keep the resource pool constant with 1000 1078 FPGA blocks, ensuring sufficient resources for all tasks in 1079 each dataset. 1080 Fig. 12 and Fig. 13 show the objective value of the pro-1081 posed schedule and the simulation duration respectively for 1082 the small dataset size against number of iterations. It is appar-1083 ent from Fig. 12 that the algorithm was able to obtain the best 1084 solution of 6-time units within the first 2 iterations. Clearly, 1085 increasing the number of iterations does not help reduce the 1086 objective any further as it seems that 6-time units is the mini-1087 mum makespan for the tasks. Furthermore, Fig. 13 shows that 1088 increasing the number of iterations degrades performance, 1089 since the elapsed time of the simulation for a greater number 1090 of iterations grows larger. Therefore, we conclude that for a 1091 small problem, 2 iterations are sufficient to obtain a subopti-1092 mal schedule. Running the algorithm for 2 iterations on the 1093 small dataset takes an average of 1 millisecond. 1094 Fig, 14 and Fig. 15 show the objective of the proposed 1095 schedule and the time for which the simulation elapsed, 1096 respectively, for the medium dataset plotted against the num-1097 ber of iterations. Fig. 14 shows that as the number of iter-1098 ations increases, a better scheduling solution with a shorter 1099 makespan is produced. This is because we allow the algorithm 1100 to run for a larger number of iterations before termination, 1101 increasing the possibility of finding a better schedule. More-1102 over, Fig. 15 further confirms our previous observation that 1103 an increase in the number of iterations increases simula-1104 tion duration. We conclude from the two experiments that a 1105 VOLUME 10, 2022     unscheduled and newly queued tasks. Hence, Algorithm 2 1131 shows the adaptive SA that differs from the SA in steps A-1 to 1132 A-6. First, the algorithm deals with the initial batch of tasks in 1133 the task list and evaluates as per the proposed model's objec-1134 tive function (see Equation 1). After convergence is achieved, 1135 it saves the best-so-far schedule and calls a delay function 1136 until a new batch of tasks arrives. The delay function is based 1137 on a random value from a Poisson distribution. The task list is 1138 updated with newly arrived tasks in ascending order of task 1139 deadlines. Then, it starts over the process of finding a new 1140 schedule with a minimum makespan that considers tasks from 1141 both the previous batch and the new batch. However, previous 1142 tasks whose execution had already started are excluded from 1143 the rescheduling process (i.e., no preemption of tasks) and 1144 only the tasks that did not start being executed are passed 1145 forth. In this way, the adaptive SA enables dynamic schedul-1146 ing of tasks which is commonly used in cloud environments. 1147

1148
This section presents the conducted experiments and the 1149 achieved results to validate the proposed virtualization frame-1150 work and heuristic algorithm. Multiple experiments were 1151 conducted to evaluate the exact solution and the proposed 1152 heuristic solution. The quality and performance of the solu-1153 tions yielded by both methods are compared. We further 1154  The three experiments that were used to validate the 1174 exact solution method were applied to the SA algorithm. Table 2, 3, and 4 provide task specifications of the first three 1176 experiments. As discussed in section V, the SA first finds 1177 Algorithm 2 Implementation of Adaptive SA Input: Task list that consists of task execution time, required blocks and deadline; Initial configuration: Task list is sorted using earliest deadline first X soln ; Determine initial temperature T(0); Determine freezing temperature T f ; A-1: Current schedule is empty; A-2: repeat while (T(i) > T f and not converged) do repeat Perturb (X soln ) by swapping two tasks randomly; Find neighbor solution X new ;

1175
Reject X new endif until thermal equilibrium Save best-so-far X soln ; Check convergence; T(i + 1) = T(i); / * cooling schedule * / Endwhile A-3: Current schedule = best-so-far X soln ; A-4: Delay (based on Poisson distribution); A-5: Update task list with new set of tasks; A-6: until (true) an initial solution that might be infeasible, and then it itera-1178 tively converges to suboptimal feasible solutions. Fig. 18 and 1179 Fig. 19 show the initial and the final solutions, respectively, 1180 that were obtained in less than a millisecond of the simulation 1181 time. In these figures, b represents a single FPGA resource 1182 block or PRR, k is a task, and t is the unit time. As the 1183 heuristic approach finds a schedule for the tasks, resources 1184 are allocated to each task for a specific duration of time. 1185 Moreover, both the exact and heuristic methods perform com-1186 parably in terms of speed, and both produce the same schedul-1187 ing solution for the first experiment. Therefore, solving this 1188 scheduling problem validates that both the methods produce 1189 optimal solutions.

1190
In the second experiment, the SA obtained the final solu-1191 tion in one millisecond that has optimality. Fig. 20 shows the 1192 initial solution from the SA which was infeasible because 1193 the schedule violated the deadline constraint for task 4 (see 1194  Table 3 for task specifications). Task 4 has a deadline of 1195 t = 11, whereas its execution in the initial solution finished 1196 at t = 12. The SA then iterates further and yields the final 1197 solution within a millisecond, as shown in Fig. 21. The final 1198 solution of the proposed SA produced the same objective 1199 value as the exact method. The SA took 1 ms, whereas 1200 the exact method took 860 ms to obtain the same solution. 1201 VOLUME 10, 2022    near-optimal solution for experiment 4 with 100 resources 1243 and 1000 tasks, whereas the exact method failed to run.

1244
In the next set of experiments, we consider the traditional 1245 FCFS and SDF algorithms and provide our implementation to 1246 draw comparisons between them and the proposed adaptive 1247 SA. Table 7 shows the number of tasks and resources used 1248 in the three experiments. The incoming tasks were sent to 1249 the schedulers in several batches to simulate a real-world 1250 cloud environment. Therefore, all the schedulers under the 1251 experiment are adaptive and perform dynamic scheduling.

1252
In FCFS, tasks in the list are scheduled in the order they 1253 arrive at the data center. The algorithm looks for available 1254 resources from the pool of resources and allocates them to 1255 each task. In case there are no free resources, it searches for 1256 resources that can execute the task at hand before its deadline. 1257 As the new set of tasks arrives, the scheduler tries to allocate 1258 resources for the first task in the set. It performs a linear 1259 search until it finds the required number of available blocks 1260 for allocation. It also tracks the estimated time for the busy 1261 blocks after which the execution ends. Suppose the sched-1262 uler fails to find any available resource in the pool. In that 1263 case, it chooses among the occupied blocks that yielded the 1264 minimum estimated finish time of execution and queues up 1265 the task to these blocks. The estimated finish time of the 1266 execution for any block includes the execution time of the 1267 currently executed task and tasks previously queued to be 1268 executed by the block. Moreover, in contrast to the adaptive 1269 SA, if a new set of tasks arrives at the FCFS scheduler, tasks 1270   experiments, the SA outperforms the other two. The under-1300 performance of FCFS and SDF is because whenever there is 1301 a lack of available resources, these algorithms must perform 1302 a linear search over all resources until resources that can 1303 finish the execution of the selected task before its deadline are 1304 found. Performing the search for each task increases in time 1305 complexity. In the case of the SA, all the tasks in the list are 1306 scheduled at once, validating the deadline constraint without 1307 doing any linear search on the resource pool. Therefore, the 1308 proposed heuristic technique takes a shorter time to find 1309 a schedule. Moreover, SDF takes even longer compared to 1310 FCFS because it performs sorting each time before schedul-1311 ing a new batch of tasks.

1313
In this section, we conduct various experiments to exam-1314 ine the impact of different parameters in the scheduling 1315 algorithm. 1316

1318
The heuristic increases in time complexity when the number 1319 of tasks is increased. The degradation in performance is not 1320 as much when we increase other parameters, for example, 1321 the number of resources in the pool, the average amount 1322 of resources required by each task, or the execution dead-1323 lines. The impact on the scheduling performance is much 1324 greater when the dataset size changes. To demonstrate this, 1325 we carried out two series of experiments. In the first series, 1326 we decrease the number of resource blocks in each exper-1327 iment while keeping the number of tasks constant. In the 1328 second series, we increase the number of tasks in every 1329 experiment while the number of resources remains constant. 1330 Both experiments show an increase in the elapsed simulation 1331 time. Table 8 and Table 9 show the task specifications of the 1332 two experiments.  Table 8. The x-axis presents the experiment num-1335 ber, and the y-axis presents the elapsed time in milliseconds. 1336 We can observe from the figure that the increase in the 1337 elapsed time is linear. Moreover, the increase in elapsed time 1338 is because resources must carry out task execution for longer 1339 periods of time as we decrease the number of resources in the 1340 resource pool. Each resource block must execute a greater 1341 number of tasks in series one after another, and a smaller 1342 number of tasks are executed in parallel due to a lack of suf-1343 ficient resources. Therefore, decreasing resources increases 1344 simulation time.          Table 10 and Table 11 show the details of the experiments 1364 conducted. We set the dataset size to be 500 tasks and the 1365 resource pool size to be 500 blocks for all the experiments. 1366 We observe that the increase in both the task parameters 1367 shows a slight exponential increase in the elapsed simulation 1368   We can observe that for this test case, by using a vFPGA,  integration. Since FPGA architecture is different from tra-1398 ditional cloud resources, different virtualization mechanisms 1399 are developed and proposed in the literature. Besides virtual-1400 ization schemes, scheduling techniques for FPGA resource 1401 pools are also an active field of research. Furthermore, 1402 validating virtualization and scheduling approaches using 1403 hardware platforms in real-time is not simple. This is because 1404 the hardware resources required to set up a cloud environ-1405 ment are expensive, and building the environment is time-1406 consuming. Hence, researchers are heavily dependent on 1407 using cloud simulators for validation purposes.

1408
This work explored several FPGA virtualization frame-1409 works and proposed an efficient virtualization approach to 1410 DPR-enabled FPGAs in the cloud. Our framework abstracted 1411 physical FPGA chips into a pool of PRRs using grid-style 1412 partitioning and implemented the MFMA virtualization. 1413 Consequently, this enabled multi-tenancy and improved 1414 FPGA resource utilization. Moreover, this work used an 1415 infrastructure where FPGAs in the hardware layer are con-1416 nected to host machines via PCIe and network devices via 1417 Ethernet. This is unlike most of the works reported in the 1418 literature, where only one type of physical connection was 1419 established with the FPGA. As a result of additional connec-1420 tivity, the FPGAs could be used as both local and global accel-1421 erators across the network. In addition, the framework used 1422 an adapter interface which served as a static communication 1423 interface between accelerators and the various framework 1424 managers. By automatically generating the adapter, users 1425 were allowed to be more productive and focus on application 1426 development instead of designing communication interfaces. 1427 Furthermore, the role of an FPGA hypervisor was significant 1428 as it provided frontend functions to initialize, operate, and 1429 terminate vFPGAs. The unified manager interfaces with these 1430 functions to efficiently manage and maintain a pool of FPGA 1431 resources. Moreover, the hypervisor implementation in this 1432 work is novel because the frontend and backend functions 1433 were implemented in separate modules as a vFPGA manager 1434 and a configuration manager, respectively. This allowed for a 1435 modular framework architecture and made implementing the 1436 framework in CloudSim easier.

1437
A typical cloud receives a bulk of user requests to use accel-1438 erator services, which are the cloud tasks that must be sched-1439 uled efficiently. We formulated an optimization model whose 1440 objective was to minimize the makespan of cloud tasks. 1441 Unlike the models presented in the literature, the proposed 1442 model can be used to optimize resource allocation from a pool 1443 of architecturally homogenous resources, irrespective of the 1444 resource type. This means the model is suitable for resource 1445 types such as PRRs in a grid, columnar slots, and even whole 1446 FPGA fabrics in a multi-FPGA infrastructure. Moreover, the 1447 proposed implementation of the SA algorithm, a metaheuris-1448 tic technique, yielded suboptimal solutions. We improved 1449 the SA by incorporating steps of the Metropolis algorithm, 1450 which explored neighboring solutions iteratively. In addition, 1451 we developed the SA, making it adaptive to support dynamic 1452 scheduling. This was crucial because tasks in a cloud arrive in 1453 VOLUME 10, 2022 pp. 33-38.