Virtual Machine Placement Optimization for Big Data Applications in Cloud Computing

Big data and cloud computing are two advanced technologies that have overcome many computing and analytical challenges in recent years. With the rise in the applications of these technologies, the necessity of efficiency and optimization in the utilization of related resources has made sense. The procedure of locating virtual machines (VM) in physical machines (PM) affects the performance, speed, and costs of cloud computing services. VM placement in cloud computing is an NP-hard problem. Indeed, the problem is more complicated in big data tasks due to the need for transferring high volumes of traffic between VMs. This paper proposes a new approach for VM placement in a multi data center (DC) cloud environment. The aware genetic algorithm first fit (AGAFF) is a context-aware algorithm that distinguishes big data tasks with an input tag and uses a structure to minimize the traffic between MapReduce nodes. This multi-objective algorithm is based on the genetic algorithm, which is incorporated with the first fit methodology. The algorithm minimizes energy usage by minimizing the number of used servers, intra-DC traffic of big data tasks, and VMs’ live migration while maximizing relevant usage of CPU and RAM in every server. Furthermore, it improves job execution time, especially in big data processing, and reduces service level agreement (SLA) violations. A comparison between the results of AGAFF and four other algorithms shows by about 61% energy consumption reduction on average on different scales and approves a decrease in the number of needed PMs, intra-DC traffic of big data processing, and the number of live migrations.

• Reducing big data traffic and placing interrelated 98 MapReduce VMs in a structure with the minimum cost. 99 • Presenting a multi-objective solution that manages waste 100 of resources and energy consumption by placing newly 101 arrived VMs and migrating present ones in case of necessity.

102
• Defining a scalable solution to heterogeneous multi-DC 103 cloud systems and using modern data center architectures. 104 • Proposing a comprehensive algorithm regarding real-105 world service provisioning limitations, circumstances, and 106 priorities. 107 The rest of this article is organized as follows. Section two 108 reviews the basic concepts. In section three, previous related 109 work is investigated. Sections four and five are dedicated 110 to problem formulation and AGAFF algorithm, respectively. 111 Eventually, section six discusses the results and analysis 112 of AGAFF. 113

114
Given that the perception of the problem and its solution 115 requires knowledge of cloud computing and big data, this 116 section gives a brief explanation of both technologies.

117
A. CLOUD COMPUTING 118 In cloud computing, the management of resource provision-119 ing is centralized, and resources are allocated to requests on 120 demand. This allocation is agile, flexible, and elastic. Accord-121 ing to the definition given by the National Institute of Stan-122 dards and Technology (NIST) of the United States [18], cloud 123 computing is a model for enabling network access to a shared 124 pool of configurable resources (e.g., networks, servers, stor-125 age, applications, and services). Access to resources is ubiq-126 uitous and convenient [18], [19]. On-demand, self-service, 127 broad network access, rapid elasticity, and measured services 128 are essential characteristics of cloud computing [18]. 129 Cloud computing platforms are generally provided in three 130 layers, including infrastructure as a service (IaaS), platform 131 as a service (PaaS), and software as a service (SaaS). More-132 over, four common deployment models are private cloud, 133 public cloud, hybrid cloud, and community cloud. Big data is an evolving term that describes any huge amount 136 of structured, semi-structured, or unstructured data that has 137 the potential to be mined for useful information [7]. It refers 138 to large datasets which require non-traditional scalable solu-139 tions for data acquisition, storage, management, analysis, and 140 visualization [20]. Big data can be characterized by a set of 141 characteristics such as volume, velocity, variety, variability, 142 veracity, value, and validity. 143 The data are generated from multiple sources such as the 144 Internet of things (IoT), social media applications, medical 145 records, emails, documents, websites, science data, sensors, 146 smart phones, and other resources [21]. This huge amount of 147 data is so large that is estimated to account for 30 percent 148 of data stored in data centers in 2021, up from 18 percent 149 in 2016 [22]. Big data can use analytics such as artificial 150 intelligence (AI), machine learning (ML) and deep learning 151 (DL) to produce better analysis [21].  As mentioned before, energy accounts for a large part of 199 cloud operation costs. TPJS [39] was an energy-and locality-200 efficient MapReduce multi-job scheduling algorithm. The 201 main characteristic of this solution was that its basic unit 202 of resource allocation was a rack. The multi-job scheduling 203 process was divided into two phases: multi-job pre-mapping 204 and parallel job execution. In the first phase, multiple jobs 205 were merged into a job group. Each job in the group was 206 centrally pre-mapped to multiple booked racks. In the second 207 phase, each reduced task of one job was mapped to multiple 208 map tasks to form a task group. The performance of the 209 algorithm was evaluated by comparing its results with those 210 of three other typical methods, w.r.t. job scheduling time, 211 resource balance rate, rack-to-rack traffic, number of used 212 racks, and energy usage.

213
MOGA [16] was another proposed solution based on 214 genetic algorithms. Its objectives were minimizing energy 215 consumption, resource wastage, and bandwidth usage. 216 MOGAs' main idea was to place traffic-dependent VMs on 217 the same PM, and if it was not possible, preferably to be 218 placed under a zone of the access layer, otherwise under 219 the aggregation layer, and in the worst case, under the core 220 layer in tree architecture of the DC [16]. The results were 221 compared with an ACO-based algorithm, a heuristic-based 222 FFD 2 solution, and a random-based approach. Improving cloud efficiency is another subject that some 225 researchers tried to integrate into placement problem in dif-226 ferent ways. The Purlieus model [29] was one of the first 227 studies focused on data placement in big data applications. 228 Its goal was to reduce the network distance between storage 229 and compute nodes for both ''map and reduce'' processing. 230 To this end, the authors tried to improve the data locality 231 for MapReduce phases. They divided MapReduce tasks into 232 three categories according to the volume of input data in every 233 phase and then adopted a different strategy for each one. They 234 showed that the combination of their two proposed techniques 235 for VM placement improved the tasks' execution speed by 236 9.1% to 100%, compared with other techniques.

237
In [8], Li et al. provided the CAM platform using a min-238 cost flow approach. Their goal was to reconcile data and VM 239 resources' initial allocation and migration to avoid placement 240 anomalies. They used MapReduce tasks' classification under 241 a procedure similar to that of [29] and finally showed that 242 their proposed algorithm made the network traffic three times 243 lower and the speed of MapReduce tasks 8.6 times higher.      However, it had been proved that the VM placement problem 273 (minimizing latency) followed a triangular inequality; so, the 274 best answer was at least twice the optimal answer. Using

307
To approach the problem, we used Knapsack as a common 308 method to model the VM placement problem in the first 309 step and then try to make the model richer, step by step. 310 Table 2 shows the used notations in the AGAFF.

311
The Knapsack is an NP-hard problem [14], [15], [28], [29], 312 [34], [35], including a set of items, each with a weight and a 313 value. Our proposed method is to maximize the total value 314 regarding the weight limitation. We formulate the problem as 315 follows. If x i , v i , and w i represent the item, value, and weight, 316 respectively, we have: where W is the maximum sustainable weight. Since the 321 Knapsack problem is proven to be NP-hard, we use a 322 constrained optimization method to solve it. To this end, 323 we suppose every answer causes a violation that is not fixed. 324 To measure the violation penalty, we define the equation 325 below: In the next step, we want to optimize the violation penalty. 328 Therefore, we turn the constrained problem into an uncon-329 strained one using the penalty function: Now, we map the VM placement problem to Knapsack. 334 To manage and control the placement process, we use a cen-335 tral control system, which is aware of the cloud system, avail-336 able resources, servers' specifications, and other necessary 337 information in service provisioning. We consider a distributed 338 cloud environment containing m physical machines (PM 1 , 339 PM 2 , PM 3 , . . . , PM m ) to define the system. Every PM has 340 k component resources. Our problem is to place n VMs on 341 available PMs to minimize the total energy consumption, 342 CPU and RAM energy usage, interrelated traffic between 343  performance indices should be in the best possible situation. 345 We suppose every PM is a bin, and x i represents the VMs 346 that should be located. It includes running and newly arrived 347 VMs at each moment, and w i denotes the required processing 348 resources for each VM. The problem is to place as many as 349 VMs in every PM, whereas the sum of processing resources 350 will not be more than PMs'.

351
A. CLOUD ENERGY MODEL

352
As stated before, minimizing energy usage is the objective of 353 AGAFF since it has a great impact on cloud providers' OPEX.

354
It is also an important step toward green computing.
where m is the number of PMs, E CPU and E mem represent 378 the energy consumption of CPU and memory, respectively, 379 and E baseline is the base power usage that is empirically 380 determined. E baseline represents the energy consumption of a 381 server when no user-level process is active [44]. The formula 382 is the same for heterogeneous and homogenous servers.

383
Energy waste in PMs are inversely related to servers' 384 processing power usage. In optimum solutions, the energy 385 waste must be minimum by using the maximum processing 386 capacity of servers. So, we try to decrease energy losses in 387 AGAFF as much as possible. To calculate the energy waste 388 ratio, we define (8) when server i is turned on.
Equation (9)   To diminish energy usage, interrelated VMs are preferred 407 to be placed on the same server. If it is not possible, sitting 408 in the same rack is the second choice, and being in distinct 409 racks in the same data center is the last option. Although 410 we evaluate the performance of AGAFF in a multi-DC cloud 411 environment, it avoids placing interrelated VMs of a big data 412 task in different data centers.

413
Upon entering every big data task, AGAFF's central control 414 system generates an n×n top triangular matrix. Adjacency 415 matrix shows which VMs are interrelated: where a ij represents the relation between VM i and VM j . If a ij 419 is 1, it means VM i and VM j are interrelated. Otherwise, 420 it means there is no special relation between VM i and VM j . 421 By using matrix A and based on the number of hops 422 between every two VMs in leaf-spine data center topology, 423 we define Bigdataviol. This index displays the imposed cost 424 by big data VMs' traffic (Bigdataviol calculation is described 425 in the next section). The cost is calculated based on the 426 number of hops between interrelated VMs in a task. Now, 427 we can compute the energy consumption of big data traffics 428 by (10):   GA forms the initial population randomly by default. This 498 procedure increases the number of solution iterations to reach 499 the best results, which produces a longer delay. AGAFF is 500 improved in setting the initial population. To conduct a smart 501 initialization, we combine random and first fit algorithms' 502 results. One-third of the first generation is randomly gener-503 ated. In another one-third, VM placement is done based on the 504 requested CPU according to the first fit method. The first PM 505 to start placement is chosen randomly. The other one-third of 506 the population is generated by using the first fit method based 507 on the requested RAM.

509
After generating the initial population, a crossover operator is 510 applied. The crossover operator defines how two parents are 511 combined to obtain two offspring [15]. We use a one-point 512 crossover, and PC% of parents are chosen randomly.

513
Mutation, another operator used in GA, aims to produce a 514 random twitch of genes in the chromosome to introduce the 515 diversity of the chromosomes [47]. In AGAFF, the mutation 516 operator chooses the PM% of chromosomes. Then, one of 517 its genes (for example, x) is selected randomly. To mutate 518 the chromosome, x/2 replaces x, lowering the VM place-519 ment dispersion. Accordingly, some mutated chromosomes 520 are added to the initial population.

521
At this step, all the population is sorted based on cost value. 522 Chromosomes with lower costs are preferred. The algorithm 523 repeats the whole process until the termination condition is 524 met. The termination condition can be determined based on 525 operational conditions of the environment or the rate of tasks 526 entering the cloud. We set AGAFF s number of iterations as 527 the termination condition. Figure 3 and figure 4 depict the 528 process of the crossover operator and a sample of the mutation 529 operator, respectively.

531
As stated in the previous section, AGAFF s main objective 532 is minimizing total cloud energy consumption. We should 533 consider several constraints to reach an applied solution. 534 To achieve the goal, we define a cost function by (15). 535 AGAFF selects answers with the minimum cost.    One of the costly challenges is the traffic produced by big data 580 tasks within a data center. A good way to manage the traffic 581 is by optimizing the placement of VMs on the cloud. AGAFF 582 is a context-aware algorithm. In other words, when a big 583 data task arrives, we assign the traffic matrix to it. We define 584 highly cohesive VMs whose data transfer in t is more than 585 a threshold. To model the cloud data centers, we consider a multi-DC 590 cloud with a leaf-spine topology [12]. structure is shown in Figure 5.

605
Here, it must be noticed that we put some limitations on 606 the placement of big data tasks. Since considerable traffic 607 is transmitted between highly cohesive VMs, the algorithm 608 is not allowed to place these VMs in different data centers.

609
This helps save energy by reducing the costs of data transfer 610 by lowering the running latency of big data tasks, as in (21), 611 shown at the bottom of the page. Observations based on our experience in XaaS Public Cloud 5 620 show that in the real world, RAM violation has the main 621 impact on system performance, and its violation can result in 622 the failure of the computing process. Figure 6  BigdataViol(j)   Optimizing the number of used PMs, minimizing big data 681 traffic through better placement, and decreasing the number 682 of VM migrations by AGAFF are the main reasons for energy 683 usage reduction. Furthermore, as mentioned before, AGAFF 684 avoids placing interrelated VMs of a big data task in different 685 data centers. This strategy plays a significant role in decreas-686 ing intra-DC big data traffic in a multi-DC cloud.   VMs. An interesting point is that the same strat-736 egy is applied in GARand, but since the initial popula-737 tion is generated randomly, the number of PMs does not 738 approach the optimum number in the determined time. So, 739 by increasing the number of VMs and enhancing the necessity 740 for more iterations, GARand's results get worse concerning 741 AGAFF results. Random presents the worst results in scat-742 VOLUME 10, 2022   and RAM violation, CPU violation, and the percent of live 752 migration are in the next ranks. It can be seen that AGAFF's 753 total violation is lower than other algorithms in figure 11(b), 754 except for GARand. This behavior of GARand can be jus-755 tified by comparing its performance in Figures 11(a and b). 756 Similar to AGAFF, GARand is an evolutionary algorithm that 757 works based on genetics. However, GARand uses the random 758 method to generate the initial population. Figure 11(a) shows 759 the impact of using different methods in generating the initial 760 Both AGAFF and GARand can show the best performance 777 in two components out of the four ones. Figure 12  If we did not consider the live migration overhead in AGAFF 785 (γ = 0), the total violation would be decreased generally. 786 However, the overhead is included to put real-world limita-787 tions in AGAFF. Figures 12(c and d) represent the violation of 788 PMs' resource components in the placement process. In these 789 items, GARand shows better functionality as a result of using 790 more PMs compared with AGAFF. It is worth noting that in 791 BF and FF, VMs are located in PMs based on the required 792 RAM. It results in zero RAM violation and notable CPU vio-793 lation. We have chosen RAM as a more critical resource given 794 our observations in the XaaS Public Cloud. However, this 795 experiment can be done by giving priority to the CPU, which 796 replaces the results of RAM violation with CPU violation. 797 If the scheduling is executed based on the required RAM and 798 CPU in BF and FF at the same time, the number of used PMs 799 goes up significantly.

801
The emergence of cloud computing has changed many 802 aspects of computing. It provides an economical and effi-803 cient platform for high-level technologies, such as big data, 804 IoT, artificial intelligence, and edge computing. However, its 805 expansion may cause some challenges like high-scale energy 806 consumption, CO 2 emissions, and global warming. Many 807 researchers have focused on related subjects to decrease the 808 side effects of cloud computing by optimizing the VM place-809 ment. Presenting various solutions to place requested VMs 810 optimally is an effective way toward energy usage reduction 811 and green computing.