DQN Approach for Adaptive Self-Healing of VNFs in Cloud-Native Network

The transformation from physical network function to Virtual Network Function (VNF) requires a fundamental design change in how applications and services are tested and assured in a hybrid virtual network. Once the VNFs are onboarded in a cloud network infrastructure, operators need to test VNFs in real-time at the time of instantiation automatically. This paper explicitly analyses the problem of adaptive self-healing of a Virtual Machine (VM) allocated by the VNF with the Deep Reinforcement Learning (DRL) approach. The DRL-based big data collection and analytics engine performs aggregation to probe and analyze data for troubleshooting and performance management. This engine helps to determine corrective actions (self-healing), such as scaling or migrating VNFs. Hence, we proposed a Deep Queue Learning (DQL) based Deep Queue Networks (DQN) mechanism for self-healing VNFs in the virtualized infrastructure manager. Virtual network probes of closed-loop orchestration perform the automation of the VNF and provide analytics for real-time, policy-driven orchestration in an open networking automation platform through the stochastic gradient descent method for VNF service assurance and network reliability. The proposed DQN/DDQN mechanism optimizes the price and lowers the cost by 18% for resource usage without disrupting the Quality of Service (QoS) provided by the VNF. The outcome of adaptive self-healing of the VNFs enhances the computational performance by 27% compared to other state-of-the-art algorithms.


I. INTRODUCTION
Software Defined Network (SDN) has engendered the virtualization of applications and networks, culminating in a cloud-native phase.Network Function Virtualisation (NFV) is being adopted in mobile networks with the deployment of 5G [1].Each cellular network generation has led to the development of new business models [2].Virtualized Infrastructure Manager (VIM) manages the entire The associate editor coordinating the review of this manuscript and approving it for publication was Farhana Jabeen.lifecycle of the software and hardware, comprising the NFV Infrastructure (NFVI), and maintains a live inventory and allocation plan of both physical and virtual resources [3].This approach follows the Management and Orchestration (MANO) function proposed by the European Telecommunication Standards Institute (ETSI).It results in simplifying service delivery and reduces cost with high-performance lifecycle management.Automation is the key to managing these complex networks and applications in various stages, such as Physical Network Function (PNF), Virtual Network Function (VNF), and Cloud-native Network Function (CNF) over a massive number of devices and different types of devices and services across many industries [4].The cloud-native network is a software service that adheres to the design principle of cloud-native network functions without any hardware appliances attached to it.It includes CNF as the software component of a network function performed in a physical device and deployed on cloud-native data centres or cloud servers.The transformations in the underlying architecture and technologies lead to complexities in service lifecycle management [5].Network Intelligence (NI) considers the embedding of Artificial Intelligence (AI) in future networks to fasten service delivery and operations, leverage Quality of Experience (QoE), and guarantee service availability, better agility, resiliency, faster customization, and security [6].Network Intelligence technology allows the CSPs to capture the details of service and application-level VNF deployment in the network.The NI transforms how we optimize our network operations, significantly reducing operating costs.NI is envisioned to manage, pilot, and operate the forthcoming networking built upon SDN, NFV, and cloud [7].
Machine Learning (ML) for networking has enabled constant monitoring of a particular application of an ML tool that leads to optimizing the next-generation networks [8].ML can exploit the hidden relationship between voluminous input data and complicated system outputs, especially for advanced techniques like deep learning.The other techniques, such as reinforcement learning, could further adapt the learning results and evolve automatically to the new environments [9].The re-instantiating and benchmarking of complex services in automating the standard techniques are followed to deliver NFV solutions via ML techniques [10].Predictive analytics powered by an AI engine enables forecasting results through leverage of data, sophisticated algorithms, advanced ML ability, and building on historical data [11].AI algorithms are driven to monitor the present condition of equipment and help predict failure through data-driven techniques based on the analysis of preceding patterns [12].These prediction and analysis techniques proactively fix issues with data centres, power lines, cell towers, and equipment present at the customer premises [13].
ML and AI can make edge networks more intelligent and show the way for next-generation networks.The AI can scale and operate the networks automatically in adopting new requirements in the model [14].With this information model, a plug-and-play algorithm constitutes the changes to topology and route optimization as of the environmental changes.ML is a subset of AI that refers to collective data and pattern analysis, where the software system learns and adapts from continuous experiences over time.
The enhancement of NFV has matured with the introduction of advanced orchestration models to a whole new paradigm.The Existing system addressed the self-healing problem with on-policy methods such as Proximal Policy Optimization (PPO) uses a new type of gradient method where policy uses a neural network to implement easily on the network.Advantage Actor-Critic (A2C) with Generalized Advantageous Estimation (GAE) uses a trajectory where the values are stored and executed in the environment in calculating the estimated advantageous function.The Trust Region Policy Optimization (TRPO) calculates the weighted probability of the current policy and formulates the optimization of that policy using Kullback-Leibler (KL) divergence measurement using action and reward with a Monte-Carlo trajectory.The proposed DRL algorithm chooses the DQN mechanism over other PPO, A2C GAE and TRPO mechanisms because of its scalability issues.The DQN is proposed using Stochastic Gradient Descent (SGD) to calculate the weight of the network.The learning objective uses a target network and an evaluating network.The algorithms were implemented to work on VMs with Apcera installed and were trained with data collected through Apceras API, and the simulation was carried out through a cloud cluster.We could see that DQN/DDQN outperforms PPO+MC, PPO+GAE, TRPO+MC, and TRPO + GAE compared to the applied agents.The proposed algorithm is designed with zero human intervention, where the service provider reads through the provisioning of data through Deep Queue Networks (DQN) that decide which public cloud is being used to serve the customer better.The self-healing operation restarts the VNF applications whenever it detects that the application has crashed or is down, increasing the overall availability of VNF.This leads to building a resilient and fault-tolerant application that can handle changes and perform well in emergencies.The existing solutions in comparison to proposed methodologies are discussed, and the objective of the proposed system is described below: • The proposed system analyses the problem of self-healing VNF through the DRL mechanism that helps in decision-making based on the DQN prediction model.
• VNFs have become a powerful base operation to use the DRN technique in chaining operations where service providers can enable/disable services based on QoS.
• The proposed DQN algorithm decides on available instances on VMs, voice services to deploy, type of hardware resources to use, and rapid enabling of services.
• The DQN mechanism minimizes the resource usage provided by VNF without disrupting the QoS.Adaptive Self-healing of VNF results in faster deployment cycles and lower CapEx and OpEx.
The Open Networking Automation Platform (ONAP) is an organized open-source cloud networking project built with the objective of developing a greater orchestration and automation platform.The major goals of this paper are listed below: • Provides a real-time operational environment based on AI/ML policy-driven orchestration and automation techniques.• The management of new services and their resources across the entire life cycle of the network • ONAP addresses the industry problem of fragmentation as an initiative in taking the industry towards automation and convergence of enabling an ecosystem between open source and standards.
• Innovation through dis-aggregated services is deployed using VNF workloads in containers and virtual machines at the Edge that intensively push the envelope to 5G.The major contributions of this paper are listed as follows: • The six state-of-the-art Deep Reinforcement Learning (DRL) algorithms are examined with fundamental differences in their properties ranging from off-policy methods such as Deep Queue Networks (DQN) and Double Deep Queue Networks (DDQN) to on-policy methods such as Proximal Policy Optimization (PPO) Advantage Actor Critic (A2C).
• The different policies are compared to a baseline P-Controller in order to evaluate the performance with respect to simpler methods.
• The final policy applied by the agent shows considerable improvements over a simple control algorithm with respect to reward and performance with multiple experiments with varying loads and configurations tested.The structure of the remaining paper is sectioned as follows: Section II describes the properties of DRL applied to VNFs in an NFV architecture for managing horizontal autoscaling.Section III deals with the system model and properties of applying DRL to the problem.Section IV describes how DRL solves the autoscaling problem by applying six state-of-theart DRL agents to the proposed model.Section V presents the evaluation of results from various experiments and modeling.Lastly, section VI presents the conclusion and future work.

II. RELATED WORKS
Network virtualization is in place because of a massive influx of devices coming with IoT and 5G applications [15].Hence, there is tremendous pressure on next-generation infrastructure.Today's network largely comprises purposebuilt infrastructure, with each device containing its own management software [16].The network of tomorrow will be deployed using NFV and SDN.Instead of a separate router, VPN, and firewall on three different hardware pieces, you can run all three on the same Intel architecturebased infrastructure with a network intelligent operations analytics system as represented in Fig. 1.When you add softwaredefined networking, you add a degree of intelligence and flexibility to your network provider that can greatly reduce operating costs.
Traditional infrastructure and new NFV-based infrastructure will need to coexist in the network for a number of years to come.For all this to work, NFV service assurance must be integrated with the service orchestrator responsible for managing the VNF lifecycle.Service assurance analytics is a key input to the orchestrator, driving remedial changes to services/VNFs.To know what open source has already done to transform the operating system (Linux), the virtualization Infrastructure (OpenStack), and big data (Hadoop) started to work in a collaborative manner rather than relying on proprietary solutions [17].

A. OBSERVABILITY BRINGS CLARITY TO CLOUD-NATIVE NETWORK
In microservices, the observability in the cloud-native network has become very important.Zero intervention automation and containerization are important concepts, but microservices bring transparency and assurance to network performance evaluation.In the telecom sector, the vendor-specific solution for service assurance observes the network traffic by pulling the data from the fibre optic connections via physical taps [18].The observability of cloud-native networks has various properties, as listed below: How can we put a physical tap on a VM? How do we monitor the microservices when deployed in thousands on a single VM at a particular time?Whatever happens in physical infrastructure cannot be virtually possible in the cloudnative network [19].The OpenSource cloud community has built robust ecosystem tools to provide service assurance compared to Fault Management, Configuration, Accounting, Performance, and Security (FCAPS) functionalities in the traditional physical telco cloud.

B. NETWORK AUTOMATION AND OPTIMIZATION
Due to the rapid increase in number of devices connected to the network, the communication networks have become complicated and hard to manage.The deployment of the latest technology like SD-WAN and its services like NFV and SDN has an enormous increase in complexity [20].The advancement of automation techniques in network operation is leveraged by allowing network operators to use AI and ML technologies.Collecting network and device data usually predicts and pre-empts the possible issues in the network and applies fixes to optimize the network's reliability [21].The service request on the customer portal holds detailed activities such as requests, complaints, interactions, and cross-channel portals.Quantitative and qualitative data are analyzed using various AI, ML, and deep learning techniques [22].It also uncovers various trends and issues in performance (i.e.) based on the device, location, and time zone.

C. 5G WILL BE A TURNING POINT FOR NFV
The ETSI Zero-touch network and service management (ZSM) aims to enable largely autonomous networks driven by high-level policies and rules.These networks are capable of self-configuration, self-monitoring, self-healing, and self-optimization without any human intervention in the future for automated execution of overall operational processes [23].This requires a new horizontal and vertical end-to-end architecture framework designed for closed-loop automation and optimized for data-driven machine learning and artificial intelligence engine for future generation cloudnative deployment as illustrated in Fig. 2. The ZSM architecture allows for managing the operational data by separating it from the management applications.The efficient access to cross-domain data exposure (e.g., topology, telemetry data) could be leveraged by intelligent network and server capabilities (e.g., AI, ML for automation) [24].This architectural design helps enable closed-loop automation (i.e.) service assurance and process fulfillment at the network and service-management levels in the VNF self-healing process.This indeed results in automated decisions bounded by 34492 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
various policies and rules using a self-optimization decisionmaking mechanism.The data has become the lifeblood of bringing automation across cross-domain services.Rapid access to real-time management data has become a key process to AI, ML, and closed-loop automation [25].The rise in data persistence from cross-domain data services allows data to be stored separately from the application and shared amongst other consumers.The data includes various attributes such as performance, trace, configuration, assurance, topology, and inventory data.The ZSM architectural design is meant for closed-loop automation and is optimized with data-driven ML and AI techniques [11].
Closed-loop is a feedback-driven operation that helps in continuous adaptation and optimization of network resource utilization and fulfillment of automated service assurance.The analytics are bounded by various policies and rules in determining the operational conditions for which automation conditions are allowed.Based on the insights from various research literature, the auto-scaling problem in VNFs is analyzed.The proposed DQN mechanism uses traditional model-free Q-learning to learn about a scaling policy and implements an auto-scaling solution based on the available legacy system in VNF.These approaches demonstrate good results based on the traditional rule-based auto-scaling solutions.The experiment evaluations are hard to compare as they operate in different environments for implementing RL solutions and modeling the auto-scaling problem.It is solved using the DRL mechanism with VNFs generates the performance measurements as inputs and executes scaling operations (VNF self-healing) based on those existing measurements.This paper has addressed six DRL algorithms to an NFV system architectural model for the autoscaling problem in ONAP platform using closed-loop automation orchestration as illustrated in Fig. 3. Virtual Probe analytics chooses the DQN approach rather than RPROP because of its scalability issues.The DQN proposes an update to the network through stochastic gradient descent of smaller batches sampled from stored observations.

III. SYSTEM MODEL A. PROBLEM DEFINITION
NFV removes the dependency between the network functions hardware and software using VNF.To achieve faster deployment cycles, we need lower Capital Expenditure (CapEx) and dynamic utilization of cloud resources to lower Operating Expenses (OpEx) [2].Reactive action is needed to allocate more resources during an unprecedented load rise during the modeling of VNF during self-healing operations with closedloop orchestration.To make these changes happen, a VNF has to be managed by some system with action to be executed.Meanwhile, for now, this management is carried out manually with the analytics of data being generated by the VNFs.The systematic procedures in a management platform have a large complexity that examines the possibility of using DRL in networks by initiating the analytics and action selection.The collective input dataset comprises an internal and external load of VNFs being deployed on the data centre for four days of data collection.The output comprises the average CPU load of 8 VNFs measured with respect to the packets sent and received from the live network.The computation is performed using a GitHub TensorForce based on Google Cloud SDK with a local installation of Python 3.5 on the Linux platform.Six state-of-the-art algorithms are compared, and computations are performed using this dataset.
Reinforcement Learning (RL) can take action toward an environment by studying self-learning and adaptable agents that progress to maximize the rewards resulting from the actions made.The actions a t are taken by the agent in an environment that is based on the current state s t , reward r t , and policy π.The policy evaluation in the intermediate state s t+1 , reward r t+1 and its next action are carried out in a closed-loop, which turns the interaction between the agent-environment.In essence reinforcement learning is an optimization problem to maximize the reward over time.Wherein the optimization is based on the state's environment, reward influence, and actions performed in changing the states and the reward.The agent in an environment exploits the current knowledge and maximizes the rewards by taking greedy actions, pushing forward the acquisition of new knowledge by exploring actions.Lastly, the agent adapts to its policy with respect to the dynamics of state transition over time.

B. VNF MODELLING DURING SCALING (SELF-HEALING) OPERATIONS
In this section, a VNF is modeled with horizontal scaling operations.The states are generic in the nature of the environment.Translating the states to different counters with various kinds of VNFs is possible.The states are chosen, and the measurements are based on the three classes L ext , L in , p QoS .The total of three observable values are divided into loads l, errors e and allocation u.The loads are considered either as internal L in , or external L ext , and the errors p QoS .The observable values are represented in Table 1.To strengthen MDP properties, the model with values k=5 of one state is combined with the last observed values into one state.So, the delay in observation is measured three times for every time step by the RL agents that are being applied to the model.The state at time t is observed with values with k=5, which holds 15 dimensions in a state as given in eq. 1.
The scaling action possible to perform on the model, and those actions can be executed on the model via an external entity such as an RL agent, The values where a t =1 represent a scale-out function and a t = −1 represent a scale-in function.Whenever a scaling action with a t ̸ = 0 model is performed, the model is set to a scaling state, where the distributed loads and their work have been reflected in state-transition dynamics, which is described in eq. 2. The present state and its timeframe n req , with 3 and 5-time steps, is randomized by leading from the real system.In reality, a VNF configuration and its scaling vary based on the allocation of the number of VMs by the VNF.In the case of dealing with the model that is configurable, with respect to modeled scaling using the parameters N vm min and N vm max .Therefore, the result of those scaling actions becomes limited when N vm min and N vm max .

C. DYNAMICS OF STATE-TRANSITION IN DISCRETE MODEL
The state transition determines in detail the internal calculations and samplings that are used in the state transitions traversed in time with the discrete model [10].Based on a state transition, the overview of an implemented model and its internal dependencies is accustomed at DQN/DDQN for the self-healing of VNFs.

1) CALCULATION OF U VM T
The VM utilization rate is calculated through a normalized measurement with respect to the minimum and maximum number of VMs. and varied with the number of processes and their noise [48].The re-distribution also holds the load between VMs and scaling in/out processes.The share update rule, where i ∈ {cpu, ram} is given in eq. 5.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Here g sc (l i t ) is the load add/remove due to re-distribution when the scaling action is active, m defines the number of time steps taken in a system in scaling state since the last scaling action a t ̸ = 0. Further calculations are performed in a real system based on the CPU and RAM measurements mentioned in eq. 6.

4) ERROR SAMPLING
The agent resource optimization is carried out to avoid the negative impact the model provides.The measurement of a negative impact is said to be hidden and sampled with probability P, wherein in real VNF, there seems to be a highrisk factor [6].The errors are defined based on the reasoning as listed below in eq.7 and eq. 8. where, > 0.9 and U (0, 1) > (1 − P hw e ) 0, else , l ram t > 0.95 and U (0, 1) > (1 − P hw e ) 0, else

IV. PROPOSED SYSTEM
The autoscaling problem is solved using the DRL mechanism with VNFs generating the performance measurements as inputs and executing scaling operations (VNF self-healing) based on those existing measurements as represented in Fig. 4.This paper has addressed six DRL algorithms to an NFV system architectural model for the autoscaling problem in ONAP platform using closed-loop automation orchestration.Virtual Probe analytics chooses the DQN approach rather than RPROP because of its scalability issues.The DQN proposes an update to the network through stochastic gradient descent of smaller batches sampled from stored observations.This paper addresses changes in comparison with the previously developed NFQ: 1) Using Stochastic Gradient Descent (SGD) as an update method for calculating the weight of the network.2) Collective sampling of transitions using random minibatches for updates; the method is called experience replay.3) Two networks are used during learning (i.e.) a target network Q(s,t; θ) and an evaluating network Q(s,t;θ) The two network learning approaches advance in strengthening the learning stability, where the first one is meant for updating weights, and the latter is meant for evaluating the value as shown in eq. 9.In every C time step, it held constant in between two networks (i.e.) target network Q(s,t; θ) and Sample initial state s1 from the environment Sample minibatch B= s (1) T , r T , s (1) r+1 , . . .., s 13: T , ifs T , a T ; θ) . . ..Q(s )) 15: end for 22: end for 23: (*)= e.g., with Stochastic gradient descent or Adam optimization Q(s,t;θ).The steps that occur in between synchronizing the network weights Q(s,t;θ ) are updated with experience replay and SGD.L DQN (s t , a t ; θ) = r t + γ max a Q(S t+1 , a; θ) − (S t+1 , a; θ)

A. TRUST REGION POLICY OPTIMIZATION
The ratio of a weighted probability is maximized with a constraint with respect to θ, between the current policy and change in that policy is formulated as an optimization problem as shown in eq.10.
The Kullback-Leibler divergence measurement is one way of calculating the future discounted reward, Ĝt of approximation Q π old (s t , a t ) with a trajectory experienced in interaction with the environment, i.e., Monte Carlo trajectories.

B. DQN AND DDQN
The DQN algorithm is proposed with improvements fed with better learning capabilities for handling autoscaling of Self-healing of VNFs in a network.The influential improvements that are driven by the DQN algorithm are Double DQN (DDQN), experienced replay based on prioritization, and Dueling networks.The overview of an implemented model and its internal dependencies, by arrow representation, are given in Fig. 5.The DDQN has become the simplest method to implement as it requires only a minimal change to the loss function represented in eq.11.The action values proved that DDQN reduces overestimations compared to standard DQN.The pseudocode with experience replay for DQN and DDQN is mentioned in algorithm 1.

C. PROXIMAL POLICY OPTIMIZATION (PPO)
The PPO algorithm holds a new type of gradient method where policy uses a neural network for policy optimization based on TRPO logic, but it's easy to implement on //Save transitions from T steps with the old policy Perform a maximizing step in θ on L π (θ ′ ) (*) 18: end while 19: end for 20: (*) = e.g., with Stochastic gradient ascent or Adam Optimization the network.Using KL-KL-divergence measurement, PPO optimizes the ratio between different policies to attenuate the ratio difference, as mentioned in eq.12. Stable update and implementation are performed with PPO in algorithm 2, resulting in faster convergence.AI-driven Closed-loop network architecture is a feedback-driven operation that helps in continuous adaptation and optimization of network resource utilization and fulfillment of automated service assurance.The analytics are bounded by various policies and rules in determining the operational conditions for which automation conditions are allowed.Based on the insights from various research literature, the auto-scaling problem in VNFs is analyzed.The proposed DQN mechanism uses traditional model-free Q-learning to learn about a scaling policy and implements an auto-scaling solution based on the available legacy system in VNF.These approaches demonstrate good results based on the traditional rule-based auto-scaling solutions.The experiment evaluations are hard to compare as they operate in different environments for implementing RL solutions and modeling the auto-scaling problem.( 13)- (15), as shown at the bottom of the page.

D. ADVANTAGE ACTOR-CRITIC (A2C) WITH GAE
The actor-critic method approximates the advantage function as Generalized Advantageous Estimation (GAE), as mentioned in algorithm 3.
The trajectory T is stored and executed in the environment in calculating the estimated advantage function as where, Here V (.) is represented using a neural network as a value function.The implementation of both PPO and TRPO with GAE estimates by replacing Q π old (s t , a t ) with Advantage function methods of A2C and optimization problems for TRPO and PPO changes are mentioned in eq. 13 and eq.14. ( 18)- (20), as shown at the bottom of the next page.

V. PERFORMANCE EVALUATION
The evaluation section helps analyze the task and implements a discrete-time model for VNFs to capture the characteristics of the scaling operation.It also examines the robustness and generalization during the learning process of six state-ofthe-art DRL algorithms.The auto-scaling or self-healing of VNF agents is based on workload management on virtual machines and happens in an automatic manner without any human intervention.

A. APPLICATION PERFORMANCE MANAGEMENT 1) TENSORFORCE
TensorForce is a new Python framework where the agents are tuned and trained.TensorForce is based on TensorFlow, allowing for agent modularization and a control logic separation from the environment.The parameter model has a separate component, hence implemented with a standardized interface and read by all the agents.For specific autoscaling problems, the agent implementation has given off-the-shelf  components in TensorForce and needs to be tuned based on the implemented model.The list of different parameters used in the model for various agents is described in Tables 2, 3, 4, and 5.
The trained agent evaluation is modified to evaluate the agent's performance on robustness with respect to environmental changes.The parameters listed in the table are fine-tuned to show the dynamics of performing scaling actions to different workloads.Furthermore, the probability //Save transitions from T steps with the old policy Perform a minimizing step in w on L V (w) (*) 15: //Estimate the advantage function Ât using GAE(λ) 16: With end fort 20: //Update the actor 21: Perform a maximizing step in θ on L π (θ ′ ) (*) 23: end while 24: end for of latency is set to 1 for the time it takes to call or send a package.Learning optimal scaling becomes easier in comparison with stochastic states of performance degradation.The reward signal weights are carefully tuned as of policy, and good behavior is rewarded.The rewards define the traits of scaling up at a high load and scaling down at a low load.

2) TRACING
The tracing is a third-party open-source tool where virtual Probes are meant to record data logs present within the microservices.The virtual Probes have collective logging and events, thus capturing the data from the microservices of every CNF and user endpoints such as TCP RTT, retransmission rate, and DPI inspection.Unlike physical probes,  virtual Probes don't involve negative network performance and generate a wealth of data supporting AI and ML.The dataset for evaluation is from the customers as the VNFs are deployed at the customer premises to interact with other network functions.The dataset is from the Telecom Regulatory Authority of India (TRAI), Government of India, with the assistance of an Indian telecom operator in order to visualize and analyze.The complete dataset comprises the internal and external load of a VNF deployed during four data collection days.The average CPU load for 8 VNFs is measured with respect to the packets sent and received from the live network.The computation is performed using a GitHub TensorForce based on Google Cloud SDK with a local installation of Python 3.5 on the Linux platform.

B. EXPERIMENTS AND RESULTS
The experimental result is based on the implementation evaluation carried out in the previous section.The focus is on the agents pertaining to reward, the robustness/generalization of policies, and convergence rates applied to the model.Self-Healing using DRL on model consists of 140 timesteps that define one episode as agents trained on the same traffic pattern.The pattern was a sinusoidal function to represent the different dynamics of the different episodes encountered during traffic with a period of 140 timesteps and varied between 0.10 and 0.95 in load.The evaluation is divided into two phases, as listed below: 1) Training phase evaluation 2) Learned policy evaluation

1) EXPERIMENT 1: GATHERING STATISTICS OF TRAINING
The training agent is evaluated based on how long each agent, based on average, takes to converge.This has indeed become an important agent measurement being implemented and deployed on VNFs since they are known as slow realtime systems.The average reward in Fig. 6 shows that DQN and DDQN show the concrete result of having the highest  converged reward amongst the agents.The graph displays that PPO A2C has a fast initial increase in reward and stands tall until 60000 timesteps.The convergence rate of all agents happens at approximately 180000 timesteps in 70 days on a real-time system model.The DQN and DDQN have the most stable value of convergence and have a small margin of variance when they are stable.The interesting thing to note is that TRPO+GAE has not reduced the variance compared to TRPO+MC to see an increase in converged reward.PPO shows a greater impact on being applied A2C with faster convergence, smaller variance, and higher reward.
The training results analyze as to why one algorithm performs better than the others.Initially, the episodic rewards show how well the algorithm performs during training.The result shows how each algorithm explores new paths and solutions for 1000 episodes.We used a ϵ -greedy policy with ϵ decreasing after each episode.While ϵ decreases, so does the number of random actions taken, and the algorithms instead choose actions to maximize the reward.The concentration of higher rewards in the mean rewards for DQN than the comparative algorithms are listed in Table 7.These rewards are the total returns from the reward functions after 285 steps without any random actions.The DQN/DDQN has the best training mean and end reward for 1000 episodes; hence, the rest of the algorithms fall behind.From the simulation performed and from the table, we found that DQN/DDQN is the best among the rest of the algorithms.Compared to episode rewards, the episode cost covers running the VM cloud clusters online.The comparative analysis of various algorithms is carried out based on the mean and end costs after training for 1000 episodes.The training reward of DQN/DDQN outperforms the rest of the algorithms, as listed in Table 8.The VNF agents had acted optimally with respect to the learned policies during training, and each experiment was carried out 300 times with different random seeds and statistics, and it was saved and compared for each VNF agent.During each experiment, the statistics consist of average reward count with percentiles, average resource utilization, maximum resource utilization, and average latency problems for every timestep.
To make a proper comparison between algorithms, speed is the desired attribute for measuring the performance of the cloud infrastructure.The computation time for various algorithms is listed in Table 8.The DQN/DDQN algorithm has taken minimum time with a higher percentage of faster computation in completing the 1000 episodes of training with TRPO+GAE as the baseline for the speed comparison.

2) EXPERIMENT 2: CHANGING THE MODEL PARAMETERS
For this experiment, the best weights amongst DQN, DDQN, PPO, and PPO A2C agents are saved with changes in the traffic pattern and internal dynamics of the model being used in the environments based on the training data.To evaluate the performance of agents in critical (or) new situations, i.e., measurement of generalization properties in the change of  The baseline for deploying the most influential fundamental control algorithm compared to the performance of other agents.The agents act optimally due to the policy being learned during training, where the agents are evaluated for each state and action without feedback.In Fig. 7, the plot compared statistics with respect to the policy deployed by each agent.The blue line indicates the external load, and the rest of the solid lines are the average resource allocation by each agent.The plot shows resource utilization and handles the average latency problems represented in dashed lines and maximum VM utilization (dotted) for each timestep.
The DQN and DDQN show consistency in getting the highest reward during training.In Fig. 8, the customer network is applied for mimicking the pattern.This pattern acts as a training pattern and stretches much longer over three days in the model, representing 5000 timesteps.In Fig. 9, the basis of the fixed virtualized resource selection policy deployed by our agents we conclude the fact that DQN receives the highest reward.In Fig. 10, the graph shows how the process load is updated in two different ways depending on whether the system performs a scaling action.

VI. CONCLUSION AND FUTURE WORK
The six state-of-the-art algorithms are trained for 1000 episodes and evaluated based on performance, rewards, cost, and speed.The PPO+MC agents marginally improve the cost by 1.7 % but it still raise the overhead for resource efficiency during autoscaling operations.TRPO+GAE agents are fairly good in auto-scaling, with an improvement of cost by 3.2 %.In comparison, our proposed DQN/DDQN learning approach best optimizes the price and lowers the cost by 18.4%.The adaptive self-healing of VNFs enhances the computation performance by about 27%, which is faster than the baseline of TRPO+GAE and other comparative stateof-the-art algorithms.These RL algorithms are developed in Python, using the TensorForce framework, and their performance is compared based on cost and stability.The algorithms were implemented to work on VMs with Apcera installed and were trained with data collected through Apceras API, and the simulation was carried out through a cloud cluster.We could see that DQN/DDQN outperforms PPO+MC, PPO+GAE, TRPO+MC, and TRPO + GAE compared to the applied agents.We note that TRPO and PPO with GAE estimation show better results than Monte Carlo estimation concerning stability and convergence rate.The comparison of DQN with other agents is strongly based on the relative performance in completing the task.The self-healing of VNF is solved using DRL, where the cost of development and maintenance has resulted in a performance gain.
The limitation that comes with the deployment of DRL to VNF is due to the fact that traffic patterns differ between customers, which results in uncertainty due to varied configurations and load patterns.The learning performance is bad, and there is a high risk of divergence as the VNF agents need to work in a multi-agent context.Even though we had achieved the optimal results of 27 % compared to all the state-of-the-art algorithms, the result proves that the simpler method based on control theory is equally good.Furthermore, using various configurations, VNF chaining process and load patterns results in a tedious validation for DRL with a heavy bottleneck of training the policies.We could embed the control methods with the classical machine learning properties in the future.

FIGURE 1 .
FIGURE 1.A network intelligent operations analytics system for NFV framework.

FIGURE 2 .
FIGURE 2. AI Engine for next-generation cloud-native networks.

4 )
) SAMPLING OF PROCESS LOAD L PR T The whole sample load processing model is driven by generating external traffic with l pr t .Based on the generated traffic load, the current process load is calculated in two ways based on the scaling state.In a real VNF, the scaling action generates additional load on the VM by means of internal processes in re-distributing data and leads to the preparation of a new VM configuration.With this, the process load is calculated as given below in eq. 4. l pr t = l tr t + U (−η pr , η pr + η scale ), if scaling state l tr t + U (−η pr , η pr ), else (Random variable U(a,b) is drawn between (a, b) from a uniform distribution, where η pr , η scale >0 3) SAMPLING LOAD OF L CPU T , L RAM T The CPU and RAM load reflects the traffic load and how it gets processed and distributed between VMs.The initial values of CPU and RAM have been inspired from the real system l cpu 0 , l ram 0

FIGURE 4 .
FIGURE 4. A systematic reinforcement learning schema for self-learning agents.

FIGURE 5 .
FIGURE 5. State transition model driven by an external load represents real VNF traffic.

FIGURE 6 .
FIGURE 6.Average reward per timesteps for agents.

FIGURE 7 .
FIGURE 7. Average reward for traffic pattern.

FIGURE 8 .
FIGURE 8. Traffic load and fluctuations in process load.

FIGURE 9 .
FIGURE 9. Automatic adaptation of virtualized resource selection policy.

FIGURE 10 .
FIGURE 10.Visualisation of process load during random scaling actions.

FIGURE 11 .
FIGURE 11.Visualization of the errors together with the other states.

TABLE 1 .
List of Symbols.
a t |s t , θ) π(a t |s t , θ old ) A π old (s t , a t )