A Hybrid Learning Framework for Service Function Chaining Across Geo-Distributed Data Centers

Service function chaining (SFC) focuses mainly on deploying various network functions in geographically distributed data centers and providing interconnect routing among them. Traditional (convex optimization-based) SFC algorithms exhibit some disadvantages on the scalability and accuracy. Recently, researches have shown the effectiveness of deep reinforcement learning (DRL) in the field of SFC. However, current DRL-based algorithms possess an extremely large action space, which leads to slow convergence and poor scalability. Some researchers relieve this issue by reformulating the SFC problem, which usually results in low utilization and high cost. To address this issue, we develop a hybrid DRL-based framework which decouples the VNF deployment and flow routing into different modules. In the proposed framework, a DRL agent is only responsible for learning the policy of VNF deployment. We customize the structure of the agent base on deep deterministic policy gradient (DDPG) and adopt several techniques to improve the learning efficiency, such as adaptive parameter noise, wolpertinger policy, and prioritized experience replay. The flow routing is conducted in a game-based module (GBM). We design a decentralized routing algorithm for the GBM to address the scalability. The end-to-end latency of flows is minimized while the resource capacity and location constraints are satisfied. During the learning process of the proposed framework, the DRL agent improves its deployment policy with the reward from the GBM (the value of reward depends on flow routing). Thus, the VNF deployment and flow routing are still jointly optimized. Compared to existing DRL-learning algorithms, the proposed hybrid DRL framework can achieve a lower cost since 1) the action space is significantly reduced due to flow routing decoupling; 2) the flow routing procedure is more efficient (the GBM adopts model-based information, e.g., the gradient). Through trace-driven simulations, we show the efficiency of our algorithm compared to existing DRL-based algorithms.


I. INTRODUCTION
Traffic flows in packet networks may need to pass through a series of middleboxes (network functions) to acquire various network services. To accommodate the network changes (e.g., flow rate variations), network resources should be adaptively scheduled so that ubiquitous services can be provided. From the perspective of the network operators, it is truly painful to schedule resources on a hardware level. Thus, the service function chaining (SFC) technique is proposed, which utilizes network function virtualization (NFV) and The associate editor coordinating the review of this manuscript and approving it for publication was Fan-Hsun Tseng .
software-defined networking (SDN) paradigms to improve network flexibility [1]. Hardware-based middleboxes are 'softwared' into virtualized network functions (VNFs) which can run on common servers in edges clouds or data centers. Meanwhile, the network operators can schedule the network traffic on flow-level granularity accordingly, which decreases the network cost and ensures a quality user experience.
The goal of SFC is to achieve significant cost reduction compared to traditional service provisioning with dedicated hardware [2]. On-demand deployment and scaling of VNF instances play key roles in cost conservation. As flow demands vary, network operators adaptively scale the number and the placement of VNF instances to save the cost VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ (e.g., rental cost) according to the distribution of flows. Meanwhile, flow routing has significant impacts on qualityof-service (QoS), e.g., end-to-end delay, as declared in the service level agreement (SLA). Therefore, a SFC algorithm needs to jointly coordinate the placement of VNF instances and the routing of user flows. Meanwhile, the resource capacity and location constraints need to be satisfied during the scheduling. It has been proved that SFC is a mixed integer programming (MIP) problem with high-complexity, which is hard to solve using traditional optimization algorithms [3].
Recently, deep reinforcement learning (DRL) shows its great potential in related areas, such as navigation, traffic engineering and flow control. Compared to traditional optimization algorithms, DRL has advantages in time-varying systems with complicated environment states. The agent of DRL can learn an approximately optimal policy in a short time through trial-and-error interactions. The policy determines the action to be taken when the perceived states of the environment are given. The authors of [4]- [6] have proven the efficiency of DRL in the field of SFC.
However, existing DRL-based SFC algorithms face several challenges. First, some algorithms do not take into consideration the variations of user demands. The DRL model needs to be re-trained once the flow patterns have changed. Second, the existing DRL-based algorithms have a large action space since the optimization of SFC involves both VNF deployment and flow routing [7]. A large action space would lead to a slow learning rate and inaccurate results. Third, some ideal assumptions are proposed to shrink the action space, meanwhile, led to performance degradation and higher cost.
This paper addresses the above-mentioned issues by proposing an innovative hybrid DRL-based algorithm for SFC deployment. Unlike existing learning-based algorithms, we decouple the flow routing from the DRL agent into a game-based module (GBM). The DRL agent only focuses on the placement of VNF instances so that the action space is significantly reduced. During the learning procedure of the DRL agent, the GBM is responsible for (1) generating a (sub-)optimal routing, (2) feedback the reward (reflect the quality of the flow routing given the current deployment policy). According to the reward (which covers the assessment on both deployment and routing), the DRL agent updates the parameterized policy to improve the outcome (the placement of VFN instances). To sum up, a joint optimization on both VNF deployment and flow routing can be carried out with the coordination between the DRL agent and the GBM. Note that, the optimization of flow routing in the GBM is conducted with a model-based algorithm. Model-related information, such as the gradient, is used to improve efficiency. Therefore, the proposed algorithm achieves a lower cost compared to a pure DRL-based algorithm. 1) We formulate the SFC problem in a geo-distributed environment. The VNF deployment and flow routing are jointly optimized to minimize the overall cost. A hybrid DRL-based framework is proposed to conduct optimization.
2) We propose a DRL agent to learn the optimal policy for the VNF deployment. The NN (neural network) structure of the DRL agent is redesigned to produce feasible VNF placement. Adaptive parameter noise and prioritized experience replay are adopted, which improve the learning efficiency. 3) We develop a decentralized routing algorithm for the GBM. Since the routing algorithm is carried out in a decentralized manner, the availability of the GBM can be ensured in large-scale networks. Moreover, we prove the convergence and the feasibility of the proposed algorithm. The remaining parts of this paper are organized as follows. Section 2 reviews the related works in the field of SFC. In Section 3, we describe the network model and the objective of optimization. In Section 4, we detail the structure and learning procedure of the proposed DRL agent. The decentralized routing algorithm for the GBM is illustrated in Section 5. In Section 6, the performance of the proposed algorithms are verified with simulations. Finally, we conclude this paper in Section 7.

II. RELATED WORKS
There have been plenty of works on the deployment of SFC. Initial studies are mostly conducted on conventional optimization algorithms. To reduce the operation cost and gain in service agility, the authors of [8] proposed an eigendecomposition-based algorithm which can dynamically place the VNF instances accordingly. The authors of [9] focused on the reliability issue and presented a suboptimal fast estimation algorithm to improve service quality. In [10], the authors proposed a heuristic deployment algorithm to reduce service delay and enhance the reliability of services. Works mentioned above mainly study the problem in a static scenario.
Some works also focused on the deployment of SFC in a dynamic environment. The authors of [11] considered the influence of the arrival of new users and the departure of old users. A column-generation-based algorithm is provided to reduce the resource consumption and signaling overhead during the deployment. The authors of [12] proposed a concept of 'following chain' to address the reliability issue, which re-arranges the placement of VNF instances based on the mobility pattern of users. The authors of [13] studied the scenarios with multiple service chains. A deployment algorithm based on binary integer programming is developed to improve network performance and reduce resource rent. An ILP (integer linear programming)-based flow scheduling algorithm is proposed in [14]. Apart from traditional decision variables (e.g., placement and routing), dynamic power control technology is considered, which can further reduce network energy consumption. Similar work is conducted in [15], in which the lightly-loaded servers are deactivated to reduce the energy consumption.
Recently, DRL has achieved great success in various fields such as network resource scheduling [16], big data analyzing [17], Internet routing [18], multimedia stream rate regulating [19] and so on. Compared with traditional optimization algorithms, DRL performs better in time-varying systems. The goal of DRL is to learn a strategy through interaction with the environment. The policy is represented by a parameterized neural network (NN). The input of NN is the state (describing the environment) while the output is the corresponding action. DRL obtains knowledge from historical behaviors and ultimately gets a better strategy. Once the training is completed, the corresponding results can be obtained directly. Due to the wild success of DRL, researchers began to apply DRL in the field of SFC. Initially, researchers considered the deployment of a single VNF, i.e., the auto-scaling of VNF. The authors of [20] proposed a learning algorithm based on parallel Q-learning. It makes a balance between the service delay and the network cost. The authors of [21] vertically scaled the VNF instance using the Actor-Critic algorithm. To improve resource utilization and service reliability, the algorithm adjusts the number of VNF instances by observing the overheads in the network.
Some authors began to use DRL to solve the VNF placement problem in SFC scenarios. The authors of [22] focused on the reliability and proposed a VNF instance placement algorithm based on Deep Q-Network (DQN). The network overhead is reduced while service reliability can be guaranteed. Similar work has been done in [23], which optimizes the placement of VNF instances using Q-learning. The authors of [24] considered the selection of VNF type and proposed a Q-learning-based algorithm to optimize the quality of service. However, these DRL-based algorithms usually have a low learning rate due to the large action space. In our work, we redesign the NN structure of the DRL-agent based on DDPG and improve the learning efficiency effectively.
Along with the placement, flow scheduling also needs to be considered in the deployment of SFC. However, the dimension of flow scheduling is much higher than the VNF placement. Using the learning algorithm directly for flow scheduling would result in low learning efficiency. Thus, the authors of [7] proposed an improved algorithm based on DDPG, which relieves the issue by customized the action exploration and experience replay during the learning. The authors of [25] developed an Asynchronous Advantage Actor-Critic (A3C)-based algorithm for SFC deployment. They also assumed that flows with the same source node can only use VNF instances on the same clouds. The authors of [26] proposed a two-stage framework. Firstly, they used DQN to generate several candidate solutions. Then, traditional optimization algorithms are employed to refine the solutions. The authors of [27] decomposed the SFC deployment into a VNF placement procedure and a flow scheduling procedure. Deep learning (DL) is used to optimize the two procedures in an isolated manner. Similarly, the SFC deployment is decoupled to two procedures in our work. We apply a game-based method in the flow scheduling procedure to promote efficiency.

III. SYSTEM MODEL A. GEO-DISTRIBUTED NFV SYSTEM
A typical geo-distributed NFV system consists of N distinct data centers (N = {1, 2, . . . , N }). NFV providers run a DL-based scheduler for placement and scaling each VNF service chain to provide ubiquitous service to users [2]. There can be many network flows between various source regions and destination regions that traversing the same service chain. For ease of presentation, we group the individual network flows which have the same source region and destination region, and refer to the flow group as one flow hereinafter. Then each flow corresponds to a source-destination region pair. Our goal is to place sufficient instances of each VNF to geo-distributed data centers, satisfying the need of all flows. We consider that instances of the same VNF can be deployed onto different data centers, and different flows can share the same VNF instance.
Different service chains are scheduled by different schedulers, and we focus on one scheduler handling one service chain [6].
, varies over time. The state-ofthe-art methods [6], [28] with some advanced schemes like RNN (recurrent neural network) can comparatively predict the flow rate precisely. Since that is not the emphasis of this paper, we assume that f i (t) can be accurately predicted. The total flow demand in time slot t can be represented as We use x(t) = x m,n (t) m,n to describe the overall deployment in time slot t, where x m,n (t) denotes the number of instances of VNF m to be deployed in data center n in time slot t. For simplicity, we use v m,n to abstract the instances of VNF m in data center n owing to their homogeneity. Similarly, e n 1 ,n 2 describes the transmission links between data center n 1 and n 2 .
Note that routing strategies have significant influences on system performance. The feasible paths of flow i (originating from n i s and ending at n i d ) is denoted by P i and P = ∪ i P i . A path p ∈ P i is a set containing the constituent vertices (e.g., v m,n )/edges (e.g., e n 1 ,n 2 ). Then, the routing strategies can be represented as describes the strategy of flow i. We interpret y i p (t) ≥ 0 as the amount of flow i choosing path p ∈ P in time slot t and p y i . Given x(t) and y(t), the aggregated workload on the instances in v m,n can be obtained, denoted as w m,n (t): Here m,n,p is the path-vertex indicator. m,n,p = 1 implies a path p traverses v m,n , while m,n,p = 0 means v m,n is not VOLUME 8, 2020 on path p. Note that w m,n (t) is equally distributed inside v m,n to achieve a better performance (due to their homogeneity). Let P m denote the processing capability of one instance of VNF m, which depends on resource capabilities of the VM (Virtual Machine), including CPU, memory and network I/O [29]. Thus, to make y(t) feasible, w m,n (t) ≤ ξ P m x m,n (t). Here ξ ∈ (0, 1] is a coefficient for traffic burst, and the degree of traffic burst is proportional to 1 ξ . We list important notations in Table 1.

B. COST STRUCTURE
The goal of the NFV provider is to minimize the overall cost when severing the flows. We consider the following costs.
(1) Operational cost. The cost can be interpreted as the cost of renting VMs or containers for running the instances. Let O m,n be the operational cost for running each instance of VNF m in data center n. The overall operational cost at t can be represented as: (2) Deployment cost. When launching a new VNF instance, a deployment cost is incurred for copying a VM image containing the network function to the data center. Let D m,n denote the cost for deploying VNF m in data center n, while there is no instance of VNF m deployed in n in the previous time slot. We use binary variable k m,n (t) to indicate if any instance of VNF m is deployed in data center n in t (k m,n (t) = 1) or not (k m,n (t) = 0). The total deployment cost in t can be expressed as: where [·] + = max{·, 0}. For simplicity, we assume k m,n (0) = 0. (3) Delay cost. The end-to-end delay consists of the transmission latency and the processing latency. Let l n 1 ,n 2 denote the transmission delay from data center n 1 to n 2 (n 1 , n 2 ∈ N ). Then, the overall transmission delay in time slot t can be computed as: We assume a constant processing delay of VNF instances, which is denoted as d τ m,n . Given a routing strategy y(t), the overall processing delay is represented as We multiply the delay by a cost per unit of time, L, to convert the overall delay to a cost, which is formulated as Now, the overall cost of geo-distributed NFV system over the entire time span T can be expressed as

C. A HYBRID DEEP LEARNING-BASED SCHEDULER
At the beginning of each time slot t, the NFV provider adjusts service chain provision and routing strategy to accommodate upcoming flow variations. Both deployment x(t) and routing strategy y(t) are dynamically optimized in order to minimize the overall cost. To achieve this goal, a hybrid deep learning-based scheduler is proposed in Figure 1. Different from existing DL-based schedulers, our proposed scheduler decouples the deployment procedure and the routing optimization into two separated modules, i.e., a DRL module (built for VNF instance placement, x(t)) and a GBM (integrated for routing strategies, y(t)). The former one is built on a model-free method, i.e., DDPG, which is a deep reinforcement learning framework for continuous action space. The DRL moudle improves the deployment policy in a trial-and-error manner. Here, the policy is the rule of determining x(t) under different circumstances (e.g., different f i (t)). During the learning procedure, the DRL moudle observes the system state and produces a x(t) using its current policy. A reward signal will be then feedback associated with the current x(t). Based on the reward, the policy of DRL would gradually be improved to obtain a smaller C all . The GBM is designed for routing based on the game theory. Given a VNF placement solution (i.e., x(t)) from the DRL moudle, the GBM is in charge of finding an optimal routing strategy (i.e., y(t)) with minimum C delay (t). It is noticed that the GBM is essentially a model-based algorithm that can use the model information (i.e., the gradient) to accelerate the learning speed and improve the accuracy. By the interactions among the environment, the proposed framework will finally obtain a good policy for both x(t) and y(t) under dynamic network scenarios. The details of the DRL moudle and the GBM are described in Section 4 and 5, respectively.
Compared to existing DL-based algorithms (which are mostly built on the model-free framework, e.g., DQN, AC, DDPG), the proposed hybrid deep learning-based scheduler has advantages on both accuracy and convergence speed. First, in our algorithm, the dimension of the learning in the DRL module is significantly reduced since y(t) is decoupled into the GBM. The reduced dimension means that the learning procedure in our algorithm is more efficient and improves convergence and accuracy. Second, the optimization in our GBM is carried with a model-based optimization (based on game theory). The model-based algorithm can leverage the model information to accelerate the convergence speed and improve accuracy. Therefore, our proposed algorithm can achieve better performance in both convergence time and accuracy.

IV. A HYBRID LEARNING FRAMEWORK FOR SERVICE FUNCTION CHAINING A. ONLINE DEEP REINFORCEMENT LEARNING
There are many available DRL algorithms with different features, such as DQN, DDPG and A3C. We design the DRL module based on the DDPG algorithm, which is a well-investigated deep reinforcement learning for continuous action space. Compared to other DRL algorithms, DDPG can achieve a better performance on large-scale problems, which is an admirable feature in our scenarios.
The basic DDPG training procedure can be viewed as a combination of policy-based and value-based methods. It learns the optimum policy and its value function through interactions with the environment. The agent is composed of two types of NN: the actor network (i.e., policy) and the critic network (i.e., estimated value function). The role of the actor is to define parameterized policy and generate actions according to the observed network state (e.g., predicated flow rate, x(t−1)), while the critic is in charge of evaluating current actions considering the reward (motivated by the action and C delay (t) from the GBM). In detail, the critic produces a temporal difference error (TD-error) which indicates whether current actions are getting better or worse than expected. Then, we adjust both the actor and the critic accordingly to reduce the TD-error mostly (with the sample gradient).
In our algorithm, the DRL module decides the deployment x(t) in each time slot t. A straightforward design is to let the DRL agent decide the deployment of all VNFs in one action. However, this design leads to an exponential action space, which incurs significant training costs and slow convergence. Therefore, in each time slot t, we let the agent use M (number of VNF types) steps to complete the deployment x(t). In each step τ , the agent only focuses on the deployment of one VNF m ∈ M.

B. DEFINITION OF STATE, ACTION, AND REWARD
Nevertheless, to utilize the DDPG algorithm, we shall first accurately define state, action, and reward. Upon a state S, an agent of DDPG takes an action A and observes a new state S . As a result, a reward can be calculated to judge the effectiveness of the action. We describe the state, action, and reward associated with our problem as following. We initializex m,n to all zero at the beginning of time slot t. In each step,x m,n is continuously updated asx m,n ←x m,n + A n , where A is the action decided by NN in the current step and A n is the n th element of A. To sum up, the state S can be represented as

2) ACTION
In each step, the DRL agent produces an action A which is the deployment of current VNF m over N distinct data centers, i.e., n∈N A n = F(t)/P m /ξ and A n ≥ 0, ∀n ∈ N .

3) REWARD
The agents would get a non-zero reward at the end of each time slot. We set the reward as the inverse of the overall cost, in which C oper (t) and C dep (t) can be directly obtained inside the DRL module while C delay (t) is feedback from the GBM.

C. CUSTOMIZED DDPG ALGORITHM DESIGN
Incorporating the above definitions, we start to design our algorithm based on DDPG. However, there are some VOLUME 8, 2020 limitations of DDPG which prevent our implementation. We summarize them as follows. 1) In traditional DDPG, the feasible space of A is a N -dimensional box. However, in our scenarios, the action A should be located on a simplex, i.e., n∈N A n = F(t)/P m /ξ and A n ≥ 0, ∀n ∈ N .
2) DDPG is originally designed for continuous action space while VNF deployment is a discrete decision-making process. Both NN architecture and the exploration mechanism of DDPG should be reinvestigated.
3) Conventional DDPG randomly draws the samples from the replay buffer for training. Sometimes, it may be trapped into bad samples with low efficiency, which leads to slow convergence and bad performance.
To tackle the above-mentioned issues, we customize our algorithm based on DDPG with the following techniques.

1) SOFTMAX LAYER AND WOLPERTINGER POLICY
In order to make the output of actor network (i.e., A) feasible, we add an embedded layer (i.e., the Softmax layer) in Figure 2. The Softmax function (also known as the normalized exponential function) applied by the Softmax layer is a function that takes a vector of N real numbers as input and normalizes it into a probability distribution consisting of N probabilities proportional to the exponentials of the input numbers. By embedding a Softmax layer, we acquire a deployment distribution of VNF instances on N data centers. Multiplying with a constant (i.e., F(t)/P m ), a relaxed deployment (continuous variable) of current VNF m can be obtained. Finally, a series of candidate actions can be get with Wolpertinger policy which selects k most similar actions (compared to relaxed deployment) using the k-nearest neighbour algorithm.

2) ADAPTIVE PARAMETER NOISE
Traditional RL (reinforcement learning) uses action space noise to change the likelihoods associated with each action the agent might take from one moment to the next. This method is not suitable for our scenarios since it might violate the feasibility of A. Therefore, we apply the adaptive parameter noise for the action exploration, which adds adaptive noise to the parameters of the actor network. Parameter space noise injects randomness directly into the parameters of the agent, altering the types of decisions it makes such that they always fully depend on what the agent currently senses. Meanwhile, since the structure of the actor network is customized as Figure 2, the output of the action always falls in the feasible space of A. Besides, similar to the cases in [7], during the learning procedure, we increase the possibilities of the solution on the boundary of the feasible space.

3) PRIORITIZED EXPERIENCE REPLAY
Traditional RL samples the experiences from replay buffer with equal importance, which ignores the difference in the value of each experience. Therefore, to improve the learning efficiency and avoid local optima, we adopt the technique of prioritized experience replay in [30]. In prioritized experience replay, the DRL agent draws experiences from replay buffer with weights proportional to their TD-error.
Incorporating the above design, we summarize the proposed algorithm in Algorithm 1.

Algorithm 1
Training Procedure for the DRL Module 1: Randomly initialize critic network Q(S, A|θ Q ) and actor π (S|θ π ) with weights θ Q and θ π 2: Initialize target network Q and π with weights θ Q ← θ Q and θ π ← θ π 3: Initialize the replay buffer B 4: for τ = 1 to τ max do 5: Initialize the state S 6: while True do 7: Apply the adaptive parameter noise to obtain a 8: Adopt the Wolpertinger policy to obtain the placement (i.e., x m,n (t)) for current VNF m 9: Execute the placement x m,n (t), observe the reward R (including C delay (t) from the GBM), and new state S 10: Store the transition (S, A, S , R) into replay buffer 11: if Replay buffer is full then 12: Apply the prioritized experience replay to sample a mini-batch of transition from replay buffer B 13: Update the actor policy with the sampled gradient 14: Update the target networks for both actor and critic 15: end if 16: if This episode is finished then

V. PROJECTION-BASED DECENTRALIZED ALGORITHM FOR ROUTING OPTIMIZATION
The GBM is responsible for seeking an optimal routing strategy y(t) when given x(t). Note that the optimization is repeated in every time slot t. Therefore, we omit the time stamp 't' for simplicity and formulate the problem as follows: w m,n ≤ ξ P m x m,n , ∀m ∈ M, ∀n ∈ N (11) (9) and (10) give the mathematical integrity of flow routing while (11) prohibits the overload of VNF instances. Note that the feasible strategies of flow i depend on other flows' behavior owing to (11). Thus, (11) is the coupled constraint while (9) and (10) are the local constraints. (P1) is a typical linear programming problem with multiple constraints. There are several LP (linear programming) solvers exists, such as the interior point method. However, note that the dimension of Y i (the feasible region of y i p ) is N M . Consider the scenarios of SFC, the solution space of (P1) is extremely large. Meanwhile, the speed of solving (P1) matters since it is a part of the learning procedure of the DRL module (responsible for reward calculation). However, it is also noticed that a precise solution of (P1) is unnecessary since an approximated answer could also work for the learning. Therefore, to address the above-mentioned issue, we design a decentralized algorithm for the GBM in this section.

A. GAME FORMULATION
The routing problem in (P1) can be illustrated in Figure 3. Flows from I distinct sources traverse through VNF instances sequentially. Each edge (e n 1 ,n 2 ) and vertex (v m,n ) in Figure 3 incurs a cost which represents the latency in (6).
We solve the routing problem in Figure 3 with the framework of the selfish non-atomic routing game, G = {I, Y, U }.
where y −i is the routing strategies of all flows other than i. The cost of player i can be represented as Each player in G attempts to minimize its cost U i (y) by changing y i . Then, we arrive at the following definition. Definition (Nash equilibrium of G) Let y * = {(y i ) * , (y −i ) * } be a strategy profile for the game G. y is a Nash equilibrium if, for every player i ∈ I, the following relationships holds.
Note that G does not consider the coupled constraint (11). To tackle this issue, we propose an extended game G ext with I + 1 players. Besides the I players in G, an additional player (i = I + 1) is introduced to cope with (11). The cost function of player i = I + 1 is represented as (15) where λ = λ m,n m∈M,n∈N is the decision variables. Particularly, λ m,n ≥ 0, ∀m ∈ M, n ∈ N . Meanwhile, an auxiliary term is added to the cost function of player i ∈ I as J i y i , y −i , λ = U i (y) + m n λ m,n w m,n (y) (16) We use z = (y, λ) to denote the strategy profile of the game G ext . In G ext , each player in G ext carry out following optimization to minimize its individual cost.
Note that in (17), there are no coupled constraints during the optimization. Definition (Nash equilibrium of G ext ) Letz be a strategy profile of G ext . Then,z is a NE (Nash equilibrium) of the game Now, we focus on the relationships between G ext and the solution of (P1).
Theorem 1: Suppose z * = y * ; λ * is a NE of the game G ext . Then, y * is the optimal solution of (P1). VOLUME 8, 2020 Proof: We prove the theorem through the equivalence between the game G ext and the KKT (Karush-Kuhn-Tucker) systems of (P1). Owing to the concavity of J i (z), players conduct a series of convex optimizations during the game. From the perspective of player i ∈ I, the following relationships satisfied at the point ofz.
From the perspective of player i = I + 1, the follow relationships holds sinceλ is an optimal solution of (18).
Meanwhile, the KKT system of (P1) at the optimal point can be expressed as It is clear by inspection that y solves (V-A) with multiplier λ, µ, and v if and only if (y, λ) solves (V-A) and (20) with multipliers µ, v, and η = P m − w m,n (y), which completes the proof.
The existence of NE of G ext can also be guaranteed since (P1) is essentially a convex optimization problem. The object function C delay is bounded and the feasible solution space is closed and compact.

B. PROJECTION-BASED DECENTRALIZED ALGORITHM FOR TASK ROUTING
In this section, we propose a decentralized algorithm to find the optimal task routing strategy. Theorem 1 has shown that the optimal y of (P1) can be obtained via the NE of G ext . Note that (P1) arises in a high-dimensional space. Conventional game theory-based schemes (e.g., best response-based or replicator dynamics-based iterations) would result in slow convergence and high computational complexity, which is not preferable for our scenarios.
To tackle this issue, we propose a projected-based algorithm to find an optimal y of (P1). The proposed algorithm alternately updates y and λ. We assume the presence of central control which broadcasts a tentative price λ (τ ) at every step τ of the algorithm. Based on this price and the aggregate of the strategies at the previous step, each agent locally computes y i (τ +1) . The central control updates the price to λ (τ +1) depending on the congestion of v m,n . The detail of the proposed algorithm is described in Algorithm 2.
Algorithm 2 Decentralized Algorithm for Flow Routing 1: Set initiate strategies, i.e., y i (0) , λ (0) . 2: Local: Each player i ∈ I updates their strategies according to the following rule: where the gradient D i p,(τ ) is 3: Central: Player i = I + 1 updates λ according to the following rule: where the gradient D n (τ +1) = w m,n y (τ +1) m∈M − P m . 4: Repeat Line 2 and 3 for τ = τ + 1 In the proposed algorithm, each player i ∈ I only knows its private information (including its local objective function J i (y i , y −i , λ) and local feasible set Y i ) and a broadcast dual variable λ. This is a preferable feature when the players interact over large-scale networks. Moreover, in each iteration, only projection and gradient operators are involved.
The gradient of players in Algorithm 2 is easy to obtain. Note that U i p (y) is a constant. Player i ∈ I only needs λ to compute its gradient. Meanwhile, the central control only need to collect the information about the congestion on v m,n . Moreover, the strategy update of player i ∈ I can be conducted in a parallel manner. In each iteration, only M × N values need to be exchanged.
Theorem 2: Algorithm 2 converges to the NE of game G ext . Proof: The compact form of the algorithm can be written as First, we prove (26) through the contradiction.
If (26) is not hold (the limitation value of (26) is a positive value ξ ), we can deduce the following relationship by However, for any z ∈ Z, we have The first inequality holds due to the property of projection (Lemma 3.1 in [31]). By reorganizing (28), we can deduce Note that α 2 τ is summable and ∇J (z τ ) can be bounded due to the boundness of the primary variable y and the multiplier λ [31]. Thus, the value of 2α τ ∇J (z τ ) , z τ − z is a finite value which is in contradiction with (27). Thus, (26) holds. (26) implies there is a limitation pointz ∈ Z such that ∇J (z) , z τ −z ≥ 0, ∀z τ ∈ Z [31]. Since ∇J (·) is a pseudo-monotone function, we can directly deduce (30) (the property of pseudo-monotone functions).
Replace z withz in (29) and apply (30), we have Summing (31) over τ ∈ [s, k], and we have Take limitation on τ = k and then τ = s, we can deduce Then, we can conclude that z τ −z 2 converges, which means Algorithm 2 would finally converge toz. Furthermore, note that ∇J (z) , z τ −z ≥ 0, ∀z τ ∈ Z, which is essentially the first order condition of NE in (18). Thus, we can conclude that the limitation point produced by Algorithm 2 is a NE of game G ext .

VI. PERFORMANCE EVALUATION A. SIMULATION SETUP
We evaluate our proposed algorithm through numerical results. The whole time span is one day and each time slot lasts half an hour. The flow pattern is set according to the real-world traffic from Huawei Technologies Co Ltd. The peak hour happens during 9 ∼ 11 PM when the flow rate is about 720 Mbps. The bottom flow rate is about 160 Mbps and appears between 4 and 6 AM. For simplicity, all the flow rates are normalized within (0, 1]. We use eight Google data center locations to create our data center network (expect when testing algorithm performance with different number of data centers). The transmission latency between two data centers is proportional to their distance. There are six flows and four types of VNFs in the system on default. The data processing capability of each VNF instance is configured between 600 Mbps and 900 Mbps according to the VNF type. The operational cost falls in (0, 1] and the deployment cost is between (0, 0.1] according to the Amazon EC2 pricing scheme. We set the weight L for the latency cost to be 1. In the proposed hybrid DRL-based framework, we add three fully-connected hidden layers in the actor network. Each layer contains 128 neurons whose activation function is ReLU (rectified linear unit). The output layer uses Softmax function and has N neurons as outputs, where N is the number of data centers. A similar structure (other than the output layer) is adopted for the critic network whose output layer is a linear neuron. During DRL training, the discount factor γ is set to be 0.99. Learning rates for actor network and critic network are all 0.0001. We use Adam optimization algorithm for parameter update. Entropy weight β is set to be 5. We build a memory buffer with a size equaling 1 × 10 6 . During the training procedure, we use a batch with size 512.
We compare the proposed hybrid learning framework with the following algorithms: 1) Pure DDPG algorithm, P-DDPG: A DDPG-based algorithm similar to the proposed algorithm. P-DDPG does not use the advanced strategy exploration techniques and prioritized experience replay during the learning. 2) DRL-based algorithm, DRL: A state-of-art DRL algorithm proposed in [6], in which the authors assumed that each flow can only use the VNF instances in the same data center. 3) Customized DDPG algorithm, C-DDPG: A state-of-art DRL algorithm proposed in [7]. In C-DDPG, the actor network produces the VNF placement and flow routing simultaneously.   30.9% lower than P-DDPG, DRL and C-DDPG on average, respectively. Additionally, we can see that, compared with P-DDPG and DRL, the advantages on network cost of the hybrid learning algorithm increase with M . The advantages against P-DDPG owing to the structure of learning agents and training strategies in the proposed hybrid learning, which enhance the efficiency of learning (i.e., the optimality on VNF instance deployment). The advantages increase as M grows, which makes the gap in network costs become larger. Compared with DRL, the proposed hybrid learning algorithm permits a higher resource utilization, which reduces the network cost. Meanwhile, a simpler training process in our algorithm makes the advantages become more prominent.

B. PERFORMANCE
As for C-DDPG, its action space (including feasible routes of all business flows) grows exponentially with M . Thus, when M is large, the learning efficiency of C-DDPG is low and the result becomes unstable. Figure 5 shows the algorithm performance with different number of user flows (i.e., I ). As the number of flows increases, the deployment cost and the service latency continuously increase. It can be seen that the hybrid learning algorithm can always obtain the lowest network cost under all cases. By comparison, the network cost of the hybrid learning algorithm is about 11.9%, 16.7% and 28.9% lower than P-DDPG, DRL and C-DDPG on average, respectively. Moreover, the performance gaps enlarge with I . The reason is that, with a bigger I , the scale of the problem increases. The hybrid learning algorithm and P-DDPG adopt a game theory-based optimization method for the optimization of flow scheduling. Some model information (such as the gradient) can be used to improve the efficiency of the algorithms. The increase in I has a relatively small effect on the learning efficiency of the algorithms (the hybrid learning and P-DDPG). On the other hand, DRL decides the flow path one by one. The increase in I results in a more complicated training process (more steps involved in an episode). The dimension of the state space increases rapidly with I , which reduces the learning efficiency and results in larger performance gaps. C-DDPG optimizes the deployment of the service chain from the perspective of flow routing. The action space exponentially increases with I . Thus, the algorithm is not appropriate when I is large. Figure 6 plots the algorithm performance with different network sizes (i.e., the number of data centers N ). Theoretically, the algorithms can get a better solution for the VNF deployment when given a bigger N . Thus, the curves in Figure 6 (i.e., network cost) gradually decrease with N . It can be seen that among all the algorithms, the proposed hybrid learning algorithm can achieve the minimum network cost under all scenarios. Compared with P-DDPG and DRL, the network cost is 11.3% and 20.2% lower on average separately. Meanwhile, as N increases, the performance gaps (in percentages) increases rapidly. This shows that the proposed hybrid learning algorithm can work efficiently in large-scale networks and has better scalability. The action space in C-DDPG increases exponentially with N . Thus, its result is unreliable with a bigger N . When N increases from 8 to 10, the network cost under C-DDPG even increases.   It can be seen that the changes in the network cost are basically consistent with the fluctuations of the total demands. At around 5:00 AM, the network cost is the lowest. As the total demands increase, it admits its peaks at around 10:00 PM (i.e., 22:00). With more service flows in the network, more VNF instances need to be deployed. Thus, higher operational cost and deployment cost will be incurred. Therefore, the variations on the network cost and total demands share a similar tendency. Compared with other algorithms (P-DDPG, DRL and C-DDPG), the hybrid learning algorithm can achieve the minimum network cost. Its cost is 8.9%, 18.6% and 29.8% lower than P-DDPG, DRL and C-DDPG on average. The advantages of the proposed algorithm mainly come from the simple learning structure and the efficient flow scheduling algorithm. Figure 8 plots the variation of network cost under different delay coefficients (i.e., L). Service latency is one of the key indicators that affect user service experience. By modifying the delay coefficient, the performance characteristics in different scenarios can be acquired. It can be seen that in both delay-sensitive and delay-insensitive scenarios, the hybrid learning deployment algorithm can effectively deploy the service chain. This shows that the proposed algorithm has strong adaptability in different scenarios. Operators can employ the proposed algorithm to serve various latency requirements in different scenarios. In contrast, the network cost of the proposed algorithm is 11.8%, 19.4% and 29.1% lower than P-DDPG, DRL and C-DDPG on average.
In Figure 9, we compare the learning speed (or convergence speed) between the proposed algorithm and DDPG. DQN and A3C are not taken into comparison due to their different learning structures. We can observe that the proposed algorithm converges nearly three times faster than DDPG. Meanwhile, a lower network cost can be achieved.

VII. CONCLUSION
In this paper, we propose a hybrid DRL learning framework for the SFC problem. The VNF placement and flow routing are jointly optimized to minimize the overall cost, including the operational cost, the deployment cost, and the delay cost. Unlike existing DRL-based algorithms, we decouple flow routing from the learning agent into a game-based module (GBM). The DRL agent, which is designed based on the DDPG, only focuses on the optimization of the VNF deployment policy. To further improve the learning efficiency, we customize the NN structure of the DRL agent and adopt several techniques such as adaptive parameter noise, Wolpertinger policy, and prioritized experience replay. In the GBM, we design a decentralized algorithm for flow routing based on game theory to address the scalability. During the learning procedure, the DRL agent adjusts the deployment policy based on the reward generated by the GBM. The learning efficiency is better than conventional DRL-based algorithms owing to two reasons. First, in the proposed algorithm, the action space of the DRL agent is significantly reduced. Second, conventional DRL-based algorithms are essentially model-free methods. However, in our algorithm, the GBM conducts the optimization of flow routing with model-based algorithms. The algorithm efficiency is improved since module-based information is adopted, such as the gradient. Finally, through numerical results, we validate that the proposed algorithm can achieve a lower system cost through limited training episodes compared to existing DRL-based algorithms. He is currently a Professor with the University of Electronic Science and Technology of China. His research interests include network signals and information processing, information visualization, and network behavior and security.