A Deep Reinforcement Learning Based Switch Controller Mapping Strategy in Software Defined Network

Software defined network (SDN) is a promising technology which can reduce network management complexity through the decoupling of the control plane and data plane. Due to large number of switches in the data plane, distributed and multiple controllers are necessary in the control plane for managing the switches. The switch controller mapping strategy for identifying the mapping relationships between the switch and controller is crucial in order to optimize the network performance. Considering the dynamics of the network behavior, it is quite important and challenging to develop models to reflect the network topology dynamics and to propose method for solving the long-term network performance optimization. Inspired by the recent advances in Artificial Intelligence (AI), in this paper, we propose a Deep Reinforcement Learning (DRL) based strategy for solving the switch controller mapping problem. A DRL based mapping strategy is proposed, in which Markov Decision Process (MDP) formulation is devised and Deep $Q$ -network (DQN) is proposed to achieve the maximization of long-term system performance by leveraging network latency, load balancing and system stability. Extensive simulations show that the DQN based algorithm can achieve the best system stability results while maintaining moderate switch controller latency and system equilibrium performance comparing with the optimization which only considers current system performance for switch controller mapping decision, and the optimization approaches which generate mapping decisions purely based on latency or load balancing separately.


I. INTRODUCTION
Software Defined Network (SDN) is an emerging technology, in which the control plane and forwarding plane are decoupled. In the control plane, the controllers own the overall view of the network and can hence perform centralized control of the network. In the forwarding plane, packet forwarding devices (usually Openflow switches), deal with the data packets' behavior based on the forwarding tables obtained from the controller [1]. The controller can manage the switches' behavior though distributing the forwarding rules into the switches. SDN can significantly reduce the complexity for The associate editor coordinating the review of this manuscript and approving it for publication was Longxiang Gao . the control and management of the network and promote network innovations, thus becoming popular in modern network scenarios, such like data centers.
Due to the application of large scale of switches in the networks, single controller mode is impossible. The large amount of traffic and the latency between controller and switch poses challenge for single controller design [2]. Accordingly, multiple and distributed controllers have been proposed in the literature in order to improve the reliability and scalability of the control plane [3]. In addition, the implementation of the physically distributed design of the logically centralized control plane has also been investigated [4], [5]. For the distributed control plane, the basic problem is to solve the controller placement problem of determining the VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ controllers' location. This problem has been investigated in the literature thoroughly, in which the focus is to decide the number of controllers required and the placement locations of the controllers [3], [6]- [8]. Moreover, after the controllers have been placed, the problem of switch controller mapping needs to be solved. The switch controller mapping identifies the mapping between switch and controller based on the network environment. Most existing works in the area of switch controller mapping solve the mapping problem considering different objectives, such as balancing the load [9], maximizing the controller resource utilization [10], minimizing the response time [11], or considering both perspectives from the controller and switch [12]. However, the above existing research works solve the switch controller mapping problem based on the instantaneous network environment information, such as the controller traffic load, working status and resources' utilization.
However, the networks' topology may be dynamic and change with time. For example, the controller may leave the system due to controller failure. Thereby, the time-varying network dynamics may affect the mapping solution. For example, when a switch is attached to a controller which may have probability of failure in future, there will be additional cost for migrating the switch. Considering this, it is better to map this newly joint switch to a controller with less failure probability. Therefore, the time-varying network dynamics can affect the mapping decision. This poses challenge to the switch controller mapping problem, since the optimization decision for the switch controller mapping should be devised not only based on the current system status but also the time-varying characteristics of the network environment. A few research works have been designed to adapt to the network traffic dynamic situation [13]- [15]. However, the above existing research work deals with the long-term time-average performance optimization for minimizing the system's operation cost by only considering traffic fluctuation. No dynamics of the controller and switch, such as the joining and leaving of the controllers and switches due to network topology dynamics have been taken into consideration.
In this paper, the objective is to tackle controller switch dynamic mapping problem for optimizing the long-term system performance (cumulative system performance in the long run) by considering the network topology dynamics including both controllers' and switches' dynamic behavior. The system performance in terms of switches' response time, controllers' load balancing and system stability in terms of controller's failure impact are considered. Since the optimization problem is a multiple-stage decision process, we apply Markov Decision Process (MDP) for modeling the problem. We decompose the multi-period decision process into sub-process which is modeled by the network environment (state), mapping decision (action) and system performance (reward) in MDP. Then we apply deep reinforcement learning (DRL) approach for solving the problem. In reinforcement learning (RL), an agent can interact with the environment and adjustify the action policy in order to achieve the optimal solution for maximizing the long-term reward (cumulative reward in the long run) of the system [16]. Moreover, DRL has recently been proposed by DeepMind to overcome the limitations of RL for solving problems with large scale of system states [17]. Through taking advantages of deep learning during the learning process, DRL can further solve the problem with high dimensional input. Inspired by the powerful tool of DRL, we aim to apply it to solve the long-term optimization problem of dynamic switch controller mapping. The system performance of the objective is modeled as the reward in MDP and the agent can devise the optimal mapping decision as the action policy for solving it. Particularly, we adopt a popular DRL algorithm-Deep Q-network (DQN) in the switch controller mapping design. The contributions of the paper are summarized below.
• The dynamic switch controller mapping problem is proposed to optimize the long-term (cumulative performance in the long run) overall system performance in terms of switches' response time, controllers' load balancing and system stability considering the dynamics of network topology including both controllers' and switches' dynamics.
• The mapping problem is modeled as a MDP, in which the state, action and reward in MDP are formulated. Then a DQN based switch controller mapping algorithm is proposed for solving the long-term optimization problem.
• A simulation platform is established to validate the proposed strategy with different input parameters. The proposed DQN algorithm is compared with the mapping method without taking the long-term system performance into account. In addition, the mapping approaches which consider switches' response time and controllers' load balancing respectively are also compared.
The rest of the paper is organized as follows. Section II introduces the related work. The system model and design objective are presented in Section III. Section IV presents the DQN based switch controller mapping approach. Numerical results are shown in Section V. And section VI concludes the paper.

II. RELATED WORK
In this section, related work is introduced. We first present the related work on controller placement and switch controller mapping problem. Then the application of DQN and Q-learning(QL) in network resource management is also elaborated.

A. CONTROLLER PLACEMENT
The problem of controller placement is to determine the controllers' location. In multi-controller SDN system, the controller system design and the controllers' placement have been investigated in many research works [3], [6]- [8]. In [3], the controller placement design problem has been investigated. The impact of network topology and the placement of controllers on latency were analyzed. And the paper concludes that the reaction bounds, metric choices and network topology are the three factors which affect the placement of controllers. In [6], the distributed design of controller system has been investigated, which makes use of the local algorithms in distributed computing for enhancing SDN controller coordination. In [7], the controller placement algorithm is proposed for solving the number and the location of controllers' placement while satisfying the performance metrics, such as latency and convergence time. Particularly, the problem is modeled as standard graph theory problems in forms of K-median and K-center, to which the complexity of the optimal solution is NP-hard. And a heuristic K-critical algorithm has been proposed which outperforms the traditional heuristic K-median and K-center solution.
In [8], heuristic approaches have been proposed to solve the controller placement problem in large scale SDN networks. A Pareto-based optimization framework is proposed for achieving different performance metrics.
For further improving the reliability design of SDN, the controller placement considering reliability design are further investigated in [18], [19]. In [18], the problem of determining the controller placement policy is solved for satisfying reliability requirement given links and switches' failure probability. The problem is modeled as a NP-hard problem and heuristic solutions are proposed to solve the problem. Similarly, in [19], a fault tolerant controller placement problem was formulated, in which the objective is to identify the placement of controllers for minimizing the cost while satisfying reliability requirement. Heuristic solution has been proposed to solve the problem. However, the above research work focusing on the placement of the controller has not considered the dynamic mapping between controllers and switches.

B. SWITCH CONTROLLER MAPPING
The switch controller mapping problem is to dynamically assign the switch controller mapping relationship based on network environment after the controllers have been placed. For solving the switch controller dynamic mapping problem, several research works have been proposed [9]- [15], [20]. In [9], an elastic distributed controller architecture is proposed, in which the controller can be dynamically justified based on the traffic. The proposed solution monitors the network performance and dynamically migrates the switch once the load is imbalanced. In [10], [20], the switch controller mapping problem is proposed for maximizing the sum of network utility of all controllers. And a distributed algorithm has been proposed for solving the optimization problem. In [11], the switch controller mapping was modeled for minimizing the average response time. A genetic algorithm has been proposed to solve the optimization problem heuristically. In [12], the switch controller mapping is proposed by considering the perspective from both the controller and switch. The status of the controller and switch is considered at the same time. A matching list is constructed accordingly, from which the mapping decision is made through mutual selection. However, the above research works only consider the switch controller mapping problem based on the instantaneous network environment information and instantaneous system performance.
More recently, a few research works have been proposed in which long-term performance are considered. In [13], the SDN controller assignment problem is solved for minimizing the long-term operating, maintenance and switching cost considering the dynamic changing of work load. An algorithm based on randomized fixed horizon control theory has been proposed, in which the long-term optimization is divided into a set of short-term optimization problem. In [14], a distributed switch controller dynamic orchestration problem including controller activation/deactivation and switch controller mapping is solved for minimizing long-term system cost. Stochastic optimization theory is applied to solve the multiple timescale optimization problem for adapting dynamic traffic variation. In [15], a dynamic control plane is proposed to minimize the long-term flow-setup and control plane adaptation cost for adapting dynamic traffic requirement and distribution. Multi-period optimization problem is established, which is decomposed into smaller instance solved by simulated annealing. However, the above works only tackled the problem for minimizing the long-term system cost considering the traffic fluctuation. Network dynamics, such as the joining and leaving of the controllers and switches due to network failure or topology dynamic have not been considered.

C. APPLICATION OF DQN/QL IN NETWORK RESOURCE MANAGEMENT
In the process of RL, the learner (agent) interacts with the environment and learns the optimal policy for achieving the maximum long-term cumulative performance. Throughout the process, the agent intends to make decisions to maximize the expected benefit. Particularly, at each time instant, the agent gathers the state information from the environment, produces an action, and obtains a reward. Then the system state is transited to the next state, which together with the reward are used for adjustifying the action. The above process is repeated for achieving the optimal policy which can maximize the long-term reward. The process of RL can be modeled as an MDP, which is a mathematical framework describing the decision-making process of the agent. MDP contains four parts: system status obtained from the environment, action set, the transition probability from current system status to the next, and the reward which is immediately obtained when an action is performed.
The most widely used RL algorithm is the Q-learning algorithm proposed in [21]. In Q-learning algorithm, a two-dimensional (state and action) Q-table is used to record the expected reward. The value of the Q-table is updated from each iteration in the learning process. Based on the Q-table, a certain system status can have different expected reward and the action which can maximize the expected reward is selected and considered as the optimal policy.
DRL [17] further improves the learning time and performance of RL by taking advantages of deep learning during the learning process. Deep learning contains a series of algorithm which can automatically learn the structure of the data. Particularly, deep learning uses deep neural networks (DNN), which is neural networks with more than one layer, for data modeling. Deep Q learning network (DQN) is the most popular DRL algorithm, which trains DNNs as the Q-network. During the training process, a DNN is trained instead of Q-table used in Q-learning. DQN contains the benefit of both RL and deep learning. The agent can learn to make decisions and deal with high dimensional input using DQN.
DQN/QL has drawn increasingly attention to various applications in communications [22]. However, the application of DQN/QL to network resource management is limited. We here list a number of work for applying DQN/QL for network resource management in Table 1. Particularly, we summarize research work in terms of the scenario, system state, action, reward, agent and the algorithm used.
The authors in [23] propose a DRL approach for managing resource at the network edge. Particularly, a case for determining the VM placement is obtained with the consideration of users' mobility. The target is to minimize the operational cost of the process. The location of the VM and the number of users associated to each base station is set as the system state, and the action is the base station where the VM is going to migrate to. The communication cost of both data transmission and VM migration are used for formulating the reward function. The agent is assumed to be an abstract controller in the system. DQN method is applied for solving the problem.
In [24], DQN is applied for solving a joint task offloading and energy allocation problem in mobile edge network. The computation task queue at the mobile user (MU), the wireless channel state, the energy of the MU, and MU-base station association are used for system state. Action set includes the task offloading decision and the energy allocation policy. Reward is set as a combination of task completing performance (execution and queuing delay, drop, failure) and the economic factors. A central controller is assumed to be the agent performing the DQN. In addition, due to large space of the action set, the authors proposed algorithm based on double-DQN.
In [25], a DQN based offloading scheme is proposed in ad-hoc mobile clouds. End user acts as the agent by learning the system information including queuing information from the users and nearby mobile cloudlets and the distance between users and cloudlets. Then the agent determines the number of tasks to be offloaded to each cloudlet. Reward is set as utility minus cost in order to maximize the sum of task rate and at the same time minimize the cost of task delay and economic expense. And DQN is used for solving the problem.
In [26], a DQN based controller synchronization in distributed SDN is proposed. The controller in each domain acts as an agent for making decisions of whether to synchronize with the controllers in other domains. Accordingly, the time slot from the last synchronization with the controller in each domain is used as system state. The domain which is selected for synchronization is set as action. The objective is to minimize the average path cost (APC), which is the link weight assignment over the constructed path. DQN is applied for solving the problem.
In [27], service placement method in SDN is investigated. The controller runs the algorithm as an agent for determining the service placement in switches. System state is the decision of deleting a service in a switch. Action set is defined as installing a new service when an old service is deleted. Reward is the benefit for replacing the old service in the switch with a new service. QL based algorithm is developed for generating the service replacing decision.
In [28], a QL based multi-objective resource allocation scheme for Network Function Virtualization (NFV) orchestration has been proposed. The NFV controller acts as the agent for deciding which node and link to use for performing virtual network function (VNF) placement and route selection. Reward is a weighted function of node usage frequency, together with node and link utilization. QL based algorithm is used for devising the VNF and route selection strategy.

III. SYSTEM MODEL AND DESIGN OBJECTIVE
A. SYSTEM MODEL The overall system model of switch controller mapping in SDN is shown in Fig. 1. The physical environment consists of a list of controllers and switches. Each controller is connected to multiple disjoint set of switches through OpenFlow protocol. We use C = (c 1 , c 2 , · · · , c i , · · · , c n ) to represent the controller set of the network, in which n is the total number of controllers. We use S = (s 1 , s 2 , · · · , s j , · · · , s m ) to represent the switch set of the network. The switch only connects to one of the controllers. And N c i is denoted as the number of switches connected to controller c i . d s j c i represents the distance for mapping the switch s j to controller c i . The controllers' system follows a hierarchical SDN control structure [6]. The location of the controllers has been selected and remains the same once selected. The number of controllers in the system is fixed. However, the controllers may encounter failures in which the failed controller(s) belong to set C . In addition, we assume that each controller has fixed capacity, which limits the number of attached switches and denoted as W c i . The switch arrives at the system due to the expanding of switch equipment. The switch may also leave the system due to the switch local failure. The controller can also leave the system due to controller failure. If the controller encounters any failure, then the switches which were originally connected to the failed controller are remapped. Table 2 summarizes the notations used in the model.
The mapping decision module is assumed to be executed on a centralized management plane, e.g., a centralized controller. The mapping decision module can obtain the system state information from the network environment, i.e., the physical network. The obtained system information includes both switch features and controller features. Such information is then passed to the mapping decision module for generating mapping decision. Then, the mapping strategy is received by the control plane of the network, based on which the switch controller mapping relationship is established.

B. DESIGN OBJECTIVE
Based on the above system model, the dynamic mapping strategy is performed considering the network topology dynamic process of both switches and controllers. The objective is to devise the switch controller mapping relationship χ(C) to achieve the optimal long-term cumulative system performance including the switches' response delay, the controllers' load balancing and controllers' failure punishment. One thing to be noted is that during the optimization process, the mapping decision impacts not only the current system performance, but also the future system performance. Accordingly, the dynamic mapping process is decomposed into multiple stage decision process with time duration T for each period. At each time slot t, we denote B 1 (D(t)) as the benefit of the controller response delay for each switch, B 2 (U(t)) as the benefit on the controller utilization, and B 3 (P(t)) as the benefit on system stability related to controller failure. And we use F() as the overall benefit function. Mathematically, the objective is expressed as which is to maximize the accumulation of the overall benefit function F(B 1 (D(t)), B 2 (U(t)), B 3 (P(t)) over a long-term time period T . The benefit of response delay, controller utilization and system stability denote system performance of response delay, controller utilization and system stability with specific numerical values, the overall benefit function F() of which the system intends to maximize. The numerical value of the system performance is generated as reward. The details of the overall benefit function F() and benefit functions B 1 (), B 2 () and B 3 () are illustrated in Section IV Eqn. (7), VOLUME 8, 2020 (4), (5) and (6). Particularly, the benefit function of response delay is declared as an inverse function of delay. The benefit function of controller utilization is an inverse function of controller utilization. The benefit function of system stability is an inverse function of affected switches due to controller failure. The overall benefit function is a weighted function of B 1 (), B 2 () and B 3 ().
In the above problem, the network dynamics model is unknown to the decision module. The objective is to learn the network environment and obtain the optimal mapping control policy. In addition, we have the constraint that the switch can only be mapped to one controller over the process.
In the above design, we need to consider current system performance to make decision. In addition, we also make mapping decisions considering the future system performance. We illustrate the advantages considering long-term system performance in Fig. 2.
Suppose that there are two controllers A and B. At the start of the process (time instant t1), switch c is mapped to controller B according to the mapping strategy. Based on the overall benefit of the system, the reward at t1 is assumed to be 2+1, which is the sum of reward of controller switch response delay and load balancing. At time instant t2, another switch d joins the system. Without long-term consideration, the system decides to map the new switch based on the instant system reward. Suppose connecting switch d to controller B can lead to a reward as 1.5 + 1.5 = 3, which is the sum of reward of controller switch response delay and load balancing. While connecting switch d to controller A can lead to a reward as 0.5+0.5 = 1 (shown in orange box), since the delay between switch d and controller A is larger and connecting to A will lead to worse load balancing performance. With only considering the current system reward information, the system will choose to map switch d to controller B for better reward for the current time instant.
However, for the next time instant t3, the network situation may change. For example, a failure may occur in controller B. Then the newly joint switch and the original connected switches need to reconnect to controller A. However, such switch migration reduces the overall system performance, leading to reduced reward of the system. Suppose at time instant t3, no more switch arrives at the system. Due to controller failure, switch c and d need to be remapped. The reward now is decided by the system stability, which reflects the cost of reconnecting the switches. Suppose the reward is −6, since two switches need to be remapped. On the other hand, with the consideration of long-term performance of the system, the system can learn the network dynamics. And at time instant t2, the system connects newly joint switch to controller A. This leads to a smaller reward than the former scheme with a value of 1. However, for the next time instant t3, the failure of controller B has less impact of the system, thus resulting in larger value of reward at time t3 with a value of −3. Through considering the long-term cumulative reward by summing over the three time instants, the latter approach can lead to better performance (reward=1) than the former approach (reward=0). As shown in Fig. 2, the orange curve (considering the long-term system performance) can lead to better long-term system performance than the blue curve in the long run.

IV. DQN BASED SWITCH CONTROLLER MAPPING APPROACH
In this section, we present the design of the DQN based switch controller mapping approach. Firstly, we present the MDP formulation of the switch controller mapping problem.
Then based on the MDP model, we present the DQN algorithm.

A. MDP FORMULATION
MDP is used for describing the optimal control problem as in discrete stochastic version [16]. In our problem for controller/switch mapping, the objective is to devise the switch controller mapping relationship to achieve the optimal long-term system performance. The dynamic mapping process is decomposed into multiple stage decision process, which is modeled as a discrete stochastic version of the optimal control problem consisting a sequential mapping decision. The mapping decision has an impact on the system current performance and also the subsequent performance. Thereby, MDP can be applied directly for modeling the controller/switch mapping problem.
To be more specific, the mapping decision module works as the agent in the MDP. The agent interacts with the SDN network continuously. Then the agent makes mapping decisions (actions) and accordingly, the network status changes and new network status information is sent back to the agent. System performance with specific numerical value is generated as rewards, which the agent intends to maximize over time by generating mapping policy.
In the following, the details of the state S, action A and reward R in the MDP model are presented.
Modeling of the system state S: The system state in the framework at time instant t includes mainly two parts: the controller features and the switch features. We use d c i to represent the distance of mapping the current switch to each controller c i . The switch and controller are assumed to be connected through the shortest path algorithm. The switches which have already been connected to the controller may leave the system due to switch failures or mobility. The number of the left switches in controller c i is denoted as l c i . In addition, we use q to represent the number of arrived switches to the whole system. And we use binary variable s t c i to represent the working status of the controller at t. We use s t c i as the variation of the working status of the controller at t. Particularly, we have in which the value of s t c i can be 0, 1 or −1. 0 represents that the status of the controller remains the same. 1 means the controller which has failed previously becomes available. And −1 means the controller which was available fails in the current time instant t.
We assume that the time duration between t and t + 1 is small enough that only at most one switch arrival exists within the time interval. The arrival process of the switch follows Poisson process. Therefore, q is a binary variable depending on the arrival rate of the Poisson arrival process p t , which is a time varying parameter. Note that the arrival rate refers to the arrival rate of newly joint switch which is added to the network topology. This corresponds to the network dynamics in terms of switch. When the controller l c f is failed at time instant t, the attached switch needs to be remapped. When the controller l c f fails, the attached switches which are connected to the failed controller need to be remapped to other working controllers. During the remapping process, there are still newly arrived switches joining to the whole topology. Therefore, the mapping decision module needs to process both the disconnected switches and newly arrived switches. And in the proposed system, the disconnected switches which were forced to be remapped are being processed first. Then, mapping decisions for the newly joint switches are generated by the mapping decision module.
The reason that we guarantee one switch arrives between the time interval is because for Poisson arrival process, if the sampling time is small enough, the probability that two or more switch arrivals in the interval can be neglected. In practice, the switch arrival pattern is measured. Then the interarrival between the switch arrival time can be obtained, based on which we can select the sampling time to be smaller than the minimum value of the interarrival time for guaranteeing that one switch arrives between the time interval.
Note that although OpenFlow has a mechanism to deal with failed controllers that activates a secondary controller, there are mainly two differences of the secondary controller approach comparing with the proposed scheme. First, in the design of secondary controllers, main controllers are responsible for managing the relationship of the mapping of switch to the secondary controllers. Secondary controller is activated when the main controller delegates the control to the secondary controllers based on its own working information. In the proposed approach, the mapping decision module is a centralized module, which makes mapping decision based on the whole system environment. Second, in our proposed approach, the mapping module can learn the network environment dynamics and generate the mapping decision based on the interaction with the network for achieving the long-term system performance. However, no research work in secondary controller mapping area has been designed for achieving this goal.
In summary, the system state which is obtained by the agent/mapping decision module at time instant t is presented as The state of the system contains two parts. The first four parameters n t , N t c i , s t c i , s t c i describe the controllers' dynamic state, including the number of controllers in the system, the number of switches attached to the controller and the working status of the controller. The last four parameters l t c i , q t , p t , d t c i are related with the switches' dynamic state, including the leaving and arrival status of the switches and switches' distance to the controllers. The above set of system state models the dynamics of the network topology.
Modeling of the action set A: The action set of the MDP reflects the mapping relationship between the switch and the controller. In this case, the action set a t = {c 1 , c 2 , · · · , c n } is modeled as the controller to which the switch is mapped.
The action set has a dimension of n, which is the total number of controllers in the system t. For easy implementation, a fixed number of action set is used n. However, when failure occurs, the failed controller cannot be selected so the actual available controller set n t ≤ n. Modeling of the reward R: The objectives of the switch controller mapping mainly include three parts: to reduce the response time of the switches, to balance the load for the controllers and to enhance the system stability by reducing the controller failure impact. In DQN design, the algorithm aims to maximize the long-term system reward. Therefore, the first portion of the reward which reflects the response time of the switch is mathematically presented in Eqn. (4) as which is an inverse function of d c i . By using the above function, the reward r 1 decreases with the increasing of the distance. This can guide the DQN module to generate the switch controller mapping with smaller response time.
The second portion of the reward is for load balancing. In this part, the aim is to minimize the maximum utilization of the controller for achieving the overall balance of the controllers' load. Mathematically, the reward is represented in Eqn. (5) as where the capacity of controller c n is denoted as W c n . γ is a punishment factor for the controller utilization. When the controller utilization goes to 1, the reward is the most punished since the value of r 2 decreases dramatically. By setting this, we avoid the situation of the overloaded controller and ensures the balance of the controllers. This is because if the case of the extreme case of the controller utilization is under control, the other controllers' utilization must be under control.
Although the switch only asks the controller for unknown packets, there are several reasons that the flow can still cause possible high processing delay in a controller. First, new arrival flow may result in high delay in a controller. It is shown that in the worst case, a network with 100 switches can have 10M new flows arrivals per second [29]. On the other hand, an SDN controller-NOX can handle around 30k flow initiation events per second [30]. Thereby, multiple controllers are needed to handle the spike of new traffic flow. Second, increasing the number of switches beyond the threshold can result in controller throughput degrading due to the increased contention across threads of controllers, TCP dynamics, and task scheduling overhead within the controller [31]. Third, the advanced functions of the controllers, such as congestion control, traffic engineering, load balancing, security and fault tolerance [32], [33], limits the number of switches a controller can manage.
The third portion of the reward involves the punishment of the failed controller. If the controller fails, the switches connected to the controller need to migrate to other controllers, resulting in the migration cost and QoS violation cost. The punishment of the migration is proportional to the total number of attached switches for the failed controllers. Mathematically, the reward of this portion is modeled in Eqn. (6) as Here the reward is a summation over all controllers. If the value of s c n equals to −1, the controller c n which was available fails. Therefore, the number of switches N c n which were attached to the controller c n needs to be remapped. Thus, the reward of r 3 is negative, resulting in punishment on the reward. When the value of s c n equals to (0, 1), there is no punishment on the system since this indicates the controller status remains the same or the controllers become available. This will not change the previous mapping switches, thus resulting in neither punishment nor benefit.
To summarize, the overall reward function assuming controller c i is selected is mathematically presented in Eqn. (7) as in which η is a function reflecting the benefit for selecting controller c i . It is mathematically modeled in Eqn. (8) as Here, we have η > 0 and η < 0. By setting a negative reward for the failed controller, the selection of failure controller is avoided. In addition, in Eqn. (7), three adjustable weight parameters α, β and δ are used to represent the importance of the three portions. Accordingly, based on the importance of the three portions in the system design, the reward can be dynamically changed. This adds flexibility in the system design since reward function can reflect the change of the system target directly.
Based on the reward function r t , the future discounted return R at time t is defined in Eqn. (9) as [17] (D(t)), B 2 (U(t)), B 3 (P(t)))), (9) which is the sum of the overall benefit r t = F(B 1 (D(t)), B 2 (U(t)), B 3 (P(t))) discounted by µ at each time-step t until the mapping process terminates at T . µ is the discount factor where 0 ≤ µ ≤ 1 and µ reflects the trade-off between the current and future reward.
Then, the design objective of problem Eqn. (1) can be achieved by generating the policy for achieving the maximum expected return Q * (s, a) expressed as [16], [17] which is the optimal state-action value function achieved for system state s and action a. And π * is the switch controller mapping policy for achieving the maximum expected return Q * (s, a).

B. DQN BASED ALGORITHM
The DQN algorithm contains two parts: offline training part and online mapping part. In the offline training part, the DQN algorithm obtains the network state information with the aid of SDN controllers. The controller behavior and switch behavior are gathered, based on which a Q-network is trained by the DQN algorithm. Then during the online mapping process, the trained Q-network is used for generating the online-mapping decision. For the online mapping part, the network environment information is obtained by the SDN controller. Then such information is modeled as state parameter s t as in MDP. Then multiple Q values correspond to different actions are generated, from which the maximum value is selected. The online mapping process of DQN is fast and efficient. Therefore, it can be used for generating real-time mapping decision. Note that as a representative machine learning (reinforcement learning) method, DQN can learn the network environment information for producing the optimal policy for solving the MDP problem with unknown network dynamics model. By selecting properly on the training parameters, DQN algorithm can achieve the optimal policy. Therefore, the propose DQN based algorithm is the optimal control policy for solving the MDP problem.
Based on the proposed MDP model, the target of solving the switch controller mapping problem is to find the optimal policy π * which can find Q * (s, a). The optimal policy can be generally found by repeatedly updating the Q value using the following Bellman equation [16] as (11) in which s is the state in the next time-step, a is the possible action in the next time-step t + 1. Such value iteration converges to the optimal Q * when t goes to infinity.
In the proposed strategy, we propose to use DQN other than QL for two reasons. The first reason is due to the high system status dimension. In QL approach, Q table is trained during the learning process. The Q table is a two-dimensional matrix, in which the row element corresponds to the system status and the column element represents the action space. Q-table is fast and accurate when the state and action set is relatively small. However, when the state and action set is large, Q-table is impossible to use and implement. In the problem of dynamic switch controller mapping, the system status includes both switch and controller features, and the size of the status grows exponentially with the size of the controller. Thereby, it is impossible to use such large system state in the Q-table. In DQN, a Q-network is trained through a deep learning model. The Q-network is modeled by a deep learning model, which can describe the data features and provide automatic learning from data structure [22]. The Q-network used in the DQN algorithm for switch controller mapping is DNN, which is a neural network with two or more hidden layers. Therefore, even the system state is large, the Q-network can be trained by using part of the system state. This can solve the problem of large system space.
The second reason we use DQN is because in Q learning, the algorithm requires the agent to observe all possible system status during the training process for generating the Q-table. However, in the mapping problem, the system status may be random. The switch arrival process and the controller status are unknown. And the pattern for their performance is unknown. Thereby, DQN is necessary since even the system status is partially known during the training process, action can be generated during the decision process by using Q-network.
The overall module design of the DQN-based switch controller mapping is shown in Fig. 3. In the training process shown in the red dashed line box, the objective is to generate a DNN based Q-network, in which the input is environment s and action a. The output is the value of Q(s, a). The details for generating the Q-network during the training process is further illustrated in Algorithm 1.
Then once the Q-network is generated, it can be applied in the switch controller mapping process as shown in the green dashed line box. During the mapping process, by using the trained Q-network and the input network environment state s t and available action a 1 t , a 2 t , · · · , a n t , multiple Q values can be generated as Q(s t , a 1 t ), Q(s t , a 2 t ), · · · , Q(s t , a n t ). Then the optimal mapping strategy can be found by choosing the action a * which can lead to the maximum value among the Q values. Mathematically, the mapping action is chosen based on Eqn. (12) as Furthermore, the training algorithm of DQN based switch controller mapping is illustrated in Algorithm 1. At the start of the algorithm, the size of the replay memory D is initialized. In addition, the initial Q-network and target Q-network is initialized with parameters θ and θ . We use DNN as the model of Q-network. Then, the body of algorithm is repeatedly executed for a number of T episodes. During each episode, another loop starts with randomly generated system status. The inner loop iterates for a number of S steps. Then, the classical -greedy algorithm is used for finding the action set. That is, with the probability of or 1− , action a t is randomly chosen among the action set or chosen by maximizing the value of Q(s t , a t ). Particularly, the value of starts at 1 at the beginning of the training step. And it decreases with the episode. The reason is because the agent knows little at the beginning of the iteration. Thereby, the agent tends to search randomly from the action set since the action leading to the best cumulative reward is uncertain. With the increasing of the iteration, the agent learns more of the network, thereby, it makes more sense to select the action which can produce the best Q value.
Then, based on the selected a t , the next system state s t+1 is obtained. Accordingly, a four tuple (s t , a t , r t , s t+1 ) is generated, which is stored in the memory pool D and used for training the Q-network. For improving the independence of the data samples, minibatches of the four-tuple set are randomly selected from D. Therefore, the DNN can be trained by using both the historical data and recently generated data, which improves the independence of the training data. Then such selected data are used for training DNN parameter θ using stochastic gradient descent method through minimizing the loss function shown as Then for a number of M steps, the parameter θ of the target network Q is adjusted with Q-network parameter θ. The reason for updating Q -network after M steps is the stability consideration [17]. The parameter of Q-network is copied to the target network Q -network after every M steps. And for the next M steps, the Q -network is used for performing the stochastic gradient descent method. This reduces the correlation between the target Q -network and estimated Q-network and increases the stability of the DQN algorithm [22].
DQN based switch controller mapping can be applied into the real SDN system naturally, thanks to the separation of forwarding plane and control plane. During the training process, the local SDN controller can collect the basic network environment statistical information (switch statistics) through the southbound interface. Then such information is delivered to the mapping decision module through the northbound interface. The mapping decision module can either be a centralized server which is connected to the SDN controllers through northbound interface. Or it can connect to a root controller through the controller northbound interface in the hierarchical SDN structure [6]. In addition, the mapping decision module can obtain local controller statistically information. Then the mapping decision module acts as the agent in MDP for training the Q-network.
During the mapping process, the local controllers and the mapping decision module collect the switch and controller behavior information respectively. Then based on the trained Q-network, the mapping decision module acts as the agent for generating control policy. Then the mapping policy is distributed through the northbound interface to the local controllers. Finally, the mapping relationship is delivered through the southbound interface of the local controllers. The separation of forwarding and control plane in SDN system ensures the implementation of the proposed DQN.
Algorithm 1 Training Algorithm of DQN Based Switch Controller Mapping 1: Input: MDP models 2: Output: Switch controller mapping policy 3: Initialize the target Q-network as Q with randomly chosen weight parameter θ 4: Initialize the Q-network as Q with randomly chosen weight parameter θ 5: Initialize the size of the memory pool D 6: for Each episode 1, · · · , T do 7: Randomly select an initial state s 0 8: for Each step 1, · · · , S do 9: With probability of , randomly choose an action a t 10: With probability of 1 − , choose an action a t = arg max a t Q(s t , a t |θ) 11: Based on the chosen action a t , obtain current reward r t and next state s t+1

12:
Store the (s t , a t , r t , s t+1 ) into the memory pool D 13: Randomly select mini-batches from D as samples {(s k , a k , r k , s k+1 )} 14: Perform gradient descent regarding with parameter θ for minimizing the loss function: 15: For every M steps, reset Q = Q. 16: end for 17: end for

C. ALGORITHM COMPLEXITY
The algorithm complexity of DQN contains two parts: the off-line training part and on-line mapping part. In the off-line training part, the algorithm complexity is expressed as O(T * S), which is proportional to the overall training steps. We show that the algorithm can converge within a certain number of training steps in the next section. In the on-line mapping part, the algorithm complexity is displayed as O(n * m), where n is the number of controllers and m is the number of switches.

A. SIMULATION SETTINGS
A simulation platform is established for verifying the performance of the proposed framework. The simulated network consists of 4 controllers. In all scenarios, the settings of network topology are the same. Each of the controller connects to a number of switches. Initially, the switch number follows Gaussian distribution with mean and variance equaling to 15 and 12. The distance between the switch and controller distributes randomly and uniformly between [1,15]. The arrival probability of the newly arrived switch in the system is assumed to be 0.8. The departure probability of each controller domain is randomly distributed between [0.1, 0.5], and the number of left switches is randomly selected within the range of [1,4]. The capacity of each controller is randomly selected between [100, 300], considering the newly generated flow that a controller can support [29] and the controller performance degrading when the number of switches exceeds the threshold [31]. The failure probability for each controller is diverse and within the range of  Table 3.
Since the Q-function in DQN needs to be trained by deep learning approach, the selection of the Q-network model and training parameters is important for approaching the Q network well. In our simulations, we select DNN with two hidden layers as the Q-function model. During the training process, the learning rate is set to be 0.001. The discount factor µ is set as 0.9. The Q-function updating step M is 100. The size of memory pool D is 5000. The size of the mini batch is 24. In the -greedy method, the initial value of is set as 1, since at the beginning of the training, the agent knows little about the network and tends to select the action randomly. Then the value of reduces by 0.001 after each training step until it reaches 0.
To verify the proposed framework, the proposed DQN approach is compared with four other approaches. 1) Greedy approach. In this approach, the system selects the mapping controller based on the instantaneous system performance using Eqn. (7). 2)Random approach. In this approach, the system randomly maps the switch to the controller. 3) Lowest Latency (LL) approach. In this approach, the system generates mapping decision purely based on the distance between switch and controller. 4) Best Equilibrium (BE) approach. In this approach, the system generates mapping decision purely based on the load of controllers. Switch will select the controller which can lead to the minimum value of max{ For all the approaches, the mapping process in the simulation is obtained by averaging over 10 time cycles, each of which contains data set of 1000 mapping steps.
Based on the above simulation platform, we perform extensive simulations and generate results in terms of three aspects: controller switch response latency, system equilibrium and system stability. The controller switch response latency is measured using the controller switch distance. The equilibrium performance is measured as the maximum value of controller utilization max{ The system stability is measured as the switches affected by the failed controllers c n min(0, s c n ) × N c n . In the following, results in the    It can be observed that the average reward results approach to the maximum after around 150 episodes. This again indicates that the training process converges after an acceptable training episode. In addition, it can be observed that the value of the average reward in case 2 is much smaller than that of case 1 due to increasing value of failure probability.

C. CONTROLLER SWITCH RESPONSE LATENCY
The cumulative controller switch response delay versus time is presented in Fig. 6. It can be observed that for both cases, the proposed DQN approach presented in red solid line shows better performance than greedy approach (green solid line), random approach (blue solid line) and BE approach (purple solid line). The LL approach (black solid line) shows the smallest value in latency, since switch controller mapping is decided purely on the controller switch latency. The DQN approach shows slightly better latency results than Greedy approach. The results of BE approach and random approach perform the worst and are very close, since no controller switch latency is considered in these schemes.

D. SYSTEM EQUILIBRIUM
The cumulative system equilibrium performance versus time is depicted in Fig. 7. The simulation results of the BE approach have the least value of system equilibrium, since mapping decision is generated by only considering system equilibrium. The Greedy approach has slightly larger system equilibrium value. The results of random approach and LL approach are very close and perform the worst, since no load balancing factor is considered during the mapping process. The results of DQN approach lies in the middle, since the DQN approach sacrifices system equilibrium for achieving system stability. For example, DQN approach may map switch to controllers which has less fail probability. Therefore, the system equilibrium performance of DQN is not as good as BE approach and Greedy approach.

E. SYSTEM STABILITY
Simulation results for the cumulative system stability versus time are plotted in Fig. 8. It can be observed that the system stability of the DQN approach performs significantly better than all other approaches. For example, during the running time, the gap of the affected switches between the DQN approach and BE approach is around 6 in case 1 and around 20 in case 2. The reason is because by generating the mapping decision based on the system long-term performance and learning the system status from the environment, the system can avoid mapping switches to possible failed controllers. This demonstrates the effectiveness of the DQN approach in protecting overall system stability. In addition, it can be observed that the system stability of case 2 is much lower than that of case 1 due to the increasing value of the failure probability in case 2.
To summarize, the compared algorithm may perform better than the proposed DQN in some specific task, such like equilibrium performance and response delay. However, the main benefit of the proposed DQN approach is to balance the overall consideration of response delay, controller utilization and system stability and to improve the cumulative system performance in the long-term consideration. The above simulation results prove that the proposed DQN approach can achieve the best system stability performance while maintaining an acceptable latency and equilibrium performance comparing with other approaches.

VI. CONCLUSION AND DISCUSSIONS
In this paper, a DRL based approach has been proposed for solving the switch controller mapping problem in order to optimize the long-term network performance. MDP is modeled and DQN is proposed which can achieve the optimal policy for achieving long-term system performance in terms of network latency, load balancing and system stability. Through extensive simulation results, the proposed DQN method is proved to achieve the best system stability performance, while maintaining acceptable latency performance and system equilibrium performance comparing with the optimization method which performs optimization by considering only the current instantaneous system performance and the optimization method which considers network latency or load balancing separately.
With the developing trend of control and forwarding plane separation of the network, SDN provides a solid and convenient platform for applying DQN in the network optimization. The DQN based network optimization can be applied into the SDN architecture naturally, where the SDN controller and the mapping decision module can collect the basic system status and acts as an agent for generating control policy. Then the control policy can be distributed through the south/northbound interface to the forwarding engines. This ensures the implementation of the proposed DQN with a relatively low complexity. Additionally, the benefit of DQN on dealing with large and uncertain system status shed light to solve the network optimization problem in SDN system. The proposed DQN approach confirms the effectiveness of the algorithm in terms of long-term system average performance and system stability over the traditional optimization approaches. With the development of Deep RL theory, there are also other DRL algorithms, which can provide improved performance than DQN. As a future work, we will consider to apply these DRL algorithms for solving the mapping problem in SDN.

ACKNOWLEDGMENT
This article was presented at the 2020 ACM International Conference on Digital Signal Processing.