Deep Q-Learning for Routing Schemes in SDN-Based Data Center Networks

In order to adapt to the rapid development of cloud computing, big data, and other technologies, the combination of data center networks and SDN is proposed to make network management more convenient and flexible. With this advantage, routing strategies have been extensively studied by researchers. However, the strategies in the controller mainly rely on manual design, the optimal solutions are difficult to be obtained in the dynamic network environment. So the strategies based on artificial intelligence (AI) are being considered. This paper proposes a novel routing strategy based on deep Q-learning (DQL) to generate optimal routing paths autonomously for SDN-based data center networks. To satisfy the different demands of mice-flows and elephant-flows in data center networks, deep Q networks are trained for them respectively to achieve low latency and low packet loss rate for mice-flows as well as high throughput and low packet loss rate for elephant-flows. Furthermore, with the consideration of the distribution of traffic and the limitated resources of data center networks and SDN, we choose port rate and flow table utilization to describe the network state. Simulation results show that compared with Equal-Cost Multipath (ECMP) routing and Selective Randomized Load Balancing (SRL)+FlowFit, the proposed routing scheme can reduce both the average delay of mice-flows and average packet loss rate, while increase the average throughput of elephant-flows.


I. INTRODUCTION
With the rapid growth in cloud computing, big data, and other technologies, the scale of data center networks is expanding continuously [1]. The traditional networks cannot meet the requirements of the existing data center networks due to the difficulties in network management and deployment. The emergence of SDN indicates the way to solve the above problem. It's a novel networking architecture which separates the control plane from data plane of the forwarding device. Data center networks can benefit from the centralized control of SDN to make intelligent dynamic decisions. In data center networks, routing is an important research point and has been researched for a long time. With the advantage of SDN, routing strategies can be deployed conveniently and flexibly based on the global view of network. To further self-learning control strategies in SDN networks. It combines SDN with deep reinforcement learning (DRL) and presents a simple example that solving QoS routing problem with DQL. However, the paper focuses on the design of architecture and the specific design of routing scheme is not explained.
In this paper, we propose a routing strategy based on DQL for data center networks based on SDN. We build two DQNs to make routing decisions intelligently. One for elephantflows to achieve low packet loss rate and high throughput, the other for mice-flows to achieve low packet loss rate and low latency. In this way, the approximate optimal routing strategy can be obtained. To summarize, we make the following contributions in this paper.
• An intelligent network architecture is built for routing in data center. According to the traffic characteristics of data center network, it will dynamically generate optimal routing strategies for elephant-flows and miceflows, respectively.
• Specific design of DQL algorithm is presented, including the design of state space, action space and reward function. To better describe the state of the network, we combine the port rate and flow table occupancy rate in switches for the purpose, which reflect the distribution of traffic in the network and the utilization of network resources, including link bandwidth resources and flow table resources.
• The effectiveness of the proposed routing algorithm is verified by simulations. It is illustrated that the performance of mice-flows can be improved in the aspects of packet loss rate and delay, while elephant-flows behave better in terms of packet loss rate and throughput.

II. RELATED WORKS
In data center networks, routing has been studied for a long time as an crucial research orientation. With the rise of SDN in recent years, lots of routing strategies based on SDN have been proposed, so that fine-grained flow control is achieved. Load balance routing is widely researched in data centers to ensure the sustainability of the network, it tries to guarantee the transmission quality of the flows while reserve some space for possible subsequent flows. Literature [2]- [4] focus on balancing load for elephant-flows. [2] selects the path that would accommodate the flow for routing and [3] chooses the least crowded path. [4] splits and sends elephantflows through multiple paths based on the ratios which are dynamically computed. The researchers in [5]- [7] design rerouting schemes to further balance the network load. They periodically determine whether the network load is balanced by setting the parameter such as load balance degree. When the parameter exceeds the threshold, flow scheduling or flow splitting is triggered. The above shemes can effectively reduce the packet loss rate and increase the throughput of elephant-flows. It should be noted that the load balance mentioned above refers to link load balance, another called flow table load balance has been proposed recently with the   consideration of the limited flow table capacity in SDNs.  Research [8] introduces flow table load balance for miceflows which account for the majority of traffic in data centers  to prevent the packet loss caused by flow table overflow. In addition, considering the low latency characteristic of mice-flows, the following routing schemes are put forward. Literature [9] picks the path with the lowest delay for miceflows and [10] assigns dedicated low-latency paths for them. [11] decreases the delay by reducing the number of flow rules installed to transmit mice-flows. However, all of these solutions are based on manual design which is non-intelligent, it implies that when similar traffic patterns happen, the same paths will be selected even the routing strategies have resulted in poor network performance. So, they lack the ability to learn from previous experiences [15]. Facing the problem, AI as a popular tool provides solution. Literature [15], [16] apply deep learning to avoid congestion in different network scenario. A Convolutional Neural Network (CNN) is trained for each path combination to generate the result that any chosen path is congested or not according to the input traffic pattern. [12] realizes QoS adaptive routing based on QL which is a typical reinforcement learning algorithm, QoSaware reward function is put forward to direct the learning process of optimal routing. Nevertheless, because of the finegrained flow control and the changing network environment, QL calls for huge storage space to maintain Q-table. To overcome the defect, DQL that combines deep learning with QL is proposed. In [14], DRL is applied to solve large scale network control problems, and a simple example that using DQL for QOS routing is present. It lays stress on the proposed architecture which introduces DRL to SDN, but the design of the routing sheme is not account for. DROM [17] and TIDE [18] are rencent DRL mechanisms for routing optimization. However, their actions depend on link weights modification, the optimal routing path can only be obtained indirectly by shortest path algorithms.

III. SYSTEM ARCHITECTURE
For the purpose of combining SDN-based data center networks with DQL, we introduce AI agent contained in AI plane to traditional SDN architecture, realizing intelligent routing decisions. We present the proposed system architecture in Figure 1. It contains three planes: data plane, control plane and AI plane. The functions of these three planes are described in detail below.

A. DATA PLANE
The data plane is mainly composed of switches which focus on packet forwarding, and all these switches support Open-Flow protocol. In particular, Fat-Tree [20] as a typical data center network topology is adopted to be data plane in this paper, which is shown in Figure 2. There are multiple paths between the source and destination nodes in Fat-Tree topology, providing data centers with high bandwidth and good fault tolerance.

B. CONTROL PLANE
The control plane interacts with the data plane through the south interface protocol(OpenFlow). The controller obtains network topology by using Link Layer Discovery Protocol, and it periodically sends state query messages to each switch to acquire the state information of switches, such as flow table status and port status. When a new flow arrives at the network, the controller first calculates the flow rate and estimates the flow type based on the flow statistics. If the flow rate exceeds thethreshold(According to previous studies, this value was often set to 5% of the link capacity), we regard the flow as elephant-flow. Otherwise, we regard the flow as mice-flow. In addition, the network performance evaluation(e.g., packet loss rate, delay, throughput) is also collected. All of these information is prepared for routing strategies formulation. The controller converts the strategies to flow rules and installs them on the corresponding switches.

C. AI PLANE
The AI plane is the core part of the system. AI agent can learn the optimal routing scheme autonomously from previous experience. Intelligent routing policies are dynamically generated to improve the performance of mice-flows and elephant-flows. The plane gains the flow type, network state information and network performance evaluation from the control plane through the north interface. For different flow types, the AI agent learns the different optimal routing strategy with network state information and network performance evaluation. In detail, two DQNs are designed for mice-flows and elephant-flows, respectively. Our goal is to achieve low latency and low packet loss for mice-flows, while low packet loss and high throughput for elephant-flows. Through QL, the agent obtains the optimal routing strategy by interacting with the environment. Furthermore, deep neural network is used to approximate the huge policy table, thus we can obtain the optimal routing path according to the input network state and flow type quickly.

IV. PROBLEM FORMULATION
We model the data center network as a directed graph G = {V , E}, where V represents the set of all switch nodes and E denotes the set of links between switches. The flow table capacity of each switch is R m , and the capacity of each link is C m . During the period from t 1 to t n , n → +∞, we assume that the set of all mice-flows and all elephant-flows are F mice = {f w |w ∈ [1, p] } and F elephant = {f v |v ∈ [1, q] }, where p and q are the number of miceflows and elephant-flows, p → +∞, q → +∞. Furthermore, we set the existing flows in the network at time t i as VOLUME 8, 2020 stands for the source switch, destination switch and bandwidth requirement of this flow, m is the total number of flows in the network at the moment.
respectively to represent the average delay, average packet loss rate and average throughput of the flow x. So, we can calculate the following indicators: (1) D mice , P mice are the average delay and average packet loss rate of mice-flows during t 1 to t n , while P elephant , T elephant are the average packet loss rate and average throughput of elephant-flows during t 1 to t n .
According to traffic characteristics of data center networks, we set our target of the routing problem as follows: min D mice (5) min P mice (6) min P elephant (7) max T elephant (8) subject to: Constraint (9) is the classical flow conservation constraint, it ensures that the ingress flows are equal to the egress flows for each switch. Constraint(10) and constraint(11) represent capacity limits for links and flow tables, respectively. Furthermore, the link capacity C m can be also expressed as the minimum of the allowable port rates at both ends of the link. In this paper, the rate of each port is limited to a same value, so the link capacity is equal to port capacity here.
The routing process is actually a network state transition process. Here, in order to better express the network state, for the network state at a certain moment, we select the instantaneous state of this moment and recent n-1 moments to represent it. Based on this setting, we approximate the routing process as a Markov Decision Process (MDP), and the relevant parameters are designed as follows: A. STATE SPACE Two state objects are considered here: flow table state and port state. We call these two states as network state collectively. As shown in Figure 3, we treat the state of the network as an image, and treat different network features as different pixel channels. Here, channels respectively represent the flow table utilization rate and the port rate of each switch at current and previous moments. Therefore, the state space can be expressed as follow: where n, m and z are respectively the number of switches, moments and ports of a single switch. FT sw i ,t j represents the flow table utilization rate of switch i at the moment t j , it is in the range of 0 to 1. Meanwhile, PS p k ,sw i ,t j represents the port rate of port k in switch i at the moment t j , it doesn't exceed ps max . ps max is a fixed value, it's the maximum amount of traffic that a port can pass per second.
In particular, FT sw i ,t j is a finite set with v elements (v is the flow table capacity), while PS p k ,sw i ,t j is continuous. To reduce computational complexity, we divide it into several levels to achieve the purpose of decentralization. The number of levels should not be too much or too little. It is supposed to ensure state differentiation while reducing computational complexity as far as possible. Here, we select ten levels in our scheme. Then PS p k ,sw i ,t j turns into a finite set with ten elements.

B. ACTION SPACE
The action space can be described as follows: where p 1 to p N are all paths in the network, a p k ∈ {0, 1}.
If a p k = 1, the current flow is assigned to path k. If a p k = 0, the result is the opposite. In particular, we consider that flows are indivisible, so each flow can only be assigned to one path. Action satisfies the following equation: Action * One = 1. Where, One is an N-dimensional column vector with all 1's in it.
Especially, for a certain flow (ip_src, ip_dst) under a state, the executable actions are in the alternative path set between the source and destination servers of the flow. Taking the Fat-Tree topology with k parameter as an example, the maximum number of actions is (k/2) 2 .

C. REWARD FUNCTION
Considering the characters of elephant-flow and mice-flow, we formulate different reward functions for these two types of flow. For elephant-flow, the goal is to minimize packet loss rate and maximize throughput. And for mice-flow, the goal is to minimize packet loss rate and latency. Therefore, the reward functions are set up as follows: For elephant-flows, where PLR represents the average packet loss rate of elephant-flows in the network, TP is the average throughput of elephant-flows after processing(Average throughput divided by the maximum receiving rate at the receiving end). This is done for bringing the two indicators into the same order of magnitude(0-1) to facilitate comprehensive evaluation. α and β are the weights of the two indicators, respectively, indicating the importance of the indicators. They satisfy that α + β = 1. For mice-flows, where PLR2 represents the average packet loss rate of miceflows, and DL is the normalized average delay of mice-flows. Both of these indicators are between 0 and 1. λ and µ are the weights of the two indicators respectively and λ + µ = 1.

V. ALGORITHM DESIGN
RL is a tool to solve the MDP problem. QL is a classical RL algorithm, which is based on value. It sacrifices some of its current earnings for its long-term earnings. Q stands for Q(s, a), it is the expected benefit of taking action a(a ∈ A) at a certain state s(s ∈ S). The main idea of the algorithm is to build a Q-table to store Q, and then select the action that can obtain a large profit according to the Q value. However, the state space is too large to build a Q-table in finite memory in our scenario. To address the problem, DQL is adopted here.

A. DQL ALGORITHM
In this section, we introduce DQN in detail and show our improvement to DQN for the routing problem. DQL is an algorithm that combines deep neural network and QL. Deep neural network has good generalization ability and can approximate almost any nonlinear function. Therefore, on the basis of QL algorithm, the deep neural network is used to establish the mapping relationship between state and action, so as to realize the accelerated solution of the problem and solve the dimension disaster problem caused by the large scale of system state.
QL updates the value function as follow: r + γ max a Q(s , a ; θ ) is the target Q value calculated by QL. While, Q(s, a; θ) is the Q value estimated. Our goal is to get the estimate Q close to the target Q.
To obtain the two types of Q value, we adopt two independent neural networks with the same structure: evaluated Qnetwork and target Q-network. The former generates the estimate Q according to the current state. It changes parameters in each episode to decrease the loss. While, the latter outputs Q corresponded to the next state, preparing for the calculation of the target Q. It updates parameters with evaluated Q-network every some steps.
To provide training samples, DQN has a reply memory which stores historical experiences. Experiences are selected randomly from the reply memory to train the neural network. In this way, the problem of time-correlation of samples is solved and the stability of training is improved.
We summarize the workflow of DQL in Figure 4. In particular, in the routing scenario, the information about the arrival flow of next moment is unknown, including the flow type as well as the source and destination IP address of the flow. And the available paths for the flow are uncertain. We let Q1 and Q2 be the action value function for miceflows and elephant-flows, respectively. We set the alternative path set as A_set = {a_set i,j |i ∈ [1, m] , j ∈ [1, m], i = j}, VOLUME 8, 2020 where m is the number of edge switches, a_set i,j represent the available paths between the i-th and j-th edge switches. A_set is made up of m(m − 1) sets (a_set i,j ), and the optional path set for the flow may be any a_set i,j from A_set. Based on the above settings and considerations, we improve the target Q value to fit our scenario. The new target Q value can be expressed as a ; θ 2 ))], where p, q are weight factors, they are configured to the proportion of mice-flows and elephant-flows in the network, respectively. In this way, we can solve the problem that the information about the arrival flow of next moment is unknown, so that we can better assess the next state.

B. ROUTING ALGORITHM BASED ON DQL
We train different DQNs for elephant-flow and mice-flow.
With the advantage of QL, the agent can learn the optimal routing path of each state by trial-and-error while guarantee the long-term benefits, the result will be more accurate than the manually designed routing scheme. CNN is used to fit the policy table-Q table, so that the agent can quickly obtain the optimal path, saving the memory space and lookup time of policy table. When a new flow arrives at the network, the controller first determines the type of the flow and obtains all paths between the source and destination servers of the flow from the precomputed alternative path set A_set. The selected paths constitutes the optional action set a_set for the flow. Get the current network status s and put it into CNN which is the evaluated q-network with action value function Q and parameter θ (CNN 1 , Q 1 , θ 1 for mice-flows and CNN 2 , Q 2 , θ 2 for elephant-flows), Q values corresponding to all paths in A_set in the current state can be obtained. Then, select a path randomly or select the path with the maximum Q value from a_set of the flow as action a, and install flow rules for switches on the path. Finally, calculate reward r and update the network state s t+1 , store (s, a, r, s t+1 ) for CNNs training. In training, the target q-network CNN with action value function Q and parameter θ is adopted to work with CNN (CNN 1 , Q 1 , θ 1 for mice-flows and CNN 2 , Q 2 , θ 2 for elephant-flows). When the CNNs have been trained well, we can generate optimal routing strategies for flows from the flow type and current network state.
We present the learning phase and routing phase of routing algorithm based on DQL in Algorithm 1 and Algorithm 2. Among them, Algorithm 1 is executed at the initial stage of network operation. It constantly updates routing policies through learning from historical experience, and it has high computational complexity due to the large state space and action space. When the learning phase is complete, the trained CNNs are adopted in Algorithm 2. Routing strategies have been obtained in advance through Algorithm 1, they can be generated directly from CNNs without extra computation, so that the intelligent routing is realized. In particular, because elephant-flows and mice-flows have similar learning processes, Algorithm 1 is the general algorithm of the two types of flows. Otherwise input s t into CNN to gain the Q(s t , a; θ), select a t = arg max a Q(s t , a; θ ) 5: Calculate the reward r t and monitor the next network state s t+1 6: Store (s t , a t , r t , s t+1 ) in replay memory D 7: Select random m minibatch from D 8: Minimize the loss function and update the parameter θ 9: Set θ 1 (θ 2 ) = θ every L steps 10: end while

VI. SIMULATION A. EXPERIMENT SETUP
We establish the network topology using Mininet [19] emulator, which provides virtual network elements to easily create a network that supports OpenFlow protocol. Ryu [20] which is an open source SDN controller is chosen to operate the network. Iperf is selected to be responsible for generating the traffic. In addition, we select Fat-Tree, a typical data center network topology to be our topology. For the convenience of Determine the type of flow 3: Obtain the alternative path set of the flow (a_set) from A_set 4: if mice-flow then 5: Input s t into CNN 1 to gain the Q(s t , a; θ ), select a t = arg max a Q(s t , a; θ ) 6: Install the flow rules according to a t 7: end if 8: if elephant flow then 9: Similar to the process above, change CNN 1 into CNN 2 10: end if 11: end while simulation, we use a Fat-Tree [21] topology with a parameter of 4 as an example, which contains 20 switches and 16 servers. The link capacity and flow table capacity are set to 100 and 20, respectively. To emulate the traffic in data center networks, we set 20 percent of the flows to be mice-flows, and the others to be elephant-flows. The threshold of the two types of flows is 0.5 percent of the link bandwidth. And the duration is 5s for mice-flows, while 30s for elephant-flows. The related parameters of reward function are set as follows: α = β = λ = µ = 0.5.

B. PERFORMANCE AND RESULTS
In order to demonstrate the effectiveness of the proposed scheme, experiments were performed under different network loads (0.1, 0.5, 0.9). We take 0.9 for example, Figure 5 shows the training process of the AI agent under this load. From 5(a) and 5(b), we can find that the average delay of mice-flows and average packet loss rate display a decreasing trend with the increase of training steps. While the average throughput    of elephant-flows shows an upward trend in this interval in 5(c). All indicators will level off after a certain number of steps. We record the convergence steps of each standard under each network load in Table 1. In particular, when the load increases, the indicators tend to stabilize in a shorter period of time. It should be noted that the average packet loss rate we measure in this paper is for all traffic in the network, including mice-flows and elephant-flows.
We compare our scheme with two methods. One is the classic data center network routing algorithm -ECMP [22], which adopts polling to allocate flows, not considering the network status. The other is SRL+FlowFit [23]. As a routing initialization algorithm, SRL randomly selects two equivalent shortest paths, and the path with the least load will be the inital path. Furthermore, FlowFit periodically monitors the state of the network and reassigns flows to optimal links. As shown in Table 2 to Table 4, the proposed scheme reduces the delay of mice-flows by an average of 55.08% and 28.16% under the network load of 0.1 to 0.9, compared with ECMP and SRL+FlowFit. The average packet loss rate is 33.17% and 25.5% lower than that of ECMP and SRL+FlowFit. Meanwhile, the average throughput of elephant-flows is 35.8% and 22.68% higher than that of the other methods. For the sake of clarity, we explain the calculation method of the above results here. Taking Table 2 as an example, we calculate the delay reduction of mice-flows of the proposed scheme compared with other schemes under each load, and then calculate the average value. VOLUME 8, 2020 In addition, we obtain the path computation time of the three methods on our experimental platform, they are 2.27e-4s, 4.14e-4s, 2.53e-3s for ECMP, SRL+FlowFit and our scheme. The computation time of our proposed scheme is higher than that of the others because of the dot product operations in neural networks. We can further reduce the time by improving the neural network structure in the future work.
Based on the above results, we can infer that the scheme we proposed is able to learn the routing strategy through training, and the trained network could provide optimized routing strategy to achieve network performance improvement. With the help of the two DQNs, low delay and low packet loss of mice-flows are realized, while the high throughput and low packet loss of elephant-flows are also guaranteed. Albeit complicated in computation, the improvement of network performance is obvious.

VII. CONCLUSION
In this paper, we focus on solving routing problems in SDNbased data center networks. DQL is employed to achieve the optimal routing. In the learning phase, we constantly adjust our routing strategies through trial and error, and train CNNs to generate the optimal paths. It spends a lot of time and computing resources. For the subsequent routing phase, we can obtain the optimal routing strategies according to the trained CNNs accurately and quickly without extra calculations. Aiming at the two types of flows in data center networks, elephant-flows and mice-flows, two DQNs are built to train and generate the corresponding routing strategy respectively. Meanwhile, the flow table utilization and port rate are both taken into account to describe the network state in the scheme. We have successfully verified the effectiveness of the proposed mechanism in a simulated data center network. Simulation results show that, the proposed routing scheme can not only provide optimized routing strategy intelligently, but also improve the network performance. In the future, we will improve the neural network structure to reduce the path computation time. Furthermore, the routing problem will be researched in a more complex scenario, where multiple flows arrive at the network at the same interval.