Traffic Measurement Optimization Based on Reinforcement Learning in Large-Scale ITS-Oriented Backbone Networks

The end-to-end network traffic information is the basis of network management for a large-scale intelligent transportation systems-oriented backbone network. To obtain exact network traffic data, a prevalent idea is to deploy NetFlow or sFlow on all routers of the network. However, this method not only increases operational expenditures, but also affects the network load. Motivated by this issue, we propose an optimized traffic measurement method based on reinforcement learning in this paper, which can collect most of the network traffic data by activating NetFlow on a subset of interfaces of routers in a network. We use the <inline-formula> <tex-math notation="LaTeX">$Q$ </tex-math></inline-formula>-learning-based approach to deal with the problem of the interface-selection. We propose an approach to compute the reward, furthermore a modified <inline-formula> <tex-math notation="LaTeX">$Q$ </tex-math></inline-formula>-learning approach is proposed to handle the problem of interface-selection. The method is evaluated by the real data from the Abilene and GÉANT backbone networks. Simulation results show that the proposed method can improve the efficiency of traffic measurement distinctly.

it also consumes many resources of routers (e.g., CPU and memory). The main reasons can be summarized as follows: • Switches and routers applied to implement direct measurement in networks need to be updated and maintained in time, which increases the cost of network operations.
• The construction of TM requires frequently collecting of network traffic measurements. Storing such information occupies a large amount of storage capacity of network nodes.
• The collected network traffic data needs to be transmitted to the network manager for further processing and analysis, which will add the redundant communication overhead for the network. To improve the high consumption of direct measurement, we focus on the problem of traffic measurement optimization in large-scale ITS-oriented backbone networks, and propose a method based on reinforcement learning (RL). The details of the traffic measurement optimization are presented in Fig. 1. A novel method based on Q-learning to realize the network measurement optimization is proposed in this paper. Since the routing matrix of an IP backbone network is available via the routing table, we compute the immediate reward (denoted by the matrix R) and the cumulative reward (denoted by the matrix Q) according to the routing matrix. Then, we can achieve a subset of interfaces, in which the NetFlow has to be activated. By enabling the NetFlow on these interfaces, we can measure a large proportion of end-to-end network traffic via a small number of routers. The main contributions of this paper can be summarized as follows: • The network traffic measurement optimization is an NP-hard problem. Hence, we use the Q-learning algorithm to search the optimal solution of the traffic measurement optimization model. In detail, an immediate reward is first defined to describe the objective of the model. Moreover, the optimal solution with the highest cumulative reward is calculated by using the Q-learning algorithm.
• To decrease the number of routers activating NetFlow, the correlation of each interface is considered. Meanwhile, to acquire the optimal solution, we involve a weight in the immediate reward so that the number of routers can be decreased obviously.
• An algorithm to select the routers via the cumulative reward is devised. According to the cumulative reward, we propose an algorithm to gain a specified proportion of network traffic data. The reminder of this paper is organized as follows. Section II presents the related work in the traffic measurement optimization problem. Section III introduces the system model of the traffic measurement optimization problem. In section IV, the traffic measurement optimization method based on RL is proposed. In section V, we evaluate the proposed method by real network traffic data sets. Section VI summarizes the main work of this paper.

II. RELATED WORK
To reduce the consumption of direct measurement for our network, many methods have been proposed to optimize the deployment mechanism of direct measurement [11], [14]- [21]. Ghode et al. focused on the problem of traffic measurement optimization in mobile ad-hoc networks, and proposed an improved ZRP protocol. Meanwhile, they added the energy constraints of the optimization model to make it work efficiently. The authors in [17] proposed a dynamic cooperative monitoring interface-selection mechanism based on social network analysis, which can effectively reduce the network cost of direct measurement. Besides, the authors in [18] studied the problem of selecting a set of monitoring interfaces, and proposed a Cross-Layer Security Monitoring selection algorithm based on Traffic Prediction (CLSM-TP). The authors in [19] proposed a mechanism to support the Software Defined Network (SDN) to distribute the monitoring load on multiple switches in the network by monitoring traffic in a distributed manner. They distributed the monitoring entries over switches assigning the monitoring tasks and eliminating the duplicated monitoring entries. Thus, it non-trivially reduces the amount of monitoring overhead in switches, the controller, and the control channels. Besides, Shin et al. in [20] proposed a distributed on-line algorithm for optimal sniffer channel allocation for passive monitoring in multi-channel wireless networks.
By using the advantages of SDN, some novel methods have been proposed to monitor the TM directly. The authors in [22] proposed a flow monitoring algorithm called lonely flow first (LFF). The LFF method polled switches with traffic flows that pass through only a single switch at first. Then, they also proposed a weight function to decide the polling order. Furthermore, the polling mechanism can select the lower cost. Besides, Tian Yang in [23] presented a new framework for solving the problem of network traffic measurement in SDN-based IP networks. Based on the low rank and time-varying characteristics of traffic flows, the authors in [24] proposed a novel mechanism for the on-line network traffic measurement by utilizing the flow measurement ability of the SDN technique. In [25], an SDN-based fine-grained measurement method in vehicular communication network was proposed. The proposed method contained two phases. At first, they proposed a monitoring method to measure the statistics of links and flows, and then obtained the coarse-grained measurements depending VOLUME 8, 2020 on the measured links and flows. Then, they performed the interpolation algorithm on the coarse-grained measurements to gain the fine-grained network traffic measurements. The work in [26] presented an on-line measurement mechanism for SDN-based data center networks. A comprehensive monitoring system, which was used to poll the switches to collect the real-time statistics for higher accuracy and lower overhead, was designed.
Although a lot of methods have been proposed to optimize the consumption of direct measurement, it is significantly difficult to acquire an outstanding optimal solution due to the lack of scalability. Thus, we propose a traffic measurement optimization method based on RL, which realizes the requirement of obtaining a large proportion of traffic data while activating NetFlow on fewer routers.

III. PROBLEM DEFINITION
When network operators perform network management decisions, they need to know the current states of our network, such as delay, packet loss rate, bandwidth, and network traffic, etc. Network measurement techniques provide feasible solution and technical support for network operators to obtain reliable network states. Since the network tomography technology was applied to the field of TM estimation, many TM estimation methods based on the network tomography technology have emerged. The network tomography technology uses easily measured network data to estimate the TM. At the same time, it is widely used in network delay estimation and other fields. In the field of TM estimation, the network tomography model describes the linear relationship among the link load Y , the routing matrix , and the TM X, which is shown as follows: The routing matrix (shown by Fig. 2) describes the transmission path of each OD flow in the network, thus the linear relationship between the link load and the TM can be described by the routing matrix. If each element of the routing matrix is represented as (l, o, d), and then (l, o, d) = 1 indicates that the flow from node o to node d passes the link l, otherwise it is 0. In a large-scale IP backbone network, the link load is an available network data, which can be measured by way of the Simple Network Management Protocol (SNMP) [11], [27]. The routing matrix is built via the routing table in each router. Therefore, for large-scale IP backbone networks, the routing matrix is also available data. Although the network tomography method has the above convenience, there is also a key problem that restricts its applications. In detail, for a large-scale IP backbone network with N routers, the number of OD flows is N 2 (N 2 >> L), where L is the total number of links in the network. Therefore, the network traffic estimation method based on the network tomography model is an inverse problem with highly ill-posed characteristics (i.e., underdetermined NetFlow is a solution that can measure all of OD flows in the network. The problem of traffic measurement optimization is selecting a subset of routers activating NetFlow so that we can gain the network traffic information as much as possible. We assume that each router in the network has K (n) interfaces on the router n, and the variable r n,k takes 1 if the NetFlow is enabled on the k-th interface of the n-th router, otherwise it is 0. Furthermore, the volume of OD flows that can be measured by the k-th interface of the n-th router is S n,k . Then the problem of traffic measurement optimization can be defined as: where S is all the traffic data in the network, and α is a ratio that needs to be measured. Generally, some network management functions can be carried out using incomplete network traffic data. Namely, the network management functions have a tolerability for the error of network traffic data, which arises the ratio α. From the above model, we see that it limits the number of interfaces activating NetFlow. To make the number of routers lower, the optimization model can be further deduced by: where d n =1 if the NetFlow is enabled on the n-th router, otherwise it is 0.

IV. REINFORCEMENT LEARNING AND OUR METHODOLOGY
A. REINFORCEMENT LEARNING RL [28], [29] is one of the most prevalent machine learning methods. The RL is viewed as the Markov decision process that can be denoted by S, A, P, R , where S and A are the sets of states and actions of an agent, respectively. P is the matrix made up of the probability of transition from the current state to the next state after executing an action. In RL, the agent interacts with the environment to obtain the suitable evaluation value (reward) of the action, and the policy is updated to obtain the maximum reward (see Fig. 3). At a certain state s, the agent selects an action a to act on the environment, and causes the environment to change and return an immediate reward R. R (s, a) is the reward function, namely, the evaluation given by the environment after the agent selects the action a according to the policy. After the agent receives the reward R, it makes the next step. Since each step of action a not only affects the single enhancement value, but also the cumulative reinforcement value after multiple learning, the selection policy of each step can increase the probability of obtaining the maximum reward. After each attempt to reach the destination state, the value of the function Q indicating the cumulative reward will be updated. If Q is enhanced, the agent will find the fastest route to reach the target state.
The main algorithm of RL consists of the TD algorithm, the Q-learning algorithm, the Sarsa algorithm, and the R learning algorithm, etc. The destination of the Q-learning algorithm known as a model-free algorithm is to learn a policy to make the value function optimal [28]. The optimal policy for Q-learning is to make the agent repeatedly explore the cumulative value function to maximize Q, which can be expressed as: π * (s, a) is the optimal policy given the state s and the action a. In other words, after the value function Q is generated, given a state s, the agent only needs to select a series of actions in the direction, in which the value function is largest, and the global optimal solution can be obtained. The Q-learning iteration is shown as follows: where η is the learning factor, and Q (s t , a t ) is the cumulative reward for t steps. R (s t , a t ) is the one-step reward value, and γ is the discount factor.

B. TRAFFIC MEASUREMENT OPTIMIZATION BASED ON REINFORCEMENT LEARNING
The problem of traffic measurement optimization is defined by selecting a subset from all routers activating NetFlow to collect most (or all) of the network traffic data. Hence, we assume that, there are L links, N routers, and P interfaces in the network. We refer to each interface as the state i, and the process of selecting the next interface is viewed as the action j.
Each row of the routing matrix shown in Fig. 2 describes all the OD flows passing the corresponding link. We express each row of routing matrix by vector θ (l), where l = 1, 2, 3, . . . , L. When an interface is selected, we hope that this interface is much more different from the previous selected interface. In other words, we prefer to select an interface consisting of much more unmeasured OD flows. Therefore, we define the reward function R (l 1 , l 2 ) as follows: where l 1 , l 2 ∈ {1, 2, . . . , L}. Then, the reward is an L × L matrix. An IP backbone network contains two types of links, that is the internal links and the external links. When we select an interface in terms of the corresponding link, a crucial problem arises since an internal link corresponds to two interfaces. Furthermore, the reward is extended as a P × P matrix, and where L internal is the number of internal links and L external is the number of external links. In this case, the element of the extended reward R (i, j) is calculated by: Note that, i ∈ link (l 1 ) means that i is an interface of the link l 1 . Besides, when an interface is determined, we also consider the number of routers and limit it small enough. Thereby, a weight is involved in the reward, which is computed as: We define P n as the interface P of the router n. Then, the reward R (i, j) is: where λ is the factor that expresses the impact of the weight. The cumulative reward can be calculated by carrying out multiple iterations via the immediate reward. In this paper, we use the Q-learning algorithm to compute the cumulative reward. The specific steps of calculating the cumulative reward Q are as follows: VOLUME 8, 2020

Algorithm 1 Calculating the Cumulative Reward Q
Step 0: Set parameter γ and let Q = 0.
Step 1: Calculate the matrix R according to Eq. 11.
Step 2: Select a starting state i.
Step 3: Check the target status: if it is reached, and stop.
Step 4: Select one possible action j from all possible actions of the current state i. Use this possible action to reach the next state. For the next state, choose the largest Q value based on all possible actions. Calculate Q(s t , a t ) = R(s t , a t ) + γ × max Q(s t +1 , a t +1 ).
Step 5: Set the next state as the current state, and go to step 3.
After obtaining the cumulative reward matrix Q, we propose an algorithm for traffic measurement optimization. Our goal is to select several interfaces to characterize a proportion (α ∈ [0, 1]) of the total network traffic. We denote the total traffic amount by t ≥ 0, and the constraint can be written as: The details of the proposed traffic measurement optimization algorithm are described as follows:
Step 1: Calculate the correlation coefficient of between any two links, and select the link with the largest correlation coefficient with other links. Select the initial state i according to the relationship between the link and the interface.
Step 2: Check the requirement shown in Eq. 12. If it is satisfied, and the STOP.
Step 3: Select the action j with the highest Q value corresponding to i.
Step 4: Record selected i and j, and delete the i and j column in the matrix Q to prevent duplicate selection.
Step 5: Record the router n, where i and j are located.
Step 6: Compute the traffic amount t .
Step 7: Update the total traffic amount t: t = t + t . Set the next state i as the current state j, and go to step 2.

B. PARAMETER SELECTION AND PERFORMANCE INDICATORS
We first evaluate the sensitivity of the proposed method with various parameters. Figure 6 shows the number of routers activating NetFlow with various weights W and the proportion of traffic α in the Abilene backbone network. We find that the number of routers is much lower, if the weight value increases. In Fig. 7, we set α = 95%, and show the number of routers with various weights and factors. From Fig. 7, we can gain the same conclusion as Fig. 6.    Figures 8 and 9 show the simulation results in the GÉANT backbone network. Comparing with the Abilene network, the GÉANT is a larger scale backbone network. Figures 8 and 9 show that the proposed method has a better optimization solution and much more insensitive.
To evaluate the performance of the proposed method, we mainly consider the following performance metrics: • To minimize the extra overhead, we hope that the number of routers activating NetFlow is small enough. Hence, the number of routers activating NetFlow is an important metric, when we want to collect different proportions of traffic. As a result, when measuring the same proportion of traffic, we choose the number of routers as an metric.
• When we measure the same number of OD flows, we want to obtain traffic information as much as possible. Therefore, the proportion of measurable traffic information of the total traffic is another performance metric chosen in this paper.
• We also measure the performance using the Normalized Mean Absolute Error (NMAE) defined as: where X (i, j) is the element of TM, andX (i, j) is its estimator. represents the unestimated set of OD flows.

C. COMPARISON ANALYSIS
To evaluate the performance of the proposed method, we compare our method (RL for short) with the method in [10]. In [10], the authors propose a method to solve the traffic measurement optimization problem by using linear programming (LP for short). When implementing the Algorithm 1, we set W = 100. Figures 10 and 11 plot the number of routers activating NetFlow, when we collect different proportions of traffic for the Abilene and the GÉANT networks, respectively. The x-axis and y-axis are the proportion of traffic and the number of routers, respectively. As shown in Fig. 10, to get 95% of the total traffic information, we only need 50% of the routers for the proposed method. In Figure 11, we need to activate less than 5 routers to obtain more than 95% of the total traffic information using the proposed method in GÉANT. By contrast, the LP method requires more than 11 and 40 routers in Abilene and GÉANT, respectively. Moreover, we can find  that RL is good at optimizing a large-scale network. As shown by above simulations, the RL method needs less routers than the LP method. At the same time, the solution of LP takes more time in practical, and this process is more complex.
Moreover, we compare our method with four data-driven network traffic measurement approaches, i.e., the iSTAMP method [30], the Bernoulli method (BM) [31], the compressive sensing over graph method (CSOG) [32], and the random walk method (RW) [33]. In [30], they proposed an efficient MAB-based flow sampling algorithm to select the most rewarding flows. It's worth noting that the RL method we proposed can select the most important flows in the network as well. Figures 12 and 13 show the proportion of traffic that can be sampled for the Abilene and the GÉANT networks, respectively. The x-axis and y-axis are the parameter α that has been proposed before and the proportion of traffic, respectively. As shown in Fig. 12, when we measure the same number of OD flows, the method we proposed can obtain more traffic information than iSTAMP, BM, CSOG and RW. For instance, if we select 80% of OD flows in the Abilene network, we can only obtain 35% of traffic information using iSTAMP method and nearly 70% of traffic can be measured   based on the RL method we proposed. Meanwhile, they are 14%, 13%, and 8% in turn for BM, RW, and CSOG. When we apply these methods to the GÉANT network, which is larger than the Abilene network, the advantage is much more obvious (see Fig. 13).
We also compare the NMAE of the proposed algorithm with the other four methods, as shown in Figures 14 and 15. Obviously, with the number of OD flows that will be measured increases, the NMAE decreases rapidly. However, comparing with the other four methods, the NMAE of the RL algorithm is much smaller. It is worth noting that the NMAE of the RL method is almost an invariant near zero with the increase of α, which means that the RL method may be an optimal solution when selecting the most rewarding flow of a network.

VI. CONCLUSION
This paper studies the problem of traffic measurement optimization, and propose a RL-based method to decrease the cost of direct measurement which is the main contribution of this paper. In our method, we first model the problem of traffic measurement optimization as a Markov decision precess, and propose a RL-based method to select the subset of routers to implement end-to-end network traffic measurement. First, we propose an approach to construct the immediate reward via the routing matrix. Then, an iteration algorithm based on Q-learning is proposed to compute the cumulative reward. By means of the cumulative reward, the optimal subset of routers activating NetFlow to measure most of network traffic.
We evaluate the performance of the proposed method by real traffic data sets from the Abilene and GÉANT backbone networks. The simulation results indicate that the proposed method can observably reduce the number of routers activating traffic measurement. Specially, aiming at the large-scale backbone network (e.g., the GÉANT backbone network), the RL method reveals to be an excellent solution. SHENGTAO LI received the M.S. degree in operational research and cybernetics from Ludong University, China, in 2010, and the Ph.D. degree in control theory and control engineering from Northeastern University, China, in 2013. He is currently an Associate Professor with the School of Information Science and Engineering, Shandong Normal University. His research interests include nonlinear systems theory, optimal switch-time control theory of switched stochastic systems, and robust control of systems with time delay.