Flow-Aware Elephant Flow Detection for Software-Defined Networks

Software-defined networking (SDN) separates the network control plane from the packet forwarding plane, which provides comprehensive network-state visibility for better network management and resilience. Traffic classification, particularly for elephant flow detection, can lead to improved flow control and resource provisioning in SDN networks. Existing elephant flow detection techniques use pre-set thresholds that cannot scale with the changes in the traffic concept and distribution. This paper proposes a flow-aware elephant flow detection applied to SDN. The proposed technique employs two classifiers, each respectively on SDN switches and controller, to achieve accurate elephant flow detection efficiently. Moreover, this technique allows sharing the elephant flow classification tasks between the controller and switches. Hence, most mice flows can be filtered in the switches, thus avoiding the need to send large numbers of classification requests and signaling messages to the controller. Experimental findings reveal that the proposed technique outperforms contemporary methods in terms of the running time, accuracy, F-measure, and recall.


I. INTRODUCTION
Software-defined networking (SDN) [1] has generated significant interest in industry and academia in recent years. The most crucial advantage of SDN is the opportunity to provide intelligence in computer networks. SDN capabilities include dynamic updates of the forwarding rules, software-based traffic analysis, and a logically centralized control network with a global view. These features allow the possible adoption of machine learning in network management [2]. However, the continuous growth of data traffic in terms of volume, velocity, and variety has made network traffic engineering a challenging task [3]. An accurate flow detection is vital for establishing appropriate forwarding strategies for various flow types, particularly for elephant flows (EFs) in an SDN environment.
Recent measurements conducted in data center networks [4], [5] have shown that 80% of the total flows take The associate editor coordinating the review of this manuscript and approving it for publication was Haris Pervaiz .
less than a few milliseconds and are less than 10 KB in size (i.e., mice flows, MFs), and that the majority of the traffic volume is accounted for by the top 10% of large flows (i.e., EFs). Any traffic that exceeds a certain threshold per unit time (e.g., 1 MBps) is often considered also an EF [6]. Given the high rate of EFs in network traffic, their effective control and rerouting can potentially improve the SDN network throughput [7]. By contrast, the competition for resources between MFs and EFs makes MFs to receive insufficient bandwidth [8]. Hence, EF detection [9] is an essential aspect of network traffic classification. The SDN controller does not need to process all flows, as the controller only needs to consider those EFs that severely impact the network performance when performing traffic management. If they are not efficiently managed, the network buffers can be filled with EFs, thereby leading to queuing delays and packet drops. Thus, EF detection is essential to easing network congestion [10].
Several EF detection techniques [6], [7], [11]- [20] have been previously proposed. However, these techniques are VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ preconfigured with fixed flow size thresholds in the switch, which can result in high rates of false positive and false negatives. Moreover, some methods require periodic extraction of the flow statistics (e.g., [6], [7], [16], [17]) or sampling packets (e.g., [11]- [15]) from SDN switches, result in a long flow detection latency and heavy controller-switch signalling overhead. Some other techniques require either important modifications in the switch hardware (e.g., [18]) or applying end-host inference (e.g., [19], [20]), which make adoption in existing SDN difficult to achieve. By considering limitations as mentioned earlier, several improved EF detection techniques have been proposed [21]- [24]. However, these techniques are weakened by a slow convergence for several reasons, including the switch-controller interaction which requires a high bandwidth and long detection time. The thresholds of existing detection approaches are usually preconfigured without any consideration of the changing traffic load or distribution in the SDN networks, which may cause a high false detection. Flow detection in SDN also requires accurate real-time detection. Flow detection techniques based on statistical thresholds can operate in real-time but with a lower accuracy, and at the same time increasing the controller workload. This problem requires a careful trade-off balancing. When performing an SDN flow prediction, a failure to detect an EF can have more severe consequences than that for misdetection an MF. To further improve the accuracy of EF detection, the flow characteristics must be fully considered.
This paper presents a flow-aware EF detection technique for SDN. The proposed technique employs a pair of classifiers that run in tandem on the SDN switches and the controller, respectively, to share the tasks of classifying the EFs. Hence, most MFs can be filtered in the switches, and a large number of classification requests and signaling messages can be avoided at the controller. Our solution provides a good trade-off between the overall accuracy and the controller loads, which is critical for real-time traffic flow management. Several experiments have also been conducted on real datasets to measure the improvement in the controller running time, accuracy, F-measure, and recall. The key contributions of this paper are as follows.
• Proposing a flow-aware EF detection technique for SDN that can identify real-time EFs with low timing overhead and high detection accuracy, recall, and F-measure.
• Proposing a switch-side count-min (CM) sketch data stream structure used to filter MFs with commodity OpenvSwitch software. Moreover, the OpenFlow protocol is enhanced with extended signalling messages to handle the CM sketch data processing between switches and the controller side classifiers.
• Evaluating the performance of the classifiers in real-time for EF detection using real traces from the Internet and a data center in a Mininet simulation environment. The performance results show that our proposed technique can significantly improve the running time, accuracy, recall, and F-measure. The reminder of this paper is organized as follows. A review of previous related studies is presented in Section II. Section III describes the framework design of the proposed EF detection technique. An evaluation of the results and a relevant discussion are detailed out in Section IV. Finally, Section V provides some concluding remarks regarding this research and areas of potential future study.

II. BACKGROUND AND RELATED WORK A. SDN BACKGROUND
The SDN architecture characteristically abstracts the controller and data planes as separate entities, as illustrated in Figure 1. Programmability is the key characteristic of an SDN architecture allowing users to develop their customized applications. Using advanced policy applications and services and programmable application program interfaces (APIs) provided by the north-bound interface, users can develop applications of their choice at the application layer. In addition, the south-bound interface offers a standard API, such that the SDN controller communicates with two interfaces, including the south-bound and north-bound interfaces using the OpenFlow protocol [25], [26].
The SDN controller acts as a network operating system that views the network topology state comprehensively and manages OpenFlow switches through a secure communication channel [28]. Its responsibilities include managing and controlling how the switches process flows through the entries in the flow tables. Several variations of SDN (compatible) controllers have been developed, including the NOX controller [29], Ryu controller [30], and Floodlight controller [31]. Centralized control in SDN provides an architectural basis for open network programmability.
By providing a programmable interface for upper-level applications, the control plane can implement complex management functions such as EF detection strategies, load-balancing switches, and global monitoring of the network and its changing needs. OpenFlow switches forward messages based on the flow entry and various counters defined for maintaining the traffic size or matching the number of data packets. These counters greatly simplify the collection of traffic statistics for EF traffic detection. The current factual south-bound protocol OpenFlow provides numerous control and monitoring mechanisms, which can flexibly implement flow management effectively and efficiently. With these features, EF detection in SDN has been rapidly advanced [9], [32].

B. OVERVIEW OF GENERIC FLOW CONTROL SCHEMES FOR DIFFERENT APPLICATIONS
A mechanism for controlling the flow of data between a pair of nodes is known as flow control. This is achieved by adjusting the transmission and receiving rates of the data. However, to bolster the quality-of-service (QoS) of the network with an improved quality-of-experience (QoE) for users, there is a need for an efficient traffic control strategy to cater for the ever-increasing traffic bandwidth [33].
The decision-making mechanism applied in the SDN architecture is the controller, which controls the functions for all flows within the entire SDN network [34]. Broadly, the flow control modes can be categorized into the coarse-grained [35], and fine-grained [36] controls. Several SDN-based flow control techniques have been proposed to further improve flow control. For example, in terms of traffic classification, Wang et al. [37] dealt with the traffic of unknown applications within the SDN by employing a semi-supervised machine learning approach in the classification of the QoS. The engine was able to run in real-time because only the first several packets of every flow were considered for feature extraction. Periodic polling for EF detection was also suggested in [7], which operates by extracting the per-flow statistics from its edge switches.

C. RELATED WORKS
The current EF detection techniques used in SDN fall into two main categories according to the detecting location: (i) switch-based detection and (ii) host-based detection.

1) SWITCH BASED DETECTION
A real-time EF detection system was proposed in [21]. The proposed method is comprised of two stages according to the statistical thresholding of the flow stream features. The first stage is to detect suspected EFs based on the statistical thresholds of multiple flows. The second stage is to identify EFs from the suspected EF set based on the features from the first few packets, which can offer timely and accurate flow classification. In addition, this approach employs a cost-sensitive learning approach using a C4.5 decision tree for real-time EF detection and the flow metric measurement. Chao et al. [23] presented an EF detection method based on a classification called FlowSeer. FlowSeer uses the features of the first five packets of a traffic flow to detect the EF. In FlowSeer, two classifiers are executed, one on the switch and the other on the controller. The switch-side classifier acts as a filter to remove most of the MFs, whereas the controller-side classifier verifies whether the EFs classified by the switch are genuine.
An EF prediction-mechanism was reported [24] for data center networks to address the characteristic traffic demands within the network. This approach seeks to reduce the overhead associated with the switch-to-controller communication by forecasting the EFs and adapting their routing policies in response to the ever-changing conditional demands on the network. However, this study has some scalability limitations when connecting to large-scale networks that are more complex and dynamic, such as multi-tenant cloud networks, large virtualized data centers, and Internet-of-Things (IoT). Huang et al. [22] proposed the arrangement of a pair of classifiers that respectively run on the controller and switch. Due to a limited switch computing power, only rules and decision trees classifiers can be used on the switch side. The controller, coupled with the switch-side classifier, accordingly rewrites the classifier rules and updates the switch flow table.
All approaches mentioned above can reduce the communication between the switches and the controller by keeping the frequency of transferring EFs statistics for flow setups to become minimal. Furthermore, by reducing the switch-controller communication, the workloads of both the controller and the network, which are the inherent overhead in the implementation of flow-based networking, can be reduced. In addition to fixing the threshold value, the EF detection on switches also requires modifying the switch hardware. Otherwise, the high detection accuracy of the EF detection system will be at the expense of a high network overhead, i.e., switch-controller to detect EF. Moreover, due to these limitations, achieving a balance among the accuracy, timeliness, and cost becomes difficult.

2) HOST-BASED DETECTION
Considering the scalability and timeliness of EF detection, the Mahout architecture [19] deploys a kernel patch in the terminal host to monitor the traffic statistics generated by the host and detect the EFs based on the pre-supposed EF threshold. To reduce the communication overhead, Mahout informs the controller regarding the EF and prescribes an in-band mechanism. Specifically, Mahout uses the differentiated service field of the IP header to mark the elephant stream. When the marked elephant stream reaches the switch, the switch forwards the corresponding packet towards the controller based on the default flow entry. Like Mahout, MicroTE [38] can conduct an analysis of all the network traffic.
By designating the monitoring end-host in each top-ofrack switch, the network traffic is collected, aggregated, and reported to the controller in time. However, due to the invisibility of network traffic generated by the virtual machines in the end-host, virtual traffic monitoring cannot be realized by simply deploying a kernel patch. Based on the monitoring tools such as VSFlow and NetFlow supported by OVS, EMC2 [20] recommends using a hypervisor deployed on the end hosts to collect the traffic statistics. However, the VOLUME 8, 2020 collected data needs to be sent to the centralized flow collector for further analysis, which may also result in an overhead of the monitoring traffic.

III. PROPOSED SWITCH-CONTROLLER FOR EF DETECTION
Elephant traffic detection in SDN must be fast, lightweight, and non-intrusive (i.e., its impact on the control plane should be minimal). At the same time, the detector must be able to accurately differentiate EFs from MFs for an effective flow migration based on a specific cost argument.
The EF detection process can be split between the controller and switches. Given the limited computational power of the switches, the classifier on the switch-side must be lightweight and designed such that it places more emphasis on a high recall. Therefore, the switch-side classifier can detect most of the EFs at the expense of false detecting some of the MFs. Meanwhile, the controller-side has more computational power and thus uses more features. Therefore, the classifier on the controller-side must place more emphasis on a higher F-measure and precision as opposed to the emphasis on recall for the classifier on the switch-side. Figure 2 shows an operation of the proposed technique. When a new flow arrives, the switch-side classifier pre-filters the MFs based on the CM sketch algorithm [39]. The controller periodically trains the switch-side CM sketch model, emphasizing an optimal recall rate, which reduces the misdetection of potential (i.e., candidate) EFs. The candidate EFs are forwarded by the switch to the controller to performs the controller-side of the process. This is more of a practical streaming classification model using a very fast decision tree (VFDT) classifier. Once an EF is recognized, the CM training model is updated and converted into minimal sets of rules, given the limited nature of the flow table size of the switch. Figure 3 shows a flow chart of the EF detection.
The switch-side classifier is based on the CM sketch algorithm [39]. Because switches have limited computing power, the training of the classifier is achieved by either the controller or an off-line server. The CM sketch reports the state of its performance in terms of delay (i.e., buffer load), total number of packets handled, and list of hashed IP source-destination address pair for the EF candidates. A CM sketch algorithm used in the switch-side provides a quadruple of the hashed IP address, the number of packets, the aggregated packet sizes, and the average delay, which is forwarded to the controller side. This approach also adds to the network traffic overhead as the switch-side classifier needs to communicate with the controller-side classifier.
The controller-side is created using a VFDT classifier based on the study in [40]. Because it is a multi-commodity flow problem, the tree needs to be trained using such data and based on the EF definition. For any given flow identified by an IP address pair, a set of alternative routes can be generated by any routing algorithm. Using the total packet arrival metrics from the switches, the controller predicts the network delay after a candidate EF passes. Based on these metrics, it selects the flow and path, which minimizes the functional cost (with parameters consisting of the maximum network delay and the number of hops). The selected flow and a new path are used to create migration instructions, which are compiled and sent to the switches.

A. SELECTION OF EF FEATURES
Before describing the EF detector architecture, we first need to define the EF and its features from the flow statistics.

1) DEFINITION OF THE EFs
Some studies, e.g., [7], [41] define an EF based on the bandwidth use over the specified limit for a specified time. The features used to accurately describe these properties are the 72588 VOLUME 8, 2020 bandwidth and duration. These can be estimated based on the number of packets arriving per time window. Because the definition does not contain any information on the flow type, such features do not add any additional information regarding the flow. The time window in which the packets are captured includes the information on the average interarrival time of the packets.

2) SELECTION OF FEATURES FOR A FLOW CLASSIFICATION
It is highly desirable to use features that do not exhibit a correlation. Most of the models perform poorly in the presence of multicollinearity, i.e., when other variables can predict one explanatory variable [42], [43]. In addition, redundant variables increase the computational cost in terms of time and storage. The packet characteristics available are flow end-point identifiers i.e., the IP addresses and ports. The OpenFlow protocol indicating the type of flow is often associated with the port numbers. Therefore, the protocol type and port numbers are correlated. Similarly, properties related to Ethernet packets and IP packets are strongly correlated and therefore introduce multicollinearity.
In view of the load balancing flows in the data plane, the IP address pair is usually adequate to identify the path of the flow. Port numbers can be included to divide the flows into smaller sub-flows. In a large flow, one sub-flow is likely to be dominant. Because the whole flow may be subjected to rerouting, these sub-flows give little added information but increase the size and complexity of the implementation. An EF is also not characterized by its direction (as per the definition). Most flows are asymmetric and usually dominated by either uplink or downlink traffic (in terms of bandwidth and packets). The dominant link is, by definition, strongly correlated with the total flow. In the implementation, the IP source and destination addresses are hashed to form a flow identifier. The hash value is symmetric for the two IP addresses, and thus gives the same key regardless of direction. The distinction between the up-link and down-link packets double the number of flows and make the estimation computationally more expensive. Because only the dominant flows are of our interest, there is no reason to make this distinction.
Each additional feature increases the cost of the traffic classification (i.e., time and space) to both the switches and controller while maximizing the orthogonality when choosing the classification feature set. Thus, the feature set used for detection of the EF in an SDN environment has the following two attributes.
1) It is easy to extract using commodity OpenFlow switches. These features include the packet inter-arrival time, IP address and port number, packet size per flow, max and mean packet size, flow duration, and other flow statistics [44]. 2) It is set up for a fast detection before a flow is concluded. For example, some features such as the frame length can only be obtained after the flow is concluded, thus failing to meet our needs for fast EF classification.
Therefore, we only consider the flow feature sets collected by inspecting the IP header. The flow contains the same five-tuple IP packets {protocol, src_port, dst_ip, src_ip, dst_port} with each flow distinguishable by statistical features such as the IP source and destination addresses, frame length, and average round trip time of a TCP Stream (TCP.analysis.ack_RTT) [44].

B. SWITCH-SIDE EF DETECTION
On the switch side, we use a CM sketch [39] to detect EF (heavy hitter) candidates. This method is fast and lightweight. As the CM sketch uses hashed IP address pairs, the IP address pair for the EFs also must be retrieved. The switch does not store or count the total number of flows, only the IP addresses. This sampling can be conducted at a relatively high frequency. After each sampling period, the result is stored in a data structure containing EFs and all packets. The arrival process vectors (containing the EF, and total flows), the flow identifiers (IP address pairs) of the EFs, and the buffer load data are sent to the controller. We next describe how the CM sketch technique works.

C. THE CM SKETCH
The CM or the Cormode-Muthukrishnan sketch is a data type suitable for counting frequencies, which is the frequency of arriving packets associated with a particular flow [39]. An end-to-end flow from a network perspective is defined by an IP address pair (source and destination IP addresses). A hash function of the IP address pair is used as an identifier in the sketch. For this purpose, it is convenient to use the IP range function in the Python package iptools. The hash is generated by the following: h = iptools.IPRange (ipsrc, ipdst) iphash = h.__hash__() The first step in the CM sketch algorithm for finding EFs can now be formulated as finding the heavy hitters in the sketch, which are high-frequency flows among all available flows. The heavy hitter problem can be formulated as a sequence of point queries to the sketch, which returns an approximate frequency related to the index, which is the hashed pair of IP addresses. For this estimation, the approximation factor and failure probability δ are set at the sketch initialization, such that the estimatê υ i of the true frequency υ i of index i can be presented as Equation (1) with probability 1 − δ, and where υ 1 = m k=1 υ k is the L 1 -norm. The L 1 -norm is essentially unknown initially, and thus the approximation factor is treated as a fraction relative to the number of packets arriving in the switch. To initiate the sketch, the values of and δ determine the size, which is the width ω and depth d of the sketch as indicated in Equations (2) and (3), respectively: The natural base e that can be chosen freely for all e > 1.
The width and depth determine the size of the sketch, which is ω×d words. This is illustrated in Figure 4, where each item i is mapped to one entry in each row j by the hash function h j , and when an update of c t to item i t arrives, c t is incremented for entry. The cost of such an update is only related to the depth d of the matrix. The sketch uses a second hash function to reduce the required space of the sketch. Therefore, the index key is further hashed to fit into the sketch width ω. Given a prime number p ≥ ω, the hash function can be chosen using Equation (4): where a ∈ {1, . . . , p − 1} and b ∈ {0, . . . , p} are known as the c-universal family of hash functions [45].
Since collisions are unavoidable when using a small space to represent a much larger range of values (i.e., the hash of the IP addresses), it follows thatυ i ≥ υ i for all i. The depth d of the sketch is made up of d instances using different hash functions, and taking the minimum over d instances to give the valueυ i closest to υ i . The hashed IP addresses cannot be retrieved because collisions occur. It is of interest to capture other features related to the flows identified by high-frequency arrivals, i.e., the packet length and round-trip time, representing the delay. For this purpose, two additional sketches using the same hash functions as the frequency counting sketch are initiated to aggregate the packet lengths and round-trip times, respectively.
As shown in Algorithm 1, by letting the sketches run for the chosen capture time interval of τ , the frequency is estimated by the frequency sketch. Because the IP addresses of the hash cannot be reconstructed, the heavy hitters are found by point queries to the sketch by taking the hash of the source and destination IP addresses. When the estimated frequency exceeds a set fraction φ of the L 1 -norm υ 1 , the IP address pairs are saved together with the minimum of d (the frequency estimates corresponding to their hash value). For the additional features, the maximum values in the d arrays are used, rather than the minimum to ensures that the worst possible characteristics are captured.

Algorithm 1 Extended CM Sketch
Given : Parameters , δ, τ and φ and a packet capture stream P.

Let
: I = Ø be the set of unique IP address pairs. Initialize: Initiate the three sketches S f , S t and S d for frequency, throughput and delay respectively, with ω = e , d = [ln δ −1 ], and determine the prime p. Output : The k heavy hitters represented by the source IP, destination IP, frequency (number of packets), throughput (sum of the packet lengths), and average RTT ACK (delay). 1 Generate hash functions (a, b, ω, p) according to Eq. (4) and set time t = 0; 2 while t < τ do 3 for each incoming packet in P do 4 Save IP address in I indexed by its hash function value;

5
Update S f , S t , and S d using the same hash defined initially; 6 end 7 Update t with a timestamp; 8 end 9 if t ≥ τ then 10 for all IP address pairs in I , query S f with the corresponding hash function (a, b, ω, p) 11 end 12 else ifυ i > φ υ 1 then 13 Save the IP address pair andυ i into the same hash; 14 Query S t and S d ; 15 Reset t= 0;

end
The output from the sketch is a set of relatively high-frequency flows (a large number of packets per time unit). However, an EF is typically defined as a flow with a large throughput for a specific duration of time. The sketch records the estimated performance and delay, the latter is a likely effect of such a flow, but does not consider these parameters when filtering out the flows. By using these sketches on short time intervals, aggregation makes it possible to estimate the frequency (i.e., the time aspects of the flow). After each time interval, the extracted candidate EF data are sent to the controller-side, and the sketches are reset for the next aggregation interval.

D. CONTROLLER-SIDE EF DETECTION
The VFDT is a stream-based data mining classification algorithm that incrementally builds it model as a tree by the division of nodes into a pair of streams of incoming data. The tree expands incrementally as more data arrives. Therefore, the candidate EF data are fed into the VFDT for flow classification based on the aggregated attributes. The VFDT is a suitable method because the classification tree is binary. As with a sketch, the VFDT reads each candidate EF data point x only once and does not require the input data to be stored [40].
In the decision tree, each node represents an implementation of a logical test on a feature from the features of x. At the same time, each leaf indicates a classification from which an appropriate label y is assigned to the incoming data point x as y = VFDT(x). The learning process of the tree is achieved through a successive replacement of each leaf with a node, starting from the root. The Hoeffding tree (HT) [40] algorithm uses the Hoeffding bound (HB) to train the model using the smallest possible number of training samples. The VFDT is made of key elements that include i) an initialization process of a tree beginning with a single leaf, and ii) a growth process of a tree where a repeated splitting check is heuristically carried out using the HB and an evaluation function G(.). Information gain is used in VFDT to represent G(.). Figure 5 shows the flow of operations in the VFDT algorithm [46].
The HB is the basis of the VFDT, where for a given sequence of independent random variables 0≤Xi≤R bounded within the range R, the probability of the sample meanX = 1 n (X 1 +X 2 +· · ·+X n ) deviating from its expectation E(X ) by a positive constant is related to the sample size n as indicated in Equation (5): The HB states that, by consideration of n independent observations of a random variable with sample meanr and δ pre-defined tolerable estimation error, with probability 1=δ, the true mean of the variable is at leastr − , where = R 2 ln(1/δ) 2n (6) The VFDT uses the HB to select the attribute to split as a decision node. Let x a and x b be the attributes with the highest and second-highest G(.) respectively and ∆G = G(x a ) − G(x b ) 0 be their difference. If G > ε with N as the number of observed samples in the leaf, and 1 − δ as the probability of x a being the highest value attribute in G(.) given by HB the leaf is then converted into a decision node splits on x a . The HB is not dependent on the distribution of X i , which is extremely convenient because the distributions of the traffic features are complex and vary with the application.
The HT algorithm aims to guarantee that the selected attribute with high probability and n examples is the same as that selected when using a significantly large number of examples. One major characteristic of the HT algorithm lies in the possibility of guaranteeing the construction of a tree that is asymptotically and arbitrarily comparable to the product of a batch learner. During each step, the attribute with the highest information gain is chosen as the test attribute. As the error decreases with increasing n, the difference in gain of the two attributes with the highest information decreases. When this difference falls below , the node is split, and testing on the attribute with the next highest information yields new leaves [40].
In the VFDT, the training sequence uses the EF definition based on the limits in throughput and duration, scaled to the time window used for aggregation as conducted by the CM-sketch. The maximum size of the tree is 2 h+1 =1, where h stands for the tree height, which equals the number of attributes. The VFDT thus produces a flow classification that can be used for processing and rerouting.

IV. EXPERIMENTAL RESULTS AND DISCUSSION
The experimental setup, results and in-depth discussion of our proposed methods are presented in this section. We also compare our findings with other contemporary methods found in the literature.

A. EXPERIMENT SETUP
The simulation was designed based on machine learning using Python socket programming APIs. The actual hardware used for the simulation included a Dell Inspiron laptop with a 3.20 GHz Intel i5-4570 CPU and 8 GB of RAM. In addition, Virtual Box was used as the virtual environment for loading a Mininet image. An SDN Hub 64-bit tutorial VM image is used to create a VM in Virtual Box with 4GB of RAM and a 20GB Hard drive. The Mininet image is a modified Ubuntu platform with a range of pre-installed and preconfigured network tools that include a Mininet simulation [47], and OpenvSwitch [48]. The Ryu controller [30] is installed and used as part of the SDN controllers for managing the OpenFlow compatible switches. VOLUME 8, 2020 The three main steps applied in our experiment are shown in Figure 6. First, the data are divided into training and testing sets. Second, the training dataset is initially classified such that it can handle diverse varieties of traffic before given to the framework switches/controller sides to identify the correct attributes. Third, we train our CM sketch algorithm and VFDT classifiers to differentiate between EF and MF traffic, respectively. The performance of the proposed system is evaluated based on certain evaluation metrics.

1) DATASETS
We evaluate the proposed EF detection method on three different real network traffic datasets MAWI [49], UNI1 [49], and UNI2 [49]. The MAWI dataset (from April 9 to April 20, 2016) was obtained from a world-wide trace. The dataset comes from the daily tracking of trans-Pacific lines (the link was upgraded from 100 Mbps to 1 Gbps with a 150 Mbps committed access rate (CAR)). It has numerous stochastic factors, which makes the traffic classification more challenging. For this dataset, we select extensive flows as the significant flows because they dominate Internet traffic. This approach is used in our simulation for the measurement of the EF detection technique. A threshold of 10 MB/sec is set for the EF in this experiment [23]. In reality, the number of MFs is usually larger than the number of EFs. Thus, the MAWI dataset has approximately 10% -20% EFs. The UNI1 and UNI2 datasets were captured from data centers studied in IMC 2010 [5]. EFs constitute ratios of approximately 2.5% and 5% in the UNI1 and UNI2 datasets, respectively.

2) EVALUATION METRICS
The performance metrics for the two-step flow classification method are the precision, accuracy, recall, F-measure, and running time. These metrics are all calculated from a confusion matrix.
In the confusion matrix, the true positive (TP) represents the number of actual positive records that are correctly classified. By contrast, the true negatives (TN ) is the number of actual negative records correctly classified. In addition, the false positives (FP) is the number of misclassified negative records whereas the false negatives (FN ) is the number of misclassified positive records.
1) The accuracy Acc is defined as the percentage of instances of the correct classification within the total number of instances.
2) The precision P is the total number of true positives divided by the sum of the false and true positives. The higher P reflects the lower number of false positives.
3) The recall R is the number of true positives divided by the sum of the false negatives and true positives. A high R-value is desired.
4) The F-measure is the harmonic mean of P and R, which has found widespread use in information retrieval and other supervised machine learning tasks. We also define high F values, as shown in Equation (10).
5) The running time T run is the time taken to run a single experiment from start T start to finish T finish .

B. EXPERIMENTAL RESULTS OF THE SWITCH-SIDE EF DETECTION
In this subsection, we compare a CM sketch method with C4.5 in terms of the accuracy, precision, recall, F-measure, and running time. We then discuss the trade-off of CM sketch. Table 1 compares the results of the CM sketch [39] and work proposed in [50] on the MAWI dataset. Hence, the results show that the CM sketch method outperforms C4.5 by eight times faster in terms of running time. Moreover, a values of P = 1 indicates zero false positives and an R of up to 90%. Our method performs better than the C4.5 method in terms of accuracy by up to 2.67%. The improved accuracy is because the estimated sum of the flow used by CM sketch is the hashed buckets with the smallest counter value. It can be determined whether a flow is a heavy hitter by checking whether its estimated sum falls below a certain threshold. The absolute change of flow over two epochs can be similarly used to verify whether a flow is a heavy hitter. Moreover, our proposed switch-side CM sketch prefilter is real-time and has low-overhead. The primary rationale behind the proposed technique is to ensure the switches and the controller work together in sharing the EF classification task such that the majority of the MFs are filtered out by the CM sketch on switches, and the number of classification requests to the controller are significantly reduced. Table 2 demonstrates how our proposed flow detection method is different from other similar techniques.

2) SKETCH TRADE-OFF
CM sketch uses a hash function to count the frequency in a sub-linear space and store the number of occurrences in a stream into a d × w matrix. These parameters determine the trade-off between the accuracy and space/time constraints. Each row has an associated hash function. An arriving element is hashed, and its corresponding row is incremented by 1. Furthermore, the CM sketch solution might be lightly slower by waiting until the CM sketch has collected an adequate number of packets to form an aggregate to send to the controller side. This time overhead is noticeable if packets arrive at extremely irregular intervals.

C. EXPERIMENTAL RESULTS OF THE CONTROLLER-SIDE EF DETECTION
In this subsection, we present the results of a set of experiments conducted to validate the performance of our proposed method. First, we present the performance of the controller-side EF detection for the SDN network. We then compare it with other methods in terms of accuracy, precision, recall, F-measure, and running time.

1) CLASSIFICATION ACCURACY
Accuracy is one of the essential classifiers metrics. To evaluate the influence of our EF detection method, we tested its classification accuracy with several training data sizes ranging from 10,000 to 50,000 on the MAWI dataset. Figure 7 shows the accuracy of our purposed classification for various training sizes. We observed that our EF detection method on the controller-side achieves a higher accuracy than the existing EEFD method [21] by up to 0.7%, and the classification-based EDMAR [22] by up to 0.5%. Figures 8(a) and 8(b) illustrate the precision and recall of our method compared to EDMAR [22], FlowSeer [23], and the Bayes network (BayesNet) [51]. Our method performs better in terms of accuracy, precision, and recall because the controller-side classifier becomes more accurate with an increase in the number of features used. Furthermore,  improvement in metrics is slightly due to the efficiency of the algorithm, early detection, and proper selection of features from accessible commodity switch features.
In our experiments, the EF detection applied to SDN achieves a recall rate of up to 98.3%. This high recall rate suggests that our method can detect most EFs, and only a few  MFs are misidentified as EFs. A comparison of the F-measure between our method and other existing methods is shown in Figure 9. It can be observed that the F-measure of our method is over 96.1%, a significant improvement over the other methods.

2) VFDT PERFORMANCE COMPARISON
In terms of the amount of times required by different methods to detect EFs in SDN, our proposed method can detect EFs within the shortest time among the four methods (see Figure 10). Furthermore, our method takes less than a second to detect 10,000 flows, which is adequate to filter and detect EFs in the SDN network. The results show that the controller-side classifier function can achieve a better running time as it is lightweight as it does not store any dataset points in memory, making it ideal for the detection of EFs on the controller side. The decision tree model can be slowly built from scratch, which helps to detect EF at any point. Whenever a new data section arrives, the testing and training phase is carried out to keep the data stored up-to-date. It does not need to read the entire dataset and instead updates the decision tree to the latest incoming and collected statistical attributes, thereby consuming less memory. Furthermore, the use of the switch-side CM sketch classifier greatly minimizes flows as MFs by about 80%, while non-MF candidates can only give it to the controller-side by about 20%, further decreasing the controller-side load. These features make the VFDT  Table 3 shows the overall performance of our proposed method. We compared the performance of our EF detection method with that of the EDMAR method [22], FlowSeer [23], and the EEFD method [21]. The experiments were conducted on the MAWI dataset [52], UNI1 Dataset [49], and UNI2 Dataset [49]. We found that our method performs better than the other methods in recall and precision. Moreover, our method also performs better than other existing methods in terms of F-measure, indicating that our approach achieves a better balance between precision and recall. The results also show that this study achieves a higher accuracy and better running time.
The EF detection classifier was trained using a training flow, as described in Table 4, which summarizes the experimental results. The table shows the ability and efficiency of our method to detect EFs at the flow-level with an extremely small FN and high TP for all tests conducted on the switch/controller side. As the reason for these results, the CM sketch and VFDT methods apply traffic classification differently. By definition, EFs are specified on flows based on their duration and intensity (bandwidth), and the packet data have no information regarding the flow duration. Thus, the CM sketch creates an aggregate of packets to approximate the flow.
However, neither CM sketch nor VFDT has any long-term memory. Hence, to identify the EF, it is necessary estimate the duration by sampling the CM sketch and use it as an input to the VFDT classifier. Moreover, the improved metrics are due to the efficiency of the lightweight algorithm used on the switch-side to filter out most of the flows unlikely to be EFs. The experiments showed that the CM sketch algorithm is efficient at estimating frequencies of candidate EF with a fast update and query times, and low space usage. Finally, the switch-side only forwards the remaining potential EFs to the controller. The use of combined CM-VFDT greatly minimizes the classifier-side load. This technique proves to be a suitable candidate to introduce an autonomous decision-maker for the detection of EFs in SDN networks.

V. CONCLUSION
This paper presented a framework that can contribute to real-time traffic flow management in SDN networks. The proposed EF detection combines the switch-side extended CM sketch and controller-side the VFDT classifiers to provide real-time EF detection for effective and efficient SDN traffic flow management. The CM-VFDT classifiers can concurrently perform EF detection on commodity OpenFlow enabled switch and SDN controller. To detect EF candidates as heavy hitters, the fast and lightweight CM sketch classifier is used on the switches. The sketch extension focused on four flow features: delay (round-trip time), IP addresses, throughput, and packet count. The experiments are implemented in Mininet simulation using OpenvSwitch as OF switch managed by Ryu controller. Real traffic datasets such as MAWI, UNI1, and UNI2 are used to measure improvement in controller running time, accuracy, F-measure, and recall. Our experimental results show that EF detection method can achieve up to 98.13% accuracy with a higher recall rate and F-measure, and with better running time, which is better than other works in the comparative study.
Our directions for future works include a performance evaluation of the proposed algorithm in a broader orchestration context. With expansion growth of the SDN deployment in the near future, on-line flow classification will become more significant. Other applications that can benefit from flow classification include intrusion detection, load balancing, and bandwidth brokerage, for which the framework may need adaptation to accommodate data from various sensors. SULEMAN KHAN received the Ph.D. degree (Hons.) in computer science and information technology from the Universiti Malaya, Malaysia, in 2017. He was a Faculty Member of the School of Information Technology, Monash University, Malaysia, from June 2017 to March 2019. He is currently a Faculty Member of the Department of Computer and Information Sciences, Northumbria University, Newcastle upon Tyne, U.K. He has published more than 50 high-impact research articles in reputed international journals and conferences. His research areas include, but are not limited to, network forensics, software-defined networks, the Internet-of-Things, cloud computing, and vehicular communications.
M. AKHTAR ALI received the Ph.D. degree in computer science from Manchester University, in 2003. He is currently a Faculty Member of the Department of Computer and Information Sciences, Northumbria University, U.K. He served with this university for more than 19 years and has been involved in various research projects. His research interests include data analytics, databases, and machine learning.
MUHAMMAD IMRAN (Member, IEEE) received the Ph.D. degree in information technology from the Universiti Teknologi PETRONAS, Malaysia, in 2011. He is an Associate Professor with the College of Applied Computer Science, King Saud University, Saudi Arabia. His research was financially supported by several grants. He has completed a number of international collaborative research projects with reputable universities. He has published more than 200 research articles in peer-reviewed, well-recognized international conferences and journals. His many research articles have been highly cited and frequently downloaded. His research interests include mobile and wireless networks, the Internet of Things, big data analytics, cloud computing, and information security. He was consecutively awarded as an Outstanding Associate Editor of IEEE ACCESS, in 2018 and 2019, in addition to numerous other awards. He has been involved approximately in 100 peer-reviewed international conferences and workshops in various capacities such as a chair, co-chair, and technical program committee member. He served/serving as a Guest Editor for approximately two dozens special issues in journals such as the He is an Associate Professor in electronics and computer engineering with the Faculty of Engineering, School of Electrical Engineering, Universiti Teknologi Malaysia. His research interests are in many-core system-on-chips, network-on-chip interconnects, domain-specific computer architectures, network processing algorithmics, and network processing accelerators. VOLUME 8, 2020