InSDN: A Novel SDN Intrusion Dataset

Software-Defined Network (SDN) has been developed to reduce network complexity through control and manage the whole network from a centralized location. Today, SDN is widely implemented in many data center’s network environments. Nevertheless, emerging technology itself can lead to many vulnerabilities and threats which are still challenging for manufacturers to address it. Therefore, deploying Intrusion Detection Systems (IDSs) to monitor malicious activities is a crucial part of the network architecture. Although the centralized view of the SDN network creates new opportunities for the implementation of IDSs, the performance of these detection techniques relies on the quality of the training datasets. Unfortunately, there are no publicly available datasets that can be used directly for anomaly detection systems applied in SDN networks. The majority of the published studies use non-compatible and outdated datasets, such as the KDD’99 dataset. This manuscript aims to generate an attack-specific SDN dataset and it is publicly available to the researchers. To the best of our knowledge, our work is one of the first solutions to produce a comprehensive SDN dataset to verify the performance of intrusion detection systems. The new dataset includes the benign and various attack categories that can occur in the different elements of the SDN platform. Further, we demonstrate the use of our proposed dataset by performing an experimental evaluation using eight popular machine-learning-based techniques for IDSs.


I. INTRODUCTION
In conventional distributed networks, the functionality of decision making processes known as control plane and, the forwarding of network traffic (data plane) are implemented within the network devices (e.g. routers or switches). The network operators configure traffic policies (e.g. routing, switching, quality of service) on each device independently.
Recently, SDN has come to prominence to solve the inherent problems of conventional distributed networks. The key benefits of SDN is making the network more flexible and easy for management by decoupling the control plane and data plane. Thus, the new paradigm can control the entire system from a centralized remote device named the controller. The benefits of SDN encourage many commercial and industrial companies to deploy SDN solutions in their network environment for several reasons, including: • Separating the control plane from the data plane facilitates network system management. Besides, the network The associate editor coordinating the review of this manuscript and approving it for publication was Weipeng Jing .
becomes easier for any change or update, and therefore reducing the human mistakes.
• IT administrators can implement network devices or upgrade the network infrastructure easily without any restraint to a specific vendor.
• Centralized view of the entire network allows the SDN controller to provide a global view of the whole network.
• Developers can deploy various applications in the upper layer of the SDN system to perform network services in a virtual environment [1].
• The underneath infrastructure devices do not need any programming language. As a result, the operation cost will be decreased significantly compared to the conventional network. These enormous benefits of SDN are making its market continuously growing. As a result, it achieved more than $9.5 billion at the end of 2019 [2], and this value is expected to reach $13.8 billion by 2021, as shown in Fig. 1. Despite the numerous benefits of SDN technology, SDN is susceptible to new security threats that can be exploited by attackers to perform different malicious tasks. If the attacker successfully accesses the SDN controller, the whole system FIGURE 1. SDN market size prediction [2]. can be exposed to critical threats. Therefore, deploying IDS techniques to detect anomalies in the SDN network traffic is an essential part of the network architecture. Generally, IDSs can be one of two approaches: signature-based or anomalybased solutions. While signature-based is widely used in commercial products due to its high detection rate and low false alarms, it fails to discover the new or unknown network attacks that are produced daily. In contrast, the anomalybased detection system has gained the attention of many academic researchers due to its ability to discover novel attacks. Despite existing work conducted on the anomaly detection systems for the SDN network, unfortunately, there are still many challenges for developing efficient IDS systems on the SDN standard. One of the significant challenges for deploying IDS is the fact that there is no public dataset generated directly from SDN networks and can be used for training and evaluation of anomaly detection systems. Most of the research community uses intrusion detection datasets, which are generated for conventional networks. However, the virtualized behavior of the SDN makes the network susceptible to new attacks, which are different from those found in the conventional network.
Although some previous efforts [3]- [8] have been tried to simulate the SDN network and generate an acceptable dataset, the existing datasets only outline a few types of attacks i.e. only focus on DoS/DDoS threats without considering the different attack classes existing in the SDN network. In addition, these datasets describe intrusions that can be generated in one element of the SDN network without representing attack vectors in different SDN layers. In this work, we address the lack of available SDN datasets by generating a comprehensive dataset that contains full network traces and reflects Internet traffic. We consider the common attack classes in conventional networks, besides the new attacks data that are generated in SDN during its centralized design. The ultimate goal of this work is to create a public dataset that can be used to evaluate IDSs for the SDN environment. The contributions of this paper are summarized as follows: • Reviewing and classifying attacks in different SDN layers.
• Studying the limitations of the existing IDS datasets.
• Proposing a virtualized network testbed to generate a new SDN dataset, namely InSDN.
• Generating a significant dataset covers various attacks that can be found in all SDN elements from the proposed network testbed. Further, the impact of the generated attacks on the different elements of SDN is reviewed. This can help the researchers to identify potential holes, and therefore, they can propose several countermeasures based on these requirements.
• Demonstrating how to use the new dataset with popular Machine Learning (ML) techniques applied in anomaly detection systems for the SDN network.

II. BACKGROUND A. LITERATURE REVIEW
This section reviews the existing publicly datasets generated from conventional networks. These datasets are widely used for intrusion detection in conventional networks, and they have been used for evaluating ML algorithms designed for anomaly detection approaches in SDN networks.
• KDD'99 [9], [10]: one of the most well-known datasets which is used widely for intrusion systems evaluation. KDD [20]. The dataset has 288602 missing class labels and 203 missing information instances. In addition, the size of the CICIDS 2017 dataset is extremely huge and contains many redundant records that seem to be irreverent for any IDS training.
• CSE-CIC-IDS2018 [21]: The dataset is the result of a collaborative project between the Communications Security Establishment (CSE) and the Canadian Institute for Cybersecurity (CIC). Similar to CICIDS 2017 but instead, it was implemented on AWS (Amazon Web Services) computing platform. The notion of profiles is used to generate the dataset in a systematic manner.
Where this dataset has two general classes of profiles, B-profiles is used to generate the normal traffic, and M-Profiles is used for attack scenarios. The dataset covers the same attack scenarios as in CICIDS 2017 dataset. However, the dataset suffers from the same inherent problems of CICIDS 2017, and also the use of synthetic traffic.
In addition to datasets described above, many data repositories have been published to cover various security domains, such as botnets [22], [23], Malware [24], [25], Port scans [26], etc. While the structure and the type of those repositories are different, we exclude them from our comparison. More details about these datasets descriptions and discussion properties can be found in [27], [28].
An important note is that although all datasets described above are normally used for IDS research on SDN networkbased, these datasets were not generated from SDN platforms. This would cause a compatible problem since the  conventional network and SDN are different in nature. In addition, each dataset has its own requirements for security systems and benchmark datasets should be adapted to the specific environment [27]. This means that the deployment of the attack vectors should consider the new architecture. Besides, each dataset emphasizes different properties. For example, some datasets represent certain attack types such as DDoS attack, while other datasets are concerning on the label accuracy such as ISCX2012 dataset, etc.
It should also be noted that each attack has different working principles. For example, ''IPsweep'' and ''Portscan'' attacks are not considered as DDoS attacks by the conventional intrusion detection techniques [29]. However, the aforementioned attacks can be utilized to generate an extensive amount of network flows and exhaust the SDN component resources (e.g., the bandwidth of the southbound interface gets saturated). In addition, decoupling the control plane and data plane brings some new threats that are unique to SDN. Thus, selecting the improper features can lead to a significant drawback on the performance attainable by most well-known classifiers. For better illustration, Santos et al. [30] demonstrated that the SDN controller attacks have the worst classifications results achieved by different machine learning algorithms. This return to the fact that some of the important features used to detect the new SDN attack types are similar to normal traffic patterns due to the unique SDN architecture. For example, flow duration, which implies alive connection time (in nanoseconds), is equal to normal flow in case of SDN controller attacks. Although this attribute is widely used in the conventional networks to detect different attack types such as bandwidth attacks, this solution is inefficient to detect the SDN controller attacks.
The comparison between the public datasets and InSDN dataset is described in Table 1, while the information of attack types and their used tools are reported in Table 2.

B. REVIEWER-2: COMPARISON OF EXISTING TESTBEDS WITH PROPOSED MODEL
We created the InSDN testbed to generate a benchmark dataset for SDN. This section compares this InSDN testbed with the existing methods in the literature. The testbed is an environment designed that can incorporate real network facilities and real traffic [31].  Although the significant amount of samples are generated for normal and attack traffic, the collected data is limited to DDoS flooding attacks.
Amaral et al. (2016) created a testbed with a small topology to collect network traffic data using an Open-Flow protocol for ML-based traffic classification solutions [33]. The authors used HP VAN SDN controller with a single HP E3800 OpenFlow-enabled switch to generate their dataset.
The switch is connected to the non-SDN network to receive copies of the upstream-link traffic through the mirroring port. Two different datasets were created from the testbed to represent various traffic applications such as YouTube, Vimeo, Facebook, LinkedIn, etc. The first dataset is relatively small and labeled under a controlled environment to represent eight different application traffic. The second dataset is unlabeled and contains all traffic data generated from the monitored room. However, the collected data highlights only normal application traffic without any representation for attack scenarios. The intrinsic dataset should cover normal and malicious traffic. Ajaeiya et al. (2017) Used the RYU SDN controller with a single Open vSwitch (OVS) for the experimental purpose [34]. The authors used publicly available PCAP files, which were collected from different experiments for their work. TCP Replay tool was used to replicate the network traffic into the SDN network, while the Wireshark tool was used to capture the traffic samples. The authors successfully collected 16,624 and 36,654 samples for normal traffic and attack traffic, respectively. The attack samples include the Brute Force credential attack, TCP DoS, ICMP Flood, and port san traffic. However, their work is mainly focused on re-modeling traffic replay instead of addressing actual traffic generation.
In 2018, Cheng et al. created a testbed network topology using the Mininet tool on the Ubuntu server [5]. Five hosts are used to create network traffic. Two hosts act as bots, and two different hosts are dedicated to normal traffic, while the last host represents the victim machine. Hping3 tool is used to generate different types of flooding attacks such as ICMP flood, UDP flood, and TCP SYN food. The same tool is also used to create the normal traffic. More than 30000 samples are collected to train the ML-based models. However, their proposed work is limited for DDoS flooding attacks only.
In 2018, Prakash et al. used the Mininet tool to build a topology from four virtual hosts and two virtual switches in order to generate a dataset for ML classification purposes [3]. TCPDump tool is used to collect the network traffic, while the Hping3 tool is utilized to simulate DDoS attacks. In their work, 2000 and 4000 samples are collected for normal and attack traffic, respectively. However, this work focused mainly on DDoS attacks without any consideration of other attacks that can happen in the SDN network.
In 2019, Santos et al. created an SDN testbed to generate the attack traffic dataset for analyzing the performance of some ML techniques [30]. The SDN network was simulated using the POX controller and Mininet tool. Scapy, a packet generation tool, is utilized to generate both normal and malicious traffic. The network topology is composed of a single OpenFlow switch and six hosts. The same hosts are used to generate normal and malicious traffic in two different experiments. The normal traffic represents HTTP and ICMP traffic only, while the malicious traffic is limited to DDoS attacks. The data flow size is 20000 samples, with 10000 for each traffic type.
Similar to the previous work, the SDN testbed is created to simulate two types of flooding attacks: UDP flooding and SYN flooding attacks [6]. Scapy tool is used to create 2000 samples for each attack type. The Miniedit (GUI editor for Mininet emulator tool) is used on the Ubuntu server to create different virtual hosts for normal and attack scenarios. The authors utilized scapy tool also to generate both normal and attack traffic. However, their work is limited to DDoS attacks only.
In 2020, Polat et al. created a testbed using six virtual hosts, two virtual machines (VM) switches, and one OVS switch to generate a normal and malicious dataset for ML training purposes [35]. They used Ubuntu 18.04 server with 1 GB RAM and I CPU on VirtualBox-KVM. The generated dataset has 65000 samples for DDoS attacks and 64,000 samples for normal traffic with 12 feature attributes and one labeled class. Hping3 tool was used to generate three DDoS attack traffic packets TCP, UDP, and ICMP flooding attacks. Again, the generated dataset is limited to DDoS attacks only.
Hence, it is clear from the studies mentioned above that there have not been real attempts to generate a comprehensive dataset for the SDN environment. The current works focus only on creating a dataset that can assist researchers in deploying ML techniques to effectively analyze and detect security problems in one element of the SDN network. One of the significant shortcomings of these methods is that the simulated data is limited to one or two activities (mostly for DoS/DDoS attack) without considering various attack types that can occur in SDN networks. In addition, their works experienced several concerns, such as out of data, representation of modern attacks, data corruption, inconsistencies, and traffic verity.
The main differences between the previous testbeds and InSDN one are the attack variety and the realistic of traffic traces. We used Kali Linux to perform various attack scenarios and create different attack classes such as DoS, DDoS, Web attacks, Password-Guessing, Botnet, Exploitation, and Probe attacks. Furthermore, the normal traffic covers various popular application services that were not represented in the previous works except [33].

C. ATTACK VECTORS IN SDN ELEMENTS
The centralized design of SDN architecture introduces new vulnerabilities that can make the SDN network vulnerable to various types of security threats [36]. In fact, all SDN layers are unavoidably susceptible to different types of attacks. Some of these attacks are specific for the SDN i.e. as a result of separating the data and control plane functionality. These attacks can occur in the SDN controller or on the communication channels between the control and data plane devices. Beside, there are various attacks that are common between SDN standard and the conventional networks i.e. the attacks on the application layer or data plane elements. While some attacks are frequent and have a mild or moderate impact against the conventional networks, the impact of these attacks is escalated in the SDN. For example, in the case that the attacker successfully gets unauthorized access to a vulnerable machine or application in a conventional network, a single machine or a small portion of this network is affected by this attack. The attacker needs to escalate his privilege or uses the victim machine to start new attacks against different machines or the subnets [37]. Therefore, there is a need for different mitigation techniques to deal with them. Kreutz et al. (2013) defined various attack vectors that can tamper the SDN architecture [38]. [39]- [44] consider the security issues in OpenFlow. In this section, we outline the comprehensive attack vectors that have a critical impact on different elements of SDN. Figure 2 summarizes the attack vectors inside the SDN network. We classify the main attacks against the SDN network into four-vectors as following: 1) Attacks on the data plane. The network elements itself can be a target of the intruder. The attacker can gain unauthorized access to vulnerable hosts in the SDN network to initiate different attacks. Besides, the attacker can generate malicious traffic, using a hosting machine or connected switch to flood the network components. The main goal of these attacks is to consume the controller resources or flow table-space of any OpenFlow switch. In addition, the attacker can cause damages in the network resources by deploying a fake switch in the SDN network in order to deviate the network traffic or for stealing purposes. The intruder can manipulate the flow entries rules of OpenFlow switch to reroute the legitimate network traffic. Furthermore, he can use the fraud switch to produce forget-requests to overwhelm the controller or to slow down the network traffic. Additionally, the virtualized view of the SDN network encourages the enterprise administrators to implement software switches such as OVS switch on their network infrastructure. Although virtual switches are software-based and run on the host servers, they can also be a target for attackers. In contrast, it is significantly difficult to physically compromise the hardware switches in the conventional network and modify its forwarding tables. 2) Attacks on Control plane Communication. In the SDN network, the controller can handle the data plane devices through communication channels. Logically, each device has a separate channel with the controller, but physically, all these channels share the same physical link. Running the flooding attack from spoofed sources can cause congestion in the channel links. Consequently, breaking down the communication between the controller and data plane elements can isolate the SDN controller from the whole network elements. Furthermore, the attacker can exploit the trust between the OpenFlow switches and the controller to launch a manin-the-middle attack, sniff valuable information, or gain full access to the controller plane [45]. 3) Attacks on SDN Controller. The controller acts as the brain of the SDN network. Gaining access or bringing down the SDN controller can consequently disrupt the whole system. In addition, the controller is vulnerable to the same vulnerabilities as the operating system installed on it. In some cases, the attacker can use his own controller and forward the node traffics based on his setup. Furthermore, the attacker can control the whole network and create his own policy if he successfully exploits the vulnerable Northbound API (i.e., the API resides in between the controller and the application layer). 4) Attacks on the application plane. The attacker can run a malicious application to violate the security policy or to bypass firewall and IDS Apps. It is noticed that the attacks numbers 2 and 3 are specific to SDNs, resulting from decoupling the data and control plane, while the attacks 1, and 4 are common in both SDN and conventional network.

D. ATTACK PHASES
The main objective of attackers is to control the network system by gaining unauthorized access to network resources. He can steal vital information or disturb the network operation, causing damage in the entire system. There are five attacking steps that can be performed by malicious intruders, as follows: 1) Reconnaissance: The first step for the attacker before initiating his attack. In this phase, the attacker can gather some information about the target system, such as IP addresses, operating system versions, running applications, etc. 2) Scanning: The attacker uses the collected information from the reconnaissance phase to discover the system vulnerabilities. Consequently, he can perform different attack scenarios against the target system. 3) Gaining Access: In this phase, the attacker can exploit the existing vulnerabilities to gain system control.
There are several methods to access the target system (eg., buffer overflow, password cracking, and session hijacking). Once the attacker successfully obtains access to the target system, he can raise his privilege to gain full access to the victim machine. 4) Maintaining Access: The attacker keeps his system access by installing remote shell connections using Trojans, Backdoors, Rootkits, etc. He can employ the compromised system for different purposes, such as stealing vital information or starting a new attack against different systems. 5) Clearing Tracks: After gaining access to the target machine, the attacker can work to hide any malicious activities in order to avoid the detection (eg., deleting the system log).
In this research work, the aforementioned attacking steps were carefully examined to generate a more realistic dataset for IDS. In addition, we have also studied the previous work in [17], [18], [46]- [48] to generate a comprehensive dataset and to take into consideration the setting up of the new environment and the different attack methodologies inside the SDN network.  [50], [51] to generate legitimate and malicious network traffics. Mininet is widely used by researchers to create a realistic virtual network with virtual switches, hosts, and links on a single Linux kernel virtually. Further, we used the Damn Vulnerable Web Application (DVWA) software to represent a PHP/MySQL webserver for a better description of different attacks inside the SDN network. The DVWA is independently installed from the operating system, using docker containers in the same OVS host. The OVS switch is configured to function as L3 switching by combining the OVS software with Linux kernel routing. In this case, all the virtual hosts can communicate with each other using different subnets. Fig. 4 shows the logical topology of the virtual testbed and its configuration in Fig. 5. The following process indicates how to map from L2 switching to L3 switching using the OVS switch.

III. PROPOSED SDN ARCHITECTURE
• Install OVS switch and Mininet software on the same VM.
• Create four adapters in the OVS-VM to represent four different network subnets. In our setup, the created interfaces were named ens38, ens39, ens40, and ens41.
• Create two OVS bridges on the same OpenFlow switch named br1 and br2.
• Assign each data plane interface to its proper bridge. We assigned ens40 to br1 and ens38 to br2 bridge. In addition, we assigned ens41 interface into S1 bridge, which is created by default on OVS switch.
• Remove the IP address from each data plane interface or assign it to zero. Later the removed IP address will be assigned to the created bridges. For example, we remove the configured IP address from ens40 interface and assign it to its connected bridge (br1). The same configuration is performed for br2 and S1 bridges.
• Connect the Kali Linux VM with the same adapter of br1, and Metasploitable2 Server with the same adapter of br2.
• Enable IP forwarding on the OVS Linux machine.
• Create a Mininet topology that contains four virtual hosts (h1 to h4). The virtual hosts of Mininet are attached to S1 Bridge. The configuration of S1 bridge is similar to previous setups of br1 and br2. We add the IP address of S1 bridge as a default gateway for each virtual host in Mininet topology.
• Connect ONOS controller to all created bridges (br1, br2, and S1) • Now, we are able to ping between all hosts in different subnets.

IV. METHODOLOGY FOR DATA GENERATION A. DATASET ATTACK SCENARIOS
This section presents our approach to generate the SDN network traffic data by using different attack scenarios. The centralized view of the SDN network and separation of the data plane from the control plane creates a new opportunity for the attacker to carry out various types of attacks compared to the conventional network. The nature of these attacks in SDN is different from those commonly affecting the conventional network [52]. For example, the attacker can generate new malicious traffic to attack the SDN controller or even the communication links between the SDN controller and OpenFlow switches. Furthermore, compromised users can be employed to start a new attacks after the traffic flow is established. Besides, the SDN applications can have different vulnerabilities such as buffer overflow, command injection, SQL injection, etc. These vulnerabilities can create attack  opportunities, and help the attacker to bypass the authentication mechanism, gain access to the controller through installing a malicious script. If the attacker successfully gains access to the controller, he can start new attacks such as flow rules manipulation, launching DoS attack, and eavesdropping on the data/control traffic. Table 4 represents the attack classes and the used tools in this virtual environment, as well as the source attack machine and the victim device IPs. 1) DoS attacks: Is one of the most common attacks inside the SDN architecture. It does not only damage the victim machine but can also overwhelm the SDN controller resource in a short time. Besides, the SDN controller is the brain of the SDN network, and in the case of DoS attacks, the whole system becomes unavailable for legitimate users. It turns the entire network into a 'body with no brain'. DoS attack can flood the victim machine with a huge amount of spoofed packets that have no matched rules inside flow tables switches. Thus, the OpenFlow switch will send these flows to the SDN controller in the form of packet-In message for further processing. When packet-In message rates are increased up to a certain limit, SDN controller resources can be overwhelmed by a large number of unprocessed packets. There are two main types of DoS attacks [53] as the following: • Network DoS attacks: The main objective of these attacks is to overwhelm the benign users by flooding the network bandwidth or victim machine by a large amount of spoofed packets. The attacker VOLUME 8, 2020 often uses different protocols like UDP, TCP, or ICMP. DoS attacks can also disturb the SDN controller or its channels due to the significant number of forwarded packets to the controller.
• Application DoS attacks: Despite the fact that these attacks do not require high bandwidth, however, it can cause serious damage to the target server and consume its resources in a short time. It mainly targets the top application layer or services such as HTTP. The application layer attack is not easy to detect since the intruder is connected to the victim server in an authorized manner. The InSDN dataset includes several types of DoS attacks that can be driven in different OSI model layers. Kali Linux is used to carry out various DoS attacks against a victim web server, which is represented by h4 virtual host. Several DoS attacks such as TCP, UDP, and HTTP flood attacks are executed by using Low Orbit Ion Canon (LOIC) tool. Further, we implemented different slow rate DoS application attacks such as Slowloris, slow-rate HTTP POST, slowhttptest, using HULK, and torshammer tools. In addition, we handled the TCP and Slowloris based DoS attacks using the Metasploit framework on Kali Linux against Metasploitable 2 server. 2) DDoS attacks: InSDN dataset also includes several DDoS attacks scenarios such as TCP-SYN Flood, UDP Flood, and ICMP Flood attacks. The Hping3 tool, which considered one of the most publicly tools is used for DDoS attacks, where the attacker machines are h1 and h2, and the victim machines are h4 web server and Metasploitable 2 server. 3) Password-Guessing Attacks: It implies to obtain access to the victim machine through breaking the username and password credentials. Two different scenarios of Password-Guessing Attacks are considered in the InSDN dataset. In the first scenario, the dictionary attack is involved by creating a dictionary for all possible users and passwords and then try each of them. The attacker machine is Kali Linux, and the victim server is the DVWA web server. Burp suite and Hydra tools are used to launch this attack to get the username and password credentials. In the second scenario, we use auxiliary scanner tool from the Metasploit framework to discover the valid credentials on the Apache Tomcat Web server, which runs on Metasploitable 2 server. 4) Web application attacks: Based on the Symantec report in 2018 [54], one in ten analyzed URLs ware vulnerable with malicious code, with a 56% increase compared to 2017. In web application attacks, we implemented the most frequent application attacks such as Cross-site scripting (XSS) attack and SQL injection.
• XSS attack: The attacker can bypass the access controls of the client machine by injecting malicious code into the trusted website. Once the client access the web application site, the malicious script will be executed. As a result, the attacker can obtain sensitive information from the client machine, such as session tokens, cookies, and so on. We tried to Gain Shell Access by preparing our malicious PHP file and uploaded it to the vulnerable web server. The skillful msfvenom tool, which combines Msfpayload and Msfencode tools into one single framework is used to create the PHP codes. Once the client starts to access the vulnerable web server, the uploaded PHP file will be executed. As a consequence, the attacker machine can access the infected client using a reverse connection.
• SQL injection attack: The attacker can use malicious quarries to manipulate the database behind the web application, allowing the attacker to get the content of the entire SQL database. The attacker can obtain unauthorized access to any web application or sensitive data on the website. The SQL attack in InSDN dataset is executed using an automatic SQL injection (sqlmap) tool against the DVWA web server. The Burp Suite tool is used to capture the user cookies, which are needed during the SQL injection attack.

5) Probe attacks:
This is the most essential phase for an attacker before starting his attack. The attacker scans the target system to discover some information that can assist him in exploiting the remote system such as We also demonstrate how these attacks can impact the SDN network severely and easily consume its resources. While some of these attacks are common with conventional networks, other attacks are more specific to SDN. Although the SDN can be afflicted with similar attacks presented in the conventional network, the solutions that are generally applied to the current environments are not applicable for SDNs [56]. Decoupling the control plane from the data plane can bring new security threats that have never appeared in the conventional network i.e. all the unmatched packets in OpenFlow switches are forwarded to the controller in the form of Packet-In message. Thus, it is very easy for an intruder or even the end-user to poison the network by generating forget messages, which are relayed to the controller. If the SDN switch does not find any matching rule for the received packets in its flow tables, the switch will extract the packet header and encapsulates it using OpenFlow protocol and sent to the controller in the format of Packet-In message. Then, the controller encapsulates the processed flow and returns it to the OpenFlow switch in the format of Packet-Out message. The parameters in Packet-Out message, as shown in Fig. 6 are used to install the flow entry in the OpenFlow switch. The attacker can employ huge amounts of malicious requests, which will exhaust the system resources resulting in a degradation of controller performance or increasing the communication overhead.
Although the Transport Layer Security (TLS) protocol has been considered as optional to secure the communication links between the SDN controller and switches, TLS cannot protect the network from the spoofing packets.
This manuscript does not emphasize all attack types in the SDN context; instead, some attacks, which are relevant to the InSDN dataset are reported in the following paragraphs. Rather, interested readers may refer to previous studies [52], [57]- [60] for more detailed information. VOLUME 8, 2020 • Data-to-control plane saturation attack [61], [62]: Different from the conventional network, unmatched packets in flow tables are forwarded to the control plane for forwarding decisions. Since the SDN controller implements the packet forwarding decisions, the attacker can exploit this vulnerability by launching a dedicated denial of service attacks to flood the network resources. He can produce an extensive amount of tablemiss Packet-In messages to exhaust the controller's resources (eg., CPU, memory) in a short time. This can cause a reduction or complete shutdown of the controller service. As a result, the normal delivery of packets will be interrupted.
• Link Flooding Attack (LFA) [63], [64]: The strategy of LFA attack in the SDN context is different from those commonly targeting conventional networks. The goal of this attack is to disconnect the controller from the data plane elements. A skilled adversary can take the chance of continuous communication between the data and the control plane to obstruct this communication.
For example, the attacker can generate normal packets with low rate traffic by employing malicious bots to congest the channel links by anomalous traffic, and this can impede the legitimate traffic towards the target network. However, the conventional techniques fail to mitigate it due to the centralized strategy of the SDN architecture in managing the network traffic. Besides, LFA mimics the same normal behavior during its low rate nature and can flood the whole network, without any further detection [63].
• Flow-Rule Flooding Attack: The attacker can flood the OpenFlow switch by creating a large amount of unmatched flow, which triggers the switch to install invalid flow rules in its entry tables. After a while, the flow tables capacity becomes full, and the OpenFlow switch is not able to install the new rules. This can deplete the switch resources and cause exhaustion in the data plane. Besides, normal users could not be able to install their flow traffic, and legal traffic cannot be forwarded.
• Password-Guessing Attacks [41]: An attacker residing on a non-SDN element can use random or systematic guessing of passwords to achieve unauthorized access to SDN elements. For example, an intruder might be successful in accessing a management console to launch attacks on the network managed by the SDN controller or in the controller itself.
• Remote application exploitation [52]: The attacker can achieve unauthorized access to a victim system or an SDN component by exploiting a software vulnerability in one of SDN components. For example, he can exploit software vulnerabilities in the application server and gain its access. If the attacker succeed to achieve unauthorized access to the application server, he can poison a controller's view of the network topology. Furthermore, the attacker can carry out a variety of other attacks such as destruction of information, compromise of integrity, deviate network traffic, exploitation, and unauthorized disclosure.

V. USAGE AND AVAILABILITY A. DATASET DESCRIPTION
We divided the dataset into three groups based on the traffic types and the target machines. The first group includes normal traffic only. The second group contains the attack traffics that target Mealsplotable-2 server. In the last group, attacks on the OVS machine are considered. The Tcpdump tool is used to capture the traffic traces for each category at the target machine and the SDN controller interface. In addition, the CICFlowMeter tool [65] is used to extract the flow features for the InSDN dataset. The reason we decided to use the CICFlowMeter in our work despite many available tools in literature such as Argus 1 and Bro-IDS 2 is the fact that none of these tools exclusively consider the time-based features [66]. However, different applications have different time constraints. As a result, it is more important to calculate the statistical time-related features for the flow traffics. The CICFlowMeter was generated by the Canadian Institute of Cybersecurity team and has been written in Java to create network flow traffics from the PCAP file. The generated flows are calculated in Bidirectional, where the first packet in the flow determines the flow direction (forward or backward). The output of the CICFlowMeter is more than 80 statistical features in CSV file format such as Protocol, Duration, Number of bytes, Number of packets, etc. The list of extracted features and details are available in the appendix (Table 13). We collected more than 80 features with 56 categories from our experiments. For simplicity, we divided the entire features into eight groups as the following: • Network identifiers attributes: these features contain the common information that used to define the source and destination flow. For example, IP address, Port number, protocol type.
• Packet-based attributes: these features hold the information related to the packets such as the total number of packets in a forward and backward direction.
• Bytes-based attributes: these features hold the information related to the bytes i.e. total number bytes in the forward and backward direction.
• Interarrival time attributes: these features show the information related to the interarrival time in both forward and backward directions.
• Flow timers attributes: these features hold the information related to the time of each flow i.e. active and inactive.
• Flag attributes: these features hold the information related to the flags like SYN Flag, RST Flag, Push flag, etc.
• Flow descriptors attributes: these features contain the traffic flow information (eg., the number of packets and bytes in both forward and backward direction).
• Subflow descriptors attributes: these features show the information related to subflows, such as the number of packet and bytes in forwarding and backward directions.
For labeling processing, we use some features information such as Source IP and Destination IP. The total number of dataset instances are 343,939 for normal and attack traffic. Where the normal data brings a total of 68424, and attack traffic contains 275,515 instances. Table 5 represents the attack classes for each group with its total size. Furthermore, the name of PCAP files under each attack group is chosen based on the target protocol layer or the tools that are used to create each file.

B. USAGE NOTES
1) The InSDN dataset includes different attacks that can strike the data, control, and application layers. The source of attacks in the dataset is classified into two categories. a) Internal: These attacks come from internal users, who have full access to the SDN network.
Although internal attacks are rare in the production systems, these attacks become more severe and can cause malicious actions for network elements. In many cases, the attacker is not able to target network servers directly since these servers might have a high level of security protection. In this case, the attacker tries to exploit weaknesses on the individual users inside the network system, and then start new attacks on different target servers. In the InSDN dataset, the compromised hosts (i.e. h1 and h2) are used to launch various attacks from internal SDN network. b) External: These attacks commonly are launched from the outside network. The attacker is mainly altering the SDN network using different malicious activities such as code exploits, DoS, malware, etc. We assume the majority of attacks in the dataset are created from an outside network to mimic the real attack scenarios.
2) We predict the effect of dataset attacks on various SDN elements. Thus, it can help to provide a better countermeasure approach.   The centralized control element displays the main differences between the SDN and the conventional networks. In the conventional network, any attack can affect only one portion of the network, probably related to one vendor without interrupting the whole network services. However, any damage to the SDN controller can cause a severe impact on the entire system. Another concern, hardening the control messages in the southbound or northbound interface can threaten the whole network system. Therefore, the organizations should tackle the security issue in the early stages before implementing their SDN project. Any delay or wait to secure the network can cause service-affecting problems.

VI. LIMITATIONS
1) Although SDN is applied in different network environments, the technology is still under development. Unfortunately, the previous history of SDN attacks is unknown. Therefore, in this work we act like the attacker and anticipate the weaknesses that he might be likely to strike. 2) The InSDN testbed was implemented using only ONOS SDN controller. The different types of functionalities in terms of security analysis for other controllers are ignored. However, authors in [68], [69] claim that the different controllers can have different security modeling, and therefore, different countermeasures. 3) SDN can be deployed in different network scales. It will be expected for SDN to support more devices and users more significant than the conventional network. Therefore, only one controller is not enough to cover all network nodes and users. For enterprise networks, there are probably several controllers connecting together through API interfaces such as eastbound and northbound interfaces. Unfortunately, due to the hardware constraints, the low scale topology with only one SDN controller was considered and implemented. However, using a single controller can perform well and achieve the purpose of optimal flow management [70]. In addition, obtaining the dataset using a single controller or multi controllers will not cause a big difference in methodology [35]. 4) To generate more intrinsic data for SDN networks, the network topology should be created using physical devices. We tested various attacks and studied its impact on SDN layers by simulating the SDN network using virtual machines instead of real elements. We are planning to generate a more intrinsic dataset using physical topology with many connected devices. 5) The InSDN dataset assumes that all attacks are generated by high-level skill attackers. The threats, which come from misconfiguration or conflicting flow-tables in the switches are ignored. 6) One of the main limitations of the proposed dataset is a high-class imbalance. This problem can cause biasing of the IDS towards the majority class, causing high false alarm and low evaluation accuracy. However, there are many different techniques to solve the problem of imbalanced samples [71]- [74]. One of these techniques is applying a relabeling solution. Where two different methods can be used: (a) The high classes can be splatted to form more classes; (b) Merge two or more minority classes that share the same characteristics to create a new one class. As a result, the imbalance issue can be reduced, and prevalence ratio is effectively improved.

VII. EXPERIMENTAL EVALUATION
This section analyzes eight supervised learning techniques to evaluate the usability and quality of InSDN dataset. The main objective is to demonstrate the quality of this dataset when it is used in the binary classification i.e. normal versus attack classes. Various performance indicators are used to evaluate the efficiency of employed supervised learning techniques, such as precision, recall, precision, F-score, and training time.

A. DATASET PRE-PROCESSING STEPS
The first phase before training the IDS models is to preprocess the dataset to make it more suitable for the training phase and avoid the overfitting problem. Few steps are taken for pre-processing the entering flows, as follows: • The InSDN dataset contains the socket information such as Source IP, Destination IP, flow ID, etc. All socket features are removed to avoid the overfitting problem, where these features can be changed from network to network. The final dataset includes 77 various features, besides the traffic category.
• The features have different ranges, so they need to be standardized to restrict the scale of the values between 0 and 1.
• One-hot encoding scheme is used to convert the labeled string to numerical values. In this model, only binary classification is considered to classify the input data into malicious and normal group. Therefore, the normal and malicious strings are encoded into binary values of 0 and 1, respectively.

B. SDN SPECIFIC FEATURES
This section focuses on selecting the necessary features that can be directly obtained from the SDN network.
In SDN, only statistical features can be extracted from the SDN controller through OpenFlow calls to the SDN switches, (eg., flow duration, number of packets, number of bytes). In this manuscript, the same framework method of [75] is used to obtain the SDN specific features. These features can be directly extracted from the SDN controller through API queries or by the manual computation based on flow statistics information. Table 7 represents the corresponding mapping between derived features from the SDN environment to the InSDN dataset features. In addition, Table 8 shows extra features that can be calculated from the manual competition. The new features include the maximum, minimum, mean, and standard deviation of these values as well as the direction-specific features. These features are essential to define some particular attacks like botnet [75]. We selected a subset of 48 features from our dataset. While the previous method [75] used a subset of 50 features to train their learning model. However, they used the source IP, destination IP in their computation. These two attributes are excluded from the feature selection strategy, where IP addresses can be changed from network to network. Besides, the same IP address can be assigned to the attacker machine as well as the normal user. Thus, IP address is not able to distinguish between normal and attack traffic. Table 9 represents the total selected features for the SDN context from the proposed data.

C. MACHINE LEARNING ANALYSIS TECHNIQUES
This work uses eight common supervised learning algorithms to evaluate the quality of the InSDN dataset. Specifically, we employed three tree-based algorithms: a single Decision Tree (DT) [76], Random Forest (RF) [77], and Adaptive Boosting (AdaBoost) [78] learner. Besides, the k-nearest Neighbor classifier (Knn) [79], Naive Bayes (NB) [80], and two Support Vector Machines (SVM) [81] based method: linear kernel (lin-SVM) and a radial basis function kernel (rbf-SVM). In addition to the previous classifiers, a multilayer perceptron model (MLP) is chosen in order to further evaluate the InSDN dataset. The hyper-parameters setting of MLP is described in the Table 10, while the default parameters are used in all the implemented algorithms. All learning classifiers are trained using the cross-validation technique with K = 5, where the training and test data are splitted into 80% to 20%. In our experiments, there is no significant difference in terms of the accuracy between K = 5 and K = 10. In addition, using the larger K is subject to the computationally expensive and time consuming process, especially in large datasets. All the experiments were implemented in Python programming language using various libraries such as Keras, Scikit-Learn, and Tensorflow. Furthermore, all the experiments were performed using a workstation machine that has the following properties: Intel(R) UHD Graphics 620, I7-8650U CPU @ 1.90GHz (8 cores), 2.1GHz, Windows 10 pro 64-bit with 16 GB of RAM. VOLUME 8, 2020

D. CLASSIFICATION METRICS
Using the complete accuracy does not yield precise comparisons [82], so we use the most important performance indicators to evaluate our proposed model, such as precision, recall, precision and F-score. These metrics are commonly used in intrusion detection systems and are defined as follows: where True Positive (TP) and True Negative (TN) represent the values that are correctly predicted. In contrast, False Positives (FP) and False Negatives (FN) indicate missclassified events. Furthermore, we considered the training time to describe how long the classifier algorithm takes for training the whole data.

E. RESULTS AND DISCUSSION
This section discusses in detail the performance evaluation of the InSDN dataset. Table 11 shows the performance of different classifiers using a fully-featured version of our dataset. It is clear that the overall score metrics are very high for DoS/DDoS and probe classes for all learner classifiers, while the U2R gives the poor performance metrics. This is because both DoS and Probe categories are commonly more different from normal traffic patterns [83]. In contrast, the U2R attack class has a high similarity to the normal connections. In addition, the size of U2R flow records is small compared to the normal flow in the  same set. Furthermore, the overall performance of Adaboost and MLP is significantly high for all attack classes, but the training time is relatively long.

1) FULLY-FEATURED VERSION OF THE DATASET
Recall and F1-Score on the botnet, web attack, and U2R classes are poor for both linear and RBF based SVM. Besides, the recall and F1-score metrics for rbf-SVM algorithm are low on the password brute-forcing attack type. Furthermore, the performance and training time of KNN, DT, and RF classifiers are reasonable for all attack classes. These algorithms succeeded in recognizing most of the attacks, but they have low scores in the U2R attack. In contrast, the NB classifier consumes less time in the learning and prediction stage compared to other classification algorithms, but its performance is significantly low for three attacks type, including Brute-Forcing, web attack and U2R classes. However, NB improved the results on botnet attack type compared to other algorithms. Another interesting finding is that the good results on the merged dataset might obfuscate poor performance on the less prevalent attack classes, as the majority of samples are for DoS/DDoS and probe attacks. Further, the training time is proportional to the data records, i.e. the training time is increased during the increase in the size of records. We can notice that the rbf-SVM had the most considerable training time for DoS/DDoS and probe, followed by MLP and Adaboost classifiers. Table 12 shows the performance of various models using SDN specific-featured version of the dataset. We can see that the Adaboost retains high-performance scores and stability for all attack classes, followed by DT and RF classifiers. However, the obtained scores on the U2R attack types are relativity small for DT and RF classifiers. In addition, Recall and F1-Score for KNN algorithm are relativity low for KNN on web attack and U2R attacks. We can find that the NB consistently had good scores on all metrics for DDoS and port attack classes, while its performance highly declined on botnet, web attack, and U2R classes. Furthermore, we noted a substantial declined in the performance of SVM on the botnet, password brute-forcing, web attack, and U2R attack classes. Where, the linear and RBF based SVM fail to identify any flow records for the botnet, web, and U2R attacks. While its recall and F1-Score metrics are very poor on password brute-forcing attack class. Although the stability of SVM (linear and RBF kernel) performance on DoS, DDoS, and probe attacks, its training time is effectively high, compared to the fully-version features of dataset. Furthermore, the recall score is decreased for MLP algorithm on password brute-forcing, botnet, and web attack types, while recall and F1-score are almost closed to zero for U2R class.

3) STATE OF THE ART RESULT COMPARISON
In this experiment, InSDN dataset is compared with four publicly available datasets (i.e, KDD'99, NSL-KDD, Kyoto and CICIDS 2017) by using six machine learning approaches, namely KNN, NB, Adaboost, DT, RF and rbf-SVM. As shown in Figure 7, it is clearly noticed that AB and RF classifiers performed well compared to other algorithms. In addition, DT, AB, and RF classifier performance remain the same over various datasets. However, KNN, NB, and FIGURE 7. Performance of classification algorithms in term of global detection rate. We estimate our dataset accuracy compared with other publicly available datasets based on the proposed work in [84].
SVM-rbf performance fluctuate across various datasets. This implies the power of DT, AB, and RF to detect the new attacks.

VIII. CONCLUSION
This paper investigated the challenging problem related to the dataset availability in the SDN environment. We proposed a new SDN dataset: InSDN, to solve some of the inherent problems in legacy datasets. We considered different attack scenarios that represent the real-world scenarios, and discussed the impact of the generated attacks on the different SDN elements. We can observe that the SDN can also be afflicted with the popular network attacks. However, the SDN network is more sensitive to malicious traffic than the conventional environments. In the conventional network, any attacks can only affect the portion of the network almost for the same vendor without bringing down the entire network. However, in the SDN environment, the compromised switches or endusers can flood the SDN controller, causing damage for the whole network.
In the near future, we will extend this work and create a more intrinsic dataset generated from large-scale networks. Moreover, we will consider new attack categories for the best representative of existing real-world networks.

APPENDIX DATA COLLECTION FEATURES
See Table 13.