Active Learning for Network Traffic Classification: A Technical Study

Network Traffic Classification (NTC) has become an important feature in various network management operations, e.g., Quality of Service (QoS) provisioning and security services. Machine Learning (ML) algorithms as a popular approach for NTC can promise reasonable accuracy in classification and deal with encrypted traffic. However, ML-based NTC techniques suffer from the shortage of labeled traffic data which is the case in many real-world applications. This study investigates the applicability of an active form of ML, called Active Learning (AL), in NTC. AL reduces the need for a large number of labeled examples by actively choosing the instances that should be labeled. The study first provides an overview of NTC and its fundamental challenges along with surveying the literature on ML-based NTC methods. Then, it introduces the concepts of AL, discusses it in the context of NTC, and review the literature in this field. Further, challenges and open issues in AL-based classification of network traffic are discussed. Moreover, as a technical survey, some experiments are conducted to show the broad applicability of AL in NTC. The simulation results show that AL can achieve high accuracy with a small amount of data.


I. INTRODUCTION
During the last decades, emerging new networking paradigms, such as Internet of Things (IoT), have introduced various network management challenges.Given the proliferation of IoT devices and the distinguishing characteristics of IoT traffic, such as heterogeneity, spatio-temporal dependencies, dominating uplink traffic, and low duty-cycle traffic patterns, network management and monitoring has become challenging.Gaining deep insight into such complex networks for performance evaluation and network planning purposes is not a trivial task with respect to processing time, human effort, and computational overhead.Understanding network traffic behavior plays a vital role in a wide variety of network management aspects, e.g., fault management, accounting, security, and network performance management [1].Some general approaches have been introduced to analyze the behavior of Amin Shahraki is with School of Computer Science, University College Dublin, Ireland.Corresponding author e-mail: (am.shahraki@ieee.org)Mahmoud Abbasi was with Department of Computer Sciences, Islamic Azad University, Mashhad, Iran, email: mahmoud.abbasi@ieee.orgAmir Taherkordi is with the Department of Informatics (IFI), University of Oslo, Norway.email: amirhost@ifi.uio.noAnca Delia Jurcut is with Department of Computer Sciences, University College Dublin, Dublin, Ireland, email: anca.jurcut@ucd.ienetworks and maintain their performance, such as Monitor-Analyze-Plan-Execute (MAPE), and Observe-Orient-Decide-Act (OODA) [2].
In networking, the process of analyzing the network traffic behavior is mainly known as Network Traffic Monitoring and Analysis (NTMA) [3].NTMA has attracted much interest in recent years and become an important research topic in the field of communication systems and networks [4].The importance of NTMA lies in the properties and challenges of modern networking, e.g., heterogeneity, complexity, and dynamicity, resulting in instability in data transmission [5].NTMA is an essential approach to measure the performance of applications and services, and to discover network inefficiencies.Indeed, NTMA allows us to shed light on the functioning of communication systems and to deal with unexpected events, especially in complex and large-scale networks, such as the Internet.
NTMA applications are generally categorized into eight groups, including Network Traffic Classification (NTC), traffic prediction, fault management, network security, traffic routing, congestion control, resource management, and Quality of Service (QoS) and Quality of Experience (QoE) management [6].In this study, we focus on NTC as an important and open issue in NTMA.NTC refers to techniques for categorizing network traffic into different classes based on their properties.The classification of network traffic is highly beneficial in various network services from QoS (e.g., traffic policing and shaping) and pricing to malware and intrusion detection [7].NTC provides detailed knowledge on network traffic, which is very useful for those who investigate the changes in traffic characteristics and long-term requirements of networks [8], e.g., Network Management and Orchestration (NMO) tools, and performance management models.
NTC techniques can be broadly grouped into three categories: port-based, payload-based, and flow-based methods [9].Port-based techniques associate a standard port number to a service or application, while payload-based methods carefully inspect the content of the captured packets to classify them.Last but not least, flow-based techniques utilize the network traffic flow characteristics (e.g., round-trip time and inter-arrival times) to associate produced traffic to the related sources.The two latter methods cannot be used in some network types (e.g., Virtual Private Network (VPN)), or violate the privacy of users by accessing their personal data.Flowbased techniques are the most common techniques for NTC as instead of inspecting all packets passing through a given link, they examine network traffic flows or an aggregated form of the network header packets information.As a result, the volume of data needed to be examined will be reduced, and arXiv:2106.06933v2[cs.NI] 5 Aug 2021 the encrypted traffic will no longer be a problem.Flow-based techniques assume that each application's traffic has almost unique statistical or time-series features that can be utilized by classifiers to categorize both encrypted and regular traffics.
In flow-based methods, the traffic classifier may leverage Machine Learning (ML) algorithms to automate the classification process, discover different traffic patterns produced by devices, and classify encrypted traffic.Although ML algorithms are powerful techniques to classify network traffic flows [10], [11], the accuracy of learning-based approaches is limited by their need for a massive number of labeled instances.As the authors in [12] mentioned, most of the real-world application data is semi-labeled or unlabeled data.Moreover, the data labeling process for ML tasks can be challenging in terms of human effort and cost [13].
Fortunately, Active Learning (AL), as a sub-field of ML, is a promising approach to deal with the need for a huge amount of labeled instances.AL aims to reduce the need for labeled examples by intelligently querying the labels during training.The query goes for the examples that the AL algorithm believes will help build the best model [14].Therefore, based on the aforementioned challenges, AL can be considered as an appropriate and efficient technique for flow-based NTC.Providing a thorough study on the usefulness of AL in NTC and reviewing the state-of-the-art techniques in this field can significantly help the network research community in better adoption of AL for classification of network traffic in various domains.To the best of our knowledge, this is the first and only study that technically reviews the efficiency and importance of AL for NTC along with surveying the literature in this field.In this paper, we study the NTC techniques and discuss AL as a useful approach in this field.The main contributions of our work are summarized as follows: • Discussing NTC techniques and their correlations with ML techniques • Reviewing existing work in AL-based NTC • Empirical evaluation of the performance of AL for NTC purposes • Discussing the challenges, and future directions in using AL for NTC The rest of this paper is structured as follows: In Section II, we review existing survey works on traffic classification techniques.In Section III, we provide an overview of the NTC problem and the use of ML techniques.Then, we devote Section IV to discussing the fundamental elements of AL and query strategies.Next, in Section V, we discuss the advantages of using AL for NTC purposes and carry out a literature review on this topic.In Section VI, we evaluate the performance of AL in NTC.In Section VII, we discuss the challenges and future directions in using AL for NTC, and finally we conclude the paper in Section VIII. of network speed in NTC and NTC tools.In [23], Finsterbusch et al. reviewed the payload-based NTC based on Deep Packet Inspection (DPI).They also practically analysed the most significant open-source DPI modules to show their performance in terms of accuracy and requirements.Additionally, they provided a guideline on how to design and implement DPI-based NTC modules.In [24], Velan et al. studied NTC models for encrypted network traffics to measure the traffic and improve the security, e.g., detecting anomalies.They have reviewed different types of encrypted traffics and how payloadbased and feature-based NTC techniques can classify encrypted network traffics.Zhao et al. [7] reviewed the use of NTC in IoT and Machine-to-Machine (M2M) networks.They reviewed the current NTC within the IoT context based on the differences between IoT and non-IoT network traffics.By reviewing the literature, the authors showed that in IoT research area, most of NTC techniques are proposed to solve security challenges.The authors in [25] reviewed the NTC techniques, i.e., statistics-based classification, correlation-based classification, behaviorbased classification, payload-based classification, and port-based classification.They also quantified classification granularity based on four levels, i.e., application type layer, protocol layer, application layer and service layer.Last but not least, they classified network traffic features and the existing public datasets that are commonly used in the proposed NTC techniques.
• Literature reviews on the use of ML in NTC: As one of the earliest study in the use of ML in NTC, Nguyen et al. [26] reviewed the literature between the years 2004 to 2007.They studied how ML models can be employed for NTC in IP networks, e.g., clustering approaches, supervised learning approaches and hybrid approaches.
They also reviewed the literature that compares ML techniques or non-ML techniques for NTC.They mentioned that offline analysis models, e.g., AutoClass, Decision Tree and Naive Bayes can achieve a high accuracy for about 99%.They also outlined some critical operational requirements for real-time NTCs models compared to offline models.In [21], Singh evaluated the unsupervised ML techniques including K-means and Expectation Maximization algorithm for NTC.The results show that the accuracy of K-Means is better than Expectation Maximization algorithm.In [27] I, a summary of the surveys above is provided based on their vision of NTC, the reviewed solutions, network type and practical evaluation of studied solutions.As indicated in the table, our survey is for flow-based NTC for the use in Internet communications and specifically considers AL as one of the most important ML-based solution.To the best of our knowledge, our study is one of the rare literature surveys that evaluates such specific ML solutions for NTC as most of existing surveys consider general ML models, e.g., supervised learning solutions for NTC.Studying AL-based solutions makes our work different from all existing survey works.

III. OVERVIEW ON NTC AND ML
In NTC, one should clarify the goals of classification based on the intended use, such as for accounting purposes, malware detection, intrusion detection, providing QoS, and identifying types of applications based on the network traffic (e.g., VPN and nonVPN traffics or Tor and nonTor traffics).Indeed, there are different factors that one can use to categorize network traffic, including applications (e.g., Facebook and Hangouts), protocols (e.g., HTTP and BitTorrent), traffic types (e.g., Web Browsing and Chat), browsers (e.g., Firefox and Chrome), operating systems, and websites.Therefore, the purpose is to determine the label of each network flow truly, e.g., browsing, interactive, and video stream.NTC can be further categorized into online and offline classification.In online NTC, the input traffic needs to be classified in a real-time or near real-time manner (e.g., QoS provisioning).On the other hand, offline classification is appropriate for applications such as anomaly detection and billing systems.Despite their importance, existing NTC techniques suffer from general networking challenges as listed below: • While the literature on traffic classification is mature to adapt to old-fashioned networking paradigms, e.g., legacy cellular systems, the dramatic growth and evolution of online applications and services have made traffic classification a non-trivial task.Due to the traffic characteristics of modern networks, e.g., being large-scale, heterogeneity, multimodal data, and big data, emerging NTC methods must meet strict requirements in terms of system performance, accuracy, and robustness.For example, the vast amount of raw data generated by IoT and cellular devices pose severe challenges to ML-based NTC methods as they need clean and pre-processed data for training purposes.
• NTC is a multi-factor procedure in which an automated program categorizes the network traffic based on the In other words, feature engineering is a challenge when it comes to using classical ML for traffic classification.• The recent increase of encrypted network traffic and protocol encapsulation methods limit the effectiveness of many traffic classification techniques since the packet inspection techniques are unable to extract network management information from network traffics.For example, a significant portion of the Internet traffic is associated with Peer-to-peer (P2P) applications.However, classification of P2P traffic is a difficult task [7] as many P2P applications, such as online video and P2P downloading, use encryption and obfuscation protocols to remove the limitations posed by Internet service providers.To overcome the above challenges, various techniques have been introduced, e.g., graphical techniques, statistical methods and ML-based methods [24].In the scope of ML, various solutions for port-based, payload-based, and flow-based have been proposed as the most promising solutions for NTC [30] [31].Multiple steps are needed for building a ML-based network traffic classifier as presented in [32].Figure 1 shows a graphical description of all steps.In the rest of this section, we discuss each individual step.

Data gathering
Data preprocessing

A. Data gathering
Since ML algorithms learn to classify the data based on sample datasets, representative data must be collected as the data gathering step.While a few publicly available network traffic datasets have been released, using these to train a traffic classification model can be difficult [33].In addition, since the behavior of the network traffic is different from one network to another one, it is highly recommended to train the ML algorithm for the target network [2].Additionally, the number of network traffic classes can be high, and it is rather impractical to consider all classes in one public dataset.Furthermore, there are a variety of data gathering and labeling techniques that lead to different feature sets.Hence, in realworld applications, the goal is to use datasets that are tailored to the intended use of NTC, mainly gathered from the target network.

B. Data pre-processing
After gathering, the data must be pre-processed such that it is represented in a form that the target ML algorithms can discover different patterns.In traffic classification, header data and payload are two major data structures.These structures often need to be pre-processed because they contain irrelevant or redundant information, such as network management data, which is not needed for traffic classification, e.g., source and destination IP addresses, and protocol information.Moreover, changes in the distribution of packet-level features can occur in real-world environments because of unexpected events like the re-transmission of packets.In short, performing some pre-processing steps such as packet filtering, elimination of noisy samples, header removal, and data quality assessment is needed to ease the learning process for the ML algorithms [34].

C. Feature engineering
Conventional classification solutions, e.g., ML-and statistical-based techniques, need to go through a feature engineering procedure, in which domain knowledge is used to extract features or patterns from the raw data [35], [23].Feature engineering is a crucial step in ML-based NTC methods because of the fact that choosing appropriate features can ease the difficulties of the modelling phase, and vice versa [36].It is worth mentioning that considering privacy, the risk associated with feature engineering and representation procedures is also crucially important, especially in the payload feature-based techniques.Indeed, there are some legal restrictions on using payload-based methods in many environments or recognizing all communication protocols.This is mainly due to the user's privacy policies, as such methods inspect the content of the network packets [37].
Generally, there are four major types of input features for NTC: • Time series: Considering time series related features, one can refer to maximum packet inter-arrival time, maximum number of bytes in packet, and inter-packet timings.According to [38], the length of time series (or the number of packets within a flow) has a visible effect on classification accuracy and computational overhead.Specifically, increasing the number of considered packets can improve the classification performance but at the cost of higher computational overhead.In [38], only the first 20 traffic packets in a flow are used for the experiments.The authors in [39] use the time-series features of packets, e.g., source and destination ports, payload size, and TCP window size (bytes) as input for a semi-supervised model to perform traffic classification related to the five Google services, including Hangout Chat, Hangout Voice Call, YouTube, File transfer, and Google play music.The simulation result shows excellent accuracy, despite using a limited number of labeled data samples.This is mainly because they conducted a pretraining step on the entire unlabeled network flows in order to learn statistical features, and then they re-trained the model using a small labeled dataset for fine-tuning.• Header: The header of a network packet contains information related to different layers (e.g., the network layer).Features such as port number and protocol number are widely used as informative features in traffic classification tasks.However, some modern NTC techniques, especially DL-based, accept entire packets as the input feature.For example, in [40] the authors used hexadecimal raw packet header and convolutional networks to classify Tor/non-Tor traffic.To this end, they utilized TCP/IP headers, especially the first 54 bytes of packets, because TCP is associated with around 90% of all the Internet traffic.• Payload: NTC techniques can also use layer-related information above the transport layer to classify network traffic.As a prime example, in [41] the authors utilize BitTorrent handshake packets on layer 4 to classify the BitTorrent traffic.BT generates the highest amount of P2P traffic.Moreover, some works use packets related to the Transport Layer Security (TLS) handshake process to identify HTTPS services [42].• Statistical features: The statistical features of network flows, such as minimum inter-arrival time and size of the IP packets can be used for NTC [43].The main idea behind using statistical features is that the statistical features of network flows generated by different services or applications are almost unique.Nevertheless, a big challenge with the methods that use statistical features is that they are not suitable for online classification.This is mainly due to the fact that a classifier needs to monitor the entire or significant part of a network flow in order to extract statistical features.

D. Model selection
Another step towards building a traffic classifier is selecting the right ML model.In the context of ML, choosing a model can carry different meanings, such as the selection of hyperparameters and parameters, as well as algorithm selection.Given NTC, several factors can be involved in the selection of the classification model (e.g., model performance, available resources, model complexity, and feature selection).One of the most significant factors is feature selection.This is due to the fact that there is a direct correlation between features and input dimensions of the model, and consequently the computational and memory complexities of the model, which are crucial factors in NTC.This implies that the dimensions and structure of input data for training purposes should be optimized.Moreover, the selected features directly affect the performance of the final learning task (e.g., classification and regression) and the dimensions of the input data for training.Hence, one should consider the right number of informative features.In the context of traffic classification, it may be not sufficient to consider the model performance as the only factor for model selection.Thus, one can also consider other criteria, such as training time and model explainability.

E. Model Evaluation
Finally, the evaluation of the selected model is the final step in building a network traffic classifier.In this step, the performance of the ML model on unseen data is measured.The ML model should be able to give accurate predictions to be useful for the given task.However, the accuracy is not the only evaluation metric for a classification task, and other metrics such as confusion matrix, F1 score, recall, etc. should be considered.NTC is a classification task, and we use the same metrics to evaluate the performance of the proposed model.

F. Existing Work
Recently, several ML techniques have been proposed for network traffic classification.In this subsection, we categorize existing work in the literature based on the goals of network traffic classification, including identifying applications (also called apps), cyber security purposes, fault detection, website fingerprinting, user activities identification, and operating systems identification.We discuss these goals in more details in the sequel.
• Mobile apps identification: This goal refers to analyzing and finally identifying the network traffic related to a particular mobile app.Given the ever-increasing number of mobile apps, network administrators and telecommunications companies are actively looking for rigorous methods to secure their infrastructure.Apps identification based on analyzing the network traffic of mobile apps can assist network administrators with resource management and planning, and app-specific policy establishment (e.g., security policy establishment and access management for a specific app).Furthermore, the identification of apps can help protect smartphone platforms (e.g., Android) against emerging security threats and uncover sensitive apps.Moreover, by app identification, it is possible to forbid the use of some particular apps (e.g., Google+ and Instagram) in an enterprise network [44].Several papers have been published on app identification.Ajaeiya et al. in [44] present a framework for the classification of Android apps.The proposed framework identifies apps traffic from a network viewpoint without adding any overhead on users' mobile phones.Moreover, the authors provide a pre-processing method for traffic flows to extract the most informative features for ML-based techniques.The work in [45] leverages Variational Autoencoder (VAE) for the identification of mobile apps.The authors claimed that their method is able to label a massive number of instances and extract the features in mobile apps traffic automatically.To this end, the authors first transform the mobile apps traffic to meaningful images, and then use VAE as a classifier.Similar work was carried out by Wang et al. in [46], in which the authors design three DLbased models, including Stacked Denoising Autoencoder (SDAE), 1D Convolutional Neural Network (CNN), and Long Short-Term Memory (LSTM) for mobile apps identifications.The authors in [47] provide a multiclassification scheme for the classification of mobile apps traffic.More specifically, they combine several mobile traffic classifiers' decisions (knowledge) to classify their traffic samples.propose Random Bidirectional Padding (RBP), a website fingerprinting obfuscation technique against intelligent fingerprinting attacks.It uses time sampling and random bidirectional packets padding to change the inter-arrival time characteristics in the traffic flow, and consequently, to identify more complex patterns in network packets.• User activities identification: Such traffic analysis can be used to obtain exciting pieces of information about a specific action that a mobile subscriber carries out on his/her device (e.g., posting a video on Twitter).The identification of the user activities may also be made to get information about a specific activity, such as the length of a message sent by a user within a particular chat application.User activity identification can be utilized by adversaries/researchers to reveal the identity behind an unknown user, e.g., in a social media, that prefers to remain anonymous.This can be done by behavioral profiling for the users of a network, which is helpful for identifying reconnaissance within the network.Moreover, such traffic analysis offers a possibility to characterize the users' habits in a network, e.g., chatting with friends in the morning and watching the video stream in the evening.The user's behavior information can be employed next time to detect the user presence in the network.In [60] [66] investigate the performance of the three famous operating system fingerprinting techniques, including user-agent, TCP/IP parameters fingerprint, and specific domains communication.Performance measures reveal that the method based on user-agents provides better performance than its counterparts.

IV. OVERVIEW ON AL
A supervised machine learns to discriminate the different traffic classes by being trained on labeled training data.While capturing large quantities of network data is relatively easy, analysing the data by ML techniques can be a very timeconsuming, expensive, or human-labor intensive process.This is mainly because of the complexity of ML techniques or the shortage of labeled data resulting in inefficient training.In order to reduce the number of needed labeled examples and, consequently, reduce the effect of ground truth challenge, AL can be used to facilitate labeling.
AL systems can participate in the gathering and selection of training instances, such that only the most informative examples are required to be labeled.Using AL, a learner follows an iterative strategy in which it interacts with an oracle to choose the most useful data instances to be labeled, thereby, it reduces the cost of data labeling by using only a few labeled examples to deliver satisfactory performance in a reasonable time.The AL paradigm is illustrated in Fig. 2, in which the three core components are: query strategy, annotator, and ML model.The query strategy is responsible for choosing unlabeled data according to a pre-defined policy.A label is then provided for the selected data by a human/machine annotator, and the data is added to the set of training instances.Afterwards, the model is updated, and the process repeated as long as new data is available, or a stopping criterion is satisfied.Different stopping criteria can be defined to end this iterative process, such as reaching the desired accuracy, running time, or a maximum number of queries, which can directly affect the performance of using AL.
There are mainly two AL scenarios to consider, namely, stream-based selective sampling and pool-based sampling (presented in Fig. 3).In the former, the distribution of unlabeled instances is known, and the instances are considered one at a time.The learner then observes each instance in sequence and decides whether the instance should be labeled or discarded.AL is a promising technique to alleviate the challenge of streaming-based learning scenarios [67], [68].AL algorithms designed for streaming scenarios can control the labeling process and gradually perform this process over time [69].Using this strategy, it is expected that the labeling process will be in balance and the algorithms will detect the changes.In the case of pool-based sampling, a pool of unlabeled data is provided, and the aim of the learner is to select the most informative instances from the pool to be labeled by the annotator.Pool-based sampling is attractive for many real-world learning scenarios as it is possible to collect a large body of unlabeled data at once.Pool-based sampling presumes that a limited amount of labeled data and a big pool of unlabeled data are available.

A. Active learning query strategies
The fundamental question in AL is that what is the most effective strategy for querying data instances?In NTC applications, different query strategies can be used based on various network circumstances, e.g., new unknown flows, changes in the behavior of network traffic, and discovering unclassified network traffics.We first, introduce the most well-known query strategies of AL widely used in literature and then evaluate the performance of the query strategies in Section VI.Note that in ML terminology, hypothesis space refers to the all possible legal hypotheses, where a hypothesis is a particular computational model that best explains the target data in supervised ML.In active learning settings, a query strategy can search the hypothesis space through testing unlabeled samples to reduce the number of legal hypotheses under attention.
• Uncertainty sampling (UNC): In UNC, a learner prefers to label the instances where the model is most uncertain about the class of the example.The idea behind the strategy is that those examples on which the model exhibits the most degree of uncertainty are most likely to improve the performance of the model over time.Different criteria, also called uncertainty strategies, for measuring uncertainty, have been proposed including posterior probability, smallest margin, and entropy [70].Entropy is one of the most popular uncertainty strategies in many AL problems.
In an n-class classification problem, assume the estimated probabilities of the n classes are p 1 , . . ., p n , respectively.Given the currently labeled data instances, the entropy is defined as E(X) = − n i=1 p i .log(p i ).Given this expression, a larger value of the entropy means a higher level of uncertainty.Accordingly, this objective function can be considered as a maximization problem.
• Query-By-Committee (QBC): In QBC, an AL system consists of a committee of different learners trained on the current labeled data.These learners are then used to make a prediction on the labels of unlabeled data.
The instances for which the committee members disagree the most on the correct label are selected for labeling.
Then, the committee of learners will use the new labeled data examples for training purposes.The QBC strategy creates wider diversity than UNC because it considers the differences in the predictions of several different learners, instead of measuring the level of uncertainty of labeling using only a single learner.However, the technique for measuring the disagreement is often similar for both query strategies [71].In the QBC strategy, the vote entropy and KL-divergence metrics are usually applied to measure the disagreement.In the literature, to construct a committee of learners, two major approaches have been proposed.In the former, one can change the parameters/hyperparameters of a particular model (e.g., by sampling) in order to generate different models and, consequently, the committee models (or learners).In contrast, in the latter, the committee is built by a bag of different learners (i.e., ensemble of learners).• Learning Active Learning (LAL): The main idea behind this strategy is to train a regressor that forecasts the Expected error reduction (EER) for an instance in a specific learning state.Indeed, this technique formulates the query strategy of unlabeled data as a regression problem.Then, regarding a trained classifier and its output for specific unlabelled instance, the Learning Active Learning (LAL) forecasts the decrease in generalization error that can be reached by labeling that instance.The interested readers are referred to read [72] for details.• Random: It refers to the conventional supervised learning scheme in which instances are randomly selected to be labeled.Since data labeling is an expensive procedure, random sampling may not lead to the best learner, especially when the query of each sample is costly, and consequently, few labels will finally be available [71].• Information Density (Density): Uncertainty sampling, QBC, and LAL query strategies are all prone to choosing outliers or unrepresentative instances and, consequently, this can lead to sub-optimal queries.A solution is to use the representativeness of an instance to ensure the selected instances resemble the overall distribution.When considering whether to query an instance, a combination of representativeness and the informativeness instances is typically used [73].In the density query, to measure the representativeness of a data instance, the closeness of the data instance to all other data instances is often considered.

CLASSIFICATION
As explained in Section III, NTC has attracted much interest in recent years and different ML methods have been proposed to solve the NTC problem.However, most of these methods suffer from various challenges such as requiring a large amount of fully labeled data, existence of a considerable amount of semi-labeled or unlabeled data in realworld network scenarios, and complex, costly, and timeconsuming methodology for data labeling.Providing labels to data instances is especially challenging for NTC techniques, because one must consider several requirements in terms of traffic data granularity in order to satisfy the desired traffic classification objectives.One can, for example, refer to classes on the application level (i.e., Skype or Facebook), protocols level (i.e., TCP or HTTP), or at the service group level (i.e., browsing or streaming) as typical examples of data granularity [24].Moreover, updating a traffic classification method is timeconsuming.However, updating the models may be needed to increase the accuracy of the method or recognize new applications, protocols, or protocol versions.The update is essentially performed using new labeled data.
AL is a promising research field in this context as it greatly reduces the cost of training and dramatically speeds up the learning phase [74].This is advantageous to ML-based traffic classification methods to better satisfy the aforementioned requirements, precisely data requirements and the need for updating to identify new types of traffic through attaching labels on the most informative instances and the need for updating to identify new types of traffic.

A. Advantages of using AL for NTC purposes
AL is potentially a good candidate to perform NTC.Below, we summarize the advantage of using AL techniques in the field of NTC: • Less amount of data needed for labeling: As mentioned before, most conventional networks generate unlabeled and semi-labeled data.Meanwhile, one of the key challenges to use the learning-based techniques for NTMA is the lack or limited accessibility of labeled instances.Moreover, data labeling is not often a straightforward procedure and can raise the cost in terms of time, human effort, and the computational overhead.Other than that, if data labeling is performed manually or by online tools, it can reduce the data quality, since not all data instances are informative.AL can tackle this concern by labeling only the most informative instances.To this end, a comprehensive set of querying strategies in AL has been proposed to determine the quality of instances for labeling [14].Task Force 97 (IETF97) 1 , the challenge is introduced as networks suffer from the lack of a unified theory that can be applied to all networks.It means that the behaviors of different networks are various based on their topology, equipment, scale, applications, etc. Theory of Network causes an important problem that ML techniques should be trained for each network separately.AL can be considered as a suitable online learning choice in such cases thanks to its ability to be learned by a limited number of data samples.This is beneficial for highly dynamic networks with a huge volume of starting and stopping network traffics.AL also allows frequent retraining which eliminates the necessity of using representative datasets.
1 https://www.ietf.org/blog/reflections-ietf-97/ B. Literature Review on using AL in NTC In this section, we review existing work on the application of AL in NTC.
Torres textitet al. [80] proposed a botnet detection technique based on AL.The authors provided a novel AL strategy to label network traffic that contains normal and botnet traffics.The AL strategy is used to create a random forest model that benefits from the user's previously-labeled instances.The primary objective of the proposed technique is to help the user in the labeling process.Similarly, the work in [81] employed AL for a security purpose, i.e., malware classification.In this work, SVMs and AL by learning (AL) have been combined to tackle the lack of labeled instances in malware detection.The simulation results reveal that using AL can enhance the performance of classification in terms of accuracy and the quality of labeled instances.In addition, the authors claimed that by using different training algorithms, e.g., Generative Adversarial Networks (GANs), one can solve issues such as the diversity of security-related datasets.
The work in [82] is another attempt to develop an accurate malware detection system.The system is based on AL, where a new Structural Feature Extraction Methodology (SFEM) is introduced to extract from docx files.The proposed system is able to identify new unknown malicious docx files.To have an updatable detection model and identify new malicious files, the system benefits from AL to update and complement the signature database with new unknown malware.
Common cybersecurity attack vectors, such as viruses, botnets, and malware are known for Intrusion Detection Systems (IDSss).Nevertheless, malicious users continuously create new attacks that can bypass the IDSss.Analyzing anomalous behaviors calls for a considerable amount of time and effort.Preparing a significant of labeled data for the training process is both increasingly costly and inefficient, because of the continuous design of new attacks.In this case, one can use AL to reduce the number of the required labeled instances, while increasing the accuracy of anomaly detection.In [83], a semi-supervised IDS has been designed that works effectively with a small number of labeled instances.The proposed learning algorithm for the IDS benefits from two ML techniques, including AL Support Vector Machine (AL) and Fuzzy C-Means clustering.Furthermore, [83] reported that the proposed learning algorithm enables the IDSs to add new training instances with minimum computational overhead.Due to the fact that domain knowledge is required for the annotations of unlabeled instances, adopting new cost-effective labeling techniques is desired.To this end, the work in [84] by Beaugnon et al. developed an interactive labeling strategy, namely ILAB, to assist the experts in the labeling process of large intrusion detection datasets.ILAB adopts divide and conquer approach to lower the computation cost.Deka et al. [85] investigated the important role of AL in the selection of more informative instances.Then, they used these instances to train a binary IDSs for Distributed Denial of Service (DDoS) attack classification.In addition, since there are massive amounts of traffic in modern networks, a parallel computation method has been employed.The authors referred to this fact that using AL is an efficient technique to keep the cost of the labeling down and avoid keeping redundant training instances.Domain-specific anomaly detection has been targeted in [86] by establishing an AL-based framework of tripartite AL to interactively detect anomalies.In this framework, a twostage algorithm is used for labeling instances.In the first stage, the algorithm selects instances based on the uncertainty criterion for labeling.Then, it uses a technique to evaluate and train multiple annotators with the most appropriate instances.Unlike other related works on AL, this paper investigates human-in-the-loop active ML.Another work that adopted AL has been conducted by Shu et al. [87].They focus on the investigation of adversarial attacks on ML-based IDSs.More specifically, the authors propose using AL and generative adversarial networks to evaluate the related threats in IDSs that use learning-based techniques.They highlight that current adversarial attack techniques need a massive amount of labeled data for training purposes.They propose using AL as a solution to tackle this problem.
Wassermann et al. in [88] examine two central issues in stream-based ML and online network monitoring, i.e., 1) how to learn in dynamic environments in the presence of concept drifts, and 2) how to learn with a small number of labeled data and, regularly improve a supervised model through new samples.To deal with these issues, the authors propose two stream-based ML algorithms, namely ADAM and Reinforcement AL (AL).The RAL algorithm is based on AL to reduce the need for ground truth samples for stream-based learning.Then, the authors use the proposed algorithms in continuous network monitoring for attack detection.
As explained above most of existing works focus on the use of ML in NTC for network security improvement, but there are a few works that use AL-based NTC techniques for other needs.The work in [89] uses AL for P2P traffic classification.They proposed P2PTIAL, a new method for P2P traffic classification based on AL, consisting of two parts, an SVM classifier and uncertainty query strategy.Moreover, to further improve the performance of classification, they adopted filtering and a balanced policy to their method.The main idea behind this filtering is to prevent from unlabeled less informative samples and consequently saving cost in terms of computation and storage space.The authors in [90] use the fusion of AL and semi-supervised learning for industrial fault detection.They propose using AL to improve the performance of the fisher discriminant analysis method since it has difficulty when the labeled instances are not satisfactory.The simulation results reveal that AL and semi-supervised methods can complement each other.Similar work has been performed in [91], in which the authors propose an AL method based on Support Vector Data Dscription (SVDD) for novelty detection in industrial data.The work in [92] introduces an active version of multi-class SVM algorithm, called costsensitive SVM (CMSVM) to deal with the imbalance problem in network traffic.To this end, CMSVM uses a multi-class SVM technique with AL which dynamically allocates a weight to class labels.The author claims that the proposed method can decline computation load, increase classification accuracy, and alleviate the imbalanced data problem.
A summary of the reviewed papers in this section is provided in Table II.As shown in the table, the use of AL techniques in NTC is basically for the improvement of network security.

VI. EMPIRICAL EVALUATION OF AL IN NTC
In this section, as a technical survey, we provide useful insights into the performance of AL and evaluate the most effective and widespread query strategies for AL settings as the most important aspect of AL in NTC.
As mentioned in Section V, the two important challenges in using ML for NTC are the need for retraining, and shortage of data samples.Our evaluation work is conducted based on these challenges.In particular, we evaluate the performance of AL-based classification with respect to the training time and shortage of data samples.To this end, two examples of using the active form of learning for NTC are studied.It should be noted that we do not consider any labeling technique in the performance evaluation.We assume that all existing samples are labeled and hence, labelling is out of scope of our performance evaluation.However, in real-world applications, the labeling time should be considered.
In the first example, we implement a stream-based AL model for a NTC task as shown in Fig. 4a.To this end, we use the Cambridge network traffic dataset as a benchmark dataset and Random Forest as the classifier [30].The dataset has been captured by the Computer Laboratory of the University of Cambridge at various time points of the days from several hosts on three institutions with about 1000 users.This dataset consists of 12 common categories of traffic flow, including WWW, MAIL, ATTACK, P2P, SERVICES, DATABASE, INTERACTIVE, MULTIMEDIA, GAMES, FTP-CONTROL, FTP-DATA, and FTP-PASV.In the stream-based AL, the unlabeled instances are presented one-by-one to the model.Then, for each observed instance, the learner has to query from the annotator for its label if it recognizes the given instance as a useful instance for the model.For example, one can recognize an instance as a useful instance if the model's prediction is uncertain, since asking for its label may resolve this uncertainty.Different strategies have been provided for stream-based AL [93].Most of these strategies select informative instances according to a single criterion.Considering a single criterion for instance selection may be problematic for AL, especially for the stream-based AL, in which a subset of instances selected to be labeled cannot properly show the original distribution of the dataset.
As shown in Fig. 4a, we measure the performance of the stream-based AL for NTC.It can be seen that the model provides around 30% accuracy on the initial training dataset.Next, the model chooses useful incoming instances and uses these instances as the new training data.The learning process can be continued until the model reaches the desired accuracy threshold.In our implementation, we set 95% as the threshold.As another example, we consider a pool-based sampling scenario (non-stream) for NTC.Under this scenario, a small set of labeled traffic data Γ and a large set of unlabeled traffic data U are chosen so that |Γ| |U |.As can be seen in Fig.
Table II: Summary of papers on active learning for NTC.

Ref. NTC challenge Method
Technical contribution [80] Botnet detection Active learning+random forest Proposes a new AL strategy to facilitate the network traffic labeling process, especially for security purposes.
[81] Malware classification ALBL+ SVMs Combines AL by learning and SVM to construct a classifier for malware classification with a small number of labeled samples.[82] Malware detection AL +SFEM Establishes a novel AL-based framework, namely ALDOCX, to identify new unknown malicious docx files.[83] Intrusion detection AL+ASVM+Fuzzy C-Means Provides a semi-supervised method for intrusion detection based on an active version of SVM and fuzzy c-means.
[84] labeling IDS dataset AL+ divide and conquer The paper proposes a labelling technique, called ILAB, for network security purposes.The technique works in an interactive fashion to decrease the cost of labeling in terms of workload.
[85] DDoS attack classification AL +SVM Provides a parallel cumulative ranker method to rate the features of a network traffic dataset.Then, leverages AL to select the most informative samples for training an SVM classifier.[86] Anomaly detection Tripartite AL Establishes a tripartite AL framework to detect anomalies in an interactive manner through crowd-sourced labels.

Reinforcement learning +AL
The paper proposes two ML algorithms for the stream-based environments to tackle some common problems in these environments, such as the lack of labeled samples, concept drifts, and having an updated model.
[89] P2P traffic classification AL+SVM One of the first works that used AL for P2P traffic classification to tackle the data labeling challenge.
[66] Fault detection AL+semi-supervised One of the first works that combine AL with semi-supervised learning to improve the performance of the fisher discriminant analysis method.[91] Novelty detection AL+SVDD Proposes to use AL to solve the problem in SVDD, i.e., the bad performance when data is massive, and the quality is poor.[92] Traffic classification AL+multi-class SVM One of the latest work that adopts AL and a variant of SVM for traffic classification.4b, AL gives better performance than the random sampling method (i.e., passive learner).In active sampling, instances are selectively chosen from the pool of instances.As mentioned, different query strategies have been provided to choose the most informative instances.By doing so, one can reduce the need for labeled instances and achieve higher accuracy than a passive model.Fig. 5 shows the evaluation results of the training accuracy ratio and training time ratio of the aforementioned query strategies based on three well-known benchmark datasets, i.e., TRAbID [94], VPN-nonVPN [95], and Tor-nonTor [96].As there are more than three benchmark datasets in this field, we selected the datasets based on their volume of data to provide small, medium, and big datasets.In all datasets, we first shuffle the data and then select 0.5%, 1%, 2%, 4%, 8%, 16%, 32%, 64% of the whole dataset as a subset to train and test the model by different query strategies.Moreover, to clarify the figures, we elaborate them in Tables III, IV and V.The Xaxis of the figures shows the training time for the full dataset.As an example to interpret tables and figures, when 0.5% of the dataset is injected to train the model by using the QBC query strategy, it takes about 3.4 seconds to achieve 74% of accuracy.In the case of the tables, we calculate the Time and Accuracy for each subset by using different query strategies in various benchmark datasets.In addition, to calculate the Training Accuracy Ratio (TAR) and the Training Time Ratio (TTR) we use Eq. 1 and Eq. 2 as below: Fig. 5a and Table III show the results of TRAbID (smallest evaluated dataset).The original number of samples of the dataset is 9159 and its original features are 43, but based on domain knowledge solution, we selected 41 features to train and test.As shown in the table and figure, Random is the fastest but for a small number of samples, it can not achieve high accuracy.UNC achieves considerable accuracy in a reasonable time compared to other query strategies.LAL is the slowest query strategy as it can not achieve high accuracy by a small number of samples.
Fig. 5b and Table IV show the results of the VPN-nonVPN that has 18759 features.Based on the domain knowledge, we select 22 features out of 24.Same as TRAbID, Random, and UNC are the fastest query strategist that can achieve considerable accuracy.The most important difference between VPN-nonVPN and TRAbID results is the training time, not the accuracy.It means that by having the same amount of data to train, the evaluated query strategies can achieve the same accuracy, but training times are very different.Fig. 5c and Table V show the results of the evaluation on the Tor-nonTor dataset.The whole dataset contains 84194 samples and the original number of features is 28, but based on the domain  knowledge, 18 of them are selected to test and train.Same as for the other datasets, Random and then UNC are the fastest query strategies to achieve high accuracy.LAL is very slow, but compared to other query strategies, it can achieve the same accuracy by having the same percentage of the whole dataset.As a final result, Random is the fastest query strategy but achieves low accuracy.LAL is the slowest query strategy, but it has a good performance in case of accuracy.Comparing the time and accuracy, UNC is the most adequate query strategy to achieve appropriate accuracy in a reasonable time.In the case of the performance of AL, the results show that by using the appropriate query strategy, AL can achieve high accuracy in a limited training time.The results show that AL is appropriate to be used in NTC models as it can be re-trained very fast with a limited number of samples.

VII. CORE CHALLENGES AND CONSIDERATIONS
Although AL can achieve a reasonable level of accuracy using much less labeled examples than traditional ML algorithms, there are still some unsolved problems.In this section, we discuss those challenges and identify the relevant open issues and considerations.

A. Challenges
The challenges of using AL for NTC are listed as below.Table VI summarizes the challenges and some literature that try to address those challenges.
• Noisy annotation: In conventional ML, the labels are often assumed to be the ground truth without noise; However, when the instances are annotated by humans or machines that are prone to errors, additional considerations must be made [97].The existence of noisy labels is problematic for the uncertainty sampling strategy, due to the fact that this strategy is intrinsically noise-seeking [98].In networking, noises can be generated deliberately or accidentally.Intruders can generate packets to mislead NTC systems.For example, in [99] the problem of noise in worm signatures extraction algorithms has been investigated.In these algorithms, deliberate noise can prevent from building reliable and useful worm signatures, and consequently decline the performance of IDSss.Noises can also be generated accidentally due to human or automated process labeling mistakes.In [100], Donmez et al. proposed proactive learning to deal with noisy labels as it uses a decision-theoretic method in order to jointly choose the optimal oracle and instance.In [101] and [102], choosing the most informative instances and combining density-weighted uncertainty sampling and standard uncertainty sampling are proposed respectively to address the noisy annotation issue.• Stopping criteria: The stopping criteria determine when to stop querying unlabeled instances, such as reaching the desired accuracy, running time, or limiting the number of queries, to name a few.As NTC is sensitive in both time and accuracy aspects, providing efficient stopping criteria is necessary.As an example, in case of security support for network traffic, accuracy has a higher priority, but multimedia streams are mainly time-sensitive.In case of latter, the adoption of right stopping criteria can reduce the time complexity of AL techniques for re-training the model.Moreover, many studies on AL have adopted the convergence of the error rate on a set of unseen data samples as the stopping criterion.Nevertheless, providing a new annotated unseen dataset is in direct contradiction to the goal of AL, due to the fact that the purpose is to reduce the need for annotation.In addition, if the unseen dataset contains a small number of instances, there is no guarantee that this dataset will be representative of the future instances.To alleviate this challenge, the authors in [103], [104], [105], [106] proposed different approaches for stopping criteria based on, e.g., stabilization of forecast over different iterations, batch-wise learning, etc. • Outliers in the data: In a canonical definition, an outlier is a data example that considerably differs from other data points in a dataset, which may occur by chance or due to an experimental error resulting in inefficiency to analyze the overall behavior of the dataset.Labeling outlier instances by a human/machine annotator may not help in the prediction of the labels of unlabeled instances, because these instances are likely to be representative of on the assumption that the data is Identically and independently distributed (i.i.d);However, the selection of instances based on a non-random distribution, which shifts over time, may lead to divergence between the prior distribution of instances and the current one [103], [76], [111].Such divergence can affect the performance of the learning task (i.e., classification or regression), since i.i.d assumption is no longer valid.In NTC, drifting distribution in the gathered data from the network can appear because of different events, e.g., network congestion, inefficient queuing models in middle nodes, and inefficient packet inspection model.
• Instance selection criteria: In query strategies, criteria for selecting the instances are very important for an AL system as the system uses the selected instances like a training set.Despite having several query strategies proposed, such as those described in Section IV, there is the lack of research on the role of the annotator (human or machine) in the labeling process.In other words, the annotator plays a passive role in providing a label to an instance, and the lack of understanding about the importance of querying is apparent.In addition, different query strategies can be employed based on the volume of existing samples and the essence of the network traffic.
Based on the Theory of Network challenge, there is no specific policy to use query strategies as the behaviors of networks are different.• Labeling method: Although Section VI shows that AL techniques are efficient for NTC, especially improving the training time and dealing with the shortage of data samples, the main drawback of using AL is the lack of a concrete labelling technique.Due to the Concept Drift challenge, the model should be retrained frequently.On the other hand, as mentioned above, labelling the selected instances can be performed by human or machine, but in real-world applications of NTC, using the human-labour is almost impossible due to the slowness and inefficiency issues.Considering that in most real-world applications of networking, data is unlabeled or semi-labeled [12], a machine-based labeling method is highly needed to perform labelling at the speed of the network with a high accuracy in labeling.The authors in [13] have proposed an interesting idea of combining human and artificial expertise for labeling commits in software development, which has the penitential to be applied in NTC and meet the mentioned criteria.

B. Considerations and open issues
There are some considerations and open issues that should be studied to use AL models in NTC as discussed below.
• The existing gap between deep learning and AL techniques: Deep learning models are being deployed in a diverse range of real-world applications, such as selfdriving cars, anomaly detection, wearables, NTMA [115] and healthcare.However, to achieve the full potential of deep learning in such applications, the lack of labeled data is one of the major barriers.Recent works in the context of AL focus on proposing new/improved algorithms, and a few of them deal with the problem of intelligent data collection [116].From a data science perspective, it is desirable to fill this gap.We believe that conducting further research on deep AL [117] and human-in-theloop ML [118] will narrow the gap.In these branches of ML, intelligent data collection is a step of the ML process.Human-in-the-loop ML can further facilitate the process of labeling difficult/new instances that a machine annotator cannot handle.• NTMA applications in which AL is still missing: As mentioned above, many AL works deploy AL for security purposes, such as intrusion detection and malware detection.However, there is still a long of AL applications, such as traffic forecasting, fault management, QoS management, and routing, in which AL has not been employed.For instance, to our best knowledge, no studies have been conducted on the coupling of network traffic prediction with AL.This is maybe due to the dynamic nature of traffic prediction tasks and difficulty of working with streaming data.The traffic load on communication networks changes overtime or in different situations.Indeed, the distribution of the traffic load may face the problem of drift distribution.• Lack of appropriate public network traffic datasets: NTC involves the categorization of network traffic into several traffic classes, such as WWW, mail, attack, P2P, and multimedia.However, as mentioned above, network security becomes a dominant application of AL-based traffic classification.For example, we have found only two papers that have used AL for P2P traffic classification [119], [89].This is because there are a few relevant and public non-security related traffic datasets.Moreover, the datasets used in these papers are often not updated.The ever-growing popularity in using encryption protocols and the number of Internet-based applications calls for up-todate non-security related traffic datasets to employ for ML techniques.Although Theory of Network indicates that benchmark datasets cannot be used in all networks, having the datasets can help at least evaluate the performance of ML techniques, e.g., AL in NTC.• Uncertainty in the performance of AL based on Theory of Network: Although the performance of AL is evaluated based on benchmark datasets, there is no guarantee to have such performance in each network separately.Different reasons, e.g., weak samples, ML techniques, the complexity of network traffics, etc. can affect the performance of the AL in each network.Overfitting can be a considerable challenge in different networks regarding the use of AL.Although in this study we evaluate the performance of AL in a general manner, its performance should be studied in the real-world target network.• Resource-constrained networks: Most ML algorithms including AL are designed to run on resource-rich devices, but in real-world applications, many devices are resource-poor, e.g., IoT, Edge and Fog devices.Thus, it is needed to optimize ML techniques, i.e., AL techniques, to be efficiently executable on such devices.As a solution, different distributed learning techniques have been introduced to use resources in resource-constrained devices to run ML algorithms in a distributed manner, e.g., federated learning [120].

VIII. CONCLUSION
In this article, active ML for NTC has been investigated.We first provided a background on NTC and AL.The main traffic classification techniques and related issues are summarized, as well as the fundamental elements of AL are explained.Moreover, the core issues and challenges of AL are introduced.This article can also serve as a reference for how to use AL for NTC.Using simulations, the impact of AL on the learning process was analyzed.Based on conducted investigations, the benefits of AL for NTC are discussed throughout the paper.We showed that AL has an excellent opportunity to improve the performance of ML techniques to be used in NTC as it can reduce the training time and the need of labeled data.As the main challenge of networking, network dynamicity enforces the ML techniques to be retrained frequently and AL helps retrain the model using a limited number of samples.(1) Human+artificial labelling method [13]

Figure 1 :
Figure1: The main steps in building a network traffic classifier.
One of the first works that use AL to deal with the lack of labeled data in the investigation of adversarial threats against ML-based IDSs.Qu e r y i t e r a t i o nA c c u r a c y (a) Stream-based scenario (b) Pool-based scenario

Figure 4 :
Figure 4: Percentage of the classification accuracy for streambased and pool-based scenario on the Cambridge dataset.
Accuracy of the model trained on the subset Accuracy of the model trained on the full dataset (1) T T R = Subset training time+ subset selection time Full dataset training time (2)

Figure 5 :
Figure 5: Experimental results on AL by using different query strategies on NTC datasets.Each point in the graphs shows what percentage of the dataset has been used for training, ranging from %0.5, %1, %2, %4, to %64, as indicated in Tables III, IV and V.
Pacheco et al. comprehensivelysurveyed the use of ML techniques in NTC for different cases, e.g., encrypted network traffic.By understanding the challenges of using ML techniques in NTC, they studied the reliable label assignment, dynamic feature selection, integrating the meta-learning processes.They considered these solutions to solve several issues, including imbalance network data, dynamicity of networks, and online strategies for retraining the ML models.In Table [28]e compared to other techniques.In[28], Gomez et al. compared seven ensemble ML techniques including OneVsRest, OneVsOne, Error-Correcting Output-code, Adaboost classifier, Bagging algorithm, Random Forest and Extremely Randomized Trees which are all based on decision trees in NTC.They compared them in case of model accuracy, latency and byte accuracy.In [29],

Table I :
An overview of existing literature surveys on NTC and ML.
[59]ize federated learning for malware detection in IoT devices through one supervised model (based on Multilayer Perceptron (MLP)) and one unsupervised model (based on autoencoder).To evaluate the framework, they use N-BaIoT dataset, which models the traffic of IoT systems impacted by malware.In[52], McLaughlin et al. present a DL-based method for Android malware detection using the raw opcode sequence as the input of a CNN model which can automatically learn the features of malware instances.The authors claimed that the proposed method has a more straightforward training pipeline than the previously proposed works (e.g., n-gram-based malware detection).Huang et al.[53]combine the unsupervised spatiotemporal encoder with LSTM to detect abnormal network traffic.The spatial feature of network traffic data was extracted in the first stage by the spatiotemporal model.Then, the obtained features are used to train another LSTM layer for the classification purpose.NSL-KDD dataset was used for the evaluation of the model.Based on the experimental results, using the proposed DL model, the efficiency of intrusion detection is significantly high compared to the traditional techniques.Website fingerprinting can help recognition of fraudsters and other unusual activities.Moreover, website fingerprinting can be considered as a type of traffic analysis attack that allows eavesdroppers to get information on the victim's activities.Given the importance of website fingerprinting, there is a large body of literature on this topic.In[57], Rahman et al. leverage the idea of adversarial ML to defend users against website fingerprinting attackers.The authors propose a method to generate adversarial examples to decline the accuracy of the attacks that use learning-based techniques for robust traffic classification.The simulation results show that the proposed method can decline the accuracy of the state-of-the-art attack by half.The work in[58]focuses on the concept drift problem in static website fingerprinting attacks for the Tor network.The authors refer to the fact that it is costly to update static attacks in dataset updating and retrain the model.Hence, they introduce AdaWFPA, an adaptive online website fingerprinting attack that leverages adaptive stream mining techniques.Luo et al. in[59] [54]bersecurity purposes: One of the main goals of traffic classification is detecting security breaches in communication systems, e.g., intrusion detection, malware detection, anomaly detection, and worm detection.Cybersecurity tools/techniques (e.g., intrusion detection systems) aim to defend communication systems from internal/external threats.Traffic classification methods can be used to assess network traffic behavior through detecting malicious traffic flow/link, and then prevent attacks.A large body of work in the literature has focused on ML-based malware and intrusion detection.The authors in[48]propose an intrusion detection approach based on deep neural networks and compare the performance of DL with classical ML classifiers, demonstrating the superiority of DL models.Similarly, in[49], Shone et al. propose a non-symmetric deep auto-encoderbased learning solution for intrusion detection.The autoencoder network has been used for learning features in an unsupervised manner.Then, they employ a stacked non-symmetric auto-encoder as a traffic classifier.In[50], Nguyen et al. propose a federated self-learning method to detect anomalies in IoT systems.Similarly, in[51], Rey et al. • Fault detection: Fault detection is part of a more extensive network management process, called fault management.Fault management points to a set of processes to detect, isolate, and then correct unusual situations of a network.Failure occurs when a system (e.g., an IoT network) cannot adequately provide a service, where a fault is the source cause of a failure.Fault management, especially fault detection, play an essential role in today's network management (e.g., QoS provisioning).Hence, many works have been conducted to improve the fault management process.In[54], Huang et al. of-bounds.Moreover, they compare the performance of the proposed method with other well-known techniques, e.g., MLP, CNN, and probabilistic neural networks.•Website fingerprinting: It refers to methods for identifying and collecting data about websites visited by a mobile device, which is essential for the advertising industry, identifying the characteristics of attacks (e.g., botnets and sniffing) and protecting users' privacy.
[63]et al.in[63]categorize user activities of the WeChat application by performing a detailed analysis on the encryption protocol of this application, called MMTLS, to find the typical user activities of the application (e.g., advertisement click and browsing moments).Then, they adopt different learning algorithms, such as Naive Bayes, Random forest, and Logistic Regression, to classify these activities.
[65]ning to analyze encrypted mobile traffic to deal with the problem of diversity of app releases, mobile operating systems, and model of devices, and identify user actions.The work in[62]focuses on the identification of the Instagram user behavior.Unlike previous works that used the statistical features of encrypted traffic, this work provides a new technique based on maximum entropy to obtain the more stable traffic features.torMachine(SVM),Random Forest, k-nearest neighbors, and Naive Bayes) and DL algorithms (i.e., MLP and LSTM ) for classification purposes.Moreover, the authors propose to use the underlying TCP variant as a practical feature for improving classification accuracy.The authors in[65]compare the performance of the ML-based techniques, such as k-nearest neighbors and Decision Tree, with the traditional commercial rule-based strategy for operating systems fingerprinting.The simulation result demonstrates the superiority of the learning-based techniques to the traditional method.Lastovicka et al. in to their need for huge amount of new data samples.Most well-known ML techniques become useless in NTC as the network cannot be unattended for a long time due to retraining purposes.AL is able to (re-)train the models very fast with high accuracy by continuous provisioning of new labeled instances.This is demonstrated in Section VI where AL performance is evaluated with regard to the training time.
[75]ncept Drift: Due to high dynamicity of computer networks, ML techniques must be re-trained frequently because of various reasons, e.g., new network behavior and new classes of network traffic[75].In most ML techniques, such as DL, retraining a model from scratch is a resource-intensive task in terms of time and power computation in addition • Addressing Theory of network: In Internet Engineering

Table III :
The results on TRAbID dataset.

Table V :
The results on Tor No-Tor dataset.
can be useful in case of data scarcity as availability of more data can improve training in ML techniques.On the contrary, outlier removal is more efficient in terms of speed.Therefore, based on the target quality requirements such as resource consumption or delay-sensitivity we can choose the appropriate solution for dealing with outliers.•Drifting distribution: Most existing AL algorithms work