A Complete Review on the Application of Statistical Methods for Evaluating Internet Traffic Usage

Internet traffic classification aims to identify the kind of Internet traffic. With the rise of traffic encryption and multi-layer data encapsulation, some classic classification methods have lost their strength. In an attempt to increase classification performance, Machine Learning (ML) strategies have gained the scientific community interest and have shown themselves promising in the future of traffic classification, mainly in the recognition of encrypted traffic. However, some of these methods have a high computational resource consumption, which make them unfeasible for classification of large traffic flows or in real-time. Methods using statistical analysis have been used to classify real-time traffic or large traffic flows, where the main objective is to find statistical differences among flows or find a pattern in traffic characteristics through statistical properties that allow traffic classification. The purpose of this work is to address statistical methods to classify Internet traffic that were little or unexplored in the literature. This work is not generally focused on discussing statistical methodology. It focuses on discussing statistical tools applied to Internet traffic classification Thus, we provide an overview on statistical distances and divergences previously used or with potential to be used in the classification of Internet traffic. Then, we review previous works about Internet traffic classification using statistical methods, namely Euclidean, Bhattacharyya, and Hellinger distances, Jensen-Shannon and Kullback–Leibler (KL) divergences, Support Vector Machines (SVM), Correlation Information (Pearson Correlation), Kolmogorov-Smirnov and Chi-Square tests, and Entropy. We also discuss some open issues and future research directions on Internet traffic classification using statistical methods.

Different techniques are used to deduce the application protocol and correlate traffic properties, such as Machine Learning (ML) algorithms, sets of heuristics, or statistical measures [7]. For example, according to Liu [17], many researchers use ML to perform statistic-based classification. Statistical classification methods can be divided into two categories: parametric and non-parametric methods [18].
We propose and use the taxonomy of classification methods shown in Figure 1. We address statistical-based methods covering both parametric and non-parametric methods. The category of parametric methods includes linear Support Vector Machines (SVM) [19], Euclidean Distance [20], Pearson correlation [21] and Jensen-Shannon Divergence [22]. The category of non-parametric methods includes non-linear SVM [19], Bhattacharyya Distance [23], Hellinger Distance [24], Kullback-Leibler (KL) Divergence [25], Wootters Distance [26], and Kolmogorov-Smirnov (KS) [27] and Chi-square [27] tests. Classifiers based in parametric methods have, for each class, a statistical probability distribution. As for the non parametric classifiers, they are used to estimate the statistical probability distribution, or in cases in which the density function is unknown [18].
In this work, we review the classification of Internet traffic based on statistical methods, including classification methods applied ''in the dark'', observing the main objectives of each survey. It is important to emphasize that we also describe the statistical methods and distances proposed for classification in general, and specific traffic classification found in the literature. Specifying the research method is a crucial step in literature reviews [46]. Our study was guided by the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) methodology [47], [48].
According to [48], the PRISMA methodology offers us an evidence-based collection of items, which can be used as a basis for revision work. In addition, PRISMA provides us with a flowchart that allows us to visualize the search strategies and eligibility of the articles. The PRISMA flowchart describes the information cycle used in the different review phases. In order to present and detail our selection process, the flowchart was prepared as shown in Figure 2. The flowchart has 3 phases: identification, screening and included. Through the flowchart, we mapped the number of articles identified, included and excluded, and the exclusions reasons.
The reviewed articles in this paper were chosen from the almost 507 most relevant articles found on a search in IEEE Xplore, Elsevier, ACM Digital Library, Google Scholar and Scopus with the keywords: Internet Traffic Classification, Traffic Classification, Traffic Identification, Encrypted Network Traffic, Network Monitoring, and Statistical Distributions. We also searched for papers with the keywords: Statistical Methods, Statistical Distances, Statistical Analysis, Parametric, Non-Parametric, in the time period extending from 2011 to 2021. In total, 145 articles were reviewed for this work.
Our inclusion criteria were full papers published in journals, articles written in English, and articles that address  features used to classify Internet network traffic. Our exclusion criteria were duplicate articles, whitepapers, articles shorter than two pages, studies that did not address differences between parametric and non-parametric methods, articles that do not address features used to classify Internet network traffic, articles that do not address proposed statistical methods solutions, articles written in languages other than English, and studies without full text available. After applying the keywords, articles that were not related to the topic in question were excluded by reading the abstract and title. We selected for full reading the articles that could be included after the exclusion and inclusion criteria.
In the initial phase, 2129 articles were identified; of which 1622 were excluded because they were duplicates, whitepapers, studies without full text available, and articles written in languages other than English; 507 were pre-selected. In the screening phase, the abstracts and titles of the articles were read and those unrelated to the topic were excluded, totaling 200 excluded and 307 eligible. Out of the 307 articles,  76 were eventually removed as they were too short, with only two pages. Finally, 231 articles were fully read, of which 86 were excluded for not addressing statistical methods solutions proposed or differences between parametric and non-parametric methods, totaling 145 that met our eligibility criteria and were included in our study. In order to present the results of our selection of articles, following our eligibility criteria, a statistical analytical visualization chart was generated, as shown in Figure 3 with the number of articles per year.
This work was structured as: review of the classification processes on Section II. Overview of SVM and statistical methods focusing on distances and divergences on Section III. Several methods to classify Internet traffic by using statistical methods on Section IV. Discussion and list of open issues on Section V. Section VI concludes the review.

II. PROCESS OF TRAFFIC CLASSIFICATION: OVERVIEW A. CLASSIFICATION PROCEDURES
An overview of the traffic classification process was provided in this section as it follows: Internet traffic categorization, data-set, features, classification approach, and validation. Collecting data from a network is a critical point and serves as input to form a pool of network traffic. Extracting and selecting features is an vital process as it can impact the efficiency and effectiveness of classification. The approach chosen for traffic identification is essential to the classification success, as well as ranking performance evaluation criteria [3]. The Figure 4 shows the procedures for classification. As it follows, two topics will be approached: 1) Internet traffic applications, and 2) Dataset.

1) INTERNET TRAFFIC APPLICATIONS
Internet traffic describes the quantity of information or data presented throughout the Web and on different applications, it can be considered a data flow on the Internet [2]. According to [49] Internet traffic is categorized and described according to Table 2. Internet traffic is grouped forming a Dataset.
Internet traffic is divided into ten categories: Administration, Communications, Gaming, Filesharing, Marketplaces, Social Networking, Real-Time Entertainment, Storage, Tunneling, Web Browsing. Each category has a description, that characterizes the associated traffic. The Administration category can be described by services and applications used to administrate the network, such as SNMP, and ICMP protocols. Gaming includes traffic by PC gaming, console, download traffic of consoles, and game updates, like Xbox Live and Playstation traffic. File-sharing includes applications that use distribution protocol models or peer-to-peer, such as Gnutella, eDonkey, Bittorrent, Newsgroups, Ares. Web Browsing includes specific websites, and Web protocols, like WAP browsing, and HTTP.

2) DATASET
Dataset, in classification, has huge importance on evaluating and comparing the performance of different methods. A dataset must contain many diverse samples of each class. A model can fit itself to a specific dataset, doing so, the worry around the probability of the dataset having a deterministic behavior appears. That can happen when a model adjusts itself too much to a specific type of traffic or to a dataset, either because of a lack of interaction to group of users or even for interacting to a small group of users [2]. Usually, for being diversified, traffic is observed on ISPs core, which means that the farther away from the destiny the captured traffic set is, the smaller is the probability of having a deterministic behavior [2].
According to [50], and [42], a dataset is collected and used as an input for training and classification purposes on an ML classifier. On statistical-based methods, statistical resources are allowed to be extracted from flows. These resources are characteristics or properties of flows calculated over many packets [50]. Normally, different features and datasets are used classification-wise.
As stated by [45] a pre-processing phase happens after data collection to extract features that are going to be included into the model. For traffic classification, it is required to evaluate network flow with their main characteristics (packet inter-arrival time and size) with their various statistical values (standard deviation, quartile, min, and max). A set of packets that have the same connection parameters is defined as a flow. Those parameters include port numbers and transport protocol, destination and source IP addresses.
As said in [51], a different way of representing Internet traffic is through time series. They are network flows represented by generating time series of communicated packets/bytes. For each flow three-time series are generated: (1) for bytes channeled through input packets, (2) for bytes channeled through output packets, and (3) for bytes channeled through input and output packets. A short description about feature selection will be present as follows.

B. FEATURE SELECTION
Features are considered in the process of investigating methods and approaches to characterize and classify traffic. Feature selection is a important step in Internet traffic classification. The author [52] sets it as the process of selecting the smallest set of features needed to reach a precise classification. The classification of different application categories occurs when there are some discrepancies in traffic behavior based on the selected features. However, [40] claims that researchers have chosen one or some features from a set of characteristics to classify different traffic flows, basing it only on the qualitative analysis of different features. For analysis purposes, according to [36], it is needed to classify encrypted traffic into numerous flows based on five tuples: User Datagram Protocol (UDP)/TCP, source/destination IP addresses and port numbers. Hereafter the following features will be approached: 1) Packet-Length Based features, 2) Packet-Ordering Based features, and 3) Packet-Timing Based features.

1) PACKET-LENGTH BASED FEATURES
As packet length is a feature related to network packets, according to [36], and [53], its information becomes a commonly used type of resource and it has demonstrated its effectiveness in analyzing traffic that has been encrypted. On packet-length based features, the packet length, the VOLUME 10, 2022  cumulative length sequence and the statistics that can be drawn out of the flow, such as minimum, maximum, average, median variance, standard deviation, relative frequency, kurtosis, skew, packet size and variance can be statistical values of packet length.
When obtaining the packet length, in each flow, the first length sequence of the X packets can be used as a key resource. Those X packets can vary a lot length-wise from one website to another, because of their different content and protocol parameters, like those in handshake process, more specifically in the Transport Layer Security (TLS)/SSL. We can use lengths of distinctive packets to distinguish different traffic types.
In a flow, the packet lengths are distributed in intervals that depend on the transport layer and on the MTU (Maximum Transmission Unit). To obtain statistical characteristics of packet length in a flow, the packet length can be aggregated to a fixed number of buckets or bags. To obtain the cumulative length sequence, considering the flow direction, the length of up-link packets can be defined as negative, and as positive for down-link packets. The length of the packets sent are then accumulated to obtain a sequence of the first X cumulative packet lengths. Considering bidirectional flows, we can be define as positive when the packet length is up or down-link. A sequence of cumulative length of the first X packets on a flow seems to be a differentiating feature.

2) PACKET-ORDERING BASED FEATURES
In some cases, the lengths of packets are alike or even the same between different encrypted traffic flows. That makes so alternatives based only on packet length seen less efficient because of the information used. For that, a counter or techniques based on packet counting can be useful [36].
Some packet counts can be considered, such as counting the quantity of up and down-links for each X packets. We can also count the amount of packets before each up-link. Besides that, we can also extract a resource that indicates the number of down-links between two up-link packets.
According to the literature, burst counting can also be useful. An up-link packet burst can be used as example, since down-link packets are exposed to network delays. To do so, the quantity, maximum and average of bursts were considered for each flow.

3) PACKET-TIMING BASED FEATURES
Several information about timestamps of packets can be used to characterize and classify traffic [36], [40]. Inter-Packet Delay is one of the examples. When packets are sent through the network, they receive a timestamp of date and time. The difference between timestamps is defined by the Inter-Packet Delay.
To determine the period of time a transmission is concentrated, the quantity of packets in a time interval is calculated for every series of packets. Timing characteristics generally have limitations, as we most often consider time distributions to be equal, when in reality they are not. The timestamps of packets may experience network fluctuations. This feature can be combined with control packets such as ACKs, CTSs reference points. Table 3 presents a summary of packet-based features.
Even though some features were chosen to classify Internet traffic differently, they do not have the same level of importance. To better understand, each selected feature can receive a weight value that represents its importance. In order to select only important sets of resources, the author in [40] discusses three methods, Wrapper method, Filter method, and Embedded Method, which are briefly described below: • Wrapper method -Makes use of machine algorithms to rate the performance of different subsets to aid learning. The results are not specific to the ML algorithms used, for this process Genetic Algorithm-GA, Sequential Forward Selection, Simulated Annealing, Sequential Backward Selection, Randomized Hill Climbing are used.
• Filter Method -Makes an independent evaluation based on data characteristics and depends on specific metrics to, before learning begins, rate and select the best subset. For that Correlation based Feature Selection (CFS) algorithm is normally used with Fast Correlation Based Feature Selection (FCFS), and Markov Blanket Filter method.
• Embedded Method -As part of the learning procedure, performs variable selection and it is usually specific to some learning machines. For this process decision tree, Naive Bayes, random forest, Support Vector Machine (SVM), and based methods are normally used in regularization techniques, etc.

C. CLASSIFICATION APPROACHES AND VALIDATION
In the literature, we find four kinds of approaches to Internet traffic classification: port-based approaches, payload-based approaches, ML-based and statistical approaches. We provide in Table 4 a comparison among these kinds of approaches, which we briefly describe in the following.

1) PORT-BASED APPROACHES
The oldest traffic classification method is the port-based approach. According to [11], this method uses the association of well-known TCP/UDP port numbers assigned by IANA with ports in the TCP/UDP header. It uses port numbers related to an application where the application is related to a specific port number [36], some examples are SSH traffic that relates to port 22, and SMTP to 25. Most applications use port numbers already ''known'' so other hosts can start communication.
During handshake, an identifier is placed in the communication channel, right in the middle of the network, awaiting for SYN packets. SYN packets have the destination port number and are used during the handshake on TCP. The application is recognized by the port number contained in the SYN packet. It becomes all possible because TCP is connection orientated. Traffic identification through port numbers is also used on UDP, even though this protocol does not have control packet in its connection.
Implementing this method is quite simple and quick, once it does not involve calculations and requires only the number of ports to identify the application. Although its easy implementation, this approach has limitations that have a huge negative impact on traffic classification. Protocols that use tunnels, random ports, and Network Address Port Translation (NAPT) cannot be identified by this approach [8]. One possibility to easily escape detection by this method is to use port 80, which is generally open for HTTP traffic.
Some other protocols that cannot be identified by portbased approaches are the telephony through Internet that uses encapsulated Session Initiation Protocol (SIP) on Real-Time Transport Protocol (RTP), which sometimes use random port numbers, and P2P protocols that use random ports or ports associated to other protocols aiming to mask the traffic [8].

2) PAYLOAD-BASED APPROACHES
This kind of approaches recognizes applications by analyzing payload or packets. Aiming to find pre-defined byte sequences from the applications, payload is analyzed bit by bit. After those sequences, called signatures, are found, they are stored and compared to application packets for classification [36]. The great advantages of these methods are their capacity to generate low rates of false negatives and a highly accurate traffic classification.   The biggest limitations of these methods are: The development and maintenance of a database with application signatures. The high consumption of computational resources for the development requiring a longer processing time and storage space. It is an inefficient method to identify and classify traffic and packet payloads that are encrypted, unavailable payloads or on recognizing applications that have not been mapped yet. Besides that, it involves legitimacy and privacy issues of packets and traffic [54], [55].
Approaches based on statistical characteristics for traffic classification have been developed aiming to overcome limitations presented by traditional approaches, and they have caught the attention of researches. To identify and classify traffic, neural network and machine learning algorithms have been used.

3) ML-BASED APPROACHES
Machine Learning is known by supplying computers with the capacity to learn through programming. It has been used to prepare machines to work with data in a more efficient way. Machine learning is divided into unsupervised and supervised. On the unsupervised learning, information is extracted through non labeled data. On the other hand, on supervised learning, the information depends necessarily on data lettering. Machine Learning uses data patterns to label things [35], [42], [56].
ML has the capacity to work and learn from big data volume by using specific algorithms. Tasks as prevision, regression and classification of massive quantities of data can be solved through it. Machine Learning also has the capacity to deal with long and wide data. Long data means that number of subjects exceeded the number of input variables. Wide data corresponds to the number of input variables exceeding the numbers of subjects [35], [57], [58].
As appointed by [56], ML has a different and specific algorithm to solve problems involving data. Choosing the best algorithm to be used depends on which modal will better suit the problem, what the problem is and the quantity of variables involved in it.
Besides Internet traffic classification, ML has also been used in network operations and management, aiming to optimize the resources and improve the system performance. In addition, ML can be applied to many different areas such as marketing, games, digital images, intruders and malware detection, information security and data privacy.

4) STATISTICAL APPROACHES
Statistical-based classification uses statistics from the network and transport layers. By using parameters undependable of payload and payload analysis, statistical based classification methods go around payload, encrypted payloads and user privacy problems. They use statistical properties unique of protocols, flows and applications, which helps to differentiate the applications [36], [40].
Some examples of valid parameters of statistical-based network classification: packet inter-arrival time, flow duration, packet size, among others [36], [40]. Besides those parameters, statistical characteristics of packet tracking are captured and used, such as Border Gateway Protocol (BGP) updates and the unexpected rise of packet rate, which can also be an indicative of P2P applications in the network.
Commonly, Machine Learning uses statistical-based strategies to calculate resource parameters that will be used as data input in the supervised method classification, like SVM [36].
As stated by [59] techniques that are implemented based on statistical classification, are capable of perceiving flow behaviors expected through observations. Statistical methods combined with methods grounded on rules might offer scalability, adaptability, flexibility and robustness. Furthermore, to differentiate traffic that has any flaws from regular traffic, statistical measurements can be used. However, the manual selection of statistical resources can compromise the requirements of traffic classification, generating a lower accuracy.

D. VALIDATION
The validating process consists on testing the obtained results from the classification, aiming to acquired its performance. In this sense, obtained classification results are compared to previously-known hand-based real data classification results, usually known as ground truth, which allows to compute true positive, false positive, true negative and false negative rates. Another challenge during validation is to collect original data in real time to obtain the ground truth [60].
Many performance measures are used to evaluate if a classification method could achieve the expected performance. Table 5 represents an overview of the metrics used to evaluate traffic classifiers. Metrics widely used are: F-measure [61], Precision, Recall, Specificity, Area Under Curve (AUC), Completeness [52], and F-1 Score [62].

III. STATISTICAL METHODS
In this section, we address the concept and properties of the statistical distances and divergence, as well as the SVM method based on statistics and widely used in traffic classification. Table 8 presents an overview of distances and divergences for quantitative (non-negative) data. We group the methods according to parametric and non-parametric approaches.

A. OVERVIEW OF PARAMETRIC AND NON-PARAMETRIC MODELS
On parametric models, datasets can be constructed by a probability distribution that has a number or a fixed set of parameters, which only the applied to variables. It is considered to be a parametric model some statistical and learning models that use a quantity of fixed parameters. For parametric ML, the quantity of parameters if fixed does not matter the amount of training data. Some examples of parametric models are linear SVM, Pearson correlation, denominated correlation information and Euclidean Distance [18], [63], [64], [65].
Non-parametric modals represent data without a defined number of parameters, and when modeling this data, they do not make presumptions about the probability distribution. Models implemented with this approach do not accept a specific mapping function between input and output data as true. This kind of models assumes that parameters are not only adjustable, but can also be altered. Parametric model also assumes that the larger the quantity of training data is, the larger will be the number of parameters. The result of this is that the non parametric model can take longer to perform the training [18], [65], [66]. Table 6 presents a comparison between parametric and non-parametric models. Bhattacharyya Distance, Hellinger Distance, KL Divergence, Wootters Distance, KS and Chi-square tests, and non-linear SVM are examples of non-parametric models. Hereafter the following subtopics will be approached: 1) Statistical Distances and 2) Statistical Divergences.
This section focus on Statistical Distances and Divergences. A brief description about these kinds of methods follows. Details about other methods herein mentioned that do not fall within those kind of statistical methods may be found elsewhere, namely details about Correlation Information (Pearson correlation) can be found in [67] and [68], details about Kolmogorov-Smirnov and Chi-Square tests can be found in [67], [69], and details about Shannon entropy can be found in [70] and [71].

1) STATISTICAL DISTANCES
The concept of distance between objects or individuals allows us to interpret, geometrically-wise, many classical techniques of multivariate analysis, equivalent to representing these objects as points in a metric space. In classification [72] of network traffic, the main objective is to find statistical differences between flows or even a pattern in traffic characteristics through statistical properties. It is possible to interpret this way because the observed variables are considered of a more general category, and not only as quantitative variables or own variables. As it is, it makes sense to calculate the proximity between objects or individuals [72], [73].
As stated by [72] the distance calculation is vital to many statistical inferences being them theoretical or applied. Besides that, it has become essential to solve data processing problems, such as classification, estimation, detection, regression, selection models, diagnosis, identification, recognition, indexation and compression. Combining its properties to statistical distance concepts, we have an essential instrument for science and data analysis [74].
Through the distance computation, it is possible to create hypotheses tests, study the estimators properties, compare classes, objects and individuals. Furthermore, the distance offers the researcher an assistance to interpret the data, because it is a very intuitive concept, allowing an easy comprehension and a harmonious representation [74], [75].
In general, we consider two classes of statistical distances between individuals and populations. The individuals of each population are characterized by a random vector X = (X 1 , . . . , X P ), which follows a probability distribution f (x i , . . . , x p ; θ). The distance between two individuals i, j, characterized by the points x i , x j , of R p , is a non-negative symmetric measure, δ (x i , x j ), which will depend on θ, where θ represents the parameters and R p is the quantity of dimensions that the X variable may have. Therefore X has n observations and p variables.
Moreover, the distance between two populations will be measured by the divergence δ(θ 1 , θ 2 ) between the parameters that characterize them. It may also be convenient to enter the distance δ(x i , θ) between an individual i and the θ parameters. Non-parametric distances can be defines by it functional  divergence and the density functions. In some cases they are related to entropy measurements.
A δ distance over an set is an application of × over R so that each pair (i, j) corresponds to a real number δ(i, j) = δij fulfilling some of the following properties, according to the Table 7.
A distance must fulfill at least properties 1, 2, 3, presented in Table 7. When it fills these properties, it is called dissimilarity. In general, δ only meets approximately some of the stated properties. It is then a matter of representing ( , δ) through a model (V , d), approximating δ to d, where δ meets sufficient properties that are mandatory. According to the representation technique, such as main component analysis, main coordinate analysis, proximity, correspondence analysis, cluster analysis, the distance d can be Euclidean, ultrametric, additive, non-Euclidean, or Riemanian, among others.

2) STATISTICAL DIVERGENCES
Non-parametric measures of divergence between probability distributions are defined as functional expressions often related to information theory, which measures the degree of discrepancy between any two distributions, not necessarily belonging to the same parametric family. Divergences have applications in statistical inference and in stochastic processes.
Let p = (p i , . . . , p n ), q = (q i , . . . , q n ) be two multinomial distributions. The divergence between q and p can be measured as the discrepancy between the quotient x i = q i /p i and 1. Based on the meaning of Ho = (p) − entropy, φ− Csiszar divergence is defined between p and q, where φ is a strictly convex function in which φ(1) = 0.H φ, and by Jensen inequality we have: The equation 1 reaches the value 0 if and only if p = q. It can be taken as a measure of dissimilarity between p and q, but in general it is not a distance, as it is not always symmetrical, or if it is, it may not meet the triangular inequality. Shannon entropy and the φ− Csiszar divergence form the information measure known as the KL [25].

B. PARAMETRIC DISTANCES AND DIVERGENCES 1) EUCLIDEAN DISTANCE
The most familiar distance between two individuals i, j is the Euclidean distance described by the equation [25]: Proposed by the Greek mathematician Euclid, it is based on calculating the distance between two points within the Euclidean space. where D E [i, j] represents the distance function, p defines the quantity of samples, k defines the initial value of the sample, x ik represents the first point and y jk represents the second point [76].

2) JENSEN-SHANNON DIVERGENCE
Jensen-Shannon Divergence (JSD) is the calculation of the difference between two series of probability distributions [77]. It is known for being the limited symmetrization of KL [78].
JSD is a function that allows us to quantify the difference of two, maybe more, probability distributions [22]. JSD also has the additional advantage of not requiring absolute continuity of the distributions to compare them. Thereby, JSD can be used to compare the distribution of different packet sequences in a network flow, associating an appearing frequency to each flow with probability distribution.
For two discrete probability distributions P = (p 1 , p 2 , . . . , p n ) and Q = (q 1 , q 2 , . . . , q n ) with pi ≥ 0, qi ≥ 0, JSD divergence is represented by [77]: JSD function equals 0, if and only if (pi = qi). In this case, it means that they are the same distribution, in other words, the same application. It is a delimited and symmetric metric (0 ≤ JSD ≤ log(2)) for orthogonal distributions (pi.qi = 0). As traffic classification was intended through the values of the distances between the application distributions, JSD determines the divergence between two probability distributions P and Q.

C. NON-PARAMETRIC DISTANCES AND DIVERGENCES 1) BHATTACHARYYA DISTANCE
Bhattacharyya Distance, also known as divergence, was proposed by a statistician called Anil Kumar Bhattacharyya VOLUME 10, 2022 (1943 and 1946) working with Kailath [79]. This distance measures the dissimilarity between two probability distributions. It is very related to Bhattacharyya coefficient, that is the calculation of the quantity of overlap of two statistical population samples [80], [81]. In its first version, Bhattacharyya did not present the calculation, he used a logarithm scale.
Bhattacharyya Distance is independent of the distribution function and it can be applied to any data set or sample. This characteristic makes the distance appealing to be used in models in which the distribution is undetermined [80].
Bhattacharyya coefficient can be used in classification as a measure of the separability between classes [82], and to determine the relative proximity between samples that are being taken under consideration.
When two probability distributions have similar averages, Bhattacharyya Distance rises depending on the difference between standard deviations, in other words, the bigger the difference between standard deviations, the bigger the probability distribution. Bhattacharyya statistical distribution is given by equation 4 [83]: where N is the quantity of partitions and pi and qi is the quantity of members from the sample in the I-th partition.

2) HELLINGER DISTANCE
Hellinger Distance was proposed by the German mathematician Ernst David Hellinger in 1909. It is a statistical divergence used to calculate the dissimilarity between two probability distributions. Hellinger Distance (HD) is related to Bhattacharyya Distance and it is part of the f-divergences family [78]. Studies presented in [84] and [85] showed that Hellinger Distance can be used in classification. On the current scenario, this distance has been very used in machine learning, even as an alternative to methods such as entropy, aiming to detect failures in the classifiers [86] and breakpoints on the performance of those classifiers [87]. Furthermore, according to the literature, Hellinger Distance has been used in many parametric models being very successful on solving problems of statistical estimation [84], [85]. The calculation function is obtained from two probability distributions p and q as follows [85]: Hellinger Distance is non-negative and symmetric, and H D (P, Q) is in 0, √ 2 . Note that the higher Hellinger Distance is, the better the differentiation between probabilities will be.

3) KULLBACK-LEIBLER DIVERGENCE
KL Divergence, well known as relative entropy, was defined by the mathematicians Solomon Kullback and Richard A. Leibler in 1951. It represents the calculation between two probability distributions [88], [89], [90], [91]. Through statistical testing, those mathematicians started from the principle that two probability distributions are different, since there is a possibility of differentiation between them. KL measures the information gain and has been used in statistics, specially in Bayesian statistics.
KL is considered a special class of divergence, being an asymmetric measurement of difference or not dissimilarity. Therefore KL allows us to deduce both the difference and the not dissimilarity between two distributions [91]. In KL, p i e q i are considered probability distributions, where the function is represented by D kl [p||q].
It can also be given by the equation: On problems of data processing or classification, the result of the function D kl [p||q] is the calculation of the expected p value, essential on samples based on q. Normally, the data is represented by p that assumes the real or current distribution of class, flow, application or model that are represented by the q variable [92], [93].
where N defines the quantity of samples. See that the symmetric version of KL Divergence is the Jensen-Shannon Divergence [78], [94].

4) WOOTTERS DISTANCE
Wootters Distance was proposed by the American physicist William Wootters in 1981, aiming to calculate the probability differences under the values of typical fluctuations. The main idea of this distance is to properly consider the statistical fluctuations inherent to any finite sample. It is purely and simply statistical and the concept can be used in any probabilistic area [95]. Considering two probability distributions p and q, the minimal distance between two points will be equivalent to the angle presented by them, represented by the equation 9 [83], [95]. Wootters can also define the not dissimilarity between two samples [96]. Given two probability distribu- Note that arccos() decreases in [0, 1], and that distances were used to discriminate traffic. Table 8 presents a summary of distances and divergences for quantitative (non-negative) data.

D. SUPPORT VECTOR MACHINES
SVM was developed by Vapnik, Guyon, and Hastie [97], based on the Statistical Learning Theory and aims to solve pattern classification problems. Statistical Learning Theory gives us mathematical conditions to choose an efficient classifier to train and test a specific set of data. SVM is a supervised method focused on classification and regression. To classify, initially, SVM was developed seeking binary classification capable of recognizing sample patterns in predefined classes [98].
Currently, SVM supports the task of multi-class learning and it is used to solve problems such as multi-classification. In addition, it has been widely used in the field of artificial intelligence. SVM is responsible for finding the best possible separation boundary between classes/labels for a given set of data that is linearly separable. For SVM, the many separation boundaries that are capable of completely separating classes are called hyperplanes. A decision plane that separates a set of objects with different class members is a hyperplane [99].
An important SVM aspect is the margin, which is seen as a breach between the two lines closest to the class points. The margin is calculated as the perpendicular distance of the support points closest to the vectors. A good margin is the one which has the greatest distance between classes, a lowers margin is a bad one [99].
SVM seeks to find the best hyperplane for a given data set whose classes are linearly separable. SVM builds a classifier according to a set of patterns identified by it in the training examples [100].
Classification problems tend to be more elaborated, requiring optimal separation through more complex structures. SVM proposes the classification of new objects (test) based on available data (training). For that, a set of mathematical functions is used to map the new objects, known as Kernels. SVM kernels are divided into two versions, linear and non-linear [92].
Kernel functions are intended to project vectors of input feature into a high-dimensional feature space to classify issues which lie in non-linearly separable spaces. This is done because as the problem of dimensional space increases, the probability of this problem becoming linearly separable around a low-dimensional space also increases. However, to obtain a good distribution of the complex problem, a training set with a high number of instances is necessary. SVMbased classification uses kernel functions Linear, Radial Base Function kernel(RBF), Polynomial, and Sigmoid [29], [101].
• Linear: it is the scalar product of observations. It is the sum of the multiplication of every pair of input vectors.
• RBF: it maps an input space in a finite dimensional space. It is the most used Kernel in SVM classification.
• Polynomial: This kernel distinguishes a non-linear input space from a curved one. It is known for being more generalized than linear kernel.
• Sigmoid: Neural networks use the sigmoid kernel as the activation function. This kernel is part of the class of differentiable, limited and crescent monotonically functions. Note that SVM-based classification kernel function Linear is considered a parametric model, while the kernel functions RBF, Polynomial, and Sigmoid are considered a nonparametric models.
RBF and Polynomial are both suggestive kernels to separate non-linear application classes from curved ones. Through this choice, more precise classifiers can be obtained. RBF and Polynomial Kernels calculate the separation line in the higher dimension to classify some applications.
An important thing about SVM is the regulation parameters that can be used to configure the SVM [29]. One of them is the C parameter, which is the penalty parameter that represents the classification error or the error term, and it is used to maintain the regulation of the model. SVM optimization depends on controlling how much error can be handed. It is this way that trade-off is controlled between incorrect classification terms and the decision limit. See that the lower the value of C, the lower hyperplane margin and the greater the value of C, the greater the margin will be [102].
Another parameter that also deserves attention on SVM is the Gamma parameter. Low values of Gamma parameter makes so the data does not adapt much to the training data set. Now when the values are higher, the data adapts perfectly to the training set. See that there must have a balance on Gamma values, because values too high can cause an over adjustment and values too low may consider only points close to the margin. VOLUME 10, 2022

IV. CLASSIFICATION OF INTERNET TRAFFIC USING STATISTICAL METHODS
This section addresses the use of statistical methods for Internet traffic classification. Tables 9 to 12 present an overview  of previously used statistical methods for Internet  To the best of our knowledge, Wootters Distance, addressed in the previous section, has not yet been investigated for Internet traffic classification. Therefore it is not considered in this section, being left as a possible future research direction in the next section. Some methods have few applications and were little explored for classification, such as Jansen-Shannon and KL. Others were quite explored, such as SVM, which has been extensively explored in this kind of classification, often presenting good accuracy values. Table 9 details the papers describing distance-based methods for traffic statistical analysis.

1) EUCLIDEAN DISTANCE
Euclidean Distance was addressed in several works found in the literature, including on the implementation of some famous machine learning algorithms, such as K-mean, and Nearest Neighbor (NN). In Table 9, there is a summary of works around this distance. Zhu et al. in [106] proposed a method for classifying an unknown protocol of the application layer based on the Euclidean Distance. In [55], Shi et al. discuss the method of extraction and selection of features for classification, where K-Means algorithm with Euclidean Distance were used to group the features. Pereira et al. in [103] developed a network traffic classification system based on real-time flow using NN technique and Euclidean Distance. The focus of [105] was to use statistical resources of network flows to identify the generated application, and the Euclidean Distance was used to test the classification algorithm. Singh in [104] used K-Means which calculates the distance between objects by using the Euclidean distance to group the network traffic applications.

2) BHATTACHARYYA DISTANCE
Shah and Dang in [114] used Bhattacharyya Distance to select the the highest distance features from a test pool. In [110], the temporal analysis of the behavior of the network is established by calculating this same distance. Aiming to calculate the difference among solved and unsolved iEvents that correspond to the traffic density distributions, Zanin [107] also used this distance. In [111], Class separability was maximized using the Bhattacharyya Distance algorithm. In [108], the Bhattacharyya Distance is used to quantify the not dissimilarity of the probability distributions of Virtual Machine (VM) resources usage. In [109], the Bhattacharyya Distance is used to calculate changes of color histogram. In [112], Baskoro et al. proposed an algorithm for counting and tracking vehicles using the Bhattacharyya Distance. It is used by Laz in [113] to evaluate detection system performance. In Table 9, there is a summary of works around Bhattacharyya Distance.

3) HELLINGER DISTANCE
In Table 9, there is a summary of works about the use of Hellinger Distance. In [118] the Hellinger Distance was used by Wang et al. to find the deviations among sketches. A sketch is a collection of hash tables where Wang et al. propose the SkyShield method using the sketch technique aiming to detect anomalies. The Hellinger Distance was used in [121] to perform linear and non-linear transformations aiming the improvement of accuracy in dataset classification. Derivation of the Hellinger square distance was used by Liu et al. in [115]. In [119] Kumari and Thakar proposed an oversampling method based on the Hellinger Distance to identify the minority class in the classification. In [116] it is used to measure the not dissimilarity of two probability distributions to implement an attack classifier in a monitoring network. It was also used in [117] on the linear SVM kernel implementation for the classifier training step. In [120] Hellinger Distance is used on feature value distribution.

4) WOOTTERS DISTANCE
In the research made throughout the databases referred to in this article during the period from 2011 to 2022, applications of Wootters Distance as a classification technique, feature selection and kernel increment in methods such as SVM, for example, were not found in the literature. Table 10 details the papers describing divergence-based methods for traffic statistical classification.

1) JENSEN-SHANNON DIVERGENCE
In Table 10, there is a summary of works around the use of JSD. In [123] Zareapoor et al. applied JSD property to identify information deviation. In [124], Zhi et al. proposed an Interest Flooding Attack (IFA), that consists of a resistance mechanism based on JSD. This mechanism can help detect and mitigate Flooding Attack on the network. The obtained values from the JSD calculation were used on [125] to select the features. In [126] JSD was used to calculate the distribution not dissimilarity among original discrete attributes and the generated ones, aiming to evaluate the Anti-Intrusion Detection Autoencoder (AIDAE) performance. In [122], the difference between M1 and M2 (the histograms of two mixture distributions) is quantified using JSD of bin-placement approaches.

2) KULLBACK-LEIBLER DIVERGENCE
Some works were found in the literature using KL for Internet traffic classification. In Table 10, there is a summary of works about the use of this divergence. Kim et al. in [127] proposed a network classification with a KL criterion. In [128], it was used to detect video clips. KL was also used in other fields of analysis, such as agriculture. In [129] KL is employed to validate the not dissimilarity of unknown pixels. In [130], KL is used to classification of encrypted internet traffic.

C. SVM
Several SVM applications for traffic classification were found in the literature. In Table 11, there is a summary of works around this statistical method. It was used in [133] with the linear, Polynomial, Sigmoid and Radial kernels for traffic classification on a Software Defined Networking-SDN. Cao et al. in [134] proposed a real-time training model using SVM. It was also used in [139] with denoising schemes to improve prediction accuracy. In [136] Miao et al. used SVM to optimize feature selection. To distinguish data representing normal network traffic and Distributed Denial of Service (DDoS) flows, Aamir and Zaidi [140] tested different combinations of parameters on SVM. In [143], Sentas et al. developed a video data detection and classification system. Luo et al. in [141] proposed the Least Square SVM (LSSVM) hybrid optimized, a model for short-term traffic flow forecasting. Suresh and Srijanee in [145] used VOLUME 10, 2022  SVM to analyze the traffic data pattern and detect anomalies in order to secure high-volume confidential data transmitted over wireless network. In [142], Xiao used this statistical method combined with KNN to detect traffic incidents. In [144] Dong proposed optimizing SVM method to improve training speed and classification, using this enhanced SVM called Cost-Sensitive SVM (CMSVM) to solve imbalance in network traffic identification. Cao and Fang [146] and Syarif et al. [63] optimized the SVM parameters based on the Genetic Algorithm (GA) for Internet traffic classification. Mostafa et al. in [147] proposed a new version of this method named Relaxed Constraint Support Vector Machines (RSVMs) to optimize classification without needing source or destination IP addresses or port information. In [100] Liu et al. addressed SVM for Traffic Identification and Classification (STIC) aiming to identify applications, focusing on the duration and quality of YouTube streaming. Aggarwal and Singh in [135] made use of this method to categorize Internet traffic. In [148], a distributed SVM framework was implemented to classify network traffic using Hadoop. In [131], Hao et al. improve a variation of it called Directed Acyclic Graph-Support Vector Machine (DAGSVM) to classify network traffic. In [132], SVM was used to sort network traffic by improving the algorithm to calculate its own resource weights and parameter values for every individual binary classifier. It was also used in [138] to classify large amounts of data. SVM was used in [137] as the basis to implement an optimized model in order to reduce memory and CPU cost in the training phase, called Incremental SVM (ISVM), and a modified version with Attenuation factor (AISVM). Table 12 details the papers describing various other methods for traffic statistical analysis found in the literature.

1) CORRELATION INFORMATION
In Table 12, there is a summary of works around Correlation Information (Pearson Correlation). Correlation was used in [61] to boost network traffic ranking performance. In [135], Aggarwal and Singh used a Bag of Flow (BoF) to model correlation information in traffic flows and SVM to categorize traffic by application. The correlation was also object of research on [149], that presented a new traffic classification framework. For that, Zhang et al. used the BoF to model information of traffic flow correlation. Besides that they also used a model based on NN. A new classification method that took under consideration the network traffic flow correlation was also proposed by Zhang et al. in [149]. In [151], Zhang et al. considered real traffic and classified the correlated flows together. In Dong et al. [150] presented the disadvantages of using Pearson's Correlation Coefficient to measure the relationship between traffic flows. From the disadvantages, the authors presented a new proposal based on metric correlation quantitatively and accurately.

2) STATISTICAL KOLMOGOROV-SMIRNOV AND CHI-SQUARE TESTS
Statistics such as Kolmogorov-Smirnov and Chi-Square tests have also been used for traffic classification. In Table 12, there is a summary of works around those tests. Neto et al. [152] represented traffic classes by using empirical distributions that correspond to the traffic classes signatures, aiming to develop a classifier based in the dark mechanism that combined both Kolmogorov-Smirnov and Chi-square tests. Chi-square was also used in [153] to test if a set of data follows a specific distribution with a degree of confidence.

3) SHANNON ENTROPY
Gomes et al. in [155] used entropy to emphasize and recognize VoIP P2P traffic flows that belonged to a VoIP session. The developed classifier aimed to identify the flow used in the conversation and focused on the specific characteristics of the voice codec instead of the application used in the VoIP session. In [154], Wang et al. used entropy to classify traffic more deeply. In [156], Zhou et al., used entropy for evaluation of encrypted traffic classification. In Table 12, there is a summary of works about the use of entropy.

V. DISCUSSION AND OPEN ISSUES A. DISCUSSION
Distance and divergence computations are advanced methods of statistical analysis that can be used for classification and, VOLUME 10, 2022  in our context, were used for Internet traffic classification. Through the statistical properties, statistical traffic classification models may be created for a given application. For these methods, sometimes a learning phase is required to build a reference model that can be used to classify traffic.
Statistical classification, also known as logic based classification, allows traffic identification through statistical attributes of the flow. The packet length and duration, the traffic flow idle timing, and the time between packet arrivals are considered examples of statistical traffic attributes or measurements of flow level. On sight of traffic, statistical classification tends to assume and explore unique resources of each application, using data mining techniques to do so most of the time.
Statistical classifiers are light weight and do not require packet payload analysis. In addition, they can achieve the same precision as other methods found in the literature, even using fewer features. These advantages make them suitable candidates for the most restricted configurations. Also, given the current trend towards flow level monitors like Net-Flow [157], the ability to operate on statistical characteristics only is an advantageous property for classifiers.
As for the computational complexity of statistical methods, Valenti et al. [7] show how tree-based statistical classification can sustain high rate of transference on off-the-shelf hardware. Figure 5 shows the Network Visualization map created using the VosViewer tool. This map was created from the references cited in this article, and based on bibliographic data. The data was read from reference manager files .ris. We chose the co-authorship analysis with fractional counting method, that is the strength of the document is divided by the total number of authors. We do not ignore documents with a large number of authors. For the generation of our map, we chose at least 1 author per document and found 462 different authors and co-authors. For each author, the total number of co-authors was calculated and the authors with the greatest total link strength will be selected were selected for the chart.

B. OPEN ISSUES
In the literature, several significant types of research have been done on traffic classification and how to improve the performance of the classifiers, but there are still some challenges ahead. Considering the technologies and methods applied, most challenges still lie in classifying encrypted, unknown, and P2P traffic in real-time or timely with high precision and low processing power. In this section, we outline some important open-ended research questions that need to be addressed in this field of research as follows: • Although SVM has been widely used to classify traffic, traditional traffic classifiers based on SVM have their limitations, among them the high computational cost when it comes to memory, CPU, highly complex training and the difficulties to operate in real time, which makes the real time and timely classification unfeasible. Possible research directions may include the development of new SVM models to address the above issues, following the work in [137].
• SVM still faces resource selection imbalance issues in its training phase. For [158], solving the problems of imbalance in the SVM classification is kept an open issue.
• SVMs performance do not absolutely depend on the size of the training data, but on the quantity of Support Vectors (SVs). An open question for research is balancing data volume and complexity because according to [137], with the increase in training data, computational complexity and the occupation of computational resources will also grow significantly.
• One of the issues to be worked on when implementing an Internet traffic classifier using SVM, is choosing correctly the self parameter C because, according to [158], the classification is sensitive to C, in which, if not chosen correctly, SVM, even optimized, produces worse classification results.
• Explore the feasibility of the use of the Wootters Distance for encrypted Internet traffic classification, which, to the best of our knowledge, has not yet been investigated.
• Investigate the use of less explored statistical distances and divergences for encrypted Internet traffic classification, namely Bhattacharyya and Hellinger Distances and Jensen-Shannon Divergence. Although these statistical methods have been investigated for network security and intrusion detection, among other, as reported in this work, we did not find specific applications of these methods for classification of encrypted Internet traffic.
• Explore the combination of the SVM classifier with statistical divergences. In the literature, we find works that combine Euclidean Distance with the K-means algorithm for classifiers, and Kullback-Leibler combined with SVM. However, we did not find classifiers combined with Hellinger Distance, Wootters Distance, Jensen-Shannon Divergence, for example.

VI. CONCLUSION
The main purpose of this work was to explore statistical methods and techniques recently used or with the potential to be used in Internet network traffic classification. We provided an overview of the Internet traffic classification process as well as an insight into statistical methods with potential interest to be used as classifiers for encrypted Internet traffic, including those methods that have not yet been explored previously for Internet traffic classification. Then, we reviewed previously used statistical methods for Internet traffic classification, organized by distances, divergences, SVM, and other statistical methods. Through the literature review, we identified that the most used statistical method for traffic classification is the SVM method. In addition, we also identified several open issues that could be the subject of further research on this topic. More specifically, we identified statistical distances and divergences that have not been much explored regarding to traffic classification. Actually, they could be used separately or combined with the SVM classifier in order to address challenging problems such as real-time traffic classification and encrypted traffic classification. He has experience in the area of probability and statistics, with an emphasis on experimental planning, working mainly on the following topics: regional development, agribusiness and sustainability, biofuels, dea and multidimensional poverty.
DAMIEN MAGONI (Senior Member, IEEE) received the M.Eng. degree from Télécom Paris, and the M.Sc. and Ph.D. degrees from the University of Strasbourg, in 1999 and 2002, respectively. He has been a Visiting Researcher at various institutions around the world, including the AIST at Tsukuba, the University of Sydney, the University of Michigan at Ann Arbor, and University College Dublin. He has been a Full Professor of computer science at the University of Bordeaux, since 2008. From 2002 to 2008, he was an Associate Professor at the University of Strasbourg. Some of his research has been supported by grants from the European Union, the CNRS, and Science Foundation Ireland. He has co-published over 80 refereed research papers. He also has authored several open-source software for networking research and teaching. His latest contributions are the virtual network device and the network mobilizer, which jointly enable the emulation of mobile networks. His main research interests include computer communications and networking, with a focus on internet architecture, protocols, and applications. He has reviewed for over 20 academic journals and has been in the TPC of numerous high-level conferences. He is a Senior Member of the ACM.