Timely Classification and Verification of Network Traffic Using Gaussian Mixture Models

We present a novel approach for timely classification and verification of network traffic using Gaussian Mixture Models (GMMs). We generate a separate GMM for each class of applications using component-wise expectation-maximization (CEM) to match the network traffic distribution generated by these applications. We apply our models for both traffic classification, where the goal is to identify the source application from which the traffic originates, by evaluating the maximum posterior probability, and for traffic verification, where the goal is to verify whether the application that claims to be the source of the traffic is as expected, by likelihood testing. Our models use only the first initial packets of truncated flows in order to provide more efficient and timely traffic classification and verification. This allows for triggering timely countermeasures before the end of flows. We demonstrate the effectiveness of our approach by experiments on a public dataset collected from a real network. Our traffic classification approach outperforms other state-of-the-art approaches that are based on machine learning, and achieves up to 97.7% flow classification accuracy when using only 9 first initial packets of flows. We show that 96.6% flow classification accuracy can still be obtained when training the GMMs using only 0.5% of all flows. Our traffic verification approach achieves a minimum Half Total Error Rate (HTER) of 7.65% when using only 6 first initial packets of flows.


I. INTRODUCTION
The number of internet applications and the variety of endusers is increasing continuously, as well as the number of online network attacks and advanced generations of malware. This has caused that both classification and verification of network traffic have become more difficult. The task of traffic classification is to identify the source application from which the traffic originates, while the task of traffic verification is to verify whether the application that claims to be the source of the traffic is as expected. Accurate classification and verification is essential, particularly for network operators in network management tasks such as resource allocation, accounting, traffic scheduling, quality of service, lawful interception of IP traffic, and network security. Continuous research on both The associate editor coordinating the review of this manuscript and approving it for publication was Luis Javier Garcia Villalba . traffic classification and traffic verification is required to mitigate these problems.
In this paper, we present a novel generative approach for timely traffic classification and traffic verification. The key idea in our approach is that we generate a separate Gaussian Mixture Model (GMM) for each application class, where an application class is a group of related applications such as e-mail clients. Before generating the GMM for an application class, we first derive feature vectors from the network traffic generated by the applications in the class. The feature vectors include features such as the number of packets in network flows, and statistics on the size of packets and the interarrival times between packets. Subsequently, we apply the feature vectors in a training stage to generate the GMM for the application class. The GMM is a probabilistic model that represents the probability density function of the feature vectors. Hence, the GMM models the probabilistic VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ distribution of feature vectors for that application class. We apply component-wise expectation-maximization (CEM) to generate the GMM [1]. Once the GMMs are generated, we can apply them for both traffic classification and traffic verification. For traffic classification, we monitor network flows in the network traffic, we extract feature vectors, and evaluate which of the GMMs correlates best with these feature vectors by means of a maximum a posteriori probability (MAP) estimate. For traffic verification, we consider network traffic that is labelled with its application class, and we verify whether it matches the GMM of the specified application class by means of likelihood testing.
Our approach can tackle the problem of concept drift in network traffic, which occurs for instance when the number of applications is growing in time. We can simply generate GMMs for the new application classes and add them to the set of GMMs, without the need for keeping all past training data and retraining GMMs. Hence, our approach facilitates smooth adaption for changing traffic profiles.
This paper contributes the following: 1) We extend our prior work on traffic classification and verification and present in more detail the parameters estimation and model selection when training the GMMs [2]. 2) We present a novel set of 59 statistical features that are derived from the first initial packets of truncated flows. We obtain optimal feature subsets for different numbers of first initial packets of flows for both traffic classification and traffic verification, and explore their effectiveness for generating GMMs. 3) We present experimental results to demonstrate that we are able to classify and verify network traffic in a timely way. This may for instance facilitate early detection of zero-day attacks. 4) We demonstrate that our approach outperforms stateof-the-art approaches for traffic classification that are based on machine learning.
The rest of the paper is organized as follows. In Section II we review prior work on traffic classification and traffic verification. In section III we present our approach to construct GMMs that subsequently can be applied for GMMbased traffic classification and verification. In Section IV we present the details of our experimental setups and evaluation results. We conclude the paper in Section V.

II. LITERATURE REVIEW A. TRAFFIC CLASSIFICATION
Traffic flows play a key role in traffic classification. Network traffic generated by different individual applications running on one or multiple hosts can be aggregated and separated by traffic flows. Each flow can be defined as a (unidirectional or bidirectional) sequence of IP packets sharing typically a 5-tuple identifier (source IP address, destination IP address, source port, destination port, protocol type) within a certain period of time. The aim of traffic classification is to map each specific flow to a group of applications of interest.
Traditional classification approaches are port-based or payload-based. In the port-based approach, applications are simply identified by the TCP or UDP port numbers. However, many applications, such as P2P applications and Skype, are increasingly using arbitrary port numbers to avoid detection, which downgrades classification accuracy and renders the use of port-based identification suboptimal. In the payloadbased approach, applications are identified by the actual payloads of packets or the reassembled contents of flows, looking for characteristic signatures of known applications. Payload-based approaches provide high accuracy, however they require high computational cost, privacy is a major concern, and they are likely to fail with encrypted payloads [3].
An alternative approach is to rely on statistical characteristics of traffic flows, regardless of their payloads, which can overcome the limitations of port-based and payload-based approaches [4]. The main assumption behind this approach is that the statistical characteristics (features or discriminators) of network traffic flows are distinct for different applications and can be used to distinguish applications from each other. A lightweight approach that is not hampered by encrypted payloads, is to consider statistical characteristics extracted from packet headers only while ignoring the packet payloads. Relying on completed flows, which requires to wait for the end of flows, is less practical or suboptimal in cases where quick and timely classification is required. This can be relaxed by considering statistical characteristics from truncated flows, in which only the first few packets of flows are considered. It has been shown by multiple researchers that such truncated flows may provide sufficiently distinctive properties that allow to distinguish applications from each other in an early stage of traffic flows [5]- [11]. However, most prior work only considered simple features such as size, direction, and sequence of the first non-zero payload packets of flows [5], [6], [8], [9]. Furthermore, they did not consider TCP control features, such as SYN and ACK, and statistical features, such as mean and variance of packet size and packet interarrival time. Consequently, attackers can evade such classification by simply padding the packet payloads. Li and Moore [7] extracted more sophisticated features from the first few packets of flows, starting from the SYN-packet up to a duration of 5 seconds. However, their features also included the port number pairs, and hence classification results may be biased towards port numbers. Consequently, attackers can evade such classification by performing port masquerading. Liu et al. [10] used 11 statistical features from the first few packets of flows. Garcia and Korhonen [11] captured distributional characteristics of the initial packets of flows considering histogram fractions.
The use of machine learning for traffic classification has been explored extensively, partly due to the availability of off-the-shelf learning algorithms and their efficiency in dealing with statistical information [3]. Several challenges and requirements have to be considered [12]: an appropriate definition of a set of classes is required to specify the classes of network traffic that should be distinguished from each other (e.g. P2P versus web traffic [13]) or to identify specific application protocols [14]; tools are required to extract (near real-time) flow features that have the potential to be deployed in online real-world settings (e.g. TIE [15] and Netmateflowcalc); benchmarking tools (e.g. [16], [17]) or network architectures (e.g. [18]- [21]) are required that generate 'reliable ground truth' [16], [22] aiming at evaluating various traffic classification approaches [23]; measurements should provide byte accuracy in addition to flow accuracy [24]; and the temporal stability of classification schemes should be examined [25]. Most prior work has focused on discriminative algorithms, including Support Vector Machines (SVM) [26] and Decision Trees (DT) [7], where a single conditional distribution model is trained for all classes and used next to discriminate the classes from each other [27]. This however may suffer from concept drift when the number of applications is constantly growing in time. The model has to be updated with the advent of a new class, which requires retraining for which also all past training data likely need to be accessible. This makes discriminative approaches in their current forms suboptimal.
More recently also deep learning has been applied for traffic classification and anomaly detection [28]- [31]. Different deep learning methods vary in their performance, but most of them provide high accuracy and are also well able to detect previously unknown attacks. However, more training data is required to gain sufficient accuracy, and also the time and computational power required to train models is in general much larger when compared to machine learning. If the primary goal is to reduce training time and computational costs, as we consider in this paper, deep learning methods are not preferred.
Also the use of GMM as a generative approach for traffic classification has been explored in prior work [6], [32], [33]. Bernaille et al. [6] proposed a method in which a single GMM is used to approximate the underlying feature distributions of all classes. Moore and Zuev [32] proposed to model the underlying feature distributions of each class by a Gaussian distribution, using a single component for all the classes. Dusi et al. [33] proposed an internet traffic classifier that utilizes class-specific GMMs to approximate the underlying feature distribution of each (type of) application injecting traffic on SSH-encrypted tunnels. These methods assign any unseen observation to the class that maximizes the posterior probability.
Regardless of whether traffic classification is based on class-conditional GMMs or a single class-independent GMM, a model selection technique is needed to provide good performance. Prior GMM-based traffic classifiers build a set of candidate GMMs with different numbers of components and then select the one that provides the best performance [6], [33]. However, training an entire set of GMMs along with the EM-associated problems is a main drawback (see Section III-B2).
The approach proposed by Figueiredo and Jain [34] automatically selects the optimal number and shape of mixture components in a single algorithm. The semi-supervised traffic classification framework presented by Qian et al. [35] utilizes the GMM learning approach proposed by Figueiredo and Jain [34]. However, they only consider a single GMM that models all network traffic, and each GMM component is assigned to an application. Our approach also adopts the GMM learning approach proposed by Figueiredo and Jain [34], but to the best of our knowledge, our approach is the first attempt to automatically find the best GMM for each individual application in traffic classification.

B. TRAFFIC VERIFICATION
Network traffic anomalies are inferred from patterns in network traffic data that deviate from (or do not conform to) the normal network behaviour (profiles) [36]. Signature-based intrusion detection systems match observed traffic against a pre-configured set of (known) intrusion signatures, while anomaly-based intrusion detection systems notice deviations from pre-defined models (profiles) describing normal traffic. The latter may detect new types of (still unknown) intrusions and zero-day attacks, but have to be adapted constantly due to changing profiles. García-Teodoro et al. presented an overview of various anomaly-based intrusion detection systems [37].
Anomaly-based detection relies on the assumption that the statistical characteristics of traffic differ between normal and anomalous traffic. Statistical characteristics of normal network traffic are used to build users' profiles. These statistical characteristics can be extracted from one or more traffic features, such as packet header, payload, flow, bandwidth usage or packet distribution.
Payload-based approaches [38] are probably one of the most promising ways of detecting traffic anomalies that originate from infected applications. These methods have the ability to provide near real-time detection with high accuracy levels. However, as for payload-based classification, also payload-based verification suffers from high computational costs, privacy concerns, and difficulties when dealing with encrypted or otherwise securely encapsulated traffic [39]. Alternative approaches utilise the information available in the non-encrypted IP packet header and analyse statistical characteristics of flows regardless of packet payload.
Much of the research on anomaly-based detection has been devoted to build network normality profiles from aggregated network traffic, and flows are a fundamental object of study [40], [41]. Prior flow-based and payload-based approaches may be able to detect some intrusion-infected traffic, but they are unable to identify the source application responsible for the traffic. Moreover, they cannot detect abnormal traffic generated by an intrusion-infected application whenever such traffic is very similar to the normal traffic of other known applications. To address this problem a detection system needs to evaluate whether traffic is normal for its source application. An anomaly-based detection system therefore should meet two requirements: (i) identify the claimant (source application) that is responsible for the traffic, and (ii) model or profile the genuine traffic of each application present in the network. To meet the first requirement, we proposed architectures that provide a binding between network traffic and source application that allows checking whether a packet/flow claimed by an application conforms to its expected traffic model [42], [43]. For the second requirement, we proposed GMM with automatic learning to model per-application traffic [2], [39], [44]. However, our prior application-specific models were still trained with features obtained from the entire packet flows and were not accurate enough for detecting anomalies in a timely manner before the end of a flow. Moreover, this approach suffered from the difficulty of finding an optimal number of mixture components. We tackled the latter issue by applying GMM learning [2], but the problem of timely detection remained to be addressed. Also Bahrololum and Khaleghi [45] applied GMM in anomaly-based detection, however they only considered full flows. Since sophisticated intrusions may meet their illicit purposes through one single flow, timely detection and prevention is important. In the present paper, we address this issue and explore the effectiveness of truncated flows for different numbers of initial packets in order to provide more efficient and timely detection.

III. GMM DESIGN
We apply a GMM-based approach for both traffic classification and verification. In this section we first describe the GMM used and its parameterization. Next, we describe traffic classification and verification.

A. MODELLING APPLICATIONS
GMMs are an effective way for data analysis and system modelling, including those involving random processes. A GMM is a probabilistic model for representing the probability density function of a population of data points as a weighted mixture of probability density functions of normally distributed sub-populations (components). For each application class a we compute a separate GMM using traffic packets from truncated flows generated by the application(s) in the class. We extract D-dimensional feature vectors x from these flows, and model their probability distribution by a GMM with M components. The GMM of an application class a is defined as: where Here, | a m | denotes the determinant of the covariance matrix, ( a m ) −1 denotes the inverse covariance matrix, and T denotes transposition. The covariance matrix can either be free (unrestricted) or diagonal (restricted). Additionally, the covariance matrix can be constrained to be the same across mixture components.

B. PARAMETERS ESTIMATIONS
For a given set of N feature vectors X = {x 1 , x 2 , . . . , x N } that is extracted from (truncated) flow samples generated by an application class a, the GMM is defined as (assuming that the feature vectors are independent, which may be incorrect but assumed to make the problem tractable): The set of feature vectors is used as a training set to construct the GMM. The goal is to estimate the parameters in a such that the GMM matches the distribution of the training samples. These parameters estimates can be obtained using either the Maximum Likelihood (ML) estimatê or the Maximum A Posteriori (MAP) estimate (given some prior p( a )) through an Expectation-Maximization (EM) algorithm.

1) THE EM ALGORITHM
With feature vector x n we associate a label z n , which is an M -dimensional vector indicating which of the M GMM components produces x n . The EM algorithm interprets X as incomplete data, with the missing part being the corresponding set of labels Z = {z 1 , . . . , z N } in which z n = [z n,1 , . . . , z n,M ] is a binary vector with z n,m = 1 and ∀ p∈{1,...,M }\m : z n,p = 0, meaning that x n was produced by the m th GMM component. Assuming that all x n are independent, the complete data log-likelihood function for the GMM can be written as [46], [47]: The EM algorithm begins with an initial estimate of parametersˆ a (0) and iteratively generates a sequence of estimates {ˆ a (t)|t = 1, 2, . . .} by alternatively applying two steps, namely Expectation (or E-Step) and Maximization (or M-Step), as follows: E-Step computes the expected value of the conditional probability of the log-likelihood, given the observed data X and the current parameter estimatesˆ a (t), yielding the following Q-function: where ≡ E Z |X ,ˆ a (t) . The latter equality is obtained from the linearity of log p(X , Z | a ) with respect to Z (see Equation 6). Since the members of Z are binary, their conditional expectations can be expressed as the posterior probability that x n was produced by the m th GMM component for application class a: M-Step updates the parameter estimates using a MAP estimation according to: The term log p( a ) is removed in the case of ML estimation. The EM algorithm is iterated until either the changes in the estimated parameters, the number of iterations, or the log-likelihood exceed some specified threshold (whichever comes first).

2) ESTIMATING THE NUMBER OF COMPONENTS
Since the data distribution likely differs for different application classes, exploring the best fit for each individual application seems worthwhile. One of the key choices in training a GMM is choosing the number of components M . The general trend observed in literature is to apply a deterministic approach that first builds a set of candidate models for a range of values of M (from M min to M max ), and next selects from the entire set of available models the one that best fulfils a given selection criterion such as the Bayesian Inference Criterion (BIC) [48], the Minimum Description Length (MDL) [49], or the Minimum Message Length (MML) [50]. Although some of these approaches may perform well, building a whole set of candidate models is a major drawback. Moreover, since they essentially utilize the EM algorithm for building the models, they are prone to two major EM-associated problems: sensitivity to initialization of the parameters, meaning that different initial values of parameters may converge to different local optima, and not necessarily the global one, during the maximization process (M-Step); and possible convergence of the sequence of EM parameter estimates to the boundary of the parameter space (where the likelihood is boundless).
To deal with these problems, we adopt the approach proposed by Figueiredo and Jain [34], which facilitates a seamless integration of parameters estimation and model selection. We assume that the number of components of each application class is unknown and can be different for different application classes. The method implements a variant of the EM algorithm, named component-wise EM (CEM) [1], with the aim of minimizing the following MML criterion as the cost function: (10) where K = D + D(D + 1)/2 is the number of parameters specifying each component and M n,z ∈ {1, 2, . . . , M } denotes the number of components whose probability is nonzero. The cost function, for fixed M n,z , is mathematically equivalent to an a posteriori probability density function yielding from a Dirichlet-type prior for ω a m and a flat prior for θ a m 's. With some adjustments to minimize the cost function, the sequence of parameter estimates of the M-step is summarized as follows: for m = 1, 2, . . . , M and ψ n,m given in the E-step. With the use of Equation 11 and allowing M to move from a large number M max to a smaller one M min , the algorithm not only tackles the problem of 'convergence to the boundary of parameter space' by annihilating 'too weak' components, but also becomes less sensitive to the initialization problem [51]. However, the simultaneous (or batch) updating ofω a m 's may cause a failure in the determination of ω a m 's. For a large value of M , all components may not have enough initial support ( N n=1 ψ a n,m < K /2 for m = 1, . . . , M ) and, consequently, all ω a m 's cannot be determined. This problem is avoided by the use of CEM. CEM performs the updating of allω a m and θ a m = {μ a m ,ˆ a m } sequentially, rather than simultaneously. In other words, CEM executes E-step for any single updating ofω a m andθ a m , rather than for batch updating of all parameter estimates. This allows immediate redistribution of the probability mass of a zero-component to the other components. The CEM iteration is continued until the relative decrease in ( a (t), X ) drops below a given threshold ε. This convergence results in a certain value for M . However, a smaller value of M n,z may lead to additional decrease in the value of ( a , X ). For this reason, after reaching convergence with VOLUME 8, 2020 a certain M , the component with smallerω a m is set to zero and CEM is rerun until convergence. This process is repeated while M n,z ≥ M min . At the end, the estimated parameters as well as the number of components are those which minimize ( a , X ).

C. TRAFFIC CLASSIFICATION
In traffic classification we apply the GMMs and assign each observed traffic flow to the most probable application class. Given a set of K application classes Y = a 1 , a 2 , . . . , a K represented by the models a 1 , a 2 , · · · , a K , our objective is to determine the application model which has the maximum posterior probability for feature vector x. Hence, we assign each flow to the class with the highest p( a i |x): Using Bayes' rule, Equation 14 evolves into: The term p(x| a i ) is computed as in Equation 1, and p( a i ) refers to the prior probability of application a i that is derived in the training stage when generating the GMM for application a i . The term p(x) is constant for all applications and can be ignored.

D. TRAFFIC VERIFICATION
In traffic verification we decide whether traffic is normal or abnormal for its source application. In the latter case, we consider the traffic to contain an anomaly. Given a (truncated) flow sample x along with its supposed source application a i , we estimate whether or not x was generated by a i by considering model a i for application a i . We perform a likelihood test by evaluating p(x| a i ) ≥ τ , which reflects whether traffic sample x matches the normal traffic of its supposed source application a i given threshold τ .
The verification decision can be erroneous in two ways: False Acceptance (FA) and False Rejection (FR). FA occurs when an actual anomaly in the traffic is considered as normal traffic; FR occurs when traffic from a clean application is considered as an anomaly. The False Acceptance Rate (FAR) is the percentage of anomalies that are considered as normal (i.e., p(x abnormal | a i ) ≥ τ ); the False Rejection Rate (FRR) is the percentage of clean samples that are considered as abnormal (i.e., p(x normal | a i ) < τ ). We determine the optimal value for τ based on the Equal Error rate (EER) criterion where FAR is equal to FRR, using an evaluation set from the traffic data. We consider two scenarios for determining τ : the Global Threshold (GT) employs a single threshold for all classes; the Class-Specific Threshold (CST) employs separate thresholds for each individual class.

IV. EXPERIMENTAL METHODOLOGY
We performed various experiments to explore the effectiveness of our GMM-based approach for traffic classification and traffic verification. In this section we first describe in detail the dataset used in our experiments. Next, we describe the experimental setup and evaluation metrics. Finally, we present the experimental results.

A. DATASETS DESCRIPTION
We conducted our experiments on the UNIBS-2009 dataset. 1 This dataset was collected from 20 workstations in the campus network of the University of Brescia in Italy on three consecutive working days (September 30, October 1, and October 2, 2009). The data was collected by running tcpdump on the edge router that connects the network to the Internet via a dedicated 100Mb/s uplink. The dataset consists of four files: three PCAP files (containing 27 GB data) and a logfile (groundtruth.log).

1) FLOW DEFINITION
The traffic contains around 79,000 flows that were generated by various types of applications employing web protocols (HTTP and HTTPS), mail protocols (POP3, IMAP4, SMTP, and their SSL variants), Skype, peer-to-peer, and other protocols (FTP, SSH, and MSN).
We focused on the TCP traffic that represents more than 98% of the data in the UNIBS dataset. A flow is defined by a set of bidirectional consecutive packets travelling between two endpoints (defined by source IP address, source port number, destination IP address, and destination port number) through a TCP connection started by a 3-way handshake (SYN, SYN-ACK, and ACK) and terminated by either observing FIN/RST packets or 'no packet seen' for a timeout of 60 s (whichever comes first). The first SYN-packet seen in a flow determines the forward direction, and the source endpoint is assigned as the client. We consider only TCP flows that have at least one packet in each direction and contain at least one non-zero payload packet.

2) FEATURE EXTRACTION
A flow associated with a particular application can be described by a number of statistical properties, or features, parameterizing its behaviour. We modified and employed the Netmate-flowcalc tool 2 to extract the bidirectional flows as defined above, and calculated their statistical feature values from the PCAP files. We checked the order of packets within each flow and ignored out-of-order packets during the feature calculation process. In order to also support truncated flows, we modified Netmate-flowcalc to extract the statistical features of a flow in a PCAP trace after observing the n th packet counting from the SYN-packet. We refer to n as the Reference Packet Count (RPC).
In prior studies on traffic classification using truncated flows, Garcia and Korhonen [11] obtained best accuracy for RPC values ranging from 12 to 15. However, they applied distributional flow features for the classification of video traffic, which only partially relates to our context. Peng et al. [8] experimented with RPC values ranging from 2 to 10, but they only considered packets with non-zero payload. They concluded that using 5 to 7 non-zero payload packets is most effective, since including more packets reduces the time performance of a classifier, while including less packets reduces the accuracy. They also concluded that the effective number of packets varies with network environments. Liu et al. [10] experimented with RPC values ranging from 5 to 10 for the classification of network traffic using statistical features, which more closely resembles our context. They observed that classification accuracy gradually increased when ranging the number of packets from 5 to 8, and slightly decreased when taking 9 and 10 packets. Based on these prior studies, we range RPC values from 3 to 10 in our experiments. We do not consider RPC values larger than 10, also since we aim at timely classification and verification of truncated flows. For comparison, we also explore results obtained when considering complete flows. For each specific value of RPC, we built a separate dataset. Some short-lived flows may not reach the RPC. We still considered such flows in our analysis, since they may carry infected payloads.
In total we derived 59 flow features. Table 1 lists the full feature set, including their abbreviations and indexes. 3 Most of these features are calculated separately for the forward and backward direction. In our experiments the features are extracted from the packets within the RPC range. Features that correspond to packets outside the RPC range (such as features 40-59 that specify the length of individual packets send in the forward direction or the backward direction) are ignored. The statistical features are recomputed when a new packet in a flow is encountered. The features include no information about the payload, and hence the feature set is content-independent, which removes privacy concerns, avoids dealing with encrypted payloads, and provides that features can be calculated at low computational cost. The features also include no information about IP addresses and port numbers, and hence the feature set is site-independent and robust against port-masquerading strategies in which an (unknown) application evades port-based detection by simply changing its destination port number to that of a well-known application.

3) GROUND TRUTH
The logfile in the UNIBS dataset contains all TCP and UDP flows and their starting time associated with the source applications and protocols. Application-level ground truth was assigned to the traffic flows through the architecture proposed by Gringoli et al. [21]. Flows are assigned to the following application categories: Web Browsers, MAILS, P2P, SKYPE, and OTHERS. 3 An idle time of 1 second was used to distinguish sub-flows. A sub-flow ends when no packet is received within 1 second; the next new sub-flow starts with the next packet.  Table 2 summarizes the TCP flows/packets/bytes in the UNIBS dataset per day and per application category. We explicitly show the break-down per day since we use temporal information in our experiments. For instance, we built our GMMs with data from the first day and evaluated the GMMs with data from the next two days. This kind of evaluation clearly separates data for model training and evaluation, and is more practical for real-world scenarios. VOLUME 8, 2020

4) FEATURE SELECTION
We applied feature selection to find a subset of the full feature set that yields the highest performance at the lowest computational cost. Finding the optimal feature subset is a NP-hard problem and diverse feature selection approaches have been proposed in the machine learning literature [52], [53]. We adopted the Sequential Forward Selection (SFS) algorithm for selecting the optimal feature subset. The SFS algorithm takes a greedy approach. It starts with an empty feature set and gradually adds features to the feature set. At each iteration, the feature is selected from the candidate feature set that optimizes an evaluation function. This feature is then added to the feature set and removed from the candidate set. The algorithm continues until the addition of further features does not improve the evaluation function. We selected the SFS algorithm since we aim at minimizing computational cost. The complexity of the SFS algorithm is O(n 2 ), where n is the number of features. More advanced feature selection algorithms could have provided better results, but at larger time complexity [54], [55].

B. EXPERIMENTAL SETUP AND EVALUATION METRICS 1) MODEL DESIGN SETUP
We trained the GMMs using the MATLAB-code as provided by Figueiredo and Jain [34]. 4 In both traffic classification and verification experiments, we allowed the covariance matrices to be free (unrestricted) when training the GMMs. We experimentally observed that unrestricted covariance matrices outperformed covariance matrices that are either diagonal (restricted) or constrained to be the same across mixture components. We set the regularizing factor for the covariance matrices and the stopping threshold ε for CEM iteration to 10 −4 . For both parameters, we experimentally found that performance was rather insensitive to values in the range from 10 −6 to 10 −2 and degraded for values larger than 10 −2 . We set the initial minimum and maximum number of mixture components (M min and M max ) to 1 and 10, respectively.

2) TRAFFIC CLASSIFICATION EXPERIMENTS
In our traffic classification experiments we used the flows from Day 1 for training and the flows from Days 2 and 3 for testing. We trained 4 models for the application classes 4 www.lx.it.pt/∼mtf/mixturecode2.zip MAILS, P2P, WEB Browsers, and SKYPE (see Table 2). We excluded the applications class OTHERS during training since it does not contain traffic of a coherent set of related applications. We also excluded this class during testing, and hence our classification experiments resemble a best-case scenario in which only traffic is considered originating from applications that the models have been trained for.
In our experiments we compared the performance of our GMM-based approach with 6 of the most commonly used supervised machine learning algorithms [56]: Naive-Bayes (NB) [32], [57], [58], Decision Tree (DT) [7], k-Nearest Neighbour (k-NN) [59], Support Vector Machines (SVM) [26], [60], [61], Random Forest (RF) [11], [62], [63], and Gradient Boosted Trees (GBT) [63]. We used scikitlearn 5 to implement all classifiers, except for GBT we used XGBOOST 6 as it is significantly faster. For the NB and DT classifiers we used the default values. For the k-NN classifier, we employed k = 1 since it achieved the highest overall accuracy among k = 3, 5, 7, 9. For the SVM classifier, we employed LibSVM [64] with RBF Kernel function. Yuan et al. [60] showed that this kernel function achieved the best traffic classification accuracy among three other kernel functions, namely LINEAR, POLY, and SIGMOID. For the RF and GBT classifiers, we applied RandomizedSearchCV in scikit-learn to perform a randomized search over 6 hyperparameters, including a range of 10 to 500 for the number of trees. The search involved 2-fold cross-validation over a set of 20 different hyper-parameter settings. We used the average F1-measure to select the best setting, and applied this setting for training the classifier.
Our experiments included the following three steps: 1) We applied the SFS algorithm as shown in Algorithm 1 in order to select the best feature subset. In each iteration we apply k-fold cross-validation in which the training dataset is randomly split into k (approximately) equal disjoint subsets. Each of the subset acts in turn as the evaluation set for the models trained with the other subsets. We apply 2-fold cross-validation to reduce computational cost. At the end of each iteration we add the feature that maximizes the average F1-measures of the models. for all f ∈ candidates do 9: features ← selected ∪ {f } 10: split dataset into k subsets data 1 . . . data k 11:ŷ ← ∅ // observed classification 12: for i = 1 to k do 13: train models with dataset \ data i and features 14  if (F post > F pre ) then 22: selected ← selected ∪ {f } 23: candidates ← candidates \ {f } 24: end if 25: until (F post ≤ F pre ) 26: train models with selected features and dataset 27: test models with test dataset We applied the SFS algorithm for every method (GMM, NB, DT, k-NN, SVM RF, GBT) and all RPC values ranging from 3 to 10. All further classification experiments were conducted using the selected feature subsets. 2) We evaluated the performance of the models on the level of correctly classified flows, packets and bytes, using the Overall Accuracy and the average F1-measure over all classes as measures. The Overall Accuracy determines the percentage of all flows/packets/bytes that are correctly classified. For per-class measures, Precision is the percentage of flows/packets/bytes correctly classified to a class over the total number of flows/packets/bytes assigned to that class, and Recall is the percentage of flows/packets/bytes from a given class that are truly classified to that class. We consider the F1-measure which takes the harmonic mean of both precision and recall as a single accuracy measure (i.e., 2 × (precision × recall)/ (precision + recall)). 3) We examined the impact of the training set size on the classification performance of our GMM-based approach.

3) TRAFFIC VERIFICATION EXPERIMENTS
In our traffic verification experiments using our GMM-based approach we considered three main stages, namely training, evaluation, and testing. We used the flows from Day 1 for training, the flows from Day 2 for evaluation, and the flows from Day 3 for testing. We excluded the applications class OTHERS during training, since it does not contain traffic of a coherent set of related applications. We considered all application classes, including the OTHERS class, during evaluation and testing. Hence our verification experiments resemble a scenario in which also traffic is considered originating from applications that the models have not been trained for. We applied the algorithm as outlined in Algorithm 2. We applied the SFS algorithm to find the optimal subset of features and determined the optimal value for the threshold τ , while aiming to minimize the Equal Error Rate, using the training dataset and the evaluation dataset. As indicated in section III-D, we determined both the Global Threshold (GT) over all classes, and Class-Specific Thresholds (CST) for each individual class. In the testing stage, we tried different values of RPC in order to find the optimal RPC value. We report the Half Total Error Rate (HTER), which takes the average of FAR and FRR as a single measurement. FAR, FRR, EER and HTER range from 0% (best) to 100% (worst).  Figure 1 shows the optimal feature subsets obtained by the SFS algorithm for the different methods and RPC values. It can be seen that the number of selected features per subset VOLUME 8, 2020  Table 1 for the features.) varies lightly, but the selected features per subset vary considerably, not only per method but also per RPC value. The features 7 and 42 are selected most often and are included in half of the subsets, while ten features are included only in a single subset and six features are not selected at all.

Algorithm 2 SFS for Feature Selection and Threshold
The optimal feature subsets provide sufficient classification capabilities. As an example, we illustrate the distinctive capability of the optimal feature subset for the GMM models for RPC = 9 in an 'Andrews plot' [65] and a 'parallel coordinates plot' [66] shown in Figure 2. The Andrews plot represents each feature vector x as a Fourier series where the coefficients are equal to the feature values. The parallel coordinates plot represents each feature vector by the sequence of features plotted against their values. The figure shows the feature vectors from 250 randomly selected flows for each class from Day 1 of the UNIBS dataset. The Fourier series in the example Andrews plot has 5 terms over the interval [0,1]: a constant, two sine terms with periods 1/2 and 1, and two similar cosine terms. The variations in these plots demonstrate that the feature subset sufficiently discriminates the different application categories.
Hereafter, we perform the classification experiments using the optimal feature subsets as shown in Figure 1.  Figure 3a illustrates the Overall Accuracy at the level of flows for different methods as a function of the RPC. It is clear that by increasing RPC, the performance of all approaches except NB improves, which indicates that using more data packets for extracting the flow features provides higher Overall Accuracy. The improvement of GMM is significant and GMM yields the highest Overall Accuracy when RPC ranges from 4 to 10, clearly outperforming all other methods. The Overall Accuracy with GMM increases up to 97.74% when RPC increases from 3 to 9, and slightly decreases for larger RPC. When using complete flows, RF achieves the highest Overall Accuracy. Figure 3b compares the average F1-measures at the flow level for the different methods. Also here, best results are obtained with GMM when RPC ranges from 4 to 10, except for RPC is 6. NB and SVM do not perform well. The performance of all approaches lacks behind when RPC ranges from 3 to 6. We observed that this is mostly due to poor performance for at least one rare class, mostly SKYPE. This can be seen in Table 3a, where the classification results for all classes are shown when RPC = 9. The average F1-measures with GMM increases up to 93.66% when RPC increases from 3 to 10, and slightly increases when using complete flows. RF achieves the highest average F1-measures for complete flows. Figure 3c illustrates the Overall Accuracy at the level of bytes for different methods as a function of RPC. We see that GMM still performs well, but not as outstanding as for the level of flows as shown in Figure 3a. This can be explained by the observation that GMM improved the Overall Accuracy at the flow level by providing better classification of 'mice flows' (mostly belonging to Web Browsers) than other classifiers. Flows with small and large payload sizes are referred as 'mice' and 'elephant' flows, respectively [24]. Figure 3d compares the average F1-measures at the byte level for different methods. GMM provides the best results when RPC is 9 (68.74%) and 10 (69.30%). The F1-measures for all approaches is (far) below 70%. These rather low F1-measure values can be explained by the fact that all approaches have poor performance at the byte level for at least one class. This can be seen in Table 3b, which compares performance at the byte level for all methods when RPC = 9. It is clear that the poor performance results for SKYPE hampered the overall performance results of all classifiers. We observed that this is due to miss-classification of five elephant flows responsible for 96% of all bytes in the test subset. Figure 3e and 3f illustrate the Overall Accuracy and the average F1-measure at the level of packets for different methods as a function of RPC. We observe a behaviour similar to the one at the byte level.
In summary, we conclude that the proposed GMM approach outperforms all other methods at the flow level when RPC ranges from 7 to 10, both for the Overall Accuracy and average F1-measures. The maximum Overall Accuracy (97.74%) is achieved for RPC = 9, and the maximum average F1-measures (93.66%) is achieved for RPC = 10. At the level of bytes and packets, best results are obtained with GMM when RPC ranges from 9 to 10, while RF and GBT perform almost equally well. The maximum Overall Accuracy (98.28% and 98.38%) as well as the maximum average F1-measures (69.30% and 79.57%) is achieved for RPC = 10. Table 3 shows the performance of the different classifiers for RPC = 9 per class in terms of Overall Accuracy and F1-measure for flows, bytes, and packets, respectively.

b: IMPACT OF TRAINING SET SIZE
Since preparing sufficiently large-scale, well-labelled training data is laborious, it is useful to determine how much training data is needed for a desired accuracy. For this reason, we evaluated the sensitivity of the proposed GMM approach for the size of the training set when RPC = 9. Figure 4 presents the Overall Accuracy as the number of the training flows varies (on a logarithmic scale). Each mark is the average result of running the GMM models 10 times. For each run, the models are trained on a different training set of the same size that is randomly selected from Day 1, and tested on the total number of flows obtained from the combination of Day 2 and Day 3. The error-bars in the figure show the standard deviation of the mean. The figure shows that the performance steadily improves and stabilizes as the training set size increases. We observed that 96.58% and 96.89% Overall Accuracy can be achieved with 403 and 1,097 randomly selected flows as training set, respectively, which constitute around 0.5% and 1.5% of all flows. Such small training sets greatly reduce the time required to build the models. We also observed that the classification speed was approximately constant at 3.7 × 10 5 flows/second. These observations make the proposed approach promising for practical uses, where welllabelled training data is scarce or re-training the models is a serious concern. TABLE 3. Performance of the different classifiers for RPC = 9 per class in terms of Overall Accuracy and F1-measure by flow, byte, and packet (best results are marked in bold).     Figure 5a shows results when using the CST -dependent feature subsets (given in second column of Table 4); Figure 5b shows results when using the GT -dependent feature subsets (given in third column of Table 4). Both figures compare HTER for both the evaluation and the test subset.

2) TRAFFIC VERIFICATION RESULTS
The figures show that using GT provides better results than using CST. This is particularly the case when using the GT -dependent feature subsets as shown in Figure 5b, where GT outperforms CST for all RPC values with both the evaluation and the test subset. A HTER of less than 15% is achieved with GT for all values of RPC, except for RPC = 3. Even when using the CST -dependent feature subsets, shown in Figure 5a, GT still outperforms CST in most cases, except for RPC of 4 and 5 on the evaluation subset, and for RPC of 5 and 10 on the test subset.
On the test set, a minimum HTER of 7.79% is achieved with GT for RPC = 7 using the CST -dependent feature subsets (Figure 5a), and 7.65% for RPC = 6 using the GT -dependent feature subsets (Figure 5b). Table 5 compares HTER for the test set per class when using the CST -dependent feature set and RPC = 8, where a minimum HTER of 9.9% is achieved by both GT and CST (see Figure 5a). It clearly shows that while both CST and GT achieved a comparable HTER for MAILS and P2P, CST significantly outperformed GT for Web Browsers and SKYPE. This is as expected, since the goal of CST is to minimize HTER of each individual application, while the goal of GT is to minimize the overall HTER of all applications combined. VOLUME 8, 2020  Table 4.
Hence, while CST yields a better HTER per application, GT still yields a better overall HTER. This is due to the fact that the number of likelihood scores available to determine the optimal GT threshold is larger than the number of likelihood scores available for determining the optimal CST thresholds.

V. CONCLUSION
In this paper we presented an almost real-time traffic classification as well as an equally fast application-aware traffic anomaly detection system based on an original use of GMMs. The learning algorithm used, unlike the basic Expectation-Maximization (EM) algorithm, selects an optimal number of mixture components automatically with a seamless integration of estimating mixture parameters from given multivariate data. The representation of each application (type) is provided by a GMM fitted to the underlying distribution of flowlevel features of that application. The traffic classification approach assigns any flow to the class with the highest posterior probability. The traffic verification approach for anomaly detection is based on class-specific and global thresholding mechanisms, where a threshold is set at the EER operating point to determine whether a flow claimed by an application is genuine. In order to provide a timely operation, only the first initial packets of flows are considered in the learning process. We adopted the SFS feature selection algorithm for selecting the optimal feature subset. We evaluated the effectiveness of our GMM-based approaches by conducting different sets of experiments on a public dataset collected from a real network. In order to provide efficient and timely traffic classification and anomaly detection, the effectiveness of different numbers of first initial packets of flows has been explored and evaluated. Our traffic classification approach considerable improves on other state-of-the-art approaches that are based on machine learning. Our GMM-based approach achieves an Overall Accuracy for flows of 97.74% using 9 initial packets of flows. We observed that 96.6% and 96.9% Overall Accuracy at the flow level, respectively, can be maintained by only using 0.5% and 1.5% of all flows for training the GMMs. Our GMM-based anomaly detection achieved a minimum Half Total Error Rate (HTER) of 7.65% by using only 6 initial packets of flows.
In our future work we intend to extend our GMM-based approach such that it also can be applied in an online non-stationary environment, where both class evolution and concept drift occur in time. We also consider to evaluate our approach with other datasets (when available), and to evaluate related approaches, such as Variational Bayesian model selection [67], as well as deep learning methods.
HARALD VRANKEN received the M.Sc. degree in information technology, the P.D.Eng. degree in information and communication technology, the Ph.D. degree in electrical engineering, and the M.Sc. degree in science education and communication from the Eindhoven University of Technology, in 1992Technology, in , 1994Technology, in , 1998, and 2006, respectively. He was a Senior Scientist with the Philips Research Labs, from 1998 to 2005, where he worked on design-for-testability of digital ICs. Since 2006, he has been with the Open University in the Netherlands, and since 2014, he has also been with Raboud University. He is currently an Associate Professor. He holds several patents and has coauthored over 50 scientific publications. His current research interests include resilience, software security, network security, and energy analysis of cryptocurrency mining.
ANDRÉ ZÚQUETE received the Ph.D. degree in informatics and computer engineering from the Instituto Superior Técnico, University of Lisbon, Lisbon, Portugal, in 2001. He is currently an Assistant Professor with the University of Aveiro, Aveiro, Portugal, a Researcher of IEETA (Institute of Electronics and Informatics Engineering of Aveiro), and a Collaborator of the Instituto de Telecomunicações. His research and development activities are centered on the security in distributed systems, with a focus on the design of security architectures for several specific scenarios (e-Voting, e-Health, e-Government, vehicular networks, etc.). He is a Program Committee Member of several conferences in the areas of security and mobility. He has participated in several national and international projects and did some consulting on the security for Portuguese companies and state Departments. He has dozens of articles published in international forums related with security and mobility and he is the author of a technical book on network security (in Portuguese). He is the Portuguese representative on the IFIP TC11 (Security and Privacy Protection in Information Processing Systems).
ALI MIRI is currently a Full Professor with the School of Computer Science, Ryerson University, Toronto. His research interests include cloud computing and big data, computer networks, digital communication, and security and privacy technologies and their applications. He has authored and coauthored more than 200 refereed articles, six books, and eight patents in these fields. He has chaired over a dozen international conference and workshops, and had served on more than 100 technical program committees. Dr. Miri is a member of the Professional Engineers Ontario.