Evaluation of Synthetic Data Generation Techniques in the Domain of Anonymous Traffic Classification

Anonymous network traffic is more pervasive than ever due to the accessibility of services such as virtual private networks (VPN) and The Onion Router (Tor). To address the need to identify and classify this traffic, machine and deep learning solutions have become the standard. However, high-performing classifiers often scale poorly when applied to real-world traffic classification due to the heavily skewed nature of network traffic data. Prior research has found synthetic data generation to be effective at alleviating concerns surrounding class imbalance, though a limited number of these techniques have been applied to the domain of anonymous network traffic detection. This work compares the ability of a Conditional Tabular Generative Adversarial Network (CTGAN), Copula Generative Adversarial Network (CopulaGAN), Variational Autoencoder (VAE), and Synthetic Minority Over-sampling Technique (SMOTE) to create viable synthetic anonymous network traffic samples. Moreover, we evaluate the performance of several shallow boosting and bagging classifiers as well as deep learning models on the synthetic data. Ultimately, we amalgamate the data generated by the GANs, VAE, and SMOTE into a comprehensive dataset dubbed CMU-SynTraffic-2022 for future research on this topic. Our findings show that SMOTE consistently outperformed the other upsampling techniques, improving classifiers’ F1-scores over the control by ~7.5% for application type characterization. Among the tested classifiers, Light Gradient Boosting Machine achieved the highest F1-score of 90.3% on eight application types.


I. INTRODUCTION
Network traffic often contains sensitive user data and private information, so the classification of this traffic is considered a controversial topic. Although network traffic classification can be used for censorship, it is also necessary for Internet Service Providers (ISPs) to guide initiatives such as resource allocation, infrastructure development, improving network security, and other network services [1]. Concerns regarding The associate editor coordinating the review of this manuscript and approving it for publication was Wei Wei . personal and professional security have led to the proliferation of many traffic anonymization technologies such as Tor, VPNs, Hypertext Transfer Protocol Secure (HTTPS), Transport Layer Security (TLS), and Secure Shell Protocol (SSH). Unfortunately, while these techniques increase privacy for the users, they also make it more difficult for ISPs to scale their network.
Tor and VPNs are some of the most common anonymization protocols. Tor traffic is encrypted through a series of network nodes that the client selects. Each node is encrypted with a unique key and can only interact with the previous and next node. As the layers of encrypted nodes increase, tracing each node and the overall path back to its origin/destination is incredibly difficult, resulting in increased user privacy [2]. Similarly, VPNs establish encrypted connections from users' devices to remote servers. All internet traffic is routed through the secure connection with the external server. ISPs and other entities can no longer see the websites and services a user is connecting to, instead, all traffic is a connection to the VPN provider [3]. VPN traffic is dependent on the integrity of a third party service while Tor is a decentralized system that relies on a community of volunteers.
Classifying anonymized traffic is complicated by the fact that network traffic is inherently imbalanced [4]. While a webpage can be loaded with relatively few packets, Peerto-Peer (P2P) and streaming services can exchange millions of packets. This means that models trained on network data may perform well during production, but their performance can fail to scale after deployment. Since poorly performing minority classes may be of primary interest to the ISP's future development, it is important to ensure that these models scale well to real-world applications [5].
Data generated through specifically designed algorithms, referred to as synthetic data, has been shown as a potential solution to the aforementioned imbalanced data problem. Synthetic Minority Oversampling Technique (SMOTE), one of the most prominent synthetic data generation techniques, was introduced in 2002 [6]. Due to the explosion in the application of deep learning, several generative models have also recently been introduced such as Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) in 2014 [7], [8]. Since these methods may capture different information from the original data to produce their synthetic samples, it makes sense that particular classification techniques may benefit more from different generative techniques.
In this article, we compare the efficacy of several data generation techniques and analyze their impact on performance across a spectrum of cutting-edge deep and shallow learning models for anonymous traffic classification. Moreover, we show that traffic classification models trained on imbalanced data fail to maintain performance when exposed to balanced traffic data, establishing poor performance for minority classes. Then we show this problem can be addressed when our originally imbalanced data is augmented with synthetic samples. We found deep learning models tended to experience greater variance in performance metrics from the GAN and VAE data, while both deep and shallow learning methods benefit from the data generated by SMOTE.
The existing need for balanced network data for anonymous traffic classification served as motivation for this work [5], [9], and [10]. By exploring various generative models, we hope to provide clarity on which synthetic data techniques could be applied in network traffic classification and demonstrate an effective methodology for doing so. The following points outline the novel contributions of this work: • Assessment and comparison of four prominent synthetic data techniques (CTGAN, CopulaGAN, VAE, and SMOTE) and their ability to generate effective synthetic network traffic samples • Application of generative techniques scarcely used in this domain, specifically a VAE and two state-of-the-art GAN variants • Performance evaluation of many boosting, bagging, and deep learning models trained on synthetic network data • Generation of a new synthetic anonymous network traffic dataset to balance the generally unbalanced network traffic data and enable future synthetic data research • Improve performance for multiclass classification of eight application types The remainder of this paper is presented in the following order: Section 2 explores works related to anonymous network traffic and synthetic data applications. Section 3 analyzes the CIC-Darknet2020 dataset [9] used in our experiments. Section 4 gives a brief description of the unique frameworks and architecture used for experimentation. Section 5 provides explainability to our CMU-SynTraffic-2022 dataset. Section 6 presents our experimental methodology, while section 7 discusses experimental results. Section 8 investigates the limitations of the present work and provides avenues for future research. Finally, section 9 concludes and summarizes the paper.

II. RELATED WORKS A. VPN TRAFFIC DETECTION
Draper-Gil et al. [12] studied the ability for time-related features to detect encrypted communications utilizing VPN services. They created the ISCX-VPN2016 dataset consisting of traffic from several applications such as browsing and streaming traffic to conduct their experiments. Their classifiers, C4.5 and KNN, were trained to distinguish VPN and non-VPN traffic as well as to classify the traffic type. The paper found that the C4.5 classifier was slightly more effective, obtaining precision and recall scores around 84% when paired with the correct flow-timeout value.
Caicedo-Muñoz et al. [13] integrated a quality of service (QoS) classifier and per-hop behavior (PHB) to the ISCX-VPN2016 dataset. They generated two new datasets: the first dataset contained VPN and non-VPN traffic while the second dataset combined VPN and non-VPN traffic with PHB labels. Bagging and boosting algorithms were most effective on their datasets, achieving accuracies of 94.42% and 92.82%.
Miller et al. [14] captured real VPN and non-VPN network data using Wireshark and NetMate. TCP flow-based features were used to train a multi-layer perceptron classifier to detect OpenVPN and non-VPN traffic. Their neural network model achieved an accuracy of around 94% on the post-training test set.

B. TOR TRAFFIC DETECTION
Lashkari et al. [15] devised a two layer approach for detecting and classifying Tor traffic. They created a labeled Tor VOLUME 10, 2022 dataset named ISCX-Tor2016 that was published alongside their research. Furthermore, the team extracted time-based features from the dataset and used them as the sole features to train their models. Models were trained on data with different flow lengths, and they found 15 seconds to be the optimal flow time. Their top model achieved a recall and precision of 99% when detecting Tor traffic, and 83% when distinguishing between eight application types.
Huo et al. [16] noted that a large number of parameters need to be calculated to train a network to classify Tor traffic. As calculating these parameters is computationally expensive, they propose a new model that extracts spatial features by CNN layers, gathers temporal features from LSTM layers, then fuses multi-scale features before sending the features to an attention mechanism. They were able to achieve 94.9% accuracy on the ISCX-Tor2016 dataset.
Gurunarayanan et al. [17] performed random oversampling and random undersampling on the ISCX-Tor2016 dataset to detect Tor traffic. The team incorporated Grid Search algorithms as a means of hyperparameter tuning. Their top model-Random Forest-achieved an accuracy of 99%.

C. DATA AUGMENTATION
Jadav et al. [10] trained 15 machine learning classifiers on the CIC-Darknet2020 dataset to differentiate between darknet and benign traffic. They addressed a class imbalance problem between the number of darknet and benign samples by incorporating SMOTE and found that Extra Tree and Decision tree to be the highest performing classifiers with accuracies of 99%. They recommend future work to investigate deep learning models and multiclass classification with many possible classes.
Guo et al. [5] introduced Imbalanced Traffic Classification General Adversarial Network (ITCGAN) as a solution to imbalanced data in the domain of network traffic classification. Using the ISCX-VPN2016 dataset, they tested ITCGAN against other synthetic data generation techniques. Namely Random OverSampling, SMOTE, ADAptive SYNthetic algorithm, SMOTE+Support Vector Machine, SMOTE+Tomek Links, and a Conditional generative adversarial network all on a 1D-CNN classifier. ITCGAN, SMOTE, and SMOTE+SVM were the only oversampling methods that outperformed the baseline dataset.
Wang et al. [18] proposed a GAN based methodology named FlowGAN to address the problem of class imbalance in the field of network traffic classification. FlowGAN was trained on traffic from the ISCX-VPN2016 dataset. Real data and synthetic data generated from FlowGAN were concatenated and used to train a multilayer perceptron neural network. In comparison to the unbalanced dataset, FlowGAN increased F1-score by 15.6%. When compared to a balanced dataset, the F1-score increased by 2.12% on average.
Okonkwo et al. [19] applied Convolutional Neural Networks to encrypted traffic classification on ISCX-Tor2016 and ISCX-VPN2016 datasets. The data traffic flows are gathered and reconstructed into flow images, or flowpics.
To balance the dataset, data augmentation is used on the flowpics to create new samples. Each flowpic was then sent to the CNN and then classified as either HTTPS, VPN, or Tor. Several more CNNs were trained for the tasks of classifying application identification and origin containing four and eight classes respectively. They obtained an average accuracy of ∼93% across all experiments.
Li et al. [20] presented a data augmentation technique utilizing a GAN, VAE, and statistical parameter configuration (SPC) to address the problem of insufficient network data. The GANs data deviated from the actual traffic with a mean and variance both less than 1.7%. The proposed GAN technique outperformed both the SPC and VAE.

D. ANONYMOUS TRAFFIC DETECTION
Lashkari et al. [11] aggregated the CIC-Darknet2020 dataset by merging two encrypted traffic datasets and introduced DeepImage. DeepImage is a model that creates a gray image by selecting the most important features from a dataset. Deep-Image sends the gray image to a two-dimensional convolutional neural network (CNN) to detect and categorize eight types of darknet traffic. Overall, the CNN had an accuracy of 86%; however, the CNN struggled to classify certain traffic types such as browsing traffic.
Gupta et al. [21] expanded upon the body of knowledge by training classifiers to detect three types of traffic: non-VPN/non-Tor traffic, Tor traffic, and VPN traffic. Previous research classified traffic as either VPN vs non-VPN, or Tor vs non-Tor, but not both. Eight machine learning algorithms were trained on the dataset, and XGBoost was the highest performer with an accuracy of 98%.
Iliadis et al. [22] trained five machine learning classifiers on the CIC-Darknet2020 dataset to detect and classify darknet traffic into one of four categories -Tor, non-Tor, VPN, and non-VPN. Their feature importance analysis found ''Total Length of Fwd Packet'' to be the most vital feature while they removed the five socket-related features to avoid overfitting. Random Forest performed the best with an accuracy of over 98% on darknet detection and classification.
Al-Omari et al. [23] analyzed the impact of training machine learning algorithms on unique feature groups to differentiate darknet and regular traffic. They settled on four feature groups: ''all features'' (68 features), ''all features without Src Port and Dst Port'' (66 features), ''selected features'' (9 features), ''selected features without Src Port and Dst Port'' (11 features). Overall, boosting algorithms tended to work the best, and the Ridge-300 classifier had an accuracy of 99.9% on the ''selected features'' feature set. Table 1 outlines the current methodologies used to detect and classify network traffic. We find the following gaps in the existing research:

E. LIMITATIONS AND KNOWLEDGE GAP OF CURRENT WORKS
• Network traffic is frequently imbalanced and is an ongoing problem in network research [9]. While previous research [5], [18], [20] has applied synthetic data generation techniques to balance VPN, Tor, and internet traffic datasets separately; no research has compared the efficacy of many synthetic data generation models against each other on a robust anonymous network traffic dataset.
• Variational autoencoders have proven to be effective in generating synthetic network traffic [20]. No reviewed papers have used a VAE to generate anonymous synthetic traffic samples.
• No known research has applied the state of the art Tab-Net model [24] to classify anonymous traffic.
• Prior research does not evaluate the effect of synthetic anonymous data samples on the classification of audio streaming, browsing, chat, email, file transfer, p2p, video streaming, and VoIP traffic. In the present paper, we bridge the aforementioned knowledge gaps by using VAE, GAN, and SMOTE algorithms to generate synthetic anonymous traffic for data balancing. These three data generation techniques have yet to be compared in this domain, so providing a direct comparison can indicate the preferred technique in anonymous network traffic classification scenarios. Moreover, we compare boosting and bagging classifiers to deep learning models in their classification effectiveness when trained on synthetic data.

III. DATASET A. FEATURE EXTRACTION AND COMPOSITION
The CIC-Darknet2020 dataset provided by the Canadian Institute for Cybersecurity [11] is incorporated in our experiments. The data set was formed through the fusion of two public datasets, namely ISCX-Tor2016 [15] and ISCX-VPN2016 [12], to create an anonymous dataset encompassing regular, Tor, and VPN traffic. Furthermore, the dataset was published in both raw Packet-Capture (PCAP) files and tabular data that was preprocessed by CIC-FlowMeter v4.0 over a predetermined time interval. These tabular samples contain time-based features that capture statistics from the traffic flow such as flow duration and packet inter-arrival times. This is combined with information about the packets' source and destination, the flags declared in their headers, and the time at which the flow was captured, creating a well-encapsulated representation of the traffic flow. We chose this dataset because it contains a relatively large number of application types (8) and samples (117,620) while incorporating both Tor and VPN traffic.
A two-layered approach was used to generate data for the CIC-Darknet2020. Regular and anonymous traffic were synthesized in the first layer, and the traffic was further broken into eight application types: audio streaming (Vimeo and Youtube), browsing (Firefox and Chrome), chat ( ICQ, AIM, Skype, Facebook and Hangouts), email (SMTPS, POP3S and IMAPS), file transfer (Skype, FTP over SSH (SFTP) and FTP over SSL (FTPS) using Filezilla and an external service), p2p (uTorrent and Transmission), video streaming (Vimeo and Youtube), and VoIP (Facebook, Skype and Hangouts voice calls) in the second layer. Figure 1 presents the traffic and application type sample ratios.

B. PREPROCESSING AND FEATURE ENGINEERING
Before generating synthetic data, we applied feature selection and data cleaning. Initially, samples containing Inf and NaN values were eliminated from the dataset. Of the 84 features in CIC-Darknet2020, 14 of them contained the value zero (0) for every sample in this dataset. These features were discarded as they do not contribute to the model performance in discriminating various traffic types. Six additional features (Flow-id, Source/Destination IP, Timestamp, and Source/Destination port) were eliminated from the dataset, bringing the total number of features down to 64. The Flow-id feature is of the form (Source IP)-(Destination IP)-(Source Port)-(Destination Port)-(Protocol), therefore it only contains duplicate information. Source/Destination IP is an artifact of the original dataset and does not represent the distribution of IP addresses found on the internet. The Timestamp feature was removed because our classifiers do not account for the order of flows and information about when a flow was initiated will not add any meaningful information. Similarly, Source/Destination Ports are unique non-deterministic identifiers which could result in overfitting.

IV. FRAMEWORKS AND ARCHITECTURE
Our experiments incorporate three shallow learning and two deep learning classifiers. Sun et al. [25] note that shallow learning classifiers tend to be more effective at classifying structured data with XGBoost (XGB), Light Gradient Boosting Machines (LGBM), and Random Forest (RF) being top performers. On the other hand, TabNet is a cutting-edge deep learning model that warrants experimentation in the field of anonymous traffic detection. The following section will present the shallow learning classifiers, three synthetic data generation techniques, and an overview of TabNet's architecture.

A. RANDOM FOREST
RF is an ensemble shallow learning classifier composed of a series of decision trees. Each decision tree in the RF model adapts the divide-and-conquer paradigm. Data splits occur at each internal node and decisions are reached at the leaf nodes. RF incorporates techniques such as bagging and randomness to aggregate the predictions of the decision trees and reduce variance and bias that may occur in a single decision tree [26].

B. LIGHTGBM
Microsoft introduced LGBM-a gradient-boosted decision tree-in 2016 as a high-speed tree-based model that can handle large data by growing trees vertically.
LGBMs predecessors tended to be much slower because they use information gain as a guiding heuristic to conduct optimal splits through the use of pre-sorted or histogram-based algorithms. LGBM addresses this concern by using Exclusive Feature Bundling (EFB) and Gradient One-Side Sampling (GOSS). EFB combines mutually exclusive features and GOSS randomly drops small gradients because larger gradients are usually associated with greater information [27].

C. XGBOOST
XGB is a gradient-boosted decision tree algorithm built on top of the Gradient Boosting Machine's (GBM) framework. XGB optimizes the GBM framework by filling in missing data through the process of sparsity awareness, addressing the overfitting problem by lowering variance and increasing bias through Lasso Regression, using the weighted quantile sketch algorithm to find optimal tree splits, performing depth-first tree pruning, parallelized decision tree construction, and outof-core computing [28].

D. SYNTHETIC MINORITY OVER-SAMPLING TECHNIQUE (SMOTE)
SMOTE [6] is an oversampling approach that generates samples for minority classes. SMOTE begins by selecting a random sample from the minority class and its k-nearest neighbors (neighbors that reside in the same feature space). Out of the k neighbors, a random neighbor is selected and the distance between the two points is calculated. The distance is multiplied by a random value between 0 and 1 and added to the feature vector to generate a new sample. The process is repeated until a satisfactory number of samples have been generated. Many techniques have been explored such as the adaptive synthetic (ADASYN) sampling approach as potential improvements to the original SMOTE algorithm [29].
A GAN model is composed of two independent models: a generator and a discriminator. The generator and discriminator make up an adversarial network where the generator attempts to synthesize new samples and the discriminator works to identify the synthetic samples. After training, the generator will be able to create samples from noise input that will preserve and correspond to the distribution of the training data [8]. This process was mathematically modeled in [6, eq. (1)]. This equation is a value function in which the generator, G, is trying to minimize the function and the discriminator, D, is trying to maximize the function. E x∼p data and E z∼p z (z) are the expected value for an input of original data and an input of noise respectively. The rest of this equation is adapted from the binary cross entropy function used to model binary classification problems.
Conditional Tabular GAN (CTGAN) is a GAN-inspired architecture that is capable of generating tabular data. CTGAN improves existing models such as table-GAN [30] by incorporating the variational Gaussian mixture model for each column rather than normalizing continuous values between -1 and 1 [31], [32]. CopulaGAN [33] is a variation of CTGAN where the Cumulative Distribution Function transformation is applied via GaussianCopula and it attempts to learn column correlation in a table [34], [35].
A VAE [7] is an autoencoder that specializes in reducing overfitting through regularization. Standard autoencoders work by encoding data into a smaller feature space (with minimal information loss) and then utilizing a decoder to reconstruct an output that is as similar as possible to the original data. The output from the decoder is compared to the initial data and weights are updated through backpropagation to minimize future reconstruction errors. Since autoencoders attempt to train an encoder and decoder with as little loss as possible, they are susceptible to overfitting. Variational Autoencoders address this concern by encoding the input as a distribution over the latent space [36]. After a VAE is trained it can be used to generate new synthetic samples for a dataset.
(2) is the loss function for a variational autoencoder and consists of two terms [5]. The first term, E z∼q(z|xi) [log p φ (x i |z)], models the reconstruction loss i.e. a measure of how similar a reconstructed output sample is to the input. The second term is the Kullback-Leibler (KL) divergence which evaluates the difference between two distributions. Minimizing the KL regularizes the output probability distribution of the encoder. More explicitly, reducing kl divergence ensures the latent distribution matches a normal distribution.
where ∼ N (0, 1) Optimizing this loss function can prove problematic as finding the gradient of this equation is not possible in its current form. This is because we are sampling from a random node which results in an intractable integral. To address this problem, Kingma et al. [5] describe the reparameterization trick. The reparameterization trick is a tool to backpropagate when sampling a random node from a distribution (in the loss function, looking at q φ (z|x i ), z is a random variable sampled from a distribution and is the problematic term). To represent z in a deterministic way, it can be written as z = µ + σ * , where is a predetermined sample from a separate distribution p( ) = N (0, 1). By substituting z with this new representation, it becomes possible to evaluate the gradient of the loss function.

G. TABNET
Arik et al. [24] recognized that deep learning models are effective in fields such as image recognition. Since a large portion of existing data is arranged in a tabular format, they proposed the deep tabular data learning architecture named TabNet in late 2020. Through the process of sequential attention, TabNet acts similarly to decision trees while adding interpretability and more efficient learning. At each step, features are passed through a feature transformer composed of a fully connected layer, batch normalization, and a Gated Linear Unit. Next, the features are sent to an attentive transformer made up of a fully connected layer, batch normalization, and sparse max normalization. The attentive transformer considers feature importance from previous steps to create a mask. The mask determines which features are most suitable to be used by the model. The mask improves model interpretability because it shows which features TabNet deemed to be the most important [24].

V. CMU-SYNTRAFFIC-2022
To facilitate future research on synthetic data and anonymous network traffic, our team has produced the CMU-SynTraffic-2022 dataset containing synthetic data generated in our experiments as well as the real data used to generate it. The synthetic portion of this dataset consists of 432,847 SMOTE, 700,000 CTGAN, 700,000 CopulaGAN and 700,000 VAE samples. CMU-SynTraffic-2022 also contains 117,620 real samples from CIC-Darknet2020 [11] for a total of 2,650,467 samples. In addition to the 64 features present in CIC-Darknet2020, this dataset also contains the data source label (real, CTGAN, CopulaGAN, VAE, SMOTE). The traffic, application, and data source is further visualized in Figure 2.
Our team performed four tests as a preliminary measure of the synthetic data using the SDV Framework [37]: Logistic VOLUME 10, 2022  Detection, chi-squared test (CS Test), Kolmogorov-Smirnov test (KS Test), and Multilayer perceptron (MLP) classifier test. The CS and KS tests determine how closely correlated the distributions of the synthetic data is to the real data [38], [39]. Logistic Detection trains a classifier to differentiate real and synthetic data and MLP is simply the F1-Score of a multilayer perceptron trained on the synthetic data when classifying by application type [40].
One noteworthy observation as seen in Figure 3 is that the CopulaGAN model achieved higher logistic detection, CS test, and KS test metrics compared to the other models. This indicates that the distribution of the CopulaGAN synthetic data may more closely match real data. Conversely, CopulaGAN had the lowest MLP F1-score among the models with the VAE model performing the best, even outperforming the original dataset.

VI. EXPERIMENT METHODOLOGY A. OVERVIEW
Our experiments were conducted in correspondence with Figure 4. Detailed discussions about data collection, data cleaning, and feature selection are presented in the Dataset section. Subsequent sections will outline the experiment scenarios and the remaining stages of the research methodology.

B. EXPERIMENT SCENARIOS
Our experiments are conducted in two distinct scenarios. Each scenario begins by establishing baseline results by training 2 boosting classifiers, 1 bagging algorithm, and two deep neural networks on an imbalanced dataset consisting of real data. These classifiers are tasked to classify samples from two separate test sets. One of the test sets contains imbalanced sample proportions, while the other test dataset is balanced. The contrast between classifier performance on the imbalanced and the balanced test sets is meant to showcase that classifiers trained on heavily imbalanced datasets may perform poorly when deployed in real-world applications where traffic may not be skewed in the same manner as the training data. These results will be dubbed Imbalanced Control and Balanced Control respectively. Next, an upsampled dataset composed of synthetic and real data is utilized to train the same five classifiers. Performance metrics were gathered to determine whether classifiers trained on synthetic data can differentiate various anonymous traffic types and to determine whether training on synthetic data increases classifier performance. Scenario A emphasizes high-level classification among Tor, VPN, and regular traffic where each traffic type is composed of a basket of eight application types. Scenario B aims to differentiate among eight application types. Both Scenario A and B are trained on the original imbalanced dataset as well as the upsampled datasets using synthetic data generation techniques.

C. TEST AND SEED SPLIT
Before generating synthetic data, the original dataset was divided into seed and test datasets for Scenario A and Scenario B. The Scenario A test dataset consists of 1,950 samples (650 of each traffic type) and the Scenario B test dataset is composed of 4,000 samples (500 samples of each application type). The remaining samples for both scenarios are present in their respective seed datasets. The datasets were limited by the fact that there were only 742 Tor samples (Scenario A) and 572 email samples (Scenario B), so we chose these proportions to ensure that the test sets are balanced while leaving enough samples in the seed dataset for reliable synthetic data generation. The justification for splitting data into a test and seed dataset prior to generating synthetic data is twofold. First, by splitting the dataset in two, none of the information contained in the test dataset is included in the data generation process. If synthetic data is generated based on data in the test dataset, the synthetic data generation algorithms will create new data that is relatively similar to the test data. These new data samples will be incorporated into the training process and may skew the results because the synthetic datasets were generated while knowing what the test dataset looks like [41]. Splitting the datasets before generating new data alleviates this problem.
Second, the purpose of our classifiers is to detect and classify anonymous traffic in the real world. Importantly, our research is not designed to determine classifiers' ability to identify artificially generated anonymous traffic. If we didn't perform a split before generating synthetic data, synthetic data would almost certainly be present in the test dataset after conducting a train test split. Our metrics would be biased as they would reflect the classifiers ability to classify synthetic data making it difficult to predict how the model would perform on real-world traffic.

D. PERFORMANCE METRICS
Accuracy can give an insight into the performance of a classifier, but If the dataset is imbalanced, the model may yield high accuracy on training data and low accuracy in practical application. Therefore, we used four metrics-precision, recall, F1-score, and AUC-to measure model performance. True positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) are the components of these evaluation metrics.
Precision is the proportion of positive classifications that are correct and recall is the percent of TPs that a model predicted. F1-score is calculated from the precision and recall metrics and is used in evaluating the balance between precision and recall. ROC curves represent the TP rate measured against the FP rate. The area under the ROC curve (AUC) and F1 score are metrics which aren't biased by disproportionate data and can better evaluate if a model is overfit.
E. DATA AUGMENTATION Previous research in the field of network traffic classification tends to upsample minority classes to the majority class. Although this is a valid methodology, we wanted to experiment with other upsampling and downsampling proportions to get optimal results. In addition to upsampling each class to the majority, eleven new datasets were generated from the Scenario A and Scenario B seed datasets using SMOTE and random undersampling for the sake of maintaining some of their original proportionality. When trained on our classifiers, The top-performing dataset for Scenario A consisted of 30,000 regular samples, 20,000 VPN samples, and 10,000 Tor samples. The dataset where all classes were upsampled to the majority class (92,659 samples) was the next best performer. For Scenario B, the top performing dataset contained 30,000 samples of each application type. The second best dataset upsampled all eight application types to the majority class which contained 48,020 samples.
With the optimal proportions in mind, SMOTE, VAE, and GANs were employed to generate a new dataset each for Scenario A and another dataset for Scenario B using the aforementioned proportions of 30,000 regular samples, 20,000 VPN samples, and 10,000 Tor samples for Scenario A and 30,000 samples of each application type for Scenario B. In total, four datasets were generated for Scenario A (using SMOTE, VAE, CTGAN, and CopulaGan) and four for Scenario B for a total of eight new datasets. Each dataset was used to train the five classifiers which were then tested on the test dataset to evaluate performance. The results section presents classifiers trained on the 30,000 regular, 20,000 VPN, and 10,000 Tor samples for Scenario A and 30,000 samples for each application type for Scenario B as these were found to be the top performing upsampling strategies.

F. MODEL OPTIMIZATION
The three shallow learning models-Random Forest (RF), XGBoost (XGB), and LightGBM (LGBM)-underwent hyperparameter tuning with grid search. LGBM and XGB were subject to variations in their n_estimators, max_depth, min_child_weight, and eval_metric. For RF, we tuned the n_estimators and max_feature parameters. Each shallow learning model was trained with variations in these parameters and the model with the optimal parameters are presented in the results section.
Hyperparameter tuning was applied to TabNet and the DNN as well. Both models use fastai's built-in lr_find() method. The function returns the optimal learning rate among a valley, slide, steep, or minimum learning rate. After experiments it was found that both the deep learning models perform best around 20 epochs. If more epochs are conducted, the models begin to overfit. The optimal dimension for the DNN was a 15-layer network with 125 nodes in each layer. A batch size of 64 was found to be the optimal.

VII. RESULTS
The following section presents the results and findings of the control, Scenario A, and B experiments. First, we establish how classifiers trained on unbalanced data are maladaptive to diverse data and may overfit to the majority classes. Then, we compare classifier performance in Scenarios A and B using upsampled synthetic data.

A. IMBALANCED VS BALANCED TEST SETS
Figures 5 and 6 depict our classifiers' performance when trained on the imbalanced dataset and tested on balanced and imbalanced test sets for Scenario A and B. This set of experiments was conducted on the CIC-Darknet2020 dataset without any synthetic data. While all classifiers yielded high metrics when tested on the imbalanced data, there was a pronounced and expected drop in F1 and accuracy when evaluated on balanced data. Notably, the deep learning models experienced the greatest performance reduction on the balanced dataset. These results indicate that the anonymous network traffic classifiers tend to overfit due to the skewed nature of training data which may make the classifiers infeasible in real world scenarios. For this reason, all of the following experiments are tested on balanced data and compared to the balanced-tested control unless otherwise stated.

B. SCENARIO A RESULTS
All classifiers in Scenario A achieved F1 and AUC scores greater than 90% whether trained on the real data or the synthetically upsampled data. Table 2 contains the results for the Scenario A experiments and is highlighted to accentuate which training data produced the highest metrics for each given model (green) as well as which techniques outperformed the corresponding control model (blue). Each synthetic upsampling technique was used to create a dataset with 30,000 regular, 20,000 VPN, and 10,000 Tor samples as this was found to be optimal. It can be observed that classifiers trained on SMOTE upsampled data had the highest metrics when compared to the other techniques. TabNet and the DNN experienced a greater F1-score improvement from control to SMOTE as compared to the shallow learning models. Furthermore, CTGAN, CopulaGAN, and the VAE saw no major improvement or degradation from the control in this scenario.
Independent of upsampling techniques, the shallow learning classifiers experienced low variability and performed better on average than the deep learning models. This could be attributed to the fact that deep learners tend to require more data than shallow learners, and our seed data may not have been sufficient. Moreover, the deep learners may require further hyperparameter optimization and training for more epochs to perform on par with the shallow learners.

C. SCENARIO B RESULTS
The results of the Scenario B experiments (Table 3) saw higher deviations in metrics across synthetic techniques and models due to the larger number of class types. For every sampling technique, we generated a dataset with 30,000 samples of each application type. Once again, SMOTE provided the most promising results, improving over control across all classifiers. The shallow classifiers trained on the other synthetic techniques performed similar to the baseline classifiers. On the contrary, our deep learning models performed poorly when compared to the shallow learners and saw large variations across the different techniques. For instance, Tab-Net's F1 improved by ∼18% from baseline to SMOTE, whereas DNN's F1 degraded by ∼16% from baseline to CopulaGAN.
With the exception of SMOTE, all deep and most shallow learning classifiers degraded in performance when trained on upsampled data. There could be a multitude of causes for this discrepancy. One potential reason could be that the GANs and the VAE are deep learning-based algorithms and may not have been trained for a sufficient number of epochs. Moreover, they may not have had enough seed data to create representative samples. SMOTE doesn't require considerable sample data to produce new samples occupying the same feature space as the original data because it is a statistical technique that does not iteratively learn on sample data. Figure 7 illustrates confusion matrices and classification results for LGBM when tested on imbalanced, balanced, and upsampled SMOTE data. From these figures, we can see that ''browsing'' and ''email'' were the classes with the greatest improvement in F1-scores when upsampled with SMOTE compared to the balanced control results. It should be noted that ''browsing'' had the second largest number of original samples while ''email'' contained the least. This implies that the model was biased towards browsing traffic and had a large number of false positives. After training on SMOTE upsampled data the F1-scores for browsing and email classification  improved by 5.7% and 8% respectively. When evaluated against the balance control, all application types improved in F1-score with the exception of ''chat'' which didn't see any variation in the result.

D. SUMMARY
Across both scenarios, SMOTE was the top performing generative technique, improving classifier metrics over the control for every classifier. The other balancing techniques performed near baseline for shallow learners, but with additional optimization there may be further improvements. Moreover, we showed that non-SMOTE generative techniques can degrade performance, especially in deep learning classifiers. Table 4 contextualizes our results with prior research. Our traffic results perform on par with prior research while our application results showed improvement 4% improvement over Lashkari et al.'s [9] application type experiments. Furthermore, we measure model performance using F1-score as it is less susceptible to overfitting and our results show that synthetic data is viable in this domain. Compared to prior studies, by exploring multiple generative techniques to address the imbalanced classes, we are able to optimize model performance.

VIII. LIMITATIONS AND FUTURE WORK
As the scope of this work is to assess the efficacy of synthetic data in the field of anonymous traffic and application categorization, it does not optimize every step in the experimental workflow. Future works may benefit from implementing additional SMOTE variants (such as ADASYN) while training the CopulaGAN, CTGAN, and VAE for more epochs. Furthermore, the models observed in this study could be refined through greater hyperparameter tuning and by conducting more exhaustive grid searches on the shallow learning classifiers. This work used SMOTE as a baseline to generate 12 new datasets of varying proportions for each class. Testing further upsampling ratios with other synthetic data generation techniques as the guiding heuristic may improve overall performance. Also, additional metrics such as synthetic data generation time of the discussed methods may be an important consideration during real world implementation of these classifiers. Future work could use multi-criterion decision making to evaluate the proposed methods based on factors other than accuracy and F1.
While the CIC-Darknet2020 dataset contains a large variety of anonymous traffic and application types, there are still several encryption and anonymity protocols and applications such as HTTPS, SSH, and SSL/TLS that are not included within this work. Characterization of those untested types may be necessary for an organization looking to deploy similar models. Moreover, the literature is lacking experimentations on the optimal flow interval, so further experimentation is encouraged to regenerate the tabular dataset from the raw pcap traffic data over different flow intervals.
The experiments would benefit from larger amounts of real data because TabNet and the DNN were likely impacted more than the shallow learners from the limited number of samples in the seed and test datasets. Furthermore, testing our models with balanced data resulted in a test set with ∼600 samples for each application type, which may not fully encapsulate the variety of real-world data.
The CMU-SynTraffic-2022 dataset provides a multitude of avenues for future research. The dataset is more robust compared to many available anonymous traffic datasets and may be incorporated in research works to characterize anonymous traffic and application types. Additionally, researchers may use the dataset to judge how their generative model compares to the generative techniques synthesized to create samples in the CMU-SynTraffic-2022 dataset, or the samples for each generative technique could be analyzed to determine the impact generative models have on classifiers' performance. CMU-SynTraffic-2022 provides a means to analyze synthetic data and its source or algorithm. For instance, it may be the case that samples generated by one generative technique could cluster with samples from another generative technique.

IX. CONCLUSION
In this work, we analyzed the performance of RF, XGB, LGBM, a DNN, and TabNet on a variety of synthetic data generation techniques. First, the classifiers were trained on the imbalanced CIC-Darknet2020 dataset and tasked with classifying samples from imbalanced and balanced test sets. It was demonstrated that, in this experiment, the models experienced performance degradation and struggled to classify minority classes because they were trained on skewed data. Next, four additional datasets were generated using a CTGAN, CopulaGAN, VAE, and SMOTE and the classifiers were retrained on these datasets to evaluate how each technique alleviates the imbalanced problem. Ultimately, the additional datasets were amalgamated into a complete dataset dubbed CMU-SynTraffic-2022 and open sourced for future synthetic and network traffic research [42], [43].
In our two-phased experiments, Scenario A classified Tor, VPN, and regular traffic while Scenario B aimed to differentiate among eight anonymous application types.
In both scenarios, SMOTE consistently provided better metrics than both the control set and the other synthetic datasets. Scenario B also showed that deep learning classifiers are impacted more than shallow learning classifiers when upsampling with synthetic data. Our shallow boosting and bagging algorithms (XGB and LGBM) were the top performers among the classifiers.
Due to the inherently imbalanced nature of anonymous network traffic, machine and deep learning classifiers often experience severe performance degradation in real-world applications. Our work demonstrated the viability of synthetic data in the domain of anonymous network traffic classification through a comprehensive comparison of data generation techniques. Furthermore, we addressed the knowledge gap in existing research by directly comparing several generative techniques and their capability to represent real network data. By testing techniques currently unused in this domain (VAE, GAN Variants) and observing how different classifier types perform on synthetic data, researchers may better implement the best generative technique for network traffic tasks. Through the creation of the CMU-SynTraffic-2022 dataset, we provide a means for further network traffic and synthetic data research. Finally, the proposed methodology presented in this work could be translated to other domains with heavily imbalanced data to potentially improve model performance.
JAMES HALLADAY is currently pursuing the bachelor's degree in math and computer science with Colorado Mesa University. His research interest includes application of machine learning to the domain of cyber security.
NATHAN BRINER is currently pursuing the bachelor's degree in computer science and a minor in mathematics with Colorado Mesa University. He is also the Vice President of CMU's Computer Science Club and the Vice President of Upsilon Pi Epsilon (the International Honor Society for Computing and Informatics) at CMU.