Tabular-to-Image Transformations for the Classification of Anonymous Network Traffic Using Deep Residual Networks

With the meteoric rise in anonymous network traffic data, there is a considerable need for effective automation in traffic identification tasks. Though many shallow and deep machine learning network traffic classification solutions have been proposed, they often rely on tabular data, making them unable to detect complex spatial relationships. However, recent advancements in computer processing power have increased the viability of transforming tabular data into images for training deep convolutional neural networks, transforming structured data problems into spatial ones. To identify the most effective methods for representing tabular anonymous network traffic data as images, we compared five deep learning classifiers trained on data from different tabular-to-image algorithms–Image Generator for Tabular Data (IGTD), DeepInsight, vector-of-feature wrapping (normalized and non-normalized), and our newly introduced Binary Image Encoding (BIE) technique in the classification of eight network application types. Furthermore, we examine whether deep residual models trained on tabular-to-image data can outperform the top-performing shallow learner, XGBoost, at classifying anonymous network traffic. We found that ResNet-50, a pre-trained instance of deep residual network, trained on image datasets using IGTD and the novel Binary Image Encoding outperformed XGBoost trained on tabular data. Our ResNet-50 models trained using IGTD and BIE achieved F1-scores of 96.0% and 98.49% respectively, improving on the baseline of 95.1% achieved by XGBoost.


I. INTRODUCTION
Network traffic classification is crucial for improving network management and security [1].For instance, real-time applications like video streaming may require lower latency for a better user experience than web browsing [2].Classifying this traffic can enable better optimization by internet service providers (ISPs) to prioritize real-time applications with more suitable network nodes [1].Moreover, classifying The associate editor coordinating the review of this manuscript and approving it for publication was Yiming Tang .network traffic may aid in malicious network traffic interception, which local governments typically mandate [3], [4], [5].Automation of this task has garnered greater interest as the scale of network traffic increases and new threats to network security reveal themselves.
Using the information in each packet that was transmitted or through a collection of packets and their metadata, called a flow, ISPs can classify traffic based on the application that produced it and optimize their infrastructure to scale to the evolving needs of their customers [6].Due to the emergence of a suite of encryption and anonymization technologies such as Secure Shell Protocol (SSH), Hypertext Transfer Protocol (HTTPS), The Onion Router (TOR), and Virtual Private Networks (VPNs), it can be difficult to rely on conventional techniques to discover the origin of the anonymous and encrypted traffic [7].To solve this problem, machine learning algorithms have been successfully employed to classify the applications producing network traffic [8], [9], [10].
A promising new machine learning approach to classifying network traffic is transforming structured tabular network traffic data into unstructured images.Audio, visual, and raw packet capture data are examples of unstructured data, whereas structured data is typically numerical or categorical data organized in a tabular form [11].The basic premise behind tabular-to-image (T2I) transformations is to convert structured data into a form more suitable for deep learning algorithms.Deep learners tend to outperform shallow learners when it comes to the classification of unstructured data, so by converting structured data into unstructured data these deep classifiers can be utilized to potentially improve predictive performance [11].Convolutional neural networks (CNNs) are one such model that can exploit properties such as locality and order between components of the data [12].Prior research [9], [13], [14] has proven the effectiveness of CNNs in classifying network traffic when trained on T2I data.Other CNN variants, such as deep residual networks (ResNet), have also yielded high performance in a broad range of classification tasks.
While previous research has established the potential of CNNs trained on T2I data, only a limited number of these techniques have been tested in the network traffic domain [9], [13].Moreover, there is minimal research comprehensively comparing the effectiveness of these T2I methods.With numerous T2I methods presented throughout several problem domains, determining the best method to use in the network traffic domain can better optimize network traffic classification tasks.These gaps in knowledge served as motivation for the experiments presented in this paper.
In our work, we measure the efficacy of five T2I algorithms in classifying anonymous network traffic by comparing five ResNet-50 classifiers trained on data generated with IGTD, DeepInsight, feature wrapping (normalized and nonnormalized), and Binary Image Encoding (BIE), a new T2I method introduced in this work.We also compare these models to the shallow learner, XGBoost, trained on the original tabular data as a baseline comparator.Using this methodology, we can find whether T2I techniques allow the ResNet classifier to outperform XGBoost and identify the most optimal T2I technique for the network traffic problem domain.
The following are the major contributions of our work: • Introduce BIE, a new T2I algorithm that utilizes data encoded as binary representations of double-precision floating point numbers.We believe BIE can be applied to many problem domains, though it may have specific advantages for network traffic classification.
• Provide a direct comparison of the efficacy among various T2I techniques.An experimental comparison of many T2I algorithms has not been explored in previous research for network traffic classification.
• Apply DeepInsight and IGTD T2I methods to the network traffic domain.These methods were initially introduced and tested on genomic data and their effectiveness when applied to classifying network traffic has not been previously explored.
• Create an open-source image dataset for further evaluation of the T2I techniques.
• Establish that T2I methods employed with ResNet-50 can outperform shallow classifiers on the classification of anonymous network traffic.
The rest of the article is structured as follows: Section II discusses related T2I and network traffic research and existing knowledge gaps, Section III introduces the datasets used in this work, Section IV introduces the classifiers used in this work, Section V explains the T2I methods that are the focus of this work, Section VI outlines our experiment methodology, Section VII presents the findings of our experiments, Section VIII concerns the limitations and future areas of study, and the paper concludes in Section IX.

II. RELATED WORKS
Our literature review found that the related works could be loosely categorized into three groups.Section II-A groups the related works that primarily focus on anonymous traffic classification regardless of the machine learning algorithms used.Section II-B aggregates works that primarily use CNNs in anonymous network traffic classification and section II-C groups works that use tabular-to-image techniques from different problem domains.

A. ANONYMOUS NETWORK TRAFFIC CLASSIFICATION
Anonymous network traffic classifiers categorize network traffic using machine learning models.The following works establish pre-existing approaches used for anonymous traffic classification [13], [14], [15], [16].
Allhusen et al. [15] trained multiple shallow learners to classify darknet and benign internet traffic from the CIC-Darknet2020 (CIC-DN) dataset (see section III-A).They analyzed the effects of the following feature groups on model performance: all original features after data cleaning, all features excluding source and destination port, 11 manually selected features, and the manually selected features without source and destination port.The Ridge-300 classifier outperformed all other models with an accuracy of 99% on their selected feature set.
Gupta et al. [16] trained an XGBoost model to classify Tor, VPN, and normal traffic from the CIC-DN dataset.The authors selected 36 of the original 83 features to reduce redundancy and increase classification efficiency.They found that XGBoost obtained the highest classification accuracy (98%) compared to seven other classifiers and concluded that the classifier still performed well despite the class imbalance in the dataset.
Lan et al. [13] proposed a deep-learning solution for darknet traffic identification and application classification called DarknetSec.Their method consists of custom attention-based 1D CNN and is compared to other state-of-the-art classifiers such as VGG19 with RF on the CIC-DN dataset.They found that their proposed method outperformed the other tested classifiers yielding >92% accuracy on 8 different application classes.
He and Li [14] developed a one-dimensional CNN model to classify anonymous proxy traffic with smaller image sizes.First, they converted two-way and one-way Spatiotemporal features to one-dimensional images.Next, they compared their method to other CNN-based models on the ISCXVPN2016 dataset as well as a self-generated dataset using Shadowsocks and Wireshark.Their method attained the same performance (>99%) as other methods while reducing memory storage by over 90% and minimizing computational overhead.

B. NETWORK TRAFFIC CLASSIFICATION WITH CNNs
CNN-based deep learning techniques are becoming a more popular approach to network traffic classification problems as the rapid growth of computation power enables quicker model training [9], [17].
Krupski et al. [17] surveyed 136 papers concerning CNN techniques in the network traffic domain, giving a specific focus on data transformation schemes.They created a taxonomy to categorize different data transformation and CNN techniques, differentiating them by their network data type, data transformation, CNN model structure, and input dimensionality.They found that 2D encodings utilizing feature wrapping and one-hot encoding were some of the most common techniques used in the existing literature.
Lashkari et al. [9] introduced DeepImage, a tabular-toimage pipeline for detecting and classifying Darknet traffic using CIC-DN dataset.DeepImage synthesizes gray-scaled images composed of the most important features from the dataset.A custom CNN was trained on the images to detect and characterize Darknet traffic with an accuracy of 86% when classifying among eight application types.

C. TABULAR TO IMAGE ALGORITHMS
In order to leverage the strengths of CNNs and improve classifier accuracy on tabular data, Tabular-to-Image (T2I) algorithms were introduced in previous works [11], [17], [18].
Sun et al. [11] proposed SuperTML, a technique for transforming tabular data into image data that can be paired with pre-trained CNNs for advanced classification tasks.SuperTML works by arranging feature values onto a 2D image.Features of greater importance are projected with larger font sizes.Moreover, SuperTML reduces the need for data preprocessing as missing values are projected as '?', and non-numeric values are placed on the image without the need for encoding.They tested SuperTML data with a pre-trained CNN and compared it with XGBoost on three separate datasets.SuperTML performed equally well or outperformed XGBoost in each test.Buturović and Miljković [18] developed a method for classifying tabular data with CNNs through an image-generating algorithm called Tabular Convolution (TAC).This method treats input vectors as kernels and converts the data into an image using convolutions of a fixed base image.Features are converted to kernels by creating a square matrix with an odd number of rows and columns.If the number of features is not the square root of an odd number, the square is either padded or trimmed towards the nearest odd square.TAC was applied to gene expression data and trained using ResNet, and results were compared to shallow classifiers (XGBoost, LightGBM, and Support Vector Machines).TAC outperformed all shallow learning methods obtaining an accuracy of 91.1% compared to the highest performing shallow learner's accuracy of 89.6%.This result was obtained using several thousand epochs of training; whereas a similar performance to non-CNN methods was seen when using 50 epochs.They conclude that the additional computation time required for TAC is negligible on modern computer architecture.
Table 1 compares the research of prior literature that explored the CIC-DN dataset, T2I techniques, or similar classifiers.

D. MOTIVATION AND PURPOSE
From the literature review, we find that insufficient research on T2I methods and the application of these methods for anonymous network traffic classification have left the following gaps in knowledge: • Research conducted on T2I methods is often only tested in the domain in which the technique was created, such as genomic datasets in the case of IGTD [12] and OmicsMapNet [19].
• Previous image generation techniques applied to network traffic are primarily formed from raw packet capture (PCAP) data as opposed to tabular data [17], [20], [21], [22], [23], [24].The focus has been on applying CNN architectures to a specific dataset, instead of the importance of T2I techniques.
• There is minimal research providing direct performance comparisons of several T2I techniques We address the gaps in existing research by empirically evaluating T2I methods in the network traffic domain while providing a detailed analysis of five T2I schemes.Among all the methods we explored, two have yet to be applied in the network traffic domain and BIE is a novel technique, to our knowledge.

III. DATASET
In this work, we use the CIC-DN dataset balanced with synthetic SMOTE data from CMU-SynTraffic-2022 (CMU) dataset.This data was then used in conjunction with the five T2I algorithms to create our CMU-SynTraffic2023-ImageDataset (CMU-I).We explain these datasets further in the next subsections.

A. CIC-Darknet2020
Lashkari et al. [9] amalgamated the CIC-Darknet2020 (CIC-DN) dataset by combining their ISCX-Tor2016 [25] and ISCX-VPN2016 [26] datasets.The CIC-DN dataset is provided in both raw Packet-Capture (PCAP) files as well as tabular data files that were preprocessed over a fixed time interval using CIC-FlowMeter v4.0 [27].The tabular data samples consist of time-based features such as flow duration as well as statistical features which makes them highly representative of traffic flows.The dataset consists of eight anonymous traffic application types and contains 117,620 samples encompassing both Tor and VPN traffic.This dataset was chosen for our experiments due to its comprehensive selection of application types, having an adequate number of samples for training, and being well researched in prior works [9], [13], [15], [16], [28].
The eight application types comprising CIC-DN are audio streaming (Vimeo and YouTube), browsing (Firefox and Chrome), chat (ICQ, AIM, Skype, Facebook, and Hangouts), email (SMTPS, POP3S, and IMAPS), file transfer (Skype, FTP over SSH (SFTP) and FTP over SSL (FTPS) using Filezilla and an external service), p2p (uTorrent and Transmission), video streaming (Vimeo and YouTube), and VoIP (Facebook, Skype, and Hangouts voice calls).Class imbalance was an apparent problem with the original dataset as 47% of the samples are p2p traffic while VoIP and email samples consist of less than 1% of the total data.We address this limitation with the synthetic data generation scheme discussed in the following section.

B. DATA BALANCING AND CLEANING
Since models trained on imbalanced datasets often perform poorly in real-world deployment, it was necessary to balance the CIC-DN data to improve model generalizability [30].In our previous work, we tested the viability of several data generation techniques to synthesize and balance the network traffic data in the CIC-DN dataset [29].We found Synthetic Minority Oversampling Technique (SMOTE) to be the top-performing upsampling technique as it improved the F1-score over baseline by 7.5% [29].The upsampled network data we created was amalgamated and published as CMU-SynTraffic-2022 (CMU) [28].In this work, we utilize the real network traffic data from the CIC-DN dataset upsampled with synthetic SMOTE data from the CMU dataset as the baseline tabular dataset for our experiments.
From our tabular dataset, 14 zero-valued features and six additional features-Flow-id, Source/Destination IP, Timestamp, and Source/Destination port-were removed as they either overfit the model or contained duplicate information.This process left 64 features in the resulting dataset.After removing samples containing NaN and Inf values and up-sampling minority classes, our final training data contained 240,000 samples with 30,000 samples in each class.We used an 80/20 train-test split for the training of our models.This means that 80% of our data was used for training while the remaining 20% was used for testing the models and calculating our performance metrics.

C. TABULAR-TO-IMAGE DATASETS
We employ each of the five T2I techniques to transform all 240,000 samples into corresponding images.The images are grouped into folders based on their application type.The dataset dubbed CMU-SynTraffic2023-ImageDataset (CMU-I) is published online [31] for further scrutiny.Section VI-C provides detailed insight into the generated images while providing visualizations of selected samples.

IV. CLASSIFIERS
This section briefly discusses the XGBoost and ResNet-50 classifiers evaluated in our experiments as well as our reasoning for selecting these specific classifiers.

A. XGBoost
XGBoost is an optimized distributed gradient boosting classifier that has become the tool of choice in many machine-learning applications due to its high performance [32], [33].XGBoost is built upon gradient boosting, which is an ensemble technique that combines the output of multiple weaker machine learning models to produce a more accurate prediction [34].XGBoost provides L1 and L2 regularization to tune and further reduce overfitting and reduce loss; mainly it enables users to tune various hyperparameters to constrain trees, makes adjustments in the learning rate during the learning process, and provides random sampling techniques [34].XGBoost was chosen as the baseline classifier as it outperformed all other classifiers in previous similar experiments and is a top-performing classifier over a wide range of experiments [11], [29], [32], [33].
We used gridsearch to optimize our XGBoost model.Gridsearch, is an algorithm which automatically tries different hyperparameter values during training to find the most performant combination.Through the use of gridsearch we found that our XGBoost performed optimally with a learning rate of 1, a max depth of 9, and 180 estimators.Our implementation of XGBoost used the default values for L1 and L2 regularization which were α = 0 and λ = 1 respectively.

B. ResNet
CNNs have produced high accuracy in image classification and increasing the depth of these networks results in improved classification accuracy.However, deeper neural networks are more difficult to train because simply stacking more layers onto a network introduces the problem of vanishing/exploding gradients [35].As layers are stacked, the partial derivative of the loss function will either approach zero and vanish, or approach infinity, causing it to explode.Neural networks utilize this value during backpropagation to adjust the weights of nodes.With a vanished or exploded partial derivative, the network is unable to learn as it can no longer adequately update the weights in the network.
ResNet is a deep residual neural network introduced to mitigate this drawback through residual learning.The residual aspect of ResNet allows for its enhancement over other CNNs because it can create a network with more depth.In a simple deep network, the output from each convolutional layer is passed directly as the input to the next layer which causes vanishing/exploding gradients.In comparison, ResNet introduces residual connections that enable the network to skip one or more layers.These connections allow information to directly propagate to all layers of the networks.As a result, ResNet models have fewer filters and lower complexity than other neural networks, such as VGG nets [36].ResNet models also show significantly higher accuracy than previous models in the field.In our experiments, we utilize a specific instance of ResNet with 50 layers called ResNet-50 due to its smaller training times and low error while being a top CNN classifier for computer vision [35].
We apply transfer learning to our ResNet models in order to improve performance.Transfer learning involves taking a pre-trained model, removing its output layer, and adding additional layers to be trained for a more specific task (in this case anonymous traffic classification).The advantage of transfer learning is that it can make use of the information learned from the previous, more general training to enhance performance on the new task [37].Our ResNet model was implemented using the TensorFlow Keras library and pretrained on ImageNet.
Our goal is to compare the top performing shallow model trained on tabular data (XGBoost) with ResNet trained on T2I data.We also provide empirical evidence on whether T2I techniques are a viable approach to multi-class classification of various application types in anonymous traffic detection and categorization problems.

V. TABULAR-TO-IMAGE ARCHITECTURES
CNNs are effective at analyzing data with spatial differences between features.This makes them ideal for application on image and audio datasets where the important information about the data is based on the order of the features [12].In these cases, the data is homogeneous which allows for CNNs to distinguish spatial differences.However, when it comes to heterogeneous data, such as tabular data, CNNs can not be directly applied.This limitation inspires the process of transforming tabular data into images to apply CNNs.Applying CNNs to the transformed data can result in superior prediction performance compared to other shallow models trained directly on tabular data.The potential for performance improvement motivates research on the most effective method of converting tabular data to images by evaluating new and pre-existing T2I algorithms.The development of DeepInsight pioneered this transformation of data to images, followed by more effective algorithms like SuperTML, TAC, and IGTD that improve the process.
Sections V-A through V-D explain the T2I algorithms we use in our experiments.A. FEATURE WRAPPING Ko et al. [38] introduced a method of converting raw network traffic data into images for classification by wrapping the binary data of a traffic flow into square images.A similar technique was then adapted to transform tabular data into images as a vector of feature wrapping [17], [24].The vector of feature wrapping technique takes a one-dimensional tabular data sample and normalizes its values before creating a square 2D image.We employ min-max normalization (1) for the feature wrapping technique used in our experiments.Since categorical features were discarded in our dataset, we did not employ the one-hot encoding technique.After normalization, each sample is then split into equal-length subvectors which are stacked on top of each other to form a square image.If a sample doesn't have enough values to form a square image, it can be padded with additional zeros [17], [24].This feature-wrapping method is illustrated in Figure 1.

B. DeepInsight
To better identify variations in genomic and biological data, Sharma et al. [39]  Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
datasets.The performance of the CNNs was then compared to traditional machine learning methods.The DeepInsight system achieved the highest accuracy metrics across each dataset, with an accuracy of 95% on average.

C. IGTD
Zhu et al. [12] introduced Image Generator for Tabular Data (IGTD) to improve existing image generation techniques.The optimization algorithm converts tabular data to images by assigning each feature to a pixel.The assignment is determined by ranking the pairwise distances between features and the pairwise distances between the assigned pixels.The algorithm then minimizes the difference between these two measurements.Pairwise distances are calculated through a distance measure such as the Euclidean distance or the Pearson Correlation Coefficient.This assigns similar features to pixels close to one another and dissimilar features to ones farther apart.The efficiency of this method stems from a greedy iterative process of swapping the pixel assignments of features to best reduce the distance between them.Unlike DeepInsight, IGTD produces dense image representations where each pixel represents a unique feature.This results in smaller images that take less time when training CNNs.IGTD also does not require domain knowledge and has excellent feature preservation as closer features are more similar.The size and shape of the image generated can be adjusted, which makes it more applicable to a variety of domains.They compared IGTD to CNNs trained with DeepInsight and REFINED images on datasets for gene expression profiles of cancer cell lines and molecular descriptors of drugs.CNNs trained on IGTD provided similar or better prediction performance when compared to the other T2I methods and models trained on the original tabular data.Despite its origination in a different domain, we wanted to examine IGTD's applicability in the network traffic domain.

D. BINARY IMAGE ENCODING
Inspired by the one-hot-encoding technique, we introduce Binary Image Encoding (BIE), a novel T2I scheme.The one-hot encoding method was originally introduced by Wang et al. [22] and involved converting binary network flow data into a 2D image by applying one-hot encoding on each byte of the sample.The reasoning for this process is that raw network data often does not have an ordering and its values are better represented as categorical features.This is because the information in raw network data are features like protocol types or flags rather than meaningful numerical values [17].
Instead of treating features as unordered categorical values, BIE makes use of the structure of binary representations of floating point values.Fig. 2 outlines a binary encoded double as well as the conversion to a decimal representation.Double precision binary numbers consist of a sign bit, the exponent which dictates the magnitude of the number, and the mantissa which represents the significant digits of the value.The technique converts each numerical sample value into a double precision binary string as discussed above.The binary values are then stacked on top of each other to create a two-dimensional matrix to be interpreted as an image where zero values become black pixels and one values become white pixels.This process is illustrated in Fig. 3, which depicts numerical feature values being converted into 64-bit binary strings and then being situated on top of each other to form the full BIE image.The pseudocode for converting an input sample into an image is provided in Fig. 4. We believe that representing feature values as vectors of binary encoded floating point numbers could have many benefits for network traffic classification.First, this method does not rely on normalization as many of the previously discussed T2I techniques do.This is advantageous because normalization reduces the range of potential feature values and can also be heavily affected by outliers.Secondly, a binary decimal string isolates the magnitude of a value (exponent) from the precise value of each digit (mantissa).When differentiating network traffic flows, one important factor to consider is the magnitude of the packets exchanged during the flow.For example, video streaming applications will have thousands of packets exchanged in a short time, whereas an email may typically have a lot fewer.To this end, isolating the magnitude  of the value in the image representation may make a classification based on packet quantity in a flow easier.Finally, the method expands the information of each value by partitioning it into meaningful parts as opposed to IGTD, where each pixel corresponds to a given feature, limiting image's information by the number of available features.

VI. RESEARCH METHODOLOGY
Fig. 5 outlines our research methodology.Subsequent sections present the experimental outline, processes for collecting metrics, tabular-to-image algorithmic conversion processes, and model training.

A. EXPERIMENTAL OUTLINE
First, we establish baseline results by training the shallow learning XGBoost classifier on CMU dataset.The XGBoost classifier was trained to distinguish among eight application types and its performance metrics were recorded.
Next, we generate five (5) new image datasets using each of the T2I algorithms which are then used to train five ResNet-50 models.After collecting performance metrics on these models, we compared their performance to XGBoost with an eye toward providing empirical evidence of the importance of T2I techniques.

B. PERFORMANCE METRICS
We report accuracy, F1-score, Area Under the Curve (AUC), mean squared error (MSE), mean absolute error (MAE), and tabular-to-image encoding time to evaluate model performance.True positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) are used to calculate the aforementioned metrics [40].
F1-score is the harmonic mean of precision and recall where precision is the proportion of correctly classified positive classifications and recall is the percent of TPs that a model predicted accurately.ROC curves are a visual representation of the TP rate in relation to the FP rate.The area under the curve (AUC) is calculated by finding the total area under a ROC curve.MAE is simply the average of the absolute differences between the model's predicted values and actual values, whereas MSE is the squared differences between the predicted and actual values.MAE reflects the overall error giving equal consideration to all data samples.In contrast, MSE is more affected by outliers so a larger MSE can indicate that there are large outliers potentially from class confusion.Our loss function is categorical cross entropy, which is a standard loss function for measuring the general fit for multi-class models [41].F1-score and AUC are less susceptible to imbalanced data and can help determine whether a model is overfitting to training data.We primarily use F1-score to compare model performance.Tabular-toimage encoding time (in seconds) is also reported as it can be an important metric to consider when performing real-time anonymous traffic detection and continuous model learning.Our preliminary experimentation found that image size had minimal effect on model performance; however, conversion time reduces considerably when generating smaller images.All T2I algorithms were given the same input dataset and generated 240,000 corresponding images.The image samples shown in Figures 6, 7, 8, 9, and 10 were generated from the same samples for each of the T2I techniques.
Our novel BIE scheme generates images (Fig. 6) containing 64 rows and 64 columns.Each row represents a feature and each column is the corresponding 64-bit binary representation for that feature where ones (1) are represented as light pixels and zeros (0) as dark pixels.
Images generated by IGTD (Fig. 7) consist of 8 rows and 8 columns for a total of 64 squares.Each square represents a   feature where a darker value indicates a higher feature value vice-versa.Unlike the other algorithms, DeepInsight does not create a grid-based image.The image (Fig. 8) is constructed as a bounding box that encompasses all the features using the convex hull algorithm.Dark pixels indicate no value and the lighter the color, the higher the feature value.
Similar to IGTD, each box in the Feature Wrapping images (Fig. 9) contains the value of a single feature.The Feature Wrapping Normalized follows the same process, but the features are normalized before encoding (Fig. 10).

D. ResNet MODEL TRAINING
The system used to train the models ran on Ubuntu 20.04.3 LTS with an Intel i7-7700k CPU, GTX 1080 GPU,  Through experimentation, we found the stochastic gradient descent optimizer to outperform other optimizers when implemented with a learning rate of 0.01 and momentum of 0.7.The learning rate is monitored with the ReduceL-ROnPlateau callback which will reduce the learning rate by a factor of 0.05 with a minimum learning rate of 0.000002 if the loss fails to improve.The model was set to train for 100 epochs with an early stopping callback to terminate training if the loss did not improve in five subsequent epochs.After each epoch, the model weights were saved and a real-time graph was updated using Tensorboard.5-fold crossvalidation was used to train the model to ensure that the models are generalizable.

VII. RESULTS AND DISCUSSION
In this section, we present the results of the five ResNet classifiers trained on the CMU-I image datasets compared to the XGBoost classifier trained on tabular CMU dataset.Then, we discuss the tradeoffs of the structured data and T2I approaches providing insights into the viability of each method in a potential real-world deployment.

A. CLASSIFIER RESULTS
Table 1 compares the performance metrics among the five ResNet-50 classifiers trained on each T2I method and XGBoost in the classification of eight application types.Values highlighted in green are the top metric across all classifiers whereas blue-highlighted values are metrics that exceed those of XGBoost.It can be seen that the proposed Binary Image Encoding is the top-performing method across all measured metrics (excluding image generation time), improving over XGBoost's F1-score by 2.4 percentage points.IGTD was the only other method that saw higher metrics over the baseline, improving upon F1-Score by approximately 1 percentage point.Figure 11 provides better visual comparisons among the evaluated methods.
Figure 12 depicts the differences in image generation time among the T2I methods.Notably, Binary Encoding took significantly longer (210 seconds) to convert the 240,000 tabular samples to their corresponding image representation which amounts to 0.9 ms per sample on average.This could be attributed to the fact that this technique is novel to this work and there may be room for further optimization.DeepInsight also took considerably longer, potentially due to its reliance on the computationally expensive Convex Hull and t-SNE algorithms.The other T2I methods had relatively shorter generation times, taking only 20-30 seconds to produce all 240,000 samples ( 0.1 ms per sample).

B. OCCLUSION SENSITIVITY ANALYSIS
To better understand our BIE ResNet model's predictions, we employed occlusion sensitivity analysis.Occlusion sensitivity analysis is a popular way to visualize CNNs by blocking out a portion of a predicted image and seeing how the model's confidence is affected [42], [43].More important areas for the classification of that image should yield lower predicted confidence when covered.This process allows us to create occlusion sensitivity maps which visualize the parts of 113108 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.BIE images that are important for classification for different classes.
We generated our occlusion sensitivity maps by replacing part of the target image with a gray patch, classifying the image, and then mapping the model's confidence value to that region.We repeated this process for 1000 images for each class and averaged the values to find the most salient regions.Figure 13 shows the average occlusion sensitivity maps for all classes.
Observing these figures we can see that critical regions are distinct among the classes.For instance, we observe that audio streaming is affected most by the group of features in the top half of the image, but only the part of the feature comprising the exponent and most significant digits of the mantissa.Chat also appears to be most affected by occluding information on the left side of the image, but does not see a significant drop in confidence when the right side is omitted.This could support the idea that for some traffic classes, the magnitude of certain features is the most relevant piece of information for classification.However, the critical regions of other classes like browsing and Email are more broad and scattered across the entire image.
Figure 14 shows individual traffic samples overlaid with their occlusion map.These images can provide a more  detailed look at the critical regions.Once again we see that audio, browsing, and chat have critical regions on the leftmost side of the image.The lower middle part of the image seems to be the most salient region for the File, P2P, and Video samples.
It should be noted that occlusion sensitivity analysis is highly dependent on patch size, so salient regions may be different when analysis is conducted with different patch sizes.Additionally, our averaged maps were conducted on a relatively small number of images, so they may not represent the entire distribution of the data.Drawing concrete inferences about how BIE images are classified is currently not possible, but these visualizations give a better idea of how the ResNet classifier differentiates different traffic types.

C. DISCUSSION
In this section we analyze the results of our experiments in context with prior works while discussing the viability of the T2I techniques.
Both IGTD and DeepInsight achieved competitive metrics proving their ability to generalize in disparate problem domains.These methods also outperformed feature wrapping, which is a more established T2I technique for network classification problems [17].While it is hard to draw direct comparisons of metrics to other research (as their data, features, classification objective, or number of classes may differ), our models trained on IGTD and BIE data obtained >96% accuracy in classification of eight network application types which was greater than any of the reviewed literature using the same dataset and classification task [9], [13].
When considering the applicability of the evaluated T2I methods, there is a definite tradeoff between computation time and accuracy.A real-time detection system would require the additional overhead of transforming collected network traffic into images before being evaluated by the model.Depending on available computational resources and the amount of network traffic, the added latency of detection could make the system unsuitable.However, we have demonstrated that T2I methods can noticeably increase classification accuracy over shallow classifiers.Furthermore, IGTD offers an increase in accuracy while also keeping the image generation time comparatively lower.
Online learning, the process of continually updating a model based on new data, can be negatively impacted by the slow training time of deep learning models.Training each of our ResNet models took 2 hours ( 30 times longer than the same by XGBoost) on our hardware.If the proposed RestNet-50 models are deployed for anonymous traffic classification, online learning may not be viable depending on available computational resources.
Our experiments also showed that the choice of T2I method is important to the overall performance of the classification system as most of the T2I methods failed to improve upon or match the performance of XGBoost.IGTD and BIE also outperform previous similar works [9] and [13] which achieved accuracies of 86% and 92% on the same eight application types.BIE may have performed well for the reasons stated in section V-D.IGTD has similarities to feature wrapping in the sense that each feature corresponds to a pixel value; however, IGTD is unique in that it correlates features by importance which may have contributed to its superior performance.
Finally, with >2% improvement on base-line classifier (XGBoost) and >1% improvement on the state-of-art T2I technique (IGTD) especially in a multi-class classification problem of determining various application types in anonymous network traffic data, we argue that our novel BIE scheme is a viable T2I technique in this domain.

VIII. LIMITATIONS AND FUTURE WORK
Due to limitations in computational resources, minimal hyper-parameter tuning was performed.Future works may benefit from experimenting with additional hyper-parameter tuning on the ResNet models and T2I parameter tuning (such as the dimensionality reduction technique used in DeepInsight and the image generation size).Moreover, other pre-trained CNN classifiers (such as ResNet-N with variation in depth, N [35]) should be evaluated and compared to non-pre-trained CNN-based models in addition to other computer vision techniques such as transformers.Finally, more visualization methods and sensitivity analysis should be applied to BIE trained models to better understand featureclass relationships.
We trained models to classify eight application types from the Tor and VPN protocols, but there are many anonymous protocols unexplored in this work.For instance, SSL/TLS, SSH, and HTTPS may not be detected or falsely classified by the current models as they were not provided in the training data.Additionally, deep learning models benefit from larger datasets, so model performance may have been impacted by the relatively small dataset.Though synthetic data was generated using SMOTE to alleviate this concern, future work could look into gathering more anonymous network traffic to address this limitation.Nonetheless, two of the CNN models still outperformed the shallow counterpart, possibly mitigating the concern.
The CIC-DN dataset did not provide the flow interval used to generate the tabular dataset from the raw pcap file.A variety of flow intervals should be tested to find the interval that optimizes model performance.
The CMU-I dataset may be used to train/optimize additional classifiers or be used as a baseline for other T2I techniques not considered in this work.Furthermore, the experimental workflow utilized to generate the CMU-I dataset should be applied to other domains to determine the general applicability of the approach.Since BIE is a novel technique, future work should test this method on other datasets and classifiers.

IX. CONCLUSION
This work explored the viability of five T2I methods (IGTD, DeepInsight, vector-of-feature wrapping (normalized and non-normalized), and the novel BIE) and their efficacy in classification of eight anonymous network application types.These techniques were used to generate five image traffic datasets (CMU-I) for training ResNet-50.To establish baseline results for comparison, the XGBoost classifier was trained on the balanced tabular dataset (CMU).From these experiments, we found that IGTD and BIE introduced in this paper improved classification metrics when compared to XGBoost, with a tradeoff of greater computation time while encoding structured samples into images.As a novel method, the results from BIE are promising; however, further evaluation is warranted for general applicability of the technique across other problem domains.
While many network traffic classification schemes based on CNNs have been proposed, only a smaller subset prioritizes data transformation, and even fewer apply image generation techniques to this field.This paper sought to demonstrate the potential of these techniques while also evaluating their real-world practicality and generalizability in the domain of anonymous network data classification.Furthermore, we have published our datasets for further scrutiny in the field.

FIGURE 2 .
FIGURE 2. Binary encoded double precision floating point value.

FIGURE 10 .
FIGURE 10.Sample images generated using feature wrapping normalized.
and 16 Gigabytes of RAM.All ResNet models were trained on the GPU while T2I algorithms were conducted on the CPU.Each model took on average 7,950 seconds (2 hours, 12 minutes and 30 seconds) to train for 100 epochs with 192,000 samples in the training dataset.Since the datasets all had the same number of samples and identical image dimensions, the models took approximately the same time train only varying by a few seconds.We employed the ResNet-50 model pre-trained on the ImageNet database as our deep learning model.The output of the model was flattened before being passed to a custom fully connected network.The fully connected network consisted of two dense layers with respective output sizes of 512 and 8 units.The first dense layer incorporated a ReLU activation function while the second layer utilized the softmax function.

FIGURE 14 .
FIGURE 14. Traffic samples overlaid with their corresponding occlusion map.