Intrusion Detection Using Payload Embeddings

Attacks launched over the Internet often degrade or disrupt the quality of online services. Various Intrusion Detection Systems (IDSs), with or without prevention capabilities, have been proposed to defend networks or hosts against such attacks. While most of these IDSs extract features from the packet headers to detect any irregularities in the network traffic, some others use payloads alongside the headers. In this study, we propose a payload-based intrusion detection scheme, <monospace>PayloadEmbeddings</monospace>, using byte embeddings of the payloads of network packets. We employ a shallow neural network to generate vector representations for bytes and their corresponding payloads. Our feature extraction technique is coupled with the <inline-formula> <tex-math notation="LaTeX">$k$ </tex-math></inline-formula>-Nearest Neighbours (<inline-formula> <tex-math notation="LaTeX">$k$ </tex-math></inline-formula>NN) algorithm for the classification of packets as intrusive or non-intrusive. In our experiments, we evaluated 34 publicly available datasets, and used ten distinct payload-based, labeled intrusion detection datasets to train and evaluate our approach. Our empirical results show that <monospace>PayloadEmbeddings</monospace> reaches between 75% and 99% accuracy across all datasets. Finally, we compare our approach to other state-of-the-art and traditional intrusion detection techniques. Our findings suggest that <monospace>PayloadEmbeddings</monospace> demonstrates significant advantages over the other techniques on most of the datasets.


I. INTRODUCTION
Intrusion Detection Systems (IDSs) are used as parts of comprehensive defense mechanisms to protect systems from network-based attacks. Depending on the deployment type, IDSs are categorized into two types: Network-based and Host-based IDSs [1]. The Network-based Intrusion Detection Systems (NIDSs) are deployed at the network level and they inspect the incoming or outgoing traffic for any suspicious, abnormal, or malicious activity. The Host-based Intrusion Detection Systems (HIDSs) are deployed at each machine in the network to prevent malicious attacks targeting the hosts. HIDSs also help to prevent attacks coming from the machines within the network. Based on the detection method, IDSs are categorized into two types: Signaturebased and Anomaly-based IDSs [2]. The Signature-based Intrusion Detection Systems (SIDSs) protect the network from malicious activities by inspecting packets in a network and comparing them against a known database of attack packet features. SIDSs fail to identify zero-day attacks. The Anomaly-based Intrusion Detection Systems (AIDSs) The associate editor coordinating the review of this manuscript and approving it for publication was Antonio Pecchia. monitor the network traffic and compare it with the expected behavior of the network to generate an alert or not. AIDSs provide better protection against zero-day attacks, though they are not immediately tolerant to new behaviors.
Most of the IDSs use packet headers to extract features and apply machine learning methods for feature engineering and classification tasks. Such IDSs can successfully detect header-based attacks, e.g., scanning attacks or probing attacks. Moreover, the attacks that generate high volumes of packets are easily detected by header-based IDSs. However, these IDSs do not inspect the payloads of network packets. Hence, they fail to detect payload-based attacks such as SQL injection, shell-code, and cross-site scripting [3]. Unlike the header-based attacks, attackers send the payload-based attack packets at lower rates, as they try to exploit a vulnerability instead of overwhelming the network.
Prior research has been done using different techniques to detect anomalies in a network payload. Many use n-grams based approaches, e.g., PAYL [4] and ANAGRAM [5], which measure the occurrence frequencies of 256 possible byte values. The frequency distributions of bytes are used to develop a normal payload profile to compare against the incoming payloads for detecting attacks in a network. VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ McPAD [6] uses a 2 v -gram technique to extract features from payloads. OCPAD [7], RANGEGRAM [8], HMMPayl [9] also use n-grams to extract features. Williams et al. [10] extracted 22 traffic-related statistical features from payloads. The authors use correlation and consistency-based feature selection techniques to reduce the feature space. However, payload features in these approaches either do not reflect any contextual information for bytes or reflect limited byte relationships.
Recently, researchers are using deep learning-based approaches to classify network payloads. Multiple studies have employed Convolutional Neural Network (CNN), and Recurrent Neural Networks (RNN) or Long Short-Term Memory (LSTM) [11]- [13]. Deep neural networks provide automatic feature extraction without manual feature engineering. However, they require a substantial amount of data and time to build an effective model against network attacks [14]. Additionally, these proposed models must take a fixed size input which requires truncating or padding payloads to a certain length. As a result, any contextual or semantic information is lost.
Our main goal is to develop an IDS that can identify known anomalies using payloads in network packets. To that end, we propose, PayloadEmbeddings, a high-level feature extraction technique employing contextual byte information for intrusion detection. Our model uses the payloads of network packets to map bytes to dense vector representations or embeddings. The model learns byte embeddings from their surrounding (context) bytes. Two bytes having similar contexts will have closer vector representations, which leads to natural groupings among the bytes that have similar contexts. Moreover, one can effectively aggregate the byte embeddings to obtain dense vector representations for their payloads. Our model can be deployed at both network-level, and host-level.
Word2vec, which has many successful applications in Natural Language Processing (NLP), generates continuous vector representations of words in high-dimensional space [15], [16]. We exploit the same idea to generate vector representations of bytes in packet payloads. We employ a shallow neural network to generate dense vector representations of bytes. The objective of the neural network is to maximize the log-probability of the neighboring bytes of each byte in payloads. Next, we aggregate the vector representations of bytes to build vectors representing their corresponding payloads, i.e., payload embeddings. Lastly, we use the vector representations of the payloads as features feeding a k-Nearest Neighbours (kNN) classifier that labels target packets as intrusive (anomalous) or non-intrusive (normal). kNN classifiers are able to generate convoluted boundaries, especially when the data instances are dispersed and not easily separable.
What is more, a suitable benchmark dataset is required to train and evaluate intrusion detection systems [17]. Publicly available, labeled datasets with raw payloads are essential for payload-based intrusion detection systems, such as our model. The datasets with real network activities containing both attack and normal traffic are quite difficult and expensive to obtain [6]. In the last few decades, a large number of IDS studies have been done on DARPA 1998/99 [18], [19], and KDD-Cup 99 datasets. The DARPA 1998/99 datasets have been widely used in various payload-based intrusion detection systems [4]- [6], [9], [13]. However, these datasets have often been criticized for outdated DDoS attacks, artificial attack injections, and redundancy [20], [21]. Finding a suitable dataset with modern real-world attacks and normal traffic is a challenge in the arena of payload-based intrusion detection systems research. We performed an exhaustive search for datasets to evaluate our approach. We found ten labeled datasets with payloads that are suitable for this study, out of 34 datasets in total. Specifically, we evaluate our model on Botnet, CIC DoS, CICIDS-2017, CSIC HTTP 2010, CTU-13, ECML/PKDD 2007, ISCX-2012, ISOT, NDSec-1, and UNSW-NB15 datasets. The detailed descriptions of these datasets are given in Section IV.
Our main contributions in this study are listed as follows: • We propose to employ a shallow neural network, Word2Vec, to generate vector representations of bytes, i.e., byte embeddings, from the payload of network packets. Next, we utilize the byte embeddings corpus model to generate payload vectors, i.e., PayloadEmbeddings that can be used as features for the classification task.
• We evaluated PayloadEmbeddings, and ten other state-of-the-art and traditional feature extraction techniques on ten appropriate datasets out of 34 publicly available datasets. Additionally, we share the implementation of our proposed technique and the other ten methods online. 1 The rest of the paper is organized as follows. Section II presents the related work. Section III describes the proposed model, PayloadEmbeddings. Section IV presents an overview of the intrusion detection datasets used in this study. Section V presents experimental settings of PayloadEmbeddings. Section VI describes the analyses of the experimental results. Section VII discusses the limitations of PayloadEmbeddings. Finally, we conclude our work in Section VIII.

II. RELATED WORK
In this section, we categorize the previous studies based on their feature extraction techniques.

A. TRADITIONAL FEATURE EXTRACTION TECHNIQUES 1) N-GRAM AND FREQUENCY
Many payload-based intrusion detection systems employ n-gram based feature extraction techniques. PAYL [4] uses the byte frequency distributions of normal packets to develop a centroid model. Relative frequency is used as a feature vector applying the n-gram (n = 1) approach. PAYL adopts the Mahalanobis Distance (MD) map to compute the consistency of incoming payloads. If the new payload exceeds a threshold, it is classified as anomalous. Wang et al. proposed ANAGRAM [5] to deal with polymorphic blending attacks. ANAGRAM develops a bloom filter from n-grams in normal payloads. N -grams are extracted from incoming payloads to identify the occurrence of new n-grams in the bloom filter. If the occurrence of such n-grams exceeds a certain percentage, the payload is labeled as anomalous. Vidal et al. proposed EsPADA [22] to protect networks against adversarial threats. The proposed approach employs the n-gram technique to extract features from the payloads. The authors utilized Counting Bloom Filters (CBF) to store the extracted features. OCPAD [7] also uses n-gram technique to extract sequences from payloads. The authors propose a knowledge-based data structure named Probability Tree to store the occurrence probability range of n-grams from normal payloads. RANGEGRAM [8] considers the maximum and minimum occurrence frequency of n-grams to detect zero-day attacks against web traffic. A database is generated from the n-grams of normal traces. If an incoming packet has a deviation from the database, it generates an alert. McPAD [6] uses a modified n-gram technique to extract the features from payloads. The authors have adopted the n v -gram technique during the training phase where the occurrence frequency of each n number of bytes separated by length v, is calculated. McPAD generates different feature spaces by varying the value of v. In the testing phase, multiple one-class Support Vector Machine (SVM) is used to detect anomalous packets by majority voting. These approaches exploit character frequency distributions in payloads as features, which often do not reflect the contextlevel information.

2) N-GRAM AND NEURAL NETWORK
Packet2Vec [24] utilizes n-grams (n = 2) to extract sequences of bytes from packets. The authors also employ Word2Vec to create vector representations for each of the most-frequent n-grams. The vectors of n-grams for each packet are divided by the number of n-grams found in that packet to create a fixed-size packet vector.
Packet2Vec also exploits Word2Vec similar to the PayloadEmbeddings. However, there are some major differences. The vocabulary size of Packet2Vec is 2 16 = 65, 536 and the vector length is 128. On the other hand, the vocabulary size and the vector length are only 2 8 = 256, and 10 in PayloadEmbeddings, respectively. As a result, the time-complexity and memory consumption for Packet2Vec are way higher than PayloadEmbeddings.

B. DEEP NEURAL NETWORK BASED TECHNIQUES
HAST-IDS [11] uses CNN and LSTM to capture lowlevel spatial and high-level temporal features from packets, respectively. AEIDS [12] employs an Autoencoder to detect outliers in network traffic. A reconstruction error on normal traffic, and a modified z-score are used to classify the incoming traffic. Liu et al. [13] proposed two models for payload-based intrusion detection, i.e., PL-CNN and PL-RNN. The authors utilized deep learning models to learn features from payload without manual feature engineering. However, the proposed approach is evaluated using the outdated DARPA 1998/99 dataset which has become obsolete in time. As these methods rely heavily on the deep learning models for feature extraction and classification tasks, the time complexity is higher than the other IDS techniques [13]. Compared to deep learning models, our proposed method employs a shallow neural network (with one hidden layer) only for byte embeddings generation. Hence, our method is computationally faster.

C. OTHER TECHNIQUES
PCNAD [25] considers the specific content of a payload. Content-based partitioning (CPP) is used to determine payload profiles for different lengths of payloads. Applying the CPP technique, PCNAD uses 62.64% of the full payload length, on average. Hosseini et al. [23] proposed a payload-based attribution scheme named CBID (Compressed Bitmap Index and Traffic Downsampling). CBID extracts features from down-sampled traffic using the combination of bloom filters and compressed bitmap index table. Jamdagni et al. [26] propose a 3-tier feature selection technique named Iterative Feature Selection Engine (IFSEng). In the first tier, Principal Component Analysis (PCA) is used to analyze raw data. Tier 2 computes the number of dominant Principal Components (PCs). Finally, a normal model is generated, and trained in tier 3. The authors use Mahalanobis Distance (MD) map to extract correlations between the packets and between the features. Like PAYL [4], (MD) map is used to classify payloads as anomalous or normal. Luo et al. [27] developed a combined model with XGBoost and PU learning for web anomaly detection. This method vectorizes HTTP payloads at the byte-level by ASCII values to avoid information loss. PU learning trains a binary classifier. Naive Bayes (NB) and Logistic Regression (LR) have been used to identify malicious behaviors. The combined model performs best when the vector dimension is 7500. However, due to multiple stages of payload analysis, the proposed approach VOLUME 10, 2022 is computationally costly and do not preserve contextual relationships between bytes in payloads.
Unlike the previous studies [4]- [8], [11]- [13], [22], [23], [23]- [27], PayloadEmbeddings can extract contextual information from payloads. Each byte in a payload is transformed into a vector space by maximizing the log probability of the neighborhood bytes with a window size of 5 to capture the contextual relations among the adjacent bytes. Moreover, the vocabulary size (=256) limits the time complexity, making our approach computationally faster than others.

III. PAYLOAD EMBEDDINGS CORPUS MODEL
In this section, we first briefly present word2vec models that inspired us to develop byte and payload embeddings for intrusion detection. Then, we introduce our model, PayloadEmbeddings, in detail.
Continuous bag of words (CBOW ) and Skip-gram models are together known as word2vec [15]. Word2vec is an efficient natural language (NLP) model to learn vector representations for words or embeddings. The most practical aspect of word embeddings is to create dense vector representations of words that contain semantic meanings. The CBOW model predicts a word given a context as input, while the Skip-Gram model predicts the context given a word as input. It is a self-supervised learning framework that learns distributed representations of texts of any length. The Paragraph vector model [28] is an extension of word2vec which is also known as doc2vec. The doc2vec model creates embeddings or vector representations for paragraphs. To obtain paragraph vectors, word embeddings are typically aggregated by concatenation or coordinate-wise averaging.
We extend the Skip-gram model to generate a corpus of embeddings for the bytes in packet payloads. We build byte embeddings corpus model from the output of the hidden layer of the model. Then, for each payload in our dataset, we generate a feature vector using the aggregated vectors of individual bytes from the byte embeddings corpus. Note that, we consider complete payloads instead of trimming or padding the payloads to a fixed length during the training of the corpus model. Figure 1 demonstrates the shallow neural network architecture, similar to Skip-gram for NLP, to generate byte embeddings. The input layer takes a T -dimensional one-hot encoded input byte b i where T = 256 unique bytes in the corpus. The hidden layer consists of k neurons to produce k-dimensional vector representation of the input byte, b i . The training objective of the model is to generate contextual byte representations for an input byte b i with respect to its neighboring bytes within a predefined window size, s. Considering n training bytes b 1 , b 2 , . . . , b n , the objective is to maximize the average log-likelihood denoted as follows:

A. BYTE EMBEDDINGS GENERATION
where s is the window size of the input byte b i . For example, if window size, s = 3, then for each byte b i we compute the average of the log probability of b i−3 to b i+3 , except the input or center byte b i . Averaging the log-likelihood generates more stable byte representations. The probability of p(b i+j |b i ) in (1) is defined using the softmax function in (2): where b c , v b , and v b are the surrounding bytes, input and output vector representations of byte b i , respectively, and T is the total number of unique bytes in the payload corpus. Note that, the computation of gradient ∇ log p(b i+j |b i ) of the objective function of (1) is expensive in the original skip-gram model. However, our payload dataset contains a maximum of 256 unique bytes which limits the computational complexity. By taking the output of the hidden layer, we generated a corpus model for bytes of all payloads in our dataset.

B. PAYLOAD EMBEDDINGS GENERATION
The low dimensional feature vector generation approach for payloads is depicted in Figure 2. The figure illustrates vector aggregation of a payload containing m bytes b 1 , b 2 , . . . , b m . For each byte b i , the corresponding vector v i of length k is selected from the pre-computed byte embeddings generated by the skip-gram model ( Figure 1). The vectors are then averaged to get the final payload feature vector with dimension k. We use the coordinate-wise vector aggregation method [16] by using (3): where, v b j [i] denotes the i th element of the byte embedding of the j th byte. m is the number of bytes in the payload and w[i] is the i th element of the final payload feature vector, w. Note that, vector length k refers to the feature size. Finally, the payload vectors are used as features and fed into our classification algorithm, k-Nearest Neighbours (kNN).

1) TIME COMPLEXITY
The time complexity for training the byte embeddings corpus model is O(s * (N +N * T )), where s is the context window size, T is the number of unique bytes, and N is the total number of bytes in the training dataset [15], [29]. The time complexity to generate a feature vector, i.e., payload embeddings is O(m), where m is the number of bytes in a payload in the testing dataset. Note that the byte embeddings are generated only once in the training stage and the feature vectors of payloads are computed from the pre-computed byte embeddings as packets arrive during the testing stage.

2) SCALABILITY
In the training stage, we learn byte embeddings from the payloads of the packets in the training dataset. In Section VI-A, we show that the byte embeddings generation process takes 5,300 to 508,000 ms for different datasets. The byte embeddings model generation is executed offline and only once. Once we learn the byte embeddings, payload embedding vectors are computed online using (3) as new packets arrive. In Section VI-A, we show that the payload generation takes between 1.09ms to 3.46ms on average over different datasets. Unlike Intrusion Prevention Systems (IPSs), Intrusion Detection Systems (IDSs) are typically not deployed inline. They passively monitor the network traffic flow and report the anomalies in the form of notifications to the administrators. To handle everincreasing network throughput, modern IDSs take advantage of parallel computing, multi-threading, and GPU-accelerated computing. PayloadEmbeddings can also process large amounts of payloads in short times as part of these modern IDSs. Hence, the proposed approach is scalable.

IV. DATASETS
Most of the IDS datasets contain only the headers of network packets. Many datasets using real traffic either do not have payloads or payloads have been removed due to privacy and security reasons [17]. Lack of labeled datasets with real network traffic is a significant issue in the area of payload-based anomaly detection research. As a result, anomaly detection using payloads has been evaluated on private datasets by most researchers. However, such datasets are difficult to collect and sometimes unavailable due to privacy concerns. Since PayloadEmbeddings is a self-supervised learning method, we need well-labeled datasets with raw payloads. In search of a suitable dataset, we came across several IDS datasets. Our search criteria for appropriate datasets were set to raw, anomalous, and labeled ''payloads''. Table 1 summarizes the features of the ten datasets that we use in this study. These features include a subset of the 15 important features to assess intrusion detection datasets introduced by Ring et al. [17]. The features consist of the year of creation, availability (whether the dataset is publicly available, restricted, or private), the format that the dataset is available in, the type of network traffic trace (real, emulated, or synthetic), size of the datasets, whether the dataset is labeled or not, whether the dataset contains metadata or not, payload availability, if the dataset is balanced or not, and the type of attacks. In the following, we briefly provide the descriptions of these datasets.

A. BOTNET
Beigi et al. created the Botnet [30] dataset in 2014 using three existing datasets: ISOT [31], ISCX 2012 [32] and CTU-13 [33]. The dataset contains various attacks from botnets such as IRC bot, Virut, Black Hole, Neris, Menti Rbot, Tbot, Murlo, Sogou, Weasel, Nsis, Zeus, and Zeroaccess. Malicious IP addresses are provided as ground truth for labeling the packets. The dataset is provided in a packetformat. Payloads are present in the traffic capture. The dataset is publicly available in two parts; training and test sets. We use this dataset in our study because it contains labeled payloads.

B. CIC DoS
The Canadian Institute for Cybersecurity created CIC DoS [34] dataset in 2012. The dataset contains eight different HTTP-based application layer DoS attacks. The attack traffic was generated using Ddossim, Goldeneye and, Hulk. Normal traffic was generated from non-attack traffic of the ISCX 2012 dataset. The dataset is recorded in packet format, labeled and is publicly available. Since this dataset meets our dataset criteria of labeled packets with payloads, we use it in this study.

C. CICIDS-2017
Sharafaldin et al. published the CICIDS-2017 [35] dataset. This dataset is created over a duration of 5 days in 2017. The dataset is available in both packet and bi-directional VOLUME 10, 2022 flow-based format and contains seven common updated family of attacks. The authors have provided additional metadata about IP addresses and attacks. The attacks present in this dataset are Brute Force, Heartbleed, Botnet, DoS, DDoS, Infiltration and Web Attack. This dataset is well labeled and publicly available. We use the CICIDS-2017 dataset in our study, as it meets our dataset criteria.

D. CSIC HTTP 2010
Information Security Institute of CSIC created the CSIC HTTP 2010 dataset [36]. This dataset contains web requests which are automatically generated. The dataset has three files: training with normal traffic, testing with normal traffic, and testing with anomalous traffic. The dataset only provides payloads, the packet headers are removed. Three types of anomalous requests are included in this dataset; Static attacks, Dynamic attacks and, Unintentional illegal requests. However, the dataset is only labeled as anomalous and normal. Payloads are not labeled as a certain attack type. This dataset is publicly available. As it meets our dataset criteria, we use it in our study.

E. CTU-13
Garcia et.al. [33] created the CTU-13 dataset at CTU University, in 2011. Attacks are executed using several botnets such as Neris, Rbot, Menti, Virut, NSIS, Murlo, and Sogou. The dataset is public and available in packet format. Instead of labeling anomalous packets to each class of attacks, the authors have divided the traffic into 13 different scenarios. However, contributors have removed the background traffic of normal packets due to privacy and security concerns. As a result, only attack packets are available. We use this dataset in our study in spite of the absence of normal payloads. To handle the lack of normal packets, we borrowed normal packets from CICIDS-2017. The details of the sampling process are discussed in Section V-A.

F. ECML/PKDD 2007
The dataset was collected from ECML-PKDD [37] competition in 2007. The competition was about analyzing the web traffic to detect or isolate attack patterns. The dataset contains seven different types of attacks. The ECML/PKDD 2007 dataset was generated by recording real traffic. The dataset is publicly available in train and test subsets. It is provided in a packet-based format that contains payloads. As packets of this dataset are labeled, we use this dataset in our study.

G. ISCX 2012
Shiravi et al. [32] created the ISCX 2012 dataset. The dataset was captured in an emulated network environment. ISCX 2012 contains the traffic of one week. α and β profiles 4020 VOLUME 10, 2022 were created to define attack and normal traffic, respectively. The dataset is publicly available in both packet-based and bidirectional flow-based formats. ISCX 2012 contains four types of attacks such as DDoS, SSH Brute Force, Infiltration, and DoS. This dataset contains payloads as the authors have provided packets in pcap files. We use this dataset in our study.

H. ISOT
Saad et al. [31] published the ISOT dataset in 2010. The malicious traffic was generated from the French chapter of the honeynet project. The dataset is publicly available in packet format. The authors provided the IP addresses of anomalous and normal traffic for labeling. We use this dataset in our study because it meets our criteria for labeled payloads.

I. NDSec-1
Beer et al. [38] created the NDSec-1 dataset in 2016. The synthetic dataset was captured in a packet-based format. Additional log files are provided by the authors. The attacks present in the dataset are brute force, DoS, Web Attack (XSS/SQL injection). NDSec-1 is available on request. The dataset is labeled and payloads are available. Hence, we use this dataset in our study.

J. UNSW-NB15
Moustafa et al. [39] created the UNSW-NB15 dataset in a small emulated environment. The network traffic is captured for more than 31 hours. The IXIA Perfect Storm tool was used to create normal and malicious traffic. It contains different types of attacks such as DoS, generic, exploits, backdoors, reconnaissance, shellcode, and worms. The dataset is publicly available in packet-based and flow-based formats. The authors provided separate train and test sets as part of the dataset. The payload is available in this dataset. We included the UNSW-NB15 dataset in our study.
In summary, we chose the presented ten datasets out of 34 datasets based on whether the dataset is in packet-based format, labeled, and contains anomalous payloads.

V. EXPERIMENTAL DESIGN
In this section, we provide a detailed experimental flow of PayloadEmbeddings. We divide our proposed approach into three stages: Data Preprocessing, Training, and Testing stages as depicted in Figure 3.

A. DATA PREPROCESSING STAGE
Step 0: We use ten datasets, including Botnet, CIC DoS, CICIDS-2017, CSIC HTTP 2010, CTU-13, ECML/PKDD 2007, ISCX 2012, ISOT, NDSec-1 and, UNSW-NB15, in our study. First, we extract network packets with payloads. Packets that do not contain any payload data are discarded. Then, each packet is labeled using the ground truth provided within the datasets. We remove the packets with duplicate payloads based on the destination ports and labels. The packet-counts of the resulting datasets are reported in Table 2. By observing the left portion of Table 2, it is noticeable that all datasets are imbalanced before sampling. Imbalanced datasets may cause some classifiers to be biased towards the majority class [40]. To overcome this problem, we perform under-sampling to the majority classes. We randomly remove the instances of the majority classes to balance the attack and normal traffic classes in the datasets. Note that, CTU-13 dataset does not have any normal packets. In order to balance this dataset, we chose normal packets from the CICIDS-2017 dataset. Please note that normal packets used in the CTU-13 dataset and the normal packets used in the CICIDS-2017 dataset are disjoint. The packet-count of each dataset after sampling is reported in the right portion of Table 2. After generating the sample datasets, we divide each of the 10 datasets randomly into two equal parts: the training set and the testing set. The training set is used in Step 1, Step 2,and Step 3 of the training stage. The testing set is used in Step 4,and Step 5 of the testing stage.

B. TRAINING STAGE
Step 1: We generate byte embeddings for payloads by training a shallow neural network (already described in Section III-A). We create multiple byte embeddings models for different vector lengths starting from 5 to 50 with intervals of 5 using (1) and (2). During this byte embeddings generation process, we use a window size of 5, which states that for any input byte, its context bytes are located at a distance of 5 or less. The window size is a hyper-parameter and it is chosen after fine-tuning. We selected CICIDS-2017 dataset for hyper-parameter tuning because the dataset has a wide range of attack payloads after sampling (see Table 1 and 2). Figure 4 illustrates the rationale behind using window size 5. All four evaluation metric ranges between 95.1%-95.4%. However, increasing the window size also increases computation time. Hence, we choose window size 5 to keep our approach computationally fast whilst not degrading performance.
Step 2: The next step after byte embeddings generation is to compute feature vectors, i.e., payload embeddings, for each payload. For each byte embedding vector length, we compute the payload embeddings using (3). The output of this step is multiple payload embeddings models.
Step 3: We train k-Nearest Neighbors (kNN) using payload embeddings models created in Step 2. Note that kNN is nonparamteric. Hence, it uses all the samples of training data for classification in the training step. We experimented with different values of k, in a range between 1-9. Additionally, we performed 10-fold cross-validation for every training dataset. The experimental results are shown in Figures 2 − 11 in the supplementary file. Analyzing the figures shows that the performance of PayloadEmbeddings slightly increases, decreases, or fluctuates when k is greater than 3. In fact, larger values of k present lower variance, but higher bias. Therefore, we set k to 3 in our experiments.

C. TESTING STAGE
Step 4: In the testing stage, first, we generate feature vectors, i.e., payload embeddings, for each payload in network packets, similar to the Step 2. However, the testing set is used during the payload embeddings computation process in this step.
Step 5: Finally, the trained kNN classifier uses the payload embeddings created in Step 4 to classify the payloads as anomalous or normal.

D. IMPLEMENTATION DETAILS
We utilize a multi-core server having 32 processing cores with 2.5 GHz per core and 512 GB of RAM to conduct the experiments of this study. We implement our experiments in Python3.6. To facilitate the generation of byte embeddings, we convert the bytes into integers and use the gensim library. Window size, number of workers, and minimum count parameters are set to 5, 4, and 1, respectively. We use Python's sklearn library for the k-NN classification and crossvalidation. To promote reproducibility in science, we share our implementations and supplementary file online 2 .

VI. EMPIRICAL VALIDATIONS
In this section, we first present the results of our proposed technique, PayloadEmbeddings, across all datasets. Next, we discuss the results via the average byte frequency distributions, confusion matrices, and t-distributed Stochastic Neighbor Embedding (t-SNE) plots. Figure 5 shows the performance results of our approach over the ten datasets used in this study. For each dataset, the results are reported for different vector lengths ranging from 5 to 50 with intervals of 5. Except for ISOT and NDSec-1 datasets, the performance over the datasets improves when the vector length is increased from 5 to 10. Above 10, the performance of our model either declines or saturates. Our approach achieves its best performance at different vector lengths over different datasets. Botnet, CIC DoS, CICIDS-2017, CSIC HTTP 2010, CTU-13, ECML/PKDD, ISCX-2012, ISOT, NDSec-1, and UNSW-NB15 datasets achieve the best performance (in terms of Accuracy, Precision, Recall and F1-score) for the vector lengths of 25, 50, 10, 45, 25, 40, 15, 10, 5 and, 50, respectively. However, we found in our experiments that there is a sharp rise in the performance when the vector length is increased from 5 to 10 for most of the datasets, except ISOT and NDSec-1. Increasing the vector length from 10 to 50 with intervals of 5, the performance of PayloadEmbeddings either improves insignificantly (UNSW NB-15, CIC DoS, ECML/PKDD, CSIC HTTP, CTU-13, and Botnet) or declines (CICIDS-2017, ISOT,  NDSec-1, and ISCX2012). Therefore, we use vector length 10 throughout our experiments.

A. PERFORMANCE ANALYSIS
In the following we discuss the performance of our approach over ten datasets. PayloadEmbeddings achieves the highest performance on ISOT dataset. The optimal performance is obtained for vector length 10 with accuracy 99%, precision 99%, recall 99% and F1-score 99%. ISOT dataset has only one type of anomaly with payload, i.e., SMTP spam.
The confusion matrix in Table 3 shows that only 8 out 8284 normal payloads have been misclassified as anomalous and 19 out of 8263 anomalous payloads have been misclassified as normal payloads. To understand the reason behind our high-performance rates over the ISOT dataset, we present the average byte frequencies of anomalous and normal payloads in Figure 6. The frequency of each byte is divided by the total number of bytes in a payload. Then, the average byte frequency distribution is computed for normal and anomalous traffic in the dataset. In Figure 6 it is noticeable that the average byte frequency of anomalous payloads is quite distinguishable from the normal payloads. Normal payloads have uniformlike average byte frequencies, whereas anomalous payloads have fluctuating average byte frequencies. The t-SNE plot for the ISOT dataset is also shown in Figure 7. t-distributed Stochastic Neighbor Embedding (t-SNE) is a very popular dimensionality reduction technique introduced by Maaten and Hinton [41]. t-SNE also attempts to generate representative visualizations of higher dimensional data in two-dimensional space by considering varying, non-linear transformations. For generating t-SNE plots, we selected 500 random samples from each class to produce sample visualizations of clusters. The t-SNE plot in Figure 7 depicts VOLUME 10, 2022  that anomalous payloads are easily separable from normal payloads in ISOT dataset. Figure 5 shows that amongst the ten datasets, our approach performs the worst on CTU-13 dataset. PayloadEmbeddings achieves 75.15% accuracy, 75.13% recall and 74.79% F1-score, when the vector length is 10. The CTU-13 dataset has 13 scenarios generated from different bots such as Neris, Rbot, Virut, Menti, Murlo, Sogou, and NSIS. Scenarios 1 and 2 include IRC, Spam, and Click-fraud attacks using Neris bot. Scenario 3 includes IRC, Portscans, and US (compiled and controlled by the authors) attacks using Rbot. Scenario 4 includes IRC, DDoS, and US attacks also using Rbot. Scenarios 5 and 13 include Spam, Portscans, and HTTP attacks using Virut. Scenario 6 and 8 include only Portscans attacks using Menti and Murlo, respectively. Scenario 7 includes HTTP attacks using Sogou. Scenario 9 includes IRC, Spam, Click-fraud, and Portscans attacks using Neris. Scenarios 10 and 11 include IRC, DDoS, and US attacks using Rbot. Scenario 12 includes P2P attacks using NSIS. The authors provide trace files and ground truth for each of the scenarios separately. However, the labels only indicate the type of bot used to generate anomalous packets instead of the label of each anomalous class. As a result, the confusion matrix in Table 4 has scenario numbers instead of particular attack classes. Also, anomalous packets  from scenarios 3, 10, and 11 do not have payloads. So, we discarded the packets from these scenarios.
By analyzing the confusion matrix in Table 4, we can deduce that our approach can successfully classify anomalous traffic from scenarios 1 and 5 with 100% accuracy. Normal traffic has been classified with 86% accuracy. Anomalous traffic from scenarios 9, 2, 13, and 7 have been mostly misclassified. Scenario 9 has been misclassified as scenario 13 and normal with 22% and 43%, respectively. Scenario 2 has been misclassified as scenario 4 (29%), scenario 13 (14%), 9 (14%), and normal (14%). Scenario 13 and scenario 7 have been the most misclassified as normal traffic with 48% and 30%, respectively. Figures 8 and 9 can explain the higher rates of misclassification that we observe in CTU-13 dataset. Figure 8 shows that average byte frequencies of normal and anomalous traffic demonstrate similar patterns, which makes 4024 VOLUME 10, 2022 it difficult for PayloadEmbeddings to differentiate. Figure 9 further explains that the attack classes and normal traffic overlap with each other. It makes it difficult for the classifier to define an optimal decision boundary. As a result, the performance of our approach suffers more.

1) COMPUTATIONAL OVERHEAD
In addition to the time complexity of Payload Embeddings presented in Section III, we present the physical execution times of Payload Embeddings. We summarize the execution times of PayloadEmbeddings and report it in Table 5 for vector length 10 and k = 3. Our proposed approach takes 5300 ms to 508000 ms to generate the byte embeddings for different training datasets. The large variation in execution times is due to the number and the size of the payloads in the training datasets. Feature computation and classification tasks are carried out in the testing stage. Feature computation refers to computing the PayloadEmbeddings from the trained byte embeddings. We compute the average time (in milliseconds) for each payload to generate its corresponding payload embeddings and report it in the feature computation step of the testing stage in Table 5. ISOT dataset is the fastest and ECML/PKDD dataset is the slowest among all the datasets in terms of creating feature vectors for each payload. We employed 10-fold cross-validation to measure and report the performance metrics (as described in Section V-B). On the other hand, we designed a 70%:30% split dataset experiment to measure the average execution time of the classification task per packet. The kNN classifier takes only 0.058 to 0.113 ms per packet to label as anomalous or non-anomalous, on average. In addition, we compute the execution time of other state-of-the art methods based on CICIDS-2017 dataset in Table 6. Training time refers to the time it needs to build a trained model based on the training set of the dataset. Testing time refers to the time it needs to generate features and classify an incoming packet. All the values are reported in milliseconds.

B. COMPARATIVE ANALYSIS
In the following, we compare PayloadEmbeddings model with other techniques. We divide these previous studies into categories. They are as follows: Traditional Feature Extraction based, Neural Network based, and other state-of-the-art techniques. First, we present a brief overview of the comparison methods. We also provide the details of their experimental settings. Then, we discuss the performance of PayloadEmbeddings against these methods. We summarize the comparative analyses with respect to accuracy, precision, recall, and F1-score over all datasets in Table 7.

1) TRADITIONAL FEATURE EXTRACTION TECHNIQUES
We categorize PAYL [4], McPAD [6], and HMMPayl [9] as traditional feature extraction techniques.   Table 7, PayloadEmbeddings outperforms PAYL on 8 out of 10 datasets based on accuracy, and F1score. PayloadEmbeddings outperforms PAYL on 7 out of 10 datasets based on precision, and 9 out of 10 datasets based on recall. The main disadvantage of PAYL is that it does not learn context information from payloads, which is captured by PayloadEmbeddings.  Ariu et al. proposed HMMPayl [9] that also applies the n-gram technique on payloads of normal traffic. A sliding window of n-bytes is used to extract sequences from the payload. A subset of these sequences is selected randomly and passed to a Hidden Markov Model (HMM). The authors set up 5 HMM models. In the training stage, each HMM model assigns a probability to the sequence of normal traffic. In the testing phase, the probability estimation from each HMM model is combined using non-trainable combiners. The authors use four combiners as the maximum, the minimum, the mean, and the geometric mean. The probability scores are combined and checked with a threshold value to classify a payload as anomalous or normal. According to Table 7, PayloadEmbeddings is only able to outperform HMMPayl on 5 out of 10 datasets based on accuracy, and 6 out of 10 datasets based on precision. However, for other metrics PayloadEmbeddings shines against HMMPayl. Our proposed method outperforms HMMPayl on 9 out of 10 datasets based on recall and F1-score. Similar to PayloadEmbeddings, HMMPayl considers full payload during the feature extraction process. As a result, HMMPayl achieves better performance in terms of accuracy and precision compared to other methods.
a: AEIDS AEIDS [12] uses a deep learning architecture-Autoencoder to detect low rate attacks as outliers in a dataset. It is an unsupervised approach that is trained on normal traffic. Autoencoders use a neural network to set the target values equal to the inputs. The authors utilize Autoencoder, along with a statistical thresholding approach to identify outliers by generating reconstruction error on the normal traffic. They use a modified z-score to calculate the threshold. Reconstruction error of new incoming traffic is transformed into modified z-score and the targeted traffic is labeled as anomalous if z < 3.5.

b: PayloadEmbeddings VS. AEIDS
Our proposed method, PayloadEmbeddings outperforms AEIDS on 7 datasets out of 10 based on accuracy. Based on precision and F1-score, PayloadEmbeddings outperforms AEIDS on 9 out of 10 datasets. Morever, PayloadEmbeddings outperforms AEIDS across all the datasets based on recall. The major reason behind the results is that AEIDS considers only byte frequency distributions to compute reconstruction errors. Additionally, the unsupervised approach does not learn the anomalous traffic pattern in the training phase. Thus, it suffers during testing and/or detection phases.

c: HAST
HAST [11] uses deep convolutional neural networks (CNNs) to learn low-level spatial features and long short-term memory (LSTM) to learn high-level temporal features from network traffic. HAST has two different architectures. HAST-I architecture takes network flow as input and uses only CNN to learn spatial features. HAST-II architecture takes network packets as input and uses a combination of CNN and LSTM to learn spatial-temporal features. We implemented HAST-II architecture for a fair comparison, as PayloadEmbeddings uses packet-based data. In HAST-II architecture, each packet is transformed into a two-dimensional image using One-hot encoding (OHE). For example, if the OHE vector is m-dimensional, then the first n-bytes of a network packet is transformed into an m*n two-dimensional image. The preprocessed data is then fed into HAST-II architecture which produces a vector as output. The softmax classifier is used on the final vector to identify the input traffic to be normal or anomalous.  Table  7, PayloadEmbeddings outperforms HAST on 8 out of 10 datasets based on accuracy. PayloadEmbeddings outperforms HAST on 9 out of 10 datasets based on precision. Also, PayloadEmbeddings outperforms HAST across all datasets based on recall and F1-score. HAST considers VOLUME 10, 2022 only the first 100 bytes of the packets. Therefore, it misses valuable patterns lying in the rest of the payloads. As PayloadEmbeddings considers the full length of payloads, it yields better classification performance.

e: PL-RNN
Liu et al. [13] proposed two deep learning-based models: PL-CNN and PL-RNN. We decided to use the PL-RNN model as it outperforms PL-CNN in their reported results. For the PL-RNN model, the authors employ the LSTM (Long Short-Term Memory) network which consists of 1 hidden layer with 128 hidden states. The first n characters of the payloads are used to train the model. The input size n is set to the average length of the payloads in a particular dataset. Finally, the softmax function is used in the output layer to classify the payloads. Goodman et al. [24] proposed Packet2Vec that utilizes a shallow neural network, i,e., Word2Vec to generate packet vectors. The authors use n-grams to create a dictionary. As n = 2, the vocabulary size reaches to 65,536 (2 16 ). To reduce the complexity, Packet2Vec limits vocabulary size to top |v| most frequent n-grams, where |v| = 50, 000. The vector length for each packet vector is set to 128. Finally, the packet vectors are used as features and classified using Random Forest (RF).

h: PayloadEmbeddings VS. Packet2Vec
Our proposed method outperforms Packet2Vec on 9 out of 10 datasets based on accuracy and recall. Additionally, PayloadEmbeddings outperforms Packet2Vec on all datasets based on F1-score, and 8 out of 10 datasets based on precision. However, Packet2Vec was able to achieve the highest precision on the CTU-13 dataset. Although Packet2Vec also employs Word2Vec, it differs from PayloadEmbeddings in vocabulary size and vector length. N -grams (n = 2) was not able to capture contextual information from bytes as well as PayloadEmbeddings.

3) OTHER STATE-OF-THE-ART TECHNIQUES
We categorize CBID [23], EsPADA [22], and OCPAD [7] as other state-of-the-art techniques. Based on accuracy, EsPADA performs the best on ISCX-2012 dataset among all other datasets. However, PayloadEmbeddings outperforms EsPADA on 8 out of 10 datasets based on accuracy. Recall and F1-scores demonstrate that our method performs better than EsPADA on all datasets but one. EsPADA constructs both normal and adversarial models. This method relies on the n-grams of payloads which falls short to capture the contextual relationships between bytes. e: OCPAD OCPAD [7] generates a feature vector for each payload using the n-gram technique where n = 3. Next, the minimum and maximum occurrence probability of n-grams are calculated and stored in a Probability Tree. Only normal payloads are used in the training phase. In the testing phase, incoming packets are detected as anomalous if the probability ranges of n-grams are not in the probability tree. OCPAD achieves the highest accuracy on ISCX-2012 dataset among all the methods. However, PayloadEmbeddings outperforms OCPAD on 9 out of 10 datasets based on accuracy, recall, and F1-score. PayloadEmbeddings also outperform OCPAD on all but CTU-13 and ISCX-2012 datasets based on precision results. OCPAD uses a Probability Tree constructed from n-grams of normal payloads which falls short on detecting different types of attack patterns correctly.
It is deducible from the comparative analysis that there is no single method that outperforms all other methods over all datasets. Depending on byte distributions, and relevance of contextual information, some methods perform better on some datasets. Nevertheless, PayloadEmbeddings outperforms all other techniques on CIC DoS, CICIDS-2017, ISOT, and UNSW-NB15 datasets based on all four evaluation metrics. It also outperforms all other techniques on Botnet, ECML/PKDD, and ISCX-2012 datasets based on F1-score. Hence, PayloadEmbeddings can be recommended to use on these seven datasets and achieve a high performance compared to other existing state-of-the-art methods.

VII. LIMITATIONS
PayloadEmbeddings considers only the payloads for anomaly detection while ignoring the packet headers. As a result, the proposed method is vulnerable to only header-based attacks such as scanning or probing attacks. On the other hand, it can effectively augment existing header-based intrusion detection systems. In addition, PayloadEmbeddings may require to be retrained as the nature of normal and/or attack traffic patterns change in time.

VIII. CONCLUSION
In this paper, we propose a payload-based intrusion detection model, PayloadEmbeddings, using vector representations of bytes in network packet payloads. The proposed model extends word2vec model in the Natural Language Processing domain. We generate payload embeddings from the vector representations of bytes. These embeddings are then fed to a k-Nearest Neighbor (kNN) classifier to detect a network packet as anomalous or non-anomalous. We evaluated our approach over ten different datasets out of 34 datasets that we assessed. Our experimental results show that PayloadEmbeddings performs well on ISOT, UNSW-NB15, CICIDS-2017, and CIC DoS datasets with at least 92% accuracy. Our approach also achieves above-75% performance figures on other datasets. Lastly, we compared our approach to ten other state-of-the-art and traditional feature extraction techniques for intrusion detection. We found that no single technique can outperform all techniques on all datasets. In spite of that, we showed our approach outperforms other techniques in terms of accuracy, precision, recall, and F1-score over most of the datasets. Also, we share the implementations of MEHMET ENGIN TOZAL received the Ph.D. degree in computer science from the University of Texas at Dallas, in 2012. He is currently a Francis Patrick Clark/BORSF II Endowed Associate Professor with the School of Computing and Informatics, University of Louisiana at Lafayette, and a member of the Informatics Program. His research interests include complex systems, network security, data/graph analytics, health informatics, Internet topology mapping and modeling, Internet security and reliability, and large-scale graph sampling, summarization and visualization for real-world complex systems.
VIJAY RAGHAVAN is the Alfred and Helen Lamson Endowed Professor in computer science with the Center for Advanced Computer Studies, School of Computing and Informatics, University of Louisiana at Lafayette. His research interests include information retrieval and extraction, data and web mining, multimedia retrieval, data integration, and literature-based discovery. He has published around 275 peer-reviewed research papers. These and other research contributions cumulatively accord him an h-index of 37, based on Google Scholar citations to his publications. He has served as a major adviser for 29 doctoral students and has garnered over $13 million in external funding. He brings substantial technical expertise, interdisciplinary collaboration experience, and management skills to his projects. His service work at the university has included coordinating the Louis Stokes-Alliance for Minority Participation (LS-AMP) program since 2001. He has served as the PC chair, the PC co-chair or a PC member for countless ACM and IEEE conferences.