Model for Detection of Masquerade Attacks Based on Variable-Length Sequences

A masquerader is an attacker who gains illegitimate access to a user’s account. Masquerade detection is one of the key problems of intrusion detection systems. Deep learning models that obtained state-of-the-art results in masquerade detection have failed to exhibit very high detection performance when data samples contain limited information. Alternatively, computationally cheaper and more memory-efficient traditional machine learning models suffer from less robust features, which hinders them in achieving high detection performance. The contributions of this article are as follows: we introduce new features of variable-length UNIX command sequences (i.e., weighted occurrence frequencies of different orders) and integrate these features into an extended Markov-chain-based variable-length model. The detection performance of our model is evaluated on three publicly available and free datasets: Schonlau (SEA), Purdue (PU), and Greenberg. The results demonstrate that our model significantly improves the true positive rate (TPR), false positive rate, receiver operator characteristic, and threshold variance compared to the baselines (other Markov-chain-based variable-length models). Furthermore, in terms of the TPR, the proposed method is superior to a state-of-the-art deep learning model that uses a convolutional neural network on the PU and Greenberg datasets and a state-of-the-art sequence-alignment-hidden Markov model on the SEA dataset. Moreover, the proposed method is much more lightweight than the state-of-the-art models in terms of computational and memory complexity, and thus more suitable for real-time masquerade detection.


I. INTRODUCTION
The Internet is an indispensable part of people's lives and has given rise to the rapid growth of e-commerce and online social networks. Owing to the increasing number of users, problems with user trustworthiness and security are becoming increasingly important. An intrusion detection system (IDS) is a software-based application or hardware device that is used to identify malicious behavior in a network [1], [2]. The most common types of attack include masquerading, malware, spyware, denial of service, probe, user-to-root (U2R), and remote-to-user (R2L). In this study, we focused on the problem of masquerade detection. A masquerader is an attacker who gains illegitimate access to a user's account. In other words, masqueraders impersonate legitimate users [3].
The associate editor coordinating the review of this manuscript and approving it for publication was Mohamed Elhoseny .
For example, masqueraders can be insiders of a system who can cause severe harm, such as stealing valuable confidential information or disrupting data integrity. Therefore, masquerade detection is very important for the security and trustworthiness of a system.
To build an IDS, some authors applied traditional machine learning (ML) models, while others used deep learning (DL) models. Such models are used in IDS for misuse detection, anomaly detection, or a combination of these approaches. For misuse detection, assumptions are made about the attacker's strategy; training data need to be labeled; hence supervised learning methods are employed. The main disadvantage of the misuse detection approach is the difficulty of adapting to dynamic changes in attacking behavior. If an attacker's strategy changes, then the detection approach needs to be updated [4]- [6]. In this study, therefore, we opted for the anomaly detection approach. Anomaly detection is focused on finding questionable behavior patterns that cannot match those expected from data [7]. In the training stage, features are extracted to model normal user behavior. In the detection stage, an audited user's behavior is classified as normal or anomalous (or doubtful as done by Dash et al. [8]) based on its deviation from the modeled normal behavior.
Many authors have proposed various methods based on anomaly detection for masquerade detection using UNIX commands. Most methods use three publicly available and free UNIX command-line-based datasets: Schonlau (SEA) [9], Purdue (PU) [10], and Greenberg [11]. Two data configurations are used in the literature: truncated and enriched. The truncated data configuration indicates that a dataset contains only command names. In contrast, the enriched data configuration indicates that a dataset consists of command names, parameters, flags, shell metacharacters, and information about the start and end of user sessions.
Schonlau et al. [9] released the SEA dataset with only the truncated data configuration. In contrast, the PU and Greenberg datasets were released with the enriched data configurations. In the literature, PU and Greenberg datasets were used not only with the enriched configuration, but also with the truncated configuration. We name the datasets with the respective data configurations as PU Enriched, PU Truncated, Greenberg Enriched, and Greenberg Truncated.
Most of the related papers used the data settings that yield a normal test set and masquerade test set, as proposed by Schonlau et al. [9] for SEA (SEA Full), Maxion [12] for Greenberg, and Lane and Brodley [10] for PU. We named these data settings as ''Full.'' With Full data settings, PU Truncated, PU Enriched, Greenberg Truncated, and Greenberg Enriched are denoted by PU Truncated Full, PU Enriched Full, Greenberg Truncated Full, and Greenberg Enriched Full, respectively. We describe these datasets with the associated data settings in detail in Section III-A.
The previous work exhibited relatively high detection performance on the datasets with the enriched data configuration but suffered from significantly worse results with the truncated data configuration. DL models that obtained state-ofthe-art results in masquerade detection have failed to exhibit very high detection performance when data samples contain limited information. Alternatively, traditional ML models that are more computationally-and memory-efficient, suffer from less robust features, which hinders them in achieving higher detection performance. Therefore, in this study, we focused on experimenting on the datasets with the truncated data configuration only, that is, when the data samples contain limited information.

A. CONTRIBUTIONS
In this article, we make the following five contributions: • We introduce new features of variable-length sequences (i.e., weighted occurrence frequencies of different orders; see Definition 6).
• We extend the Markov chain-based variable-length model of Xiao et al. [4] and integrate the new features into the model (see Sections II-B.5-II-B.7).
• On three different datasets (SEA, PU Truncated, and Greenberg Truncated) with two data settings, our model outperforms the baselines (Xiao et al. [4]) at several metrics, such as the true positive rate (TPR), false positive rate (FPR), receiver operator characteristic (ROC), and threshold variance.
• Our model achieved significant improvement on the TPR metric compared to state-of-the-art Convolutional Neural Network CNN [13] for PU Truncated Full, Greenberg Truncated Full, and state-of-the-art sequence-alignment Hidden Markov Model SA-HMM [14] for SEA Full.
• Our model requires much less computation and memory usage than the state-of-the-art models and is therefore more suitable for masquerade detection in real-time. Oka et al. [19] used the Eigen co-occurrence matrix (ECM) to model normal user behavior and performed masquerade detection based on the inner product between the ECMs of a normal user and the monitored user. For SEA Full, their method obtained a TPR of 72.3% and FPR of 2.5%. The main drawback of their approach is that calculating ECM is computationally expensive and uses large amounts of memory. By contrast, Kim and Cha [3] proposed a more VOLUME 8, 2020 lightweight approach by introducing the notion of common commands as a feature and trained a support vector machine (SVM) with a voting engine as a classifier for masquerade detection. They achieved a TPR of 80.1% and FPR of 9.7% for SEA Full, and a TPR of 94.8% and FPR of 0% for SEA1v49. Furthermore, for Greenberg Enriched Full, they obtained a TPR of 87.3% and FPR of 6.4%, whereas for Greenberg Truncated Full they obtained a TPR of 71.1% and FPR of 6%. To obtain more balanced accuracy than that of Kim and Cha [3], and to reduce the calculation burden of Oka et al.'s [19] ECM, Chen and Aritsugi [20] combined ECM with SVM using online updating. They tested their model for one-class (TPR = 62.77%, FPR = 6%) and two-class (TPR = 72.24%, FPR = 3%) classification for SEA Full. Furthermore, Li et al. [21] first extracted the principal features of the user behavior from a correlation Eigen matrix by principal component analysis (PCA) and then fed these features into an SVM-based detection system. For SEA Full, they achieved the best TPR of 82.6% and FPR of 3% among SVM-based models reported in the literature. Nonetheless, one of the disadvantages of SVM-based models is that they did not demonstrate very good results for SEA Full and Greenberg Truncated Full. Their performance for PU Truncated Full has not been reported in the literature.
Lane and Brodley [22] used two approaches for masquerade detection: Hidden Markov Model (HMM) and instance-based learning. Their HMM-based method obtained a TPR of 62% and FPR of 1.5% for SEA Full. By contrast, Okamoto et al. [23] proposed an immunity-based HMM and obtained a TPR of 60% and FPR of 1% for SEA Full. To enhance the detection performance of an HMM-based approach, Huang and Stamp [24] exploited the positional information of users and proposed a profile HMM (PHMM) that achieved a TPR of 70% and FPR of 5% for SEA Full. Kholidy et al. [25] proposed the data-driven semi-global alignment (DDSGA) approach. In the training phase, a given user's sequence alignment parameters were calculated. In the detection phase, discovering several misalignment sequences indicated masquerade activities. DDSGA achieved a TPR of 88.4% and FPR of 1.7% for SEA Full. To further improve the detection performance, Qiu et al. [14] combined a sequence alignment (SA) module with an HMM and named it SA-HMM. The SA module mitigated the problem of variations in user activity sequences, and the HMM leveraged the positional information between the observations of users. For SEA Full, SA-HMM achieved a state-of-the-art TPR of 94.1%. However, there is still room for improvement. Furthermore, it is worth mentioning that HMM-based models are computationally costly.
In [9], Schonlau et al. applied a Bayes one-step Markov chain approach and obtained a TPR of 69.3% and FPR of 6.7% in detecting masquerade attacks for SEA Full. In the training stage, transition probabilities were calculated from one command to the next. In the detection stage, they checked whether the observed transition probabilities were normal or anomalous. By contrast, Ju and Vardi [26] explored a hybrid multistep Markov chain for profiling normal transition probabilities between commands. They used maximum likelihood estimation to detect masquerade activities obtaining a TPR of 49.3% and FPR of 3.2% for SEA Full. Nonetheless, the TPR results of the methods used in [9] and [26] are low. Tian et al. [27] used variable-length UNIX command sequences to mine user behavioral patterns and employed similarity measure as a classifier examining each command and each sequence of commands. In [28], the authors continued the work of Tian et al. [27], and considered the transition relations between adjacent commands to develop an approach using Markov chains. Although it achieved better detection results, it had several disadvantages, such as too many states, a high computational cost, poor fault tolerance, and poor generalization. To address these problems, Xiao et al. [4] improved upon Tian et al.'s work [27] by significantly reducing the number of states, which decreased memory consumption and computation. Their model used weighted frequencies of variable-length sequences as a feature. Compared with previous Markov chain-based models, it achieved the best results for SEA Full with a TPR of 90% and FPR of 5%. However, the authors did not report experimental results for PU Truncated Full and Greenberg Truncated Full.
In [29], Geng et al. proposed a computationally cheap model for online masquerade detection. The model exploited N-grams to mine normal user sequences and performed the classification based on square root term frequencyinverse document frequency, which was obtained from natural language processing (NLP). The method obtained a TPR of 93.33% and FPR of 8.8% for SEA Full.
Elmasry et al. [30] conducted massive experiments with various DL models, including a Deep Neural Network (DNN), CNN [13], and Long Short-Term Memory (LSTM). The CNN was leveraged as a text classification model, framing the masquerade detection problem as a text classification task. Although the CNN achieved state-of-the-art results for SEA1v49, PU Enriched Full, Greenberg Enriched Full, PU Truncated Full, and Greenberg Truncated Full on multiple evaluation metrics, the TPR results were not very high for SEA Full (87.7%), PU Truncated Full (81.6%), and Greenberg Truncated Full (83.5%). Thus, achieving a higher TPR for SEA Full, PU Truncated Full, and Greenberg Truncated Full, is still a great challenge. Furthermore, the CNN requires intensive computation and large memory usage. Table 1 lists the best results of the previous work.

2) IDS BASED ON DL MODELS
Compared to traditional ML models, DL methods have shown impressive results for IDS at detecting DoS, probe, U2R, and R2L attacks [31]- [35]. Kim et al. [31] applied the LSTM architecture and trained the model with the KDD Cup 1999 dataset. They empirically verified that the DL approach is effective for IDS. Yin et al. [32] used RNNs and developed the model RNN-IDS, which outperformed other ML models, including J48, artificial neural network, random forest (RF), and SVM at both binary and multiclass classification. Meng et al. [33] presented a method based on kernel PCA and LSTM-RNN by integrating data preprocessing, feature extraction, and attack detection into an end-to-end detection system. For the NSL-KDD dataset, their system substantially outperformed several attack detection strategies that use the SVM, neural network, and Bayesian methods. Le et al. [34] explored a model with several components, including an RNN, LSTM, and gated recurrent unit, and discussed which component gave the best result. Abdulhammed et al. [35] employed variational autoencoders and PCA for feature dimensionality reduction. Subsequently, using these features they performed the classification by investigating a variety of ML models, such as RF, Bayesian network, linear discriminant analysis, and quadratic discriminant analysis.
As mentioned in Section I-A, our proposed model extends the Markov chain-based variable-length model of Xiao et al. [4]. Our work significantly differs from the previous work. The main differences are presented as follows.  [4] model with N 2 + 1 states. In Section IV, we verify empirically that integrating these features boosts our proposed model's detection performance compared to the baselines. Moreover, as measured by the TPR metric, our proposed model exceeds state-ofthe-art CNN [13] and SA-HMM [14] systems. Because our model uses less memory and is computationally less intensive than the state-of-the-art models, it is better suited to real-time masquerade detection.
This article is organized as follows. Section II describes the proposed method. Section III provides the experimental settings. Section IV presents the experimental results and discussion. Section V provides the assessment of the computational and memory complexity. Section VI concludes this article and discusses future work.

II. PROPOSED METHOD A. MARKOV CHAIN
In this section, we recall some definitions related to the Markov processes from [36].
Definition 1: A discrete random process {σ n , n ≥ 1} is said to be a Markov chain with the state space = {1, 2 . . .} if VOLUME 8, 2020 it satisfies the Markov property. That is, for all n ≥ 1 and θ m ∈ with 1 ≤ m ≤ n + 1, Definition 3: Let {σ n , n ≥ 1} be a homogeneous Markov chain and p ij = Pr (σ n+1 = j|σ n = i), and a i = Pr (σ 1 = i) for all i, j ∈ P = p ij is the matrix of the conditional probabilities p ij and is called the (one step) transition (probability) matrix. A = [a i ] is a row vector representing the probability mass function of σ 1 and is called the initial (probability) distribution of the Markov chain.
Theorem 1: Let {σ n , n ≥ 1} be a homogeneous stationary Markov chain and 1 ≤ m ≤ n. Then, Fig. 1 presents the architecture of the proposed masquerade detection system. We discuss the details in the next sections.

B. TRAINING STAGE 1) PREPROCESSING THE DATA
The training data of a normal user is preprocessed into the sequence of command names B = (b 1 , . . . ,b r ), where b j ∈Vocab, 1 ≤ j ≤ r, and Vocab is the vocabulary of command names.

2) EXTRACTING VARIABLE-LENGTH SEQUENCES
We extract variable-length sequences and streams of sequences. The definitions are provided below.
Let l = {l (1) , l (2) , . . . , l (W)} be the set of W lengths of the variable-length sequences, where W ∈ N(N = {1, 2, . . .} is the set of natural numbers) l (k) ∈ N (1 ≤ k ≤ W) and l (1) < l (2) < . . . < l (W). We define the variable-length sequences generated from the training data B as: The length of the sequence S k q is equal to l (k) ∈ l. Then, we generate streams of sequences We assumed that r > l (W) − 1 to enable the extraction of S 1 ,S 2 , . . . ,S W . A larger W means better detection performance, but more memory and computation are required.

3) CALCULATING WEIGHTED OCCURRENCE FREQUENCIES OF DIFFERENT ORDERS
We calculate weighted occurrence frequencies of different orders, which are defined below. Definition 4: Let n k q be the number of occurrences of a sequence S k q in S k (1 ≤ k ≤ W, 1 ≤ q ≤ r − l (k) + 1). Then, the occurrence frequency of S k q in S k is defined as follows: Definition 5: For a given set of weights we calculate the weighted occurrence frequency of the first order as follows: Note that {e 1 = e 1 (k)} weights are chosen empirically (see Section III-D) and must satisfy the condition 1 ≤e 1 (1) ≤e 1 (2) ≤ . . . ≤e 1 (W) condition, which is justified in Remark 1 (Section II-B.6). For brevity, we use the term ''weighted frequency'' instead of ''weighted occurrence frequency'' in this article. We generalize Definition 5 to extract more features by introducing the concept of the v th order weighted frequency with the following definition.
, where e v (k) ≥ 1, and V is the number of orders, we define the v th order weighted frequency using the following recurrent form: Here, f wf k q v is computed as follows: where n k q v is the number of occurrences of wf k From Eqs. 6, 5, and 4, it is inferred that n k q ≡ n k q 0 For example, if B = (b 1 , . . . ,b r ) = (ci, ci, ci, ci, cd, ls, cd, ls, cd), r = 9, W = 2, l(1) = 2, l(2) = 3, e 1 (1) = 2,e 1 (2) = 3, then Table 2 presents the numerical estimate of S k q , n k q , f k q , wf k q (k = 1, 2; 1 ≤ q ≤ r − l (k) + 1) illustrating Definitions 4 and 5. To illustrate Definition 6, which is one of our contributions, Tables 3 and 4 presents the numerical estimate of n k In Section IV, we will empirically verify that incorporating the features v th -order weighted frequencies of variable-length sequences boosts the detection performance.

5) SORTING THE VARIABLE-LENGTH SEQUENCES AND DIVIDING THEM INTO SEVERAL SETS
For each v = 1, 2, . . . , V the following two procedures are performed. First, the sequences are sorted in the LGS G such VOLUME 8, 2020 This splitting process was introduced by Xiao et al. [4], but whereas their splitting mechanism works for V = 1 only, whereas ours generates v sets for each v = 1, 2, . . . , V. Note that v and G contain only the same short sequences.
Remark 1: Algorithm 1 is similar to Algorithm 1 of Xiao et al. [4] (Mining behavioral patterns). The difference is that our algorithm has an additional loop (for v = 1, . . . , V iterations do) to mine behavioral patterns based on each feature, i.e., the v th -order weighted frequency (see line 4, Algorithm 1). Note that getWeightedFrequency v (shortSequence) returns the v thorder weighted frequency of shortSequence. Note that for  if (getWeightedFrequency v (shortSequence) > getWeightedFrequency v (shortSequenceMax)):

11.
shortSequenceMax g.add (g v ) 20. end for 21. return g of the training data, the method returns a value greater than 0. However, this is not always the case for the detection data. When the value is 0, Algorithm 1 returns an empty g.
Remark 2: The method v .getIndex (g v ) returns the index of the set in v to which g v belongs. For instance, if g v ∈ v 3 , then v .getIndex (g v ) returns 3. If g is not empty, then σ m is computed as the sum of σ mv multiplied by the coefficients if (!g.empty()): // if g is not empty 10.
Remark 3: Note that none of the states built on training data is NV = N V therefore A NV = 0,P NV,σ m = 0, P σ m ,NV = 0, m = 1, . . . , M. If σ m = NV, then Pr (σ m = θ m , . . . ,σ n = θ n ) = 0. In Algorithm 3, we use N V + 1 distinct states instead of the N + 1 states used by Xiao et al. [4]. Consequently, in our proposed model the initial probability distribution vector A and state transition probability matrix P have N V + 1 and N V + 1 2 dimensions, correspondingly, in place of N + 1 and (N + 1) 2 in Xiao et al.'s [4] model, respectively. This is the only difference between our proposed Algorithm 3 and Algorithm 3 of Xiao et al. [4] (Building probability distributions).

C. DETECTION STAGE
The detection stage consists of the following steps borrowed from Xiao et al. [4].

1) PREPROCESSING THE DATA
The testing data of an audited user is preprocessed into the sequence of command namesB = b 1 ,b 2 , . . . ,br , Y σ m = Y σ m + 1 // increment states' occurrence times 8.

9.
// increment the number of transitions from state σ m to state σ m+1 10.
A σ m = Y σ m /M // normalize the occurrence times of each state 16.
P σ m ,σ m+1 = P σ m ,σ m+1 Y m 20. end if 21. end for 22. end for 23. P NV,NV = 1 24. return A, P wherer is the number of command names. We assume thatr> l (W) +w − 1 in order to generate variable-length sequences with the maximum length l (W) over the window w.

4) COMPUTING THE DECISION VALUES
We assumed that, in the detection stage, the commands of a misbehaving user may match the commands of a normal user for a short period of time but will eventually differ over a longer period [4], [6], [8]. Therefore, we can make decisions about the type of behavior by considering w short state sequences, where w is the window size. Furthermore, some low probabilities may be due to a coincidence. To stop these probabilities from producing large values when summed over the window w, assuming thatr > l (W) + w − 1, the decision value D (n) is calculated in terms of the probabilities of the short state sequences as follows [4]: w ≤ n ≤M−u+1 and η is a predefined probability threshold.

5) CLASSIFYING THE DECISION VALUES
Suppose that the decision threshold is τ . An audited user's n th behavior is classified as normal or masquerade based on D (n). If D (n) ≥ τ , then the n th behavior is ranked as normal; otherwise, it is ranked as masquerade. Note that the n th behavior consists of σ n-w+1 , σ n-w+2 , . . . ,σ n short state sequences. Normally, a higher τ classification threshold catches more masquerade behavior but may increase the FPR. In contrast, a lower threshold τ may decrease not only the FPR but also the TPR. Thus, the choice of the threshold τ value depends on the application.

D. OUTLINE OF THE PROPOSED MODEL
The training stage of the proposed method consists of the following steps:  = (b 1 , . . . ,b r ) for the training stage (see Section II-B, Section II-B.1), and a testing setB = b 1 ,b 2 , . . . ,br for the detection stage (see Section II-C, Section II-C.1), as described below.

1) SCHONLAU DATASET
Schonlau et al. [9] introduced a free UNIX command line-based dataset called SEA with the truncated data configuration only. The UNIX acct audit tool was used to collect commands from NU SEA = 50 different users for several months. We used the following data settings proposed by Schonlau et al. [9], which we named ''SEA Full'' for NU SEA = 50, and ''SEA Shrunk'' for NU SEA = 4. For each user, the first 5000 commands yielded the training set, and the next 10,000 commands built the test set. Each user's test set is contaminated with the commands of other users at random positions. The positions were given at (http://www.schonlau.net/masquerade/masquerade_summary .txt) in.txt format, which we used to generate the normal and masquerade test sets.

2) PU DATASET
The PU dataset was introduced by Lane and Brodley [10]; the data were taken from the tcsh history files and stored with the enriched data configuration. The dataset contains session information for each of NU PU = 8 users. We converted the dataset into the truncated data configuration and used the following data settings proposed by Lane and Brodley [10], which we named ''PU Truncated Full'' for NU PU = 50, and ''PU Truncated Shrunk'' for NU PU = 4. For each user, the training set was the first 1500 commands, and the test set was the next 500 commands. We also took 500 commands of the other (NU PU −1) users starting from the 1501 st command, resulting in 500 × (NU PU −1) commands for the masquerade test set.

3) GREENBERG DATASET
Greenberg's dataset [11] was collected through the csh tool and is stored with the enriched data configuration. There were 168 UNIX users in the dataset divided into four groups: novice programmers, experienced programmers, computer scientists, and non-programmers. We converted the dataset into the truncated data configuration and used the following data settings proposed by Maxion [12]. We selected users with 2000 -5000 commands. NU Greenberg = 50 users satisfied the condition. Then, for each user of NU Greenberg users, the first 2000 commands were used for the experiments. The first 1000 commands were used to create the training set; the next 1000 commands yielded the normal test set. We chose 25 users from the remaining of the 118 users to act as masqueraders. From these users, we randomly selected 300 commands to be the masquerade test set for each of the NU Greenberg users. Note that each set of 300 commands was distinct. Thus, each user's training set consisted of 1000 commands. The normal test set contained 1000 commands, and the masquerade test set contained 300 commands. We named these data settings ''Greenberg Truncated Full'' for NU Greenberg = 50, and ''Greenberg Truncated Shrunk'' for NU Greenberg = 4. Table 5 briefly describes the datasets with the associated data settings.

B. BASELINES
Xiao et al.'s [4] Markov chain-based variable-length method (V = 1) was denoted as ''Markov.'' Our proposed Markov chain-based variable-length model with both first-and second-order weighted frequencies (V = 2) was denoted as ''MarkovF2.'' Note that, for a given number of N sets, Markov uses (N + 1) states. In contrast, our proposed MarkovF2 uses N 2 + 1 states (see Section II-B.6). Therefore, we also used Xiao et al.'s [4] Markov model with N 2 + 1 states (i.e., N 2 sets) as an additional baseline and denoted it as ''MarkovN2.'' Thus, the proposed model is MarkovF2; both baselines Markov and MarkovN2 are from Xiao et al. [4].

C. EVALUATION METRICS
To assess the performance of our model, we used different evaluation metrics such as the TPR, FPR, ROC, threshold variance, computational complexity, and memory complexity. Let D be a given dataset with specified data settings, and NU be the number of users. Then for each user i 's masquerade test set (user i ∈ D, 1 ≤ i ≤ NU), the numbers of true positive and false negative predictions are denoted by TP i and FN i , respectively, whereas, for user i 's normal test set, the numbers of false positive and true negative predictions are denoted by FP i and TN i , respectively. User i 's true positive rate TPR i is A higher TPR i indicates better masquerade detection. By contrast, a lower FPR i indicates a better classifier for normal test data, where For D, we calculate TPR mean = NU i=1 TPR i /NU and FPR mean = NU i=1 FPR i /NU. We define ROC as the plot of TPR mean against FPR mean for each threshold value τ from -1 to 1 in increments of 0.002.
For each user i (user i ∈ D, 1 ≤ i ≤ NU), we denote a threshold value τ by threshold optimal i that yields the maximum TPR (TPR max i ) value with the constraint that the corresponding FPR i ≤ 0.1. We denote this FPR i as FPR optimal i . Then, the TPR is defined as follows.
and the FPR is defined as follows.
Threshold variance is calculated through the following equation.
where threshold mean is the average threshold value, that is Lower threshold_variance indicates better generalization. We consider threshold_mean to be a recommended threshold value for D.

D. EXPERIMENTAL SETUP
The experiments were implemented in the Java programming language. For reproducibility, the code was deposited on GitHub (https://github.com/kazaros91/Masquerade-Detection VOLUME 8, 2020 -Based-on-Markov-Chain-Using-Variable-Length-Sequences). The experiments were conducted on a MacBook Pro with an Intel Core i7 processor, 2.9 GHz processor, four cores, and 16 GB RAM. We performed a greedy search with the hyperparameters presented in Table 6. Because of space limitations, we only provided the best results obtained through the greedy search.

IV. RESULTS AND DISCUSSION
This section is organized as follows. Section IV-A presents the comparison of our model to the baselines using Shrunk data settings. Section IV-B presents the comparison of our model to the baselines using Full data settings. Section IV-C presents the effect of the window size and number of sets on the detection performance using Full data settings. Section IV-D presents the comparison of our model to stateof-the-art models, such as CNN [13] and SA-HMM [14], using Full data settings.  Table 7, and the ROC curves are presented in Fig. 2. Table 7 demonstrates that, for SEA Shrunk, the proposed MarkovF2 obtains noticeably better TPR, FPR and thresh-old_variance than Markov and MarkovN2. Fig. 2 shows that, for SEA Shrunk, the proposed MarkovF2 substantially outperforms Markov and MarkovN2 at ROC curves. Note that the ROC curves of MarkovN2 and Markov are quite similar.
For PU Truncated Shrunk, Fig. 3 shows that, in terms of ROC curves, the proposed MarkovF2 significantly outperforms MarkovN2, which itself is better than Markov.    Table 9 presents the results, and Fig. 4 presents the ROC curves. Table 9 demonstrates that, for Greenberg Truncated Shrunk, the proposed MarkovF2 obtains substantially better TPR, FPR and threshold_variance than Markov and MarkovN2.
For Greenberg Truncated Shrunk, Fig. 4 shows that the proposed MarkovF2 outperforms Markov and MarkovN2 at ROC curves. Markov is better than MarkovN2 in terms of ROC curves. For Greenberg Truncated Shrunk, Fig. 4 shows that the FPR mean is in [0, 0.06], the proposed MarkovF2 is superior to Markov and MarkovN2 at ROC curves. When FPR mean is in range [0,0.06] MarkovF2 still outperforms the baselines but not very noticeably.  Table 10 presents the results, and Fig. 5 presents the ROC curves. Table 10 demonstrates that, for SEA Full, compared to Markov and MarkovN2, the proposed MarkovF2 obtains remarkably better TPR and threshold_variance, as well as noticeably lower FPR.
For SEA Full, Fig. 5 demonstrates that MarkovF2 has superior TPR mean than the baselines (Markov and MarkovN2) by significant margins, when the FPR mean is in [0, 0.06). When the FPR mean is in [0.06, 0.17], MarkovF2 still outperforms the baselines, but not very noticeably.

C. EFFECT OF WINDOW SIZE AND NUMBER OF SETS ON PERFORMANCE
Recall that w is the window size and N is the number of sets. Figs. 8-10 illustrate the impacts of w and N on the TPRs of all models (Markov (a), MarkovN2 (b), and MarkovF2 (c)) for SEA Full, PU Truncated Full, and Greenberg Truncated Full. We identify the following dependencies: for N = 2, 3, 4, the TPR values of all models increase with window size w. For all window sizes (w = 21, 36, 51, 81), the TPR values of all models decrease with increasing N. Thus, the highest TPR is achieved at w = 81 and N = 2. Table 13 presents the best results for the proposed model (MarkovF2), the baselines (Markov and MarkovN2), and the  state-of-the-art models (CNN [13] and SA-HMM [14]) for SEA Full, PU Truncated Full, and Greenberg Truncated Full. Although the CNN has a better FPR than all the other models, MarkovF2 has the best TPR. What is more important is that the MarkovF2 achieves significant improvement in terms of TPR on the datasets with the truncated data configuration, which means that MarkovF2 is more effective when data samples contain limited information. Table 14 presents the computational and memory complexity of the models. (We did not analyze the computational and memory complexity of SA-HMM because HMM-based methods inherently have a higher computational and memory complexity.) Table 15 provides the parameter values used to evaluate the computational and memory complexity. The numerical estimate of the complexity presented in Table 16 assume that the memory unit is 4 bytes. Table 16 indicates that compared to CNN [13], MarkovF2 requires approximately 10 9 times fewer ops in the training stage, approximately 10 10 times fewer ops in the detection stage, and approximately 10 6 times fewer memory units. The details are provided in Sections IV-A and IV-B.    The LGS size is H , the initial probability distribution vector A vector has size N V + 1, and the transition probability distribution matrix P matrix has size is N V + 1 2 . Thus, the memory complexity is given by H +

V. COMPUTATIONAL COMPLEXITY AND MEMORY COMPLEXITY
The memory complexity of Markov (V = 1) is therefore O H + N 2 , that of MarkovN2 (V = 1) is O H + N 4 , and that of our MarkovF2 (V = 2) is O H + N 4 . The best results were achieved when N=2, which is very small.

B. CNN
Let N conv = 6 be the number of convolutional layers of the CNN [13], K be the kernel size (O (K ) = 7), and InChannels = OutChannels = C = 1024, where InChannels and OutChannels are the numbers of input and output channels, respectively, in each convolutional layer. N fully = 3 is the number of fully connected (FC) layers, and z = 2048 is the number of neurons in each FC layer. Then, for the CNN [13], the computational complexity in the training stage can be determined as follows: each convolutional layer performs the convolution operation and pooling, which in sum requires O((r − K + 1) × K × InChannels × OutChannels) + O (1) = O((r − K + 1) × K × C 2 ) = O(r × K × C 2 ) ops. Each FC layer performs O z 2 ops. Thus, each forward pass requires F = O N conv × r × K × C 2 + N fully × z 2 ops. Each backward pass needs B = 2 × F ops, as argued by He and Sun [37]. In total, the computational complexity in the training stage of the CNN [13] is T = O (E × (F + B)) = O E × 3 N conv × r × K × C 2 + N fully × z 2 , where E = 30 is the number of epochs. Thus, T = O (540 × r × K × C 2 + 270 × z 2 . The detection stage consists of the forward pass only, for which the computational complexity is F = O N conv × r × K × C 2 + N fully × z 3 = O(6 × r × K × C 2 + 3 × z 3 ). Each convolutional layer of the CNN [13] is a weight tensor in R batch_size×1×K ×InChannels×OutChannels , where batch_size = 64 and InChannels = OutChannels = C = 1024. Subsequently, the memory complexity is O(batch_size × K × C 2 ). Each FC layer is a weight tensor in R batch_size×z×z , and its memory complexity is O(batch_size× z 2 ). The total memory complexity of the CNN [13] is given by

VI. CONCLUSION
In this work, the problem of masquerade detection is addressed based on variable-length UNIX command sequences. The contributions are summarized as follows.
• We introduce new features of variable-length sequences, i.e., the weighted occurrence frequencies of different orders.
• We extend the Markov-chain-based model by Xiao et al. [4] and integrate these new features into the extended model.
• For three different datasets -SEA, PU Truncated, and Greenberg Truncated, with both Shrunk and Full data settings -our model outperforms the baselines (both Markov and MarkovN2 from Xiao et al. [4]) at TPR, FPR, ROC and threshold variance metrics.
• In terms of TPR, the proposed model MarkovF2 outperformed the state-of-the-art DL model CNN [13] for PU Truncated Full and Greenberg Truncated Full, and the state-of-the-art SA-HMM [14] for SEA Full.
• Our model is much more lightweight in terms of computational and memory complexity than the state-ofthe-art models. Thus, our model is more suitable for masquerade detection in real-time. One limitation of our model is the selection of the optimal hyperparameter values. In the case of V ≤ 2, the optimal hyperparameter space is quite small (see Table 6 ). In the future, we plan to test our method with higher-order (V > 2) weighted occurrence frequencies. In this case, because of the increased number of weight sets, which enlarges the hyperparameter space, finding optimal hyperparameters becomes more challenging. Therefore, we plan to apply a learning algorithm such as the back-propagation neural network on Markov chains proposed by Xiao et al. [38]. However, this will increase the computational cost and memory usage.
As noted in Section I-B, the state-of-the-art CNN is a text classification model applied to the masquerade detection problem. Compared to a variety of RNN-and CNN-based models, capsule neural networks (CapsNets) have demonstrated superior results in terms of accuracy and sample efficiency for a variety of NLP tasks, such as text classification [39]- [41], sentiment analysis [42], relation extraction [43], and question answering [41]. Moreover, CapsNets can be more robust against adversarial attacks like shuffling VOLUME 8, 2020 words, as shown in [44]. (It should be noted that the original CapsNet architecture was introduced for image classification by Sabour et al. [45].) Thus, another direction for future research is investigating the application of CapsNets to masquerade detection.