Introduction and Motivation
A heterogeneous cloud is a complex platform requiring substantial security infrastructure. According to the NIST [1], a cloud platform should have essential characteristics not limited to on-demand self service, broad network access, and resource pooling. These features have helped forging cloud computing into a standard for both private and public sectors. As such, many organizations are utilizing the cloud computational power for different tasks to meet growing business needs. Typically, a cloud service provider (CSP) offers Infrastructure as a Service (IaaS) where clients are allowed to ‘rent’ space in the form of virtual machines (VMs) within a data center to facilitate different operational jobs. Clients have the ability to spawn many of these virtual machines on-demand. Such a convenient way of utilizing computational resources is derived from the defined cloud essential characteristics. Recently, the amount of cloud services, in particular VMs, being offered as well as the number of clients demanding the use of these services has increased dramatically. This increase has made the cloud a very desirable target for attackers since these resources, if exploited, can be recruited to launch large scale cybersecurity attacks [2]–[5].
Cloud malware is one of the most common and growing threats where a malicious software is purposely designed to attack VMs running on a cloud IaaS. Although malware is a well researched challenge [6], [7], it’s impact magnifies in cloud settings due to several underlying reasons: (i) the high demand of cloud resources usage as well as the increase in the number of clients significantly broaden the attack vector, (ii) several clients lack the ability to properly secure their acquired resources, and (iii) the rise of automated configuration tools (e.g., Puppet,1 Chef,2 etc.) further adds to the list of security vulnerabilities. If a VM is spawned using a script that contains a configuration vulnerability (a flaw in security settings, like failing to auto-encrypt files or change a default image root password) it could be left prone to attacks. Further, any VM spawned using the same script will most likely have the same weakness. This is particularly true in cases where a client is deploying a large-scale system on the cloud. For example, deploying a Web Service used by millions of users will typically include multiple web, application, and database servers, which in most cases will all be deployed using the same configuration script. The redundant use of configuration scripts across the servers that make up a web service could allow malware to easily propagate to each server in the web service. Consequentially, detecting cloud malware in a real time, online, and effective manner is an essential task for CSPs.
To address these challenges, numerous malware detection approaches have been proposed [8], [9] and are mostly categorized into static analysis [10], dynamic analysis [7], [8] and online malware detection [9], [11]. Static analysis works via analysing executables by code examination and creating a signature for the executable if it is flagged as a malware, whereas, dynamic analysis works by running an executable in a closed environment (e.g., sandbox) and monitoring its behavior. Online malware detection methods focus on constantly monitoring hosts by analyzing normal and malicious behaviors at all times. Static and dynamic analysis methods are well understood in literature and both have their shortcomings [10], [12]. Static approach falls short against polymorphic malware, which constantly changes its identifiable features, and zero-day malware. Such sophisticated malware can evade detection by applying packing and crypting methods to change the way it looks. Dynamic analysis can mitigate the limitations of static analysis since it is based on the behavior of the malware during execution; however, smart malware can detect the presence of sandboxes and cease malicious activities to avoid detection. Additionally, static and dynamic analysis share a fundamental drawback due to the fact that they aim to detect malware executables before they run on a host. This is not usually the case since malware can get into a host without passing through the static/dynamic detection system. To mitigate the aforementioned drawbacks, online malware detection is used by defining a set of host-wide features to capture benign and malicious behaviors.
Detecting malware in a rapid and effective manner has become a necessity. As such, researchers have utilized machine learning (ML) as a mature and reliable way for static, dynamic and online malware detection. In this paper, we introduce an approach of online cloud malware detection using deep learning (DL). In particular, we demonstrate the effectiveness of using Recurrent Neural Networks (RNNs) for online malware detection by utilizing processes system features of VMs in cloud IaaS environments. Our work is driven by the assumption that many VMs running on the cloud are automatically provisioned to do a specific task. In turn, such VMs will contain a fixed set of processes to achieve this task. Note that processes are dynamic in nature, so other unexpected processes will always be created and deleted. However, a large number of the running processes belong to the fixed set. For example, a single VM configured to host a web service will typically have web server processes (e.g, Apache), database processes (e.g. MySQL), etc. that can be represented as a sequence. Each process in this sequence is represented as a vector of the utilized system features. Towards this end, we use RNN to learn the sequence of processes running in a VM and how the presence of malware can disrupt this sequence.
We conducted an analysis for the malware samples which showed that the majority of the malware was able to change their process names to a legitimate system process. Malware was also capable of attaching itself to a legitimate process and, because of these two reasons, typical whitelisting methods are not effective, hence more sophisticated methods are needed. In our previous work [13], we used simple shallow CNN model which proved effective but with a limited detection accuracy. This was used as a baseline for our more sophisticated RNN approach.
The main contributions in this paper are as follows:
We introduce a novel approach of detecting cloud malware using RNNs by utilizing processes system features. We demonstrate that the set of processes running in a VM can be represented as a sequence of system features. Further, we highlight that RNNs can effectively detect the presence of malware processes within the benign processes sequence.
We provide a comparative analysis of Long Short Term Memory (LSTM) and Bidirectional (BIDI) models in terms of evaluation metrics, along with training and detection time.
We provide an analysis on the effect of using different input representations. Our experiments suggest that both LSTM and BIDI models achieved high performance regardless of the order of system features, whereas, the order of processes within the input sequences impacted the performance by a range of 1-2%.
The remainder of this paper is as follows. Section II, discusses other related works regarding RNNs, malware detection, and cloud computing. Section III describes the approach and methodology to our experiments. Section IV discusses the experimental cloud set up and the results from the RNN models. Section V elaborates on the RNN sensitivity to different input representations whereas cost analysis is described in Section VI. Section VII focuses on discussion and highlights some limitations, Section VIII summarizes the findings and concludes with possible future directions.
Related Work
Behavioral machine learning based malware detection approaches can be divided into dynamic malware detection and online malware detection. An important distinction between the two approaches is that, in dynamic malware detection, executable (malware or benign) is run in a sandbox and its behavior is captured, whereas in online malware detection, the behavior of the entire system is captured with particular times being labeled as malicious if a malware is running. In this section, we discuss some of the related dynamic and online malware detection works. Further, we sub-categorize these works based on several aspects including traditional versus deep learning based approaches and whether the work is cloud-specific, as shown in Table 1.
A. Dynamic Malware Detection
There have been several works on dynamic malware detection using traditional machine learning approaches. The works in [14], [16] focused on using system calls as features. Firdausi et al. [14] employed traditional machine learning algorithms such as KNN, Naive Bayes, decision trees and SVM, where as Lucket et al. [16] used neural networks. The works in [15], [18] rely on system performance metrics and traditional ML algorithms for malware detection. In addition, Fan et al. [17] built a framework using sequence mining techniques that effectively discover malicious patterns in malware. This work utilizes a Nearest Neighbor classifier to identify previously unknown malware.
Recently, it has become clear that more sophisticated approaches for malware detection are needed. This is mainly because of the fact that traditional ML approaches require extensive pre-processing and rigorous feature engineering and representation. As such, recent research efforts have moved towards employing end-to-end deep learning techniques to bypass the feature engineering step. Many research works [8], [19]–[24] aimed to overcome the limitations of traditional ML approaches and employed DL algorithms. The works in [21]–[24] provide malware detection methods based on system calls and RNN. Others [8], [19], [20] have also used Recurrent Neural Networks (RNN) and Convolutional Neural Network (CNN) but, instead focused on API calls.
However, dynamic analysis has some limitations due to controlled environment where the malware run. In many cases, it cannot be analyzed completely due to limited access of Internet. Sophisticated malware can detect the presence of a sandbox and immediately terminate any malicious behavior. In addition, most of the dynamic detection target traditional host-based systems and not specific to cloud infrastructures (e.g., VMs). Consequentially, the need for online malware detection approaches is necessary.
B. Online Malware Detection
The advantages of online malware detection approaches are: (1) they don’t rely on a closed environment, (2) they continuously monitor the VMs, as opposed to dynamic analysis approaches where once an executable is deemed benign it freely runs on the system, and (3) they consider the entire VM behavior as opposed to just an executable behavior.
The authors in [25], [26] utilize performance counters for online malware detection, whereas [27] proposed the use of memory features; however, these works used traditional ML algorithms and targeted traditional host-based systems. In order to enhance the accuracy of malware detection in cloud, more cloud-specific techniques are proposed. Guan et al. [29] proposed an anomaly detection for VMs in cloud environment using system calls. They used an ensemble of Bayesian predictors and decision trees. Similarly, Azmandian et al. [28] proposed an intrusion detection system using system calls and used traditional ML algorithms including KNN and clustering. Further, Dawson et al. [31] used API calls captured through the hypervisor and used a non linear phase-space algorithm to detect anomalous behavior.
Other works have focused on using features that can only be fetched through the hypervisor. Given that many experimental setups are run within the context of a hypervisor, it is common to see features collected from the hypervisor. Also, such techniques are suitable to be implemented by the CSP since they do not require inside visibility to the VMs. Watson et al. [2] utilized performance metrics that can be fetched from the hypervisor in order to detect malware. This paper utilized a one class SVM for malware detection; however, they focused on malware that is known-to-be as highly-active malware. Similarly, Abdelsalam et al. [30] demonstrated a black box based approach to detect malware. This work uses VM-level system and resource utilization features. This worked well in detecting highly active malware with high resource utilization features but was not as effective in detecting malware that hide itself with low utilization.
Beside the works that used traditional ML algorithms, others [9], [13], [32], [33] focused on using deep learning algorithms for online malware detection. The authors in [13] extended their work in [30] and introduced a detection method which uses a CNN model with the goal of identifying low profile malware. This method achieved
In this paper, we primarily focus on online malware detection using RNN in cloud infrastructures. To the best of our knowledge, this is the first work that uses RNN based malware detection approach using performance metrics in online cloud environment. We provide a novel way of representing a VM’s behavior as a sequence of processes performance metrics as discussed in the next section. Additionally, our work provides an insightful analysis on the RNN sensitivity to different input representations for malware detection.
Methodology
In this section, we explain the methodology used for malware detection in VMs in cloud infrastructure.
A. LSTM Models
RNN is a category of deep learning that can process sequential information such as language translation [36], speech recognition [37], and time series prediction [38]. However, it suffers from two problems. First, RNN struggles with short term memory; this means that long inputs can cause the RNN model to forget earlier information. Second, RNN models are subject to vanishing gradients. This is where the gradient value becomes diminished as the model backpropagates, which leads to the model not learning properly. LSTM was created to resolve these problems [39]. LSTM units contain input, forget, and output gates which control how the information flows into and out of the cell. This allows them to preserve important information and discard any unnecessary data. As shown in Figure 1, LSTM contains
All of these gates help LSTM layers create a reliable model that can leverage all of our sequential data without the worry of losing data or having inaccurate gradients. The first step in the LSTM unit is the forget gate. Data from the previous hidden state (h
B. Bidirectional Models
Bidirectional LSTM models are able to process input in a forward and backwards manner [40]. Instead of the layer only processing the input normally by using one LSTM layer, past to future, another LSTM layer is added that processes the input starting at the last object of the input and working its way backwards, i.e. future to past. Just like in a normal LSTM layer, each bidirectional LSTM layer is assigned a number of units. This bidirectional methodology allows the model to learn more by analyzing the data from both directions and applying information from future inputs towards its predictions. Once these two layers process their respective data, the output from these layers is then concatenated together after each timestep. This type of model is useful when extra context might be needed in order to make accurate predictions. In our case, the bidirectional model can analyze future processes and use that information to determine what might be happening at a current process. This creates a model that is well suited to determine if a machine is infected with malware or not by analyzing how the machine will behave in the future. Figure 2, depicts the architecture of a bidirectional LSTM layer within an RNN model.
C. System Features
The system features in Table 2 are the features used to define processes behavior. The values are an example of the raw data collected about a single process taken at a certain time. Further processing of the data is required such as encoding the strings using one-hot-encoding, and the data must be flattened to a 1-dimensional vector before it can be used in an RNN. Most of these features can be obtained by using Virtual Machine Introspection (VMI) tools such as LibVMI3 to capture snapshots of VMs memory and, in turn, extract the required information by using memory forensics tools such as Volatility.4 This set of system features are intended for the sole purpose of demonstrating the validity of our approach, but more features can further enhance the accuracy.
D. Unique Processes and RNN Input
System features are collected from all processes running in a VM at certain time. With many short lived processes (i.e. being created and destroyed quickly within each VM) as well as having their IDs reassigned by the operating system, it can be misleading and difficult to learn their behavior. As such, we define “unique processes” (as introduced in [13]) to reduce such dynamism. Unlike traditional operating system process which is identified by a “pid”, a unique process is more concerned about the behavior of a process and is identified by a tuple of two elements process name and the command used to run the process. Figure 3, shows an example of operating system processes converted to unique processes. Processes sharing the same 2-tuple (e.g., forked processes) are aggregated by taking the average of their measures. This approach also helps in reducing the number of processes in a single sample.
The collected unique processes’ features will be represented as data samples to be used as input to the RNN models, where each data sample is a sequence of unique processes. We represent a sample \begin{align*} X_{t} = up_{1} \begin{bmatrix} f_{1}\\ f _{2}\\ \vdots \\ f _{n}\\ \end{bmatrix} \to up_{2} \begin{bmatrix} f_{1}\\ f _{2}\\ \vdots \\ f _{n}\\ \end{bmatrix} \cdots \to up_{m} \begin{bmatrix} f_{1}\\ f _{2}\\ \vdots \\ f _{n}\\ \end{bmatrix}\end{align*}
Typically, a malware infects a VM and creates one or more processes which will disrupt the benign sequence of processes. Depending on the malware, it can attach itself to another process and cease its own main process to avoid detection which may turn some existing unique processes behavior to malicious. As such, a malicious sample includes some malicious processes interspersed between the benign sequence and can be represented as follows (\begin{align*} X_{t} = up_{1} \begin{bmatrix} {f_{1}}\\ {f_{2}}\\ {\vdots } \\ {f_{n}}\\ \end{bmatrix} \to {mp_{1}} \begin{bmatrix} {f_{1}}\\ {f_{2}}\\ {\vdots } \\ {f_{n}}\\ \end{bmatrix} \cdots \to up_{m} \begin{bmatrix} f_{1}\\ f _{2}\\ \vdots \\ f _{n}\\ \end{bmatrix} \to {mp_{k}} \begin{bmatrix} {f_{1}}\\ {f_{2}}\\ {\vdots } \\ {f_{n}}\\ \end{bmatrix}\end{align*}
A malware process can hide within the large number of running processes by renaming its process to some commonly used names. However, using the concept of unique process makes it harder for the malware process to hide because the number of unique processes is substantially smaller. Further, a malware process will be more visible since it will be considered a unique process. Our aim is to learn from the sequence of processes (including benign processes that a malware attached to) in a given sample and to identify it as malicious or benign.
Experimental Setup and Results
A. LSTM and BIDI Models Architecture
Our first model is based on LSTM and consists of eight layers. The first three LSTM layers consist of 256, 128, and 64 units, respectively. Each of our LSTM layers is followed by a dropout layer of 10% in order to prevent over fitting. The final layer is an output layer with softmax activation. Since we are using binary classification (i.e. malicious or benign) we only need two output units. Our second RNN model is bidirectional LSTM. This model consists of four bidirectional LSTM layers. The four layers are comprised of 512, 256, 128 neurons, and 64 neurons, respectively. Each of these layers is followed by a dropout layer of 10%. The output layer for this model consists of two output units and uses a softmax activation. Both of these architectures were chosen due to their simplicity which allows for faster training times. Despite the models’ simplicity, they are still able to perform at a high level.
These models are trained, validated and tested with a data set that consists of 113 experiments, split by 60% for training, 20% for validation, and 20% for testing. To obtain optimal models, a grid search method was used for hyperparameters optimization, mostly, with respect to batch sizes (16, 32, and 64) and learning rates (.0001,.00001,.000001).
B. Experimental Setup
1) Cloud Testbed
Getting accurate system features from the malware experiments is imperative for showcasing near real world performance. To accomplish this, a cloud testbed running an actual application was used and multiple measures were taken to ensure that the malware shows its true behavior. As shown in Figure 4, the cloud testbed utilized OpenStack,5 a popular open source cloud platform and consists of one control node and four compute nodes. The control node handles tasks such as the dashboard, storage, network, identity, and computing. The compute nodes only handle computing services. Each compute node is also supplied with agents for networking, polling, and collecting.
To avoid hindering the malware and allow it to exhibit its true malicious behavior, all of our experiments were conducted in the wild where all the VMs were connected to the Internet. This is because (i) sophisticated malware typically has the ability to detect the presence of a closed restricted environment (e.g., sandbox) and (ii) many malware, which are controlled by a command and control server (C&C), cease malicious activities upon failing to communicate with its C&C. Also, all antivirus tools and firewalls were disabled.
2) Malware Samples
In total, 113 linux malware executable were obtained from VirusTotal.6 To avoid biased results towards certain malware families, the malware was chosen randomly from various categories (according to VirusTotal) including DoS, DDoS, Backdoor, Trojan, Virus, Worm, among others.
3) Experiments Deployment:
Figure 5 shows an overview of the experiments deployment. The upper dotted box depicts the deployment of a single experiment stack. To simulate a real world scenario, a commonly used 3-tier web architecture, consisting of web-servers, application servers, and a database, was deployed. A front load balancer is deployed to handle and distribute clients requests to appropriate web servers. Web servers are connected to application servers via an internal load balancer to distributed the requests among the application servers. For simplicity, application servers are all connected to a single powerful database server. Further, an auto-scaling policy was implemented based on CPU usage. The same scalability policy is applied to both web and application servers independently. If the average CPU utilization of all VMs belonging to the web or application tier exceeded 70%, new VMs are spawned and attached to the corresponding load balancer to meet demand. If the CPU utilization fell below 40%, VMs are deleted to reduce resource usage. In our experiments, based on the traffic load, between 2 to 10 servers were spawned in each tier. Random GET/POST requests, denoting clients, were sent to the front load balancer using a multi process python script running on a dedicated VM. For integrity of experiments, the traffic/requests were generated based on an ON/OFF Pareto distribution. This deployment is intended to reflect the real world dynamic behavior of cloud infrastructures to satisfy changing tenants resource requirements.
The lower part of Figure 5 consists of a main control VM and a data collection VM. The main control VM is responsible for (i) keeping the malware executables in a database, (ii) injecting a single malware in one of the application servers at a certain point of time, and (iii) deploy/destroy an experiment stack. We utilized OpenStack Heat orchestration service to easily deploy/destroy an experiment stack using
4) RNN Models Training
All experiments resulted in 40,680 data samples collected. This is because the data we are collecting represent the behavior of all processes in the virtual machine, not just the actual malware executables. Models training was performed on a high performance computing center (HPC) with four Dell PowerEdge R730 servers, each with one NVIDIA Tesla K80 GPU. The RNN models were built and tested by Python scripts using Keras7 API which is built on top of Tensorflow.8
5) RNN Input
As stated in Section III-D, the input to the RNN models is a sequence of vectors, each denoting the features for a particular unique process. In our experiments, the maximum number of unique processes in any experiment is 120, hence, all sequences are padded to be of the same length. The system features (Table 2) collected for each unique process are preprocessed by converting categorical string features to one-hot vectors and standardizing the data values.
C. Evaluation
The performance of our models is measured by five evaluation metrics, accuracy, precision, recall, and F1 score.\begin{align*} Accuracy=&\frac {TP+TN}{TP+TN+FP+FN} \\ Precision=&\frac {TP}{TP+FP} \\ Recall=&\frac {TP}{TP+FN} \\ F1~Score=&2 \times \frac {Precision \times Recall}{Precision + Recall}\end{align*}
D. Results
As stated in Section IV-A, a different malware is used in each of the 113 experiments and the dataset collected were divided into 60% training, 20% validation and 20% testing. In order to emphasize the ability of our models to detect zero-day malware, the dataset were split on the number of experiments (i.e. 67 training, 23 validation and 23 testing). This ensures that the data samples collected from the 23 experiments (based on 23 unseen malware) for testing were completely unseen to the RNN models. The training dataset is used to train the RNN models, the validation dataset is used as a way to tune the hyperparameters (e.g., learning rate, batch-size, etc.) to get optimal models, and the testing dataset is used to measure the detection ability of the optimized LSTM and BIDI models.
To ensure the validity of our results, both LSTM and BIDI models were trained twice (i.e.
Figure 7 depicts the results of our experiment where the bars shown were produced by calculating different evaluation metrics for each the LSTM and BIDI optimal models. In our case, the optimal models are identified by hyperparameters of batchsize =32 and learningrate =1 e–5. Both of the
Figure 8 shows the training and validation mean cross entropy loss during the models’ learning progression. Training loss is recorded after each iteration, whereas validation loss is recorded after each epoch. The figure shows that the models were able to properly generalize and learn from the given datasets. The red line indicates the epoch where a particular model scored the highest validation accuracy during the 40 epochs training phase.
RNN Sensitivity to Different Input Representations
In Section III-D, we described how we construct the samples that are used in our experiments. Each sample consists of a sequence of unique processes. However, it is not clear whether the order of unique processes and features in a single sample would affect the RNN models’ ability to learn and generalize effectively. Altering the ordering of the input data can often reveal insights as to how to best train certain models. For instance, the authors in [41] provided an analysis on the effects of input ordering when using CNN models. They used similar process system features for malware detection using CNN models and studied the effects of processes and features ordering in the input, represented as an image (denoting processes
In this section, we provide an analysis on whether the order of sequence in a single sample (denoted by row models) as well as the order of features (denoted by col models) would affect the results of the RNN models. The key intuition of this analysis lies in the fact that some unique processes might be closely related, and including them in close proximity in the input sequences might help the models to easily draw and learn such correlations. For example, consider two unique processes of the FastCGI Process Manager (FPM) php-fpm: master and the its forked pool of processes php-fpm: pool (see Figure 3). Similarly, some system features might be related. For example, features of cpu usage such as cpu_user, cpu_sys, cpu_num, and cpu_percent are closely related. As shown in Figure 9, different row orderings are created by randomly changing the sequence of unique processes for all samples. Similarly, different column orderings are created by randomly changing the order of the features that belong to each unique process. This increases the odds of preventing related processes or features from appearing in close positions in a given sample.
A. Random Ordering Results
In our experiments, we trained four LSTM (i.e. LSTM
Figure 10 shows the results of
Results of LSTM and BIDI (
Mean cross entropy loss for LSTM and BIDI (
Cost Analysis
In this section, we provide cost analysis with respect to the LSTM and BIDI models’ training time. A problem with training RNN models is in the choice of the number of training epochs to use. Training the model for too many epochs can lead to overfitting (even with dropout layers), where as, training for too few epochs may lead the model to underfitting. In our case, we determined that 40 epochs are sufficient for our models to properly converge. During these 40 epochs, the set of weights that achieved the highest accuracy on the validation data set is recorded and chosen to be the most optimal model in each of the case. Although, some models can converge and learn faster than others.
Table 5 shows the epochs where the RNN models achieved the highest validation accuracy along with the time taken in seconds and the corresponding loss. In general, the LSTM models converged relatively faster than the BIDI models. The LSTM
As reflected by the results, all our models were able to detect the presence of malware within one input sample.
Discussion and Limitations
In this section, we discuss some rationale about our results as well as limitations and potential future work improvements.
The results in Section IV-D illustrated that both LSTM and BIDI models achieve almost equally high performance. However, it was clear from the results in Section VI that the LSTM models achieved such performance in a shorter amount of training time. Even though the
The results derived from the experiments in Section V-A showed that input representation in terms of the order of unique processes and features is a major concern in terms of detection performance and training cost. The
One limitation in our work lies in the size of our experiments. We conducted 113 experiments each using a different malware executable, but more samples would allow us to obtain a deeper understanding of how our detection models perform against differing malware types. Another limitation in this work is the assumption that a VM is infected by a single malware. In practice, a VM can be infected by multiple malware simultaneously. An analysis of whether our detection system will work as expected during the presence of multiple malware working at once is needed. Further, our work focuses on detecting malware in a single VM. However, in cloud auto scaling architectures, a malware that infects a single VM can propagate to similarly configured VMs fairly quickly. As such, malware propagation as well as multiple malware infections are left to future work. Another limitation to our work is that it is possible for malware to slip in between the averages within a group of processes with the same name. This is a common drawback associated with any methodology that generates meta-stats (e.g., average, standard deviation, etc.). This drawback is confined to the unique process aspect of our approach since this is where we are averaging the measurements of processes in order to reduce the number of features.
Conclusion
In this paper, we introduce an approach of using LSTM and BIDI models for online malware detection based on processes system features. Results showed that both LSTM and BIDI models achieved outstanding performance (over 99%) on the testing dataset; however, the LSTM models required less time than the BIDI models to achieve such performance. Additionally, we analyzed the impact of input representations on our models by conducting random ordering experiments with respect to unique processes and features (i.e.,
In the future, we plan to increase the scale of our experiments by using thousands of malware samples including more malware families. Additionally, we plan to study the impacts of malware propagation to similarly configured VMs in a cloud environment on the robustness of our detection models. We also plan to study the impacts of multiple malware infections to the same VM.
Appendix ARandom Column Orderings
Random Column Orderings
Table 6, Figure 12, and Figure 13 show the remaining results generated by the LSTM and BIDI (col
Appendix BRandom Row Orderings
Random Row Orderings
Table 8, Figure 14, and Figure 15 show the remaining results generated by the LSTM and BIDI (row