Cloud Intrusion Detection Method Based on Stacked Contractive Auto-Encoder and Support Vector Machine

Security issues have resulted in severe damage to the cloud computing environment, adversely affecting the healthy and sustainable development of cloud computing. Intrusion detection is one of the technologies for protecting the cloud computing environment from malicious attacks. However, network traffic in the cloud computing environment is characterized by large scale, high dimensionality, and high redundancy, these characteristics pose serious challenges to the development of cloud intrusion detection systems. Deep learning technology has shown considerable potential for intrusion detection. Therefore, this study aims to use deep learning to extract essential feature representations automatically and realize high detection performance efficiently. An effective stacked contractive autoencoder (SCAE) method is presented for unsupervised feature extraction. By using the SCAE method, better and robust low-dimensional features can be automatically learned from raw network traffic. A novel cloud intrusion detection system is designed on the basis of the SCAE and support vector machine (SVM) classification algorithm. The SCAE+SVM approach combines both deep and shallow learning techniques, and it fully exploits their advantages to significantly reduce the analytical overhead. Experiments show that the proposed SCAE+SVM method achieves higher detection performance compared to three other state-of-the-art methods on two well-known intrusion detection evaluation datasets, namely KDD Cup 99 and NSL-KDD.

approaches provide unsatisfactory classification or detection accuracy. The intrusion detection result depends not only on the performance of the classifier but also on the quality of the input data. Network traffic data usually involves high dimensionality and redundancy of features, which can easily cause a feature dimensionality disaster. Therefore, feature dimensionality reduction is particularly important for effectively improving the performance of the above-mentioned supervised classifiers [7]. It includes two types of techniques: feature subset selection and feature extraction [8]. Feature subset selection works by removing relevant or redundant features; the subset of features selected will give the best performance according to some objective function. Many studies [9], [10], [11], [12], [13] have demonstrated that feature selection methods can overcome the "dimensionality curse" and achieve high detection performance in CIDS. Feature extraction maps the original high-dimensional features into low-dimensional features and generates new linear or nonlinear combinations of the original features [8]. Recently, various researchers have demonstrated that deep learning technology has considerable potential for IDS, especially in feature extraction. This study aims to use the deep learning technique to automatically extract essential features from raw network data and input them into a shallow classifier for effective identification of attacks.
The remainder of this paper is organized as follows. Section 2 reviews exiting studies. Section 3 presents relevant background information. Section 4 describes the design and training process of the proposed stacked contractive autoencoder (SCAE)-SVM model in detail. Section 5 discusses the design of the CIDS framework. Section 6 presents and analyzes the experimental results. Finally, Section 7 states the conclusions and explores directions for future work.

EXISTING STUDIES
Deep learning approaches are mainly categorized into supervised learning and unsupervised learning. The difference between these two approaches lies in the use of labeled training data. Specifically, convolutional neural network (CNN) [14] that use labeled data fall under supervised learning, which employs a special architecture suitable for image recognition. Unsupervised learning methods include deep belief network (DBN) [15], recurrent neural network (RNN) [16], autoencoder (AE) [17], and their variants. Next, we describe recent studies related to our work; these studies are mainly based on KDD Cup 99 or NSL-KDD datasets.
Studies on intrusion detection using the NSL-KDD dataset have been reported [18], [19], [20], [21], [22]. Tang et al. [18] used deep neural networks (DNNs) to build an anomaly detection model in the software-defined network (SDN) environment. They trained their model by using 6 basic features taken from the 41 features of the NSL-KDD dataset. Salama et al. [19] used DBN to extract features for intrusion detection and SVM to classify the data after dimensionality reduction. Experimental results showed that their hybrid DBNþSVM method improves the detection performance compared with using SVM or DBN as standalone classifiers. Aygun et al. [20] proposed two deep learning-based anomaly detection models using AE and denoising AE to detect zero-day attacks with high accuracy. And they used a stochastic approach to determine the threshold value that directly affects the accuracy of the proposed models. Niyaz et al. [21] developed an effective and flexible NIDS referred as selftaught learning (STL), which combines a sparse AE used for unsupervised dimensionality reduction with softmax regression used to train the classifier. This method achieved satisfactory classification accuracy in 2-class, 5-class, and 23-class classification tasks. Subsequently, Niyaz et al. [22] developed a DDoS intrusion detection system and applied it to the SDN environment. They used the stacked auto-encoder (SAE) for feature reduction and evaluated the detection performance of the SAE-SVM model on network traffic collected from real and private network test beds.
Studies on intrusion detection using the KDD Cup 99 dataset have been reported [23], [24], [25]. Kim et al. [23] specifically targeted advanced persistent threats and proposed a deep neural network (DNN) using 100 hidden units, combined with the rectified linear unit activation function and the ADAM optimizer. Their approach was implemented on a GPU using TensorFlow. Papamartzivanos et al. [24] proposed a novel method that combines the benefits of a sparse AE and the MAPE-K framework to deliver a scalable, self-adaptive and autonomous misuse IDS. They merged the datasets provided by KDD Cup 99 and NSL-KDD to create a single voluminous dataset. Shone et al. [25] designed a new non-symmetric deep autoencoder (NDAE) model, which unlike typical AEs, provides nonsymmetric data dimensionality reduction. This model was combined with an RF classification algorithm to construct a classifier. This method achieved satisfactory results on the KDD Cup 99 and NSL-KDD datasets.
Studies using private or other public datasets for intrusion detection have also been documented [26], [27], [28]. Loukas et al. [26] used RNN-based deep learning enhanced by long short term memory (LSTM) to considerably increase intrusion detection accuracy for a robotic vehicle. They demonstrated that their approach achieves high accuracy with considerably more consistency than with standard machine learning techniques. Yu et al. [27] proposed a network intrusion detection model based on TCP, UDP, and ICMP sessions, which combines the stacked denoising auto-encoder (SDAE) and softmax classifier. Comparative experiments showed that the performance of this model is better than that of DBN and SAE in 2-class and 8-class classification tasks. This method was also validated on other public datasets, such as the UNB ISCX IDS 2012 and CTU-13 datasets. Later, Yu et al. [28] proposed a stacked dilated convolutional autoencoder (DCAE) method, which can automatically learn key features from a large amount of raw network traffic. DCAE has fewer parameters than fully connected neural networks such as SAE. However, the limitation of the DCAE model lies in the relatively long training process, and the authors planned to adopt GPU parallelization technology to overcome this problem in the future. Yu et al. trained and tested their model with a private dataset, so the model cannot be directly compared with other schemes.
From the above-mentioned findings, we can conclude that deep learning has been successfully applied to network intrusion detection. However, it remains in its infancy. Most researchers are still combining it with various algorithms through numerous experiments (structure, training, optimization, etc.) to explore the most effective solution. Hence, we believe that our study can make a significant contribution to cloud intrusion detection research.
The contributions of this study can be summarized as follows: We propose a novel stacked contractive autoencoder (SCAE)-based deep learning method for network intrusion detection. The SCAE method can automatically learn essential and robust low-dimensional features from raw network traffic and input them into a shallow SVM classifier. Leveraging on the respective strengths of deep and shallow learning techniques, their combination can effectively improve detection performance. We design a cloud intrusion detection system (CIDS) that uses an SDN frame for collecting virtual network traffic from the Xen cloud platform and apply the SCAEþSVM model for feature extraction and classification detection. The proposed system attempts to detect attacks on the data plane and is implemented on the application plane. We evaluate our SCAEþSVM IDS model by applying it on two well-known datasets frequently used to evaluate the detection performance of IDS. The experimental results show the effect of the network structure depth and extracted feature number on the detection performance. In addition, they demonstrate that compared with the results of existing similar models, our model achieves better or similar results.

Autoencoder (AE)
An autoencoder (AE) is an unsupervised feature dimensionality reduction technique, with its structure consisting of an encoder and a decoder, including an input layer, a hidden layer, and an output layer. The encoder is used for dimensionality reduction and the decoder is used for reconstruction, which is regarded as the reverse process of the encoder.
Let a training dataset D have n samples; then D ¼ fx ðiÞ ; y ðiÞ jx ðiÞ 2 < dx ; y ðiÞ 2 <g n i¼1 , where each sample x is a d xdimensional feature vector and y is the class label. The encoder function f maps input x into a hidden representation h 2 < d h , and the decoder function g maps hidden representation h back to a reconstruction z 2 < d x . When the number of hidden layer neurons are less than the number of input layer and output layer neurons, that is, d h < d x , we obtain the compressed vector of the input, and thus realize dimensionality reduction. The encoding and decoding processes are defined as follows: where f and g are non-linear activation functions (typically, they are sigmoid or hyperbolic tangent functions), w and w' are the weight matrices, where w is a d x Â d h matrix and w 0 ¼ w T , and b and b' are the bias values with d h and d x dimensions, respectively.
Here, we use u ¼ fw; bg and u 0 ¼ fw 0 ; b 0 g, where u and u 0 represent the parameters of the encoder and decoder, respectively. The goal of learning is to minimize the reconstruction error between the input x and the output z by adjusting these parameters. The minimization objective function is defined as Lðx ðiÞ ; g u 0 ðf u ðx ðiÞ ÞÞÞ; (3) where L is the loss function or reconstruction error function, which is typically the mean squared error (4) or the crossentropy loss (5).
½x ðiÞ log z ðiÞ þ ð1 À x ðiÞ Þlog ð1 À z ðiÞ Þ: (5Þ In general, the smaller the reconstruction error, the closer output z is to input x, which implies that h is an effective low-dimensional feature representation. However, the reconstruction criterion alone will lead to the problem of the output being identical to the input; therefore, the AE cannot effectively extract features. To address this problem, we can adopt strategies such as adding constraint representation or corrupting the input by adding noise.

Denoising Autoencoder (DAE)
Unlike the conventional AE, the denoising autoencoder (DAE) [29] aims to learn a more effective and robust feature representation from the corrupted input datax. The DAE first corrupts the input x and then sends the corrupted datâ x into the auto-encoder for denoising. Finally, it reconstructs the clean version x. This process yields the following objective function: Ex ðiÞ $qðx ðiÞ jx ðiÞ Þ Lðx ðiÞ ; g u 0 ðf u ðx ðiÞ ÞÞÞ; where the expectation is over the corrupted versionsx of samples x obtained from a corruption process q ðxjxÞ . Common corruption approaches include additive isotropic Gaussian noise (GS), i.e.,xjx $ Nð0; s 2 IÞ, and binary masking noise (MN), which randomly sets a fraction of input features to 0.

Contractive Autoencoder (CAE)
Rifai et al. followed up on DAE and proposed the contractive autoencoder (CAE) [30]. The aim of CAE is to learn robust feature representation. Although DAE and CAE have the same purpose, they adopt two distinct methods. DAE learns robust feature representation from a relatively intuitive perspective by randomly adding noise to the input. CAE learns robust feature representation from the perspective of analysis by regularization. The minimization objective function is given by where J f ðxÞ is the Jacobian matrix representing the partial derivative of the hidden features h with respect to the weight w. represents the penalty coefficient used to adjust the proportion of the penalty term in the objective function.
Further, jjJ f ðxÞjj 2 F is the square of the Frobenius norm (F norm) of the Jacobian matrix, as shown in (8). When the activation functions f and g are a sigmoid function, that is, sigmoidðxÞ ¼ 1 1þe Àx , the contractive penalty term can easily be computed as (9).
where h j 2 h represents the element of the hidden representation, d x and d h represent the dimensions of input x and hidden representation h, respectively. w tj is the element of a d h Â d x matrix that represents the weight of connection between input x and hidden representation h. The overall computational complexity is O(d x Â d h ).
As can be seen, the final objective function consists of two parts. First, the reconstruction error function is used to obtain as much effective information as possible from the input data. Second, the newly added penalty term is used to suppress minor perturbations in the input data, and by introducing the F norm of the Jacobian matrix as the constraint term, the learned feature is made locally invariant.

Contrastive Analysis
The three technologies described in Sections 3.1 to 3.3 can achieve feature dimensionality reduction. However, the CAE has some advantages compared with the AE and DAE. In general, there are two criteria for good feature representation: (1) good reconstruction of input data, and (2) excellent robustness when the input data is disturbed to a certain extent. The AE satisfies only the first criterion, while the DAE and CAE satisfy both criteria. In some classification tasks, the second criterion is more important. Therefore, the DAE and CAE are more suitable for the task of classification detection.
In addition, there are at least three differences between the DAE and the CAE. First, the sensitivity of the features is penalized directly rather than indirectly. Thus, DAE is a robust reconstruction of gðfðxÞÞ, and the robustness of fðxÞ is partial or indirect. CAE will punish fðxÞ instead of gðfðxÞÞ, and the encoder part fðxÞ is used for classification, robustness of the extracted features appears more important than robustness of the reconstruction inputs. Second, DAE improves the robustness of feature extraction by adding random noise to the input data, while the robustness of CAE against perturbations is achieved by calculation. The robustness is analytic rather than stochastic. Third, CAE can finely control the trade-off between the reconstruction input and the punishment by setting the hyper-parameter . Thus, CAE is superior to DAE, we believe that CAE will be a better choice than DAE for learning useful features and achieving higher classification accuracy.

PROPOSED METHODOLOGY
In this section, we first describe the SCAE used for feature learning. Subsequently, we describe the training process of the SCAE model, and the SVM classifier used for multiclass anomaly detection.

Designing SCAE Model for Feature Learning
In general, deep feedforward networks have many advantages, which are also applicable to the AE and its variants. The SCAE consists of several hidden layers for encoding, and a set of symmetrical layers for decoding in which the output of each layer is fed as the input of the subsequent layer. The detailed structure of the SCAE is presented in Fig. 1. Here, the superscript numbers refer to the hidden layer identity, and the subscript numbers signify the dimension for the layer.
In the encoding phase, the k-th hidden layer of the SCAE can learn k-order features from the output of the (k-1)-th layer. That is, the first hidden layer learns 1-order features from raw input. The second hidden layer learns 2-order features in the appearance of the 1-order features. Subsequent higher layers learn higher-order features. Conversely, in the decoding phase, the (k-1)-th layer is reconstructed from (k-1)-th-order features from the output of the k-th layer, and so on, until the input is reconstructed.
Thus, the encoding and decoding processes of the SCAE network can be expressed as Here x 0 represents the input data, x k represents the k-thorder features learned by the k-th hidden layer, and x m denotes the low-dimensional m-th-order features that will be transported to the classifier, where m is the number of hidden layers or the depth of the network. Here, we let x m ¼ z m , and allow layer-wise mapping of the m-th-order features back to reconstruction z 0 . We use u k ¼ fw k ; b k g and u 0k ¼ fw 0k ; b 0k g, which represent the parameters corresponding to the k-th encoder and decoder, respectively. Thus, the minimization objective function of the SCAE is expressed as follows: From (12), we can see that the SCAE has some practical problems. The first problem is that the objective function is difficult to calculate in the case of a deep network with multiple layers compared with the case of a single hidden layer. Thus, we use the greedy layer-wise strategy in which the learning problem is broken down into simpler steps. That is, some basic CAEs are trained separately, and then stacked into a deep neural network (detailed in the next section). Since each layer is trained to be contractive locally, the SCAE is also naturally contractive. Another problem is that the penalty term may be considerably larger than the reconstruction error; hence, we can either set a small value or make changes to the penalty term, such as by calculating its average value. Here, we use the latter method; thus, the minimization objective function of each CAE network is as follows: where d k-1 and d k represent the feature dimensions of the (k-1)-th layer and the k-th layer, respectively. k represents the penalty coefficient of the k-th CAE network. k is set such that the penalty value is less than the reconstruction error but is of the same order of magnitude as the reconstruction error, and can achieve higher classification performance.

Training Process of SCAE
Fundamentally, the exact structure of our deep learning model will be obtained through experiments and training on a large number of structural combinations (i.e., number of hidden layers, number of neurons). Next, we introduce the training process of the SCAE-SVM model in detail, as shown in Fig. 2. The training process can be divided into three stages: unsupervised greedy layer-wise pretraining, unrolling, and supervised fine-tuning.
In the pretraining stage, the greedy layer-wise strategy is used to train a series of basic CAEs separately, and the output of each CAE hidden layer is fed as the input of the next CAE network. Specifically, a CAE network is first trained by minimizing the objective function and the network parameters such as weight and bias are recorded. The second CAE network is then trained by taking the hidden-layer output of the first CAE as input, and so on. Each CAE network is trained and all parameters can be initialized.
After the pretraining of each basic CAE network, the hidden layer of each CAE network is unrolled and stacked into a deep CAE network (i.e., SCAE); in other words, only the encoder is retained while discarding the decoder along with its parameters.
Fine-tuning is the process of further adjusting the initial parameters to obtain an optimal model. For this purpose, we mainly use the multiclass cross-entropy loss function, which is the difference between the measurement of predicted target probability distribution and the actual probability distribution, as shown below.
½y ðiÞ log y 0ðiÞ þ ð1 À y ðiÞ Þlog ð1 À y 0ðiÞ Þ where y and y' denote the actual and predicted values of a class, respectively. Here, back-propagation (BP) and the ADAM optimizer are used to determine the minimum loss value in order to deduce the corresponding learning parameters and achieve an optimized model. After completing the three stages mentioned above, the exact structure of the SCAE can be definitely determined. For the case of the NSL-KDD dataset, the structure of the SCAE includes three hidden layers consisting of 28, 16, and 8 neurons, where each neuron represents a feature. These learned low-dimensional features can be used to train the classifier.
However, the classification power of SCAE with multiclass cross-entropy is relatively weak compared to other discriminative models such as the NN, DT, and SVM. SVM shows excellent performance in handling classification problems and has gradually emerged as a popular solution in the field of anomaly detection. Hence, we can combine the deep learning power of SCAE with the shallow learning technique of SVM.

Output Layer: SVM Classifier
SVM is essentially a binary classification model, but attack types in the cloud computing environment are diverse. Hence, more than one classifier should be employed. SVM can solve multi-class (m-class) classification problems, and it involves two methods: "one-versusone" (OVO) and "one-versus-all" (OVA) [31]. OVO takes the i and j class samples from training dataset and labels them as positive and negative classes, respectively. Further, it constructs m(m-1)/2 binary classifiers. By contrast, OVA takes the j class samples from the training dataset labels them as a positive class; the remaining samples are labeled as a negative class. Further, it constructs m binary classifiers.
Obviously, the OVA approach requires fewer binary classifiers. Rifkin and Klautau [32] demonstrated that OVA is preferred to OVO. Hence, we employ the SVM using the OVA approach to construct our classifier.

CLOUD INTRUSION DETECTION SYSTEM BASED ON SCAE AND SVM
Here, we use SDN technology to build our CIDS, which decouples the traditional network structure into data plane, control plane, and application plane. The CIDS framework is shown in Fig. 3. An openflow virtual switch (OVS) is used to forward the virtual network flow; this represents the data plane. A network controller (NC) is used to install the flow table and routing control as well as to collect network traffic; this represents the control plane. NC configures and manages the OVS according to the openflow protocol. The application plane includes various applications to implement different functions, such as the anomaly detection application. The anomaly detection application is used to achieve three main functions: (1) data preprocessing, where the network traffic is transformed and standardized, (2) classifier training, where the SCAEþSVM model used for feature extraction and classification detection is trained from the preprocessed network traffic, and (3) attack recognition, where the trained classifier is used to detect intrusion on the testing dataset or online network traffic.

Data Collection
Usually, data collection is the first and a critical step in intrusion detection. In our study, we use the OVS and NC to collect virtual network traffic data from the Xen cloud platform, as shown in Fig. 4. Each physical machine (PM) consists of a privileged domain named dom0 and a nonprivileged domain named domU. PMs are connected by a traditional switch and VMs are connected by the OVS, which is deployed in dom0. A VM can communicate with another VM in the same or a different PM through the OVS. The OVS is used to forward virtual network flow, while the NC is responsible for routing control and network flow collection. The network traffic obtained by the NC is handed over to the anomaly detection application for the intrusion detection.

Data Preprocessing
Data preprocessing mainly includes data transformation and standardization. Data transformation is used to convert the nominal features into numeric values. For example, in the NSL-KDD dataset, there are three nominal features: protocol type, service type, and TCP status flag. The attack type is also nominal. We transfer the attack type into numeric values. For example, 0, 1, 2, 3, and 4 represent Normal, DOS, Probe, R2L, and U2R, respectively.
In addition, to eliminate the bias in favor of features with greater values as well as the large number of sparse features whose values are 0, we use standardization to scale each feature value into a well-proportioned range. Here, we use the Z-score method for standardization.

SCAEþSVM Classifier
When building classifiers or other predictors, combining feature learning methods can lead to dimensionality reduction and high detection performance. Here, we use the SCAE deep learning algorithm to extract essential features from raw network traffic. Note that the SCAE is pretrained in an unsupervised mode and fine-tuned by employing a supervised back-propagation algorithm. Once the essential features are extracted, they will be used to train the SVM classifier. Here, the SVM classifier exploits the OVA approach to distinguish between normal and abnormal data. We consider SCAEþSVM as a whole or a black-box, and the learned features are not visible.

Attack Detection
After the SCAEþSVM classifier has been trained, we use the trained and saved classifier to detect the testing data or online traffic. When the network traffic is transported to the SCAEþSVM classifier, an output is generated, which indicates whether the data is normal or an attack (i.e., DOS, Probe, R2L, U2R). For example, if the classifier considers records as normal, then the records will be labeled as Normal, and others will be labeled as non-Normal. By contrast, if the classifier considered records as DOS, then the records will be labeled as DOS, and others will be labeled as non-DOS (including Normal, Probe, R2L, and U2R) and so on.

EXPERIMENTAL RESULTS AND ANALYSIS
To verify the effectiveness of the proposed SCAEþSVM model, we conduct some experiments. First, we introduce the NSL-KDD [33] and KDD Cup 99 [34] datasets used in this paper. Second, we describe the experimental design and environment, including classification performance metrics and model parameters. Finally, we present some experimental results and a comparative analysis.

Dataset
In our experiments, we use two well-known intrusion detection evaluation datasets that are widely used to validate CIDS. The KDD Cup 99 dataset contains around five million training data records and two million testing data records. Here, we use around 10% of the records, i.e., 494,021 training data records and 311,029 testing data records, to evaluate the SVM classifier. By removing the records that appear only in the testing data but not in the training data, we are left with 292,300 testing records. Each record consists of 41 different features and is labeled as either normal or an attack. These attacks can be classified into four types: DOS, Probe, R2L, and U2R.
As a revised version of the KDD Cup 99 dataset, the NSL-KDD dataset proposed by Tavallaee et al. [35] contains 125,973 training data records and 22,544 testing data records. Similarly, by removing the records that appear only in the testing data but not in the training data, we are left with 18,794 testing records. Each record in the NSL-KDD dataset is also composed of 41 different features. Table 1 summarizes the exact distribution of the KDD Cup 99 and NSL-KDD datasets.
Although these two datasets have some limitations and 41 is a relatively small number of dimensional features, they are used widely in similar studies. Therefore, we validate the performance of the SCAEþSVM model using these two datasets.

Experimental Design and Environment
Here, we design two groups of experiments to verify the effectiveness of the proposed SCAEþSVM model. These two experiments answer two questions: (1) whether SCAE can extract effective features and achieve the objective of dimensionality reduction, (2) whether the proposed SCAEþSVM model can effectively improve the detection performance of the CIDS.
Therefore, we will prove the ability of the SCAE to generate low-dimensional features as well as the effect of parameters in the SCAEþSVM model on the intrusion detection efficiency. Then, we will compare the results of the SCAEþSVM method with those of three state-of-the-art approaches and those obtained by two similar methods (see Table 2). The structures of these three approaches are similar to that of the SCAEþSVM (here, SDAE employs 0.3 times Gaussian noise). We perform three classification tasks, i.e., 2-class, 5-class, and 13-class classification, on the NSL-KDD dataset. And executing two classification tasks, i.e., 2-class and 5-class classification, on the KDD Cup 99 dataset. Specifically, 2-class classification involves normal and attack data, 5-class classification involves normal data and four types of attack traffic data (i.e., DOS, Probe, U2R, and R2L), and 13-class classification involves more than the minimum 20 entries of the training data in Table 1.
In general, six metrics are used for evaluating the detection performance: accuracy rate (ACC), precision rate (P), recall rate (R), f-measure (F), confusion matrix (M), and receiver operating characteristic (ROC) [18]. They are defined as follows: where the accuracy rate is the proportion of records that are all correctly identified. The precision rate is the proportion   [21] Sparse Auto-Encoder plus Softmax S-NDAE [25] Stacked Non-symmetric Auto-Encoder plus Softmax SCAEþSVM Stacked Contractive Auto-Encoder plus Support Vector Machine of correctly identified records among all identified attack records. The recall rate represents the percentage of records that are correctly identified for the original type of attack. The f-measure better evaluates the performance because it is the harmonic mean of the precision rate and recall rate. These four metrics can be obtained by a confusion matrix M, and the leading diagonal elements in the confusion matrix are the number of records correctly predicted. For example, the element M i,j denotes the number of records that are actually from the class i but incorrectly identified to be from the class j. The ROC curve illustrates the classification performance through the true positive rate (TP) and false positive rate (FP). The larger the area under the ROC curve (AUC), the higher is the TP and the lower is the FP. Similar to most existing deep learning models, the proposed SCAEþSVM model was implemented using Tensor-Flow running on a 64-bit Ubuntu 16.04 machine with the configuration of Intel Core 2 Duo processor (2.7 GHz) with 8 GB RAM. The hyperparameters of the SCAEþSVM model are set as follows.
In the fine-tuning stage, we cascade all the CAE networks to form a 41-28-16-8-5 structure, and optimize all the parameters. The number of epochs is 50, and the learning rate is 0.001.
The SVM classifier uses the Gaussian radial basis function (RBF) kernel. The best values of the penalty coefficient c and the kernel parameter gamma obtained through grid search in the closed interval [2], [3], [4], [24] are 1 and 0.125, respectively.

Ability of SCAE to Extract Essential Features
In the previously published article [36], we demonstrated that the classification accuracy does not improve as the number of features increases. We can conclude that the accuracy of a classifier that employs a smaller number of feature sets is similar to or better than that of a classifier that employs all the features. Therefore, we need to extract essential or important features from the original features and build a more effective classifier. Feature learning can achieve this goal by extracting a smaller number of features that best represent the original data.
The exact structure of the SCAE model can be determined through experiments with numerous structural combinations (i.e., number of hidden layers and neurons). Next, we discuss how to select an appropriate SCAE model experimentally. The depth of the SCAE network has a significant influence on the effects of feature learning and classification. As the number of network layers increases, better feature representation can be extracted and the classification performance can be improved. However, some researchers [37] have shown that the training time increases considerably  with the number of network layers, and additional layers will easily lead to over-fitting. In our experiment, five different types of SCAEþSVM models were set up on the NSL-KDD dataset. The comparative analysis results in terms of the accuracy rate are shown in Table 3. SCAE 1 þSVM is set as a shallow network, i.e., a 41-5 structure. Similarly, SCAE 2 þSVM, SCAE 3 þSVM, SCAE 4 þSVM, and SCAE 5 þSVM are set as 41-20-5, 41-28-16-5, 41-28-16-8-5, and 41-28-20-12-6-5 structures, respectively. From Table 3, the SCAE 4 þSVM model shows the best classification performance. Thus, SCAE can extract optimal features and achieve the objective of dimensionality reduction. The performance of deep structures is better than that of shallow structures because multi-layer mapping units can extract important structural information. From the experimental results, we can observe the following:

Classification Performance of the SCAEþSVM
1) The SVM classifiers combined with different deep learning methods are superior to the standalone shallow SVM. The SAE, SDAE, and SCAE deep learning algorithms all achieve the goal of dimensionality reduction. They can not only capture the essential features but also improve the classification accuracy. We demonstrate that combining deep and shallow learning techniques, can play to their respective strengths and achieve better detection performance. 2) In the 2-class classification task, for the NSL-KDD dataset, the SCAEþSVM model has the highest accuracy rate and precision rate. However, the recall rate and f-measure of the SCAEþSVM model are higher than those of the SAEþSVM and SDAEþSVM methods and lower than those of the STL method. 3) In the 2-class classification task, for the KDD Cup 99 dataset, the accuracy rate, precision rate, recall rate, and f-measure of the SCAEþSVM model are all the highest, albeit only slightly better than those of the other three methods. 4) In the 5-class classification task, for the NSL-KDD dataset, the SCAEþSVM model achieves the highest accuracy rate and recall rate among all the models. However, the precision rate and f-measure of the SCAEþSVM model are higher than those of all models except the S-NDAE. The AUC value of the S-NDAE method illustrated later is lower than that of our SCAEþSVM model.

5)
In the 5-class classification task, for the KDD Cup 99 dataset, the accuracy rate and the recall rate of the SCAEþSVM model are higher than those of the SAEþSVM and SDAEþSVM methods and slightly higher than those of the S-NDAE models. However, the precision rate and f-measure are lower than those of the S-NDAE model. 6) In the 13-class classification task, for the NSL-KDD dataset, the SCAEþSVM model achieves the highest accuracy rate and recall rate among all the models. However, the precision rate and f-measure of the SCAEþSVM model are lower than those of the S-NDAE model. In summary, the SCAEþSVM method achieves better detection performance than the three state-of-the-art approaches, and the results are better or at least match those of the similar studies compared herein. Table 6 lists the precision rate, recall rate, and f-measure of every class of the 5-class classification task on NSL-KDD dataset. We can see that our model achieves superior performance in terms of most of the measures except the U2R class. The results re-emphasize that our model does not handle smaller classes such as "R2L" and "U2R", which show lower performance because the data size affects the classification results to some degree.
The confusion matrix of the 5-class classification obtained using our SCAEþSVM model is presented in Table 7. The values of the leading diagonal denote the number of correctly classified records of the testing dataset. "R2L" and "U2R" have more samples that are identified incorrectly.  Next, we evaluate the detection performance of the proposed SCAEþSVM model on the 13-class classification task. These 13 classes are those with more than the minimum 20 entries listed in Table 1. The purpose of the experiment is to determine whether the proposed deep learning model can identify each type of attack with fine granularity and maintain stable detection performance when the number of attack categories increases. The corresponding performance analysis is presented in Table 8. The experimental results show the following: 1) The total detection performance of the SCAEþSVM model is satisfactory and stable. It achieves the highest level compared to other methods. 2) All the classes of our model achieve better detection performance, except for three classes: buffer_overflow, nmap, and pod. Because the number of testing data of the warezclient class is 0, so the detection performance of this attack is 0. Figs. 5a, 5b, 5c, 5d and 6a, 6b, 6c, 6d show the ROC curves of four different methods in 5-class classification tasks on the NSL-KDD and the KDD Cup 99 datasets. The dotted lines represent the ROC curve of the total classification performance of the method. The other lines represent the five attack types. A larger area under the ROC curve implies a high true positive rate and a low false positive rate. As can be seen, the area under the ROC curve of the SCAEþSVM model is the largest; thus, it can be considered to have the best performance.
The AUC values of four methods are listed in Tables 9 (a)-(b). In the 5-class classification tasks, the AUC values of the SCAEþSVM model on the NSL-KDD and the KDD Cup 99 datasets are 0.92 and 0.98, respectively. This implies that the proposed SCAEþSVM method achieves higher detection performance than the other methods. In addition, in the 2class classification tasks, the AUC values of the SCAEþSVM on the NSL-KDD dataset is 0.86, which implies higher detection performance than the other method. On the KDD Cup 99 dataset, the AUC values of the SAEþSVM, SDAEþSVM, and SCAEþSVM methods are all 0.98, that is, all achieve high detection performance. Thus, our method performs well in the 2-class classification task. Table 10 lists the AUC values of the 5-class classification task in two methods-the S-NDAE and SCAEþSVM methods-on the NSL-KDD dataset. These results indicate that our proposed method has four AUC values superior to the S-NDAE.
In the 13-class classification tasks, the AUC values of four methods are listed in Table 11, SCAEþSVM achieves an AUC value of 0.95 on the NSL-KDD dataset; thus, it achieves higher detection performance than the other methods. The ROC curve of our method in the 13-class classification task is shown in Fig. 7.
From the above-mentioned results, it is nearly certain that our proposed SCAEþSVM method has a high true positive rate and a low false positive rate. Thus, our method performs well in the 2-class, 5-class, and 13-class classification tasks.

CONCLUSION
Security concerns not only lead to severe losses in the cloud computing environment but also cause users to lose confidence in cloud computing itself, which will inevitably have a serious impact on the healthy and sustainable development of cloud computing. Building a cloud intrusion detection system is one of the solutions for protecting cloud computing from malicious attacks. Recently, researchers have demonstrated that an efficient and effective CIDS can be built by combining a deep learning algorithm for feature extraction with a classifier. In this study, we designed a hybrid system that uses a stacked contractive auto-encoder (SCAE) for feature reduction and the SVM classification algorithm for the detection of malicious attacks. Using the NSL-KDD and KDD Cup 99 intrusion detection datasets, we experimentally demonstrated that the proposed SCAEþSVM-IDS model achieves promising classification performance in terms of six metrics compared with three state-of-the-art methods.  Although the proposed SCAEþSVM-IDS approach has shown encouraging performance, it can be improved by further optimizing the classifier. The SVM classifier cannot effectively recognize some new attacks existing in the testing dataset. Therefore, designing an optimal classifier requires careful consideration in future studies. As a long-term objective for future work, on the one side, we aim to reduce the controller's bottleneck and implement an CIDS that can detect different kinds of network attacks. On the other side, we plan to implement our solution in a real cloud environment to evaluate its performance.
Wenjuan Wang received the master's degree from Zhengzhou University, China, in 2007. Currently, she is working toward the PhD degree in the PLA Strategic Support Force Information Engineering University. From 2016 she was working as an associate professor with the PLA Information Engineering University. Her primary research interests include intrusion detection and security of cloud computing.
Xuehui Du received the PhD degree from the PLA Information Engineering University, in 2011. From 2010 she is working as a professor with the PLA Strategic Support Force Information Engineering University. Her current research works include network security and security of big data. She has more than 23 years of research and teaching experience.
Dibin Shan received the master's degree from the PLA Information Engineering University, China, in 2008. From 2011 he is working as a lecturer with the PLA Strategic Support Force Information Engineering University. His research interests include network security and security of big data.
Ruoxi Qin received the BS degree in information engineering from Electronic Engineering Institute, Hefei, China, in 2017, currently he is working toward the master's degree in the PLA Strategic Support Force Information Engineering University. His research interests include image processing, deep learning, and artificial intelligence.
Na Wang received the PhD degree from the PLA Information Engineering University, China, in 2008. From 2011 she is working as an associate professor with the PLA Strategic Support Force Information Engineering University. Her research interest includes network security.