Communication Protocol Classification Based on LSTM and DBN

,


I. INTRODUCTION
Nowadays, more and more countries have increasingly attached importance to electronic countermeasure system. The electronic countermeasure system is composed of electronic reconnaissance system, electronic jamming system, and electronic defense system [1]. Electronic reconnaissance system can obtain parameters of enemy electronic equipment, and analyze the signal to obtain more information [2]. According to the results of electronic reconnaissance system, electronic interference system can adjust the interference system parameters, interfere the enemy communication system, and block the normal communication [3]. According to the results of electronic reconnaissance system, electronic defense system can improve the anti-jamming capability by adjusting system parameters [4]. We can see that electronic reconnaissance system is the basis for analyzing intelligence, electronic interference system, and electronic defense system. The stronger the electronic reconnaissance ability, we can get more information about enemy equipment. However, the exciting electronic reconnaissance system can only detect signal layer parameters such as signal carrier The associate editor coordinating the review of this manuscript and approving it for publication was Guan Gui . frequency and bandwidth, and cannot identify communication protocol [5]. Communication protocols are the ties to operate information networks [6], [7]. If we identify the communication protocol, we can get most of the communication information [8]. In order to improve the electronic reconnaissance ability, we need study the protocol classification algorithm.
The existing protocol classification method mainly includes method based on port number, method based on deep packet inspection, and method based on deep stream inspection [9]. The method based on port number can classify the protocol that has registered the port number in Internet assigned numbers authority (IANA) [10]. The method has the characteristics of simple calculation and low complexity, and does not require complicated processing to obtain classification results. In the early stage of Internet development, the application layer protocol is few and simple, method based on port number to classify protocol is feasible. However, more and more protocols do not register their port numbers in IANA, fewer protocols can be classified by the method based on port number [11], [12]. So the method based on port number is not suitable any more [13]. The method based on deep packet inspection uses the pattern matching algorithm to find out whether data message contains the protocol feature, so as to classify the protocol. Most data message contains specific protocol feature, which is convenient to distinguish different protocols. And the feature is open and very easy to get in most cases, so the protocol can be classified by searching and detecting the protocol feature. Kazemi and Fanian [14] proposed a method for network tunneling protocols identification. This method is an improvement of deep packet inspection method. It overcomes the problem of high processing cost, low efficiency and requiring high memory and CPU resources. The simulation results show that the proposed method can identify network tunneling protocols with high accuracy and low processing cost. Chen and Liao [15] proposed an optimized algorithm for deep packet inspection. Chen proposed an optimized solution based on regular expressions. In order to reduce memory space requirement of the deep packet inspection engine, Chen used an optimized algorithm to reduce the states of a deterministic finite automaton. Meanwhile, Chen improved the matching performance of the deep packet inspection engine by adopting a hybrid packet matching pattern. The simulation results show that the cost of memory space is reduced and the matching speed is improved. In order to make the protocol identification more accuracy, we should build a good set of application protocol signatures, however, it is a very time consuming task and demands a high expertise. Feng et al. [16] proposed an automatic traffic signature generation method based on Smith-waterman algorithm to solve above problem. The simulation results show that the proposed method is a very accurate and quick approach compared to the approach that uses preceding application protocol analysis. Xia and Liu [17] proposed an optimized protocol identification method for high-speed network system. Xia makes a deep system-wide profile and analyze the major hotspots of a typical network system on an Intel X86-64 platform, including a complete TCP/IP stack and a protocol identification engine by deep packet inspection. The simulation results show that the proposed algorithm eliminates the memory bottleneck and computing bottleneck. The protocol classification method based on deep packet detection has high accuracy, but time complexity is high. It is not feasible for the system with high real-time requirement. The method based on deep flow detection uses data flow feature to classify different protocols. When we use different protocols to transmit data, the data flow feature will also be different. Yang et al. [18] proposed a P2P network traffic classification method using support vector machine(SVM). The proposed method extracts the traffic feature of Bittorrent, PPLive, Skype and MSN. Yang also introduced the classification framework based on the SVM. The simulation results show that the classification accuracy can reach 91%. Este et al. [19] proposed a network traffic classification method based on SVM. The proposed method classifies network traffic according to the size of packets and the direction of data flow. The proposed method classifies correctly only with a few hundred training samples. The simulation results show that the classifiers can be very effective at discriminating network traffic generated by different applications, and the classification accuracy of Bittorrent can reach 96.8%. Moore and Zuev [20] proposed an optimized network traffic classification method based on Bayesian analysis. The proposed method extracts 249 features of 10 different network traffic. The simulation results show that the method has high classification accuracy and the classification accuracy of Web can reach 96.8%. Because data flow has a large number of features, it is difficult to select the appropriate features to classify protocol.
The existing protocol classification methods are all for Internet protocol, these methods are not suitable for wireless communication protocol classification. The existing communication protocol classification methods first identify the modulation scheme, frequency and other signal parameters, then vectorize the parameters according to the entropy, and finally compare the parameter set with the protocol parameters or use the machine learning method to classify the protocol. However, this method depends on the identification of various signal parameters. When the identification accuracy of the signal parameters is high, the protocol classification can get high accuracy. When the identification accuracy of one or more signal parameters decreases, the protocol classification accuracy decreases seriously due to the error accumulation. In order to improve the communication protocol classification accuracy, we propose a novel communication protocol classification algorithm based on LSTM and DBN.
The remaining parts of this paper are organized as follows. In Section 2, we introduce DBN and LSTM. In Section 3, we propose a protocol classification algorithm based on DBN and make a lot of simulation experiments. In Section 4, we propose a protocol classification algorithm based on DBN and LSTM, and make a lot of simulation experiments. Conclusions are drawn in Section 5.

II. BACKGROUND KNOWLEDGE A. DBN
Because the network structure of Boltzmann machine (BM) is too complex, it is impossible to realize such a complex network in reality [21]. In 1986, Professor Smolensky created restricted Boltzmann machine (RBM) [22]. RBM is a two-layer undirected graphical model, which is composed of visible layer and hidden layer. They are connected by ''full connection'' [23]. RBM simulates the probability distribution of visible layer nodes by random features [24]. Different from BM, the connection between nodes in RBM needs to satisfy certain limitation [25]. Nodes in the same layer are not allowed to connect, and nodes in different layers must be fully connected [26]. The input and output of standard RBM only have two states of ''0'' and ''1''. The ''0'' represents that the node is inactive and the ''1'' represents that the node is active [27]. The difference between RBM and BM is that BM allows nodes at the same level to connect. Because of this limitation, RBM has a key feature: its random hidden nodes are conditionally independent under given observation data, and vice versa [28].
T represents the state of the ith neuron in the visible layer. h = h 1 , h 2 , · · · , h n h T represents the state vector of the hidden layer and h j represents the state of the jth neuron in the hidden layer. a = a 1 , a 2 , · · · , a n v T ∈ n v represents the offset vector of the visible layer and a i represents the offset vector of the ith neuron in the visible represents the offset vector of the hidden layer and b j represents the offset vector of the jth neuron in the hidden layer. W = w i,j ∈ n h ×n v represents the weight matrix between hidden and visible layer. w i,j represents the connection weight between the ith neuron in the hidden layer and the jth neuron in the visible layer.
DBN is trained greedily by stacking multiple RBMs. A typical DBN structure is shown in Fig.2 [29]. Similar to the RBM structure, nodes between the same layers are not allowed to connect, and nodes between layers are connected by ''full connection''. DBN can learn the structure of input data, and extract the features of input data step by step through multilayer RBM, and use the features to classify signal, image and so on [30]. DBN is a model for learning to extract deep data feature. DBN can simulate the joint probability of the visible layer vector x and the l hidden layer. The joint probability is defined as follows: is the conditional probability of the visible layer and hidden layer of the k-layer RBM. P h l−1 |h l is the conditional probability of the visible layer and hidden layer of the top RBM.
In order to classify communication protocol, classifier is usually added to the last layer. Common classifiers include SVM, logistic regression model and softmax regression model. However, SVM and logistic regression model are only suitable for binary classification. So we use softmax regression model. Softmax regression model is an extension of logistic regression model, and their mathematical models  For a given structure, the training process of DBN is completed in training RBM layer by layer [31]. The process of training layer by layer is shown in Fig.3. First, receive input data and train the first layer RBM. After training, the hidden layer response of all training data are calculated. Then freeze the weight of the first layer and use the response as input to train the next layer, and repeat these steps to train all layers. All training is done in an unsupervised way. The last thing to train DBN is to treat it as a neural network and use tagged data to train all weights acquired during unsupervised training.
First, we use the data X as input to the first layer RBM. The state of the hidden layer is defined as follows: where σ is sigmoid function. Then, we suppose DBN consisting of l hidden layers. DBN is initialized layer by layer with greedy algorithm. Then the state of ith hidden layer is defined as follows: Finally, we use backward propagation algorithm to optimize the weights. In this way, we can get the global optimal weight vector. The global optimal weight vector is defined as follows: where J (W , b) is cost function, α is the learning rate.
After extracting the features, the extracted features are as input to the softmax classifier for classification training. For training date set x 1 , y 1 , · · · , x i , y i , · · · , (x m , y m ) , probability of data x into category j in softmax regression model is defined as follows: is to normalize the probability. It can ensure that the sum of the probability that all test samples belong to category j is 1.
We train the the model parameters θ by minimizing the cost function. The cost function is defined as follows: If the output j is label y i , y i = j = 1. If the output j is not label y i , y i = j = 0. Weight attenuation is used to prevent network over-fitting.

B. LSTM
Recurrent neural network (RNN) is the network for processing sequence data. RNN is different from other networks obviously [32]. In RNN, the feedback of the hidden layer not only enters the output, but also enters the hidden layer of the next time step, thus affecting the weights of the next time step. The structure of RNN is shown in Fig.4 [33]. In order to derivate and describe conveniently, we make the left structure simplify the right structure.
RNN is different from other multi-layer neural network in that it has the concept of time sequence. The next time step will be affected by current time step. We can better introduce RNN by expanding the network in time sequence [34]. We expand the network through multiple time steps, and  visualize the connection in the form of acyclic. The expand structure of RNN is shown in Fig.5 [35].
The biggest advantage of RNN is that it introduces the time sequence into neural network, which can make current time step data have a direct impact on the next time step data. We can set the corresponding input and output layer number according to the time step. The hidden layer number is same as that time nodes and the hidden layer is recursive feedback. The first data is as input to the first layer, and then affects the second layer. The second data is as input to the second layer and then affects the third layer. The input impact is from left to right, and the final output feedback is right to left to adjust the weight [36]. The weight adjustment is based on the loss function between the output data and the original data. In order to minimize the loss function, we calculate the multivariate variables partial differential. In the calculation of partial differential, we start from the last layer, and then the weight adjustment of the penultimate layer depends on the last layer, and the rest layers are also like this. Derivative may be zero in the last few layers, which is also called gradient explosion. Therefore, the weights of the previous layers are not adjusted [37]. It is impossible to achieve the complete convergence in each training, only to adjust the last several layers of weights. Because the front layer is not adjusted, the loss is not completely eliminated [38]. We can also set the learning rate α which changes with the loss function to prevent the feedback from falling too fast with the gradient and missing the minimum convergence value, or at the same time, we can set the momentum impulse which changes with the α reverse to prevent falling into the local optimal solution. But these methods can not solve the above problems effectively. This leads to LSTM neural network to solve the above problems [39].
LSTM was proposed by Hochreiter and Schmidhuber in 1997 [40]. LSTM model can overcome the shortcomings of RNN. LSTM consists of recursively connected subnets, which we can call memory cell. We can think of the memory cell as memory in computer. Each memory cell contains one or more self-connected memories and three multipliers. These three multipliers are input gate, output gate and forget gate, which provide read, write and reset operations [41]. As shown in Fig.6, LSTM is not a single layer neural network, but a multilayer structure, which interacts in a very special way. In Fig.6, the yellow module represents the learning neural network layer, and the red module represents the point by point operation, such as vector addition [42].
The key of LSTM is the red part state in Fig.7. It conveys information like conveyor belt, which ensures transfer information not changed. The most important component of LSTM is the gate, which can add or delete information [43]. Gate allows information to pass selectively. As shown in Fig.7, LSTM has three gates that work together. Activation function takes memory state of network as input. We set the specific threshold. If the output is greater than the threshold, we multiply the output of gate and the current layer result, and make the multiplication result as the next layer input. If the output is less than the threshold, the output will be forgot [44]. Fig.8 shows the structure of LSTM memory cell. The structure of LSTM is almost the same as the RNN, but the addition unit of RNN hidden layer is replaced by memory cell. The memory cell of LSTM can store and access information for a long time, so LSTM can solve the problem of gradient disappearance. Memory cell consists of one unit and three gates, that is input gate, output gate and forget gate [45]. The three gates symbolize the information gate. They control the information transmission of neuron and how to distribute information to the current neuron and the next neuron. These three gates have activation functions and control the activation functions by multiplication [46]. The  black dots in Fig.8 represent multiplication. g is the input function. h is the output function. f is the activation function. We usually use sigmoid function or tanh function as activation function. The equations are as follows: We assume that the input of LSTM at time moment t includes input layer x t , hidden layer h t−1 which is calculated by the previous unit and memory cell c t−1 . The output of LSTM at time moment t includes hidden layer h t which is calculated by the current unit and memory cell c t . The output of LSTM is calculated as follows: A, First, we calculate the three gates at time moment t. Here we do not consider the memory cell c t−1 . The detail equations are as follows: where i t is input gate at time moment t, o t is output gate at time moment t, f t is forget gate at time moment t. B, Then, we calculate the memory cell c t . The detail equations are as follows: wherec t is the memory cell without forget gate f t . C, Finally, we calculate the hidden layer h t at time moment t. The detail equation is as follows: It can be seen from the above equation that the function of the input gate is to multiply the memory cell without the forget gate, and control the input to enter the memory cell. The function of forget gate is to multiply the memory cell of the last moment and control the attenuation in the memory cell. The function of the output gate is to multiply the memory cell at the current memory to get the output of the hidden layer, control the output of the memory cell to the hidden layer, and further affect the results of each gate at the next moment. LSTM solves the problem of gradient disappearance. LSTM can save the most meaningful information in the memory cell, overcome RNN's difficulty in saving longdistance information.

III. COMMUNICATION PROTOCOL CLASSIFICATION BASED ON DBN A. DATA SET
With the development of wireless communication, wireless communication protocol has also made great progress. According to the different types of communication, wireless communication network can be divided into wireless wide area network (WWAN), wireless metropolitan area network (WMAN), wireless local area network (WLAN), wireless personal area network (WPAN), low rate wireless personal area network (LR-WPAN) and so on. Wireless communication protocol is the last link to wireless communication.
We need to package the information according to the protocol and transmit packaged information according to the protocol transmission parameters, so that the receiver can correctly interpret the transmitted data and get the information. ZigBee, Bluetooth and WiFi are the three most common wireless communication protocols. We takes these three protocols as examples to verify the effectiveness of the algorithm. The I/Q signal data used in this paper is collected between SNR = −5dB to SNR = 10dB. And we remove the gap between signal, only save the data which has signal. Table 1 is an example of the collected data.   Because of the uncertainty of wireless channel, the signal data is greatly affected by SNR. And the collected data is complex number, which makes DBN unable to process collected data directly. So we need to preprocess the collected data. We use the following method to preprocess data.
(1) Cut the data into 256 points and keep 4 decimal.
(2) According to the real part and the imaginary part, the data processed in step 1 is divided into 256 × 2 array.
(3) Expand the data processed in step 2 to 512 × 1 array. Table 2 shows the data example after preprocessing. The training sample and testing sample number are all 48000. Each sample contains 512 data points. Fig.9 shows the training flow chart of communication protocol classification algorithm based on DBN. The algorithm training process detail is as follows:
(2) The preprocessed data is used as input to train the first layer RBM. After training, we obtain 512 × 1 eigenvector.
(3) Keep the training parameters of the first layer RBM. 512 × 1 eigenvector is used as input to train the second layer RBM. After training, we obtain 256 × 1 eigenvector.
(4) Keep the training parameters of the above two layer RBM. 256 × 1 eigenvector is used as input to train the third layer RBM. After training, we obtain 128 × 1 eigenvector. (5) After training above three layer RBM, we use BP algorithm to adjust the whole network parameters. Finally, according to the output of softmax classifier, we can obtain the final classification results.
The detail network parameters are shown in Table 3.

C. EVALUATION INDEX
The algorithm classification performance can be quantified by the prediction accuracy of test sample. If the true label and the predicted label of the classifier are given by y i andŷ i respectively, then the test error of m test test samples can be defined as follows: The classification accuracy can be obtained by 1 − E test . The algorithm classification performance can also be measured by other statistics, namely: true positive (TP), false positive (FP), false negative (FN) and true negative (TN). They are defined as follows: • If the true label is A and the predicted label is A, then the sample is considered as TP.
• If the true label is not A and the predicted label is A, then the sample is considered as FP.
• If the true label is A and the predicted label is not A, then the sample is considered as FN.
• If the true label is not A and the predicted label is not A, then the sample is considered as TN.
Through the above statistics, we can get the performance evaluation index precision(P), recall(R) and F 1 score. They are defined as follows: P, R and F 1 score are performance evaluation indicators for each category. We use the average of equal weights (P avg , R avg and F 1 avg ) to quantify the overall performance. The closer P avg , R avg and F 1 avg is to 1, the algorithm has better performance.

D. SIMULATION RESULT AND ANALYSIS
In this section, we make a lot of experiments to verify the effectiveness and performance of the proposed algorithm. We use t-SNE algorithm to visualize the feature. Fig.10 shows the visualized feature under different SNR. The red number ''0'' represents ZigBee. The blue number ''1'' represents Bluetooth. The green number ''2'' represents Wifi.
X axis and y axis don't have real meaning. We focus on the discrimination between extracted features. From Fig.10, we can see that the features extracted by DBN have different discrimination under different SNR. With the decrease of SNR, the discrimination of features gradually decreases. When SNR = 10dB, the feature discrimination is good, but a small number of features still overlap with each other. When SNR = −5dB, most of features are overlap with each other, the feature discrimination is poor. Table 4 shows the evaluation indicators under different SNR.
From Table 4, we can see that with the increase of SNR, P avg , R avg and F 1 avg are also increase, the algorithm classification performance is better and better. When SNR = 10dB, P avg = 0.787, R avg = 0.799 and F 1 avg = 0.793. At this time, the algorithm classification performance is not bad. When SNR = −5dB, P avg = 0.525, R avg = 0.618 and F 1 avg = 0.567. At this time, the algorithm classification performance is poor. Fig.11 shows the each protocol classification accuracy and average classification accuracy.
From Fig.11, we can see that with the increase of SNR, classification accuracy is also increase. When SNR = 10dB, classification accuracy of ZigBee is 85%, classification accuracy of Bluetooth is 83% and classification accuracy of Wifi is 81%. When SNR = 6dB, the average classification  accuracy has reached 80%. However, with the increase of SNR, the average classification accuracy has not increased too much. When SNR = 10dB, the average classification accuracy doesn't reach 100%. This is because the input of DBN is the original IQ signal data, and the features extracted by DBN don't have good discrimination. Fig.12 shows the confusion matrix under different SNR.
From Fig.12, we can see that with the increase of SNR, the data outside the diagonal is less and less. When SNR = 10dB, most data is on the diagonal, the algorithm classification performance is not bad. When SNR = −5dB, few data is on the diagonal. But even SNR = 10dB, there are still some data outside the diagonal. This shows that even the SNR is high, the algorithm classification performance is still not optimal.

IV. COMMUNICATION PROTOCOL CLASSIFICATION BASED ON DBN AND LSTM
A. MODEL FRAME From the analysis in the previous section, we find that the performance of protocol classification algorithm based on DBN is still not optimal even SNR is high. In order to improve the classification performance, we add LSTM after data preprocessing to further process the data. The advantage of doing so is that the input of DBN will have stronger ability to represent protocol. And the proposed algorithm doesn't need to process data complexly and the sample data doesn't need to obey the certain distribution. It greatly reduces the  requirement about sample data. The structure of the protocol classification algorithm based on LSTM and DBN is shown in Fig.13.
The algorithm process detail is as follows: (1) After data preprocessing, the data becomes 512 × 1 array.
(2) The preprocessed data is used as input to the LSTM. After processing, we obtain 512 × 1 eigenvector.
(3) After training the LSTM, 512 × 1 extracted eigenvector is used as inut to train the first layer RBM. After training, we obtain 512 × 1 eigenvector.
(4) Keep the training parameters of the first layer RBM. 512 × 1 eigenvector is used as input to train the second layer RBM. After training, we obtain 256 × 1 eigenvector.
(5) Keep the training parameters of the above two layer RBM. 256 × 1 eigenvector is used as input to train the third layer RBM. After training, we obtain 128 × 1 eigenvector.
(6) After training above three layer RBM, we use BP algorithm to adjust the whole network parameters. Finally, according to the output of softmax classifier, we can obtain the final classification results.
The detailed network parameters are shown in Table 5.

B. SIMULATION RESULT AND ANALYSIS
In this section, we make a lot of experiments to verify the effectiveness and performance of the proposed algorithm. We use t-SNE algorithm to visualize the feature. Fig.14     with each other, the feature discrimination is very good. When SNR = −5dB, some features are overlap with each other. However, compared with the protocol classification algorithm based on DBN, the feature discrimination has been significantly improved. It shows that by adding LSTM, the feature discrimination is more obvious, which is more conducive to protocol classification. Table 6 shows the evaluation indicators under different SNR.
From Table 6, we can see that with the increase of SNR, P avg , R avg and F 1 avg are also increase, the algorithm classification performance is better and better. When SNR = 10dB, P avg = 1, R avg = 1 and F 1 avg = 1. At this time, the algorithm classification performance is the best. When SNR = −5dB, P avg = 0.670, R avg = 0.716 and F 1 avg = 0.693. At this time, the algorithm classification performance is poor. However, compared with the protocol classification algorithm based on DBN, evaluation indicators has been significantly increased. Fig.15 shows the each protocol classification accuracy and average classification accuracy.  From Fig.15, we can see that with the increase of SNR, classification accuracy is also increase. When SNR = 10dB, classification accuracy of ZigBee, Bluetooth and Wifi are all reach 100%. When SNR = −5dB, the average classification accuracy is only 47%. However, compared with the protocol classification algorithm based on DBN, the average classification accuracy has been significantly increased. Fig.16 shows the confusion matrix under different SNR.
From Fig.16, we can see that with the increase of SNR, the data outside the diagonal is less and less. When SNR = 10dB, all data is on the diagonal, the algorithm classification performance is best. When SNR = −5dB, few data is on the diagonal. However, compared with the protocol classification algorithm based on DBN, the number of data on the diagonal has been significantly increased. Based on the above results, we can see that the performance of protocol classification algorithm based on LSTM and DBN is better than performance of protocol classification algorithm based on DBN.
We compare the performance between the proposed algorithm in this paper and the proposed algorithm in [47]. Fig.17 shows the average classification accuracy comparison. LD represents the algorithm proposed in this paper and EBD represents the algorithm proposed in [47].
From Fig.17, we can see that under the same conditions, the average classification accuracy of LD is higher than EBD. When SNR = −5dB, the average classification accuracy of LD is 87%, the average classification accuracy of EBD is only 67%. Table 7 shows the evaluation indicators comparison. From Table 7, we can see that evaluation indicators of LD is higher than EBD. It shows that the proposed algorithm in this paper has better performance than the algorithm in [47].

V. CONCLUSION
In this paper, we study communication protocol classification, and proposed a novel protocol classification algorithm based on LSTM and DBN. The proposed algorithm does not need to process data complexly and the sample data doesn't need to obey the certain distribution. It greatly reduces the requirement about sample data. In this paper, we take three most common communication protocols (ZigBee, Bluetooth and WiFi) as example, and make a lot of simulation experiments to verify the effectiveness of the proposed algorithm. The simulation result shows that when SNR>6dB, the classification accuracy can reach 90%.