Association analysis and identification of unknown bitstream protocols based on composite feature sets

Concomitant with the rapid development of network communications technology, the analysis of communication protocols has become indispensable in the maintenance of daily network security. Common protocol analysis methods predominantly analyze protocols using known information, such as fixed port numbers; however, these methods have significant limitations. In the current network environment, the proportion of undisclosed protocols is increasing daily, and the information related to such protocols is difficult to obtain and sometimes fails because of the particularity of the unknown protocol format. Therefore, it is crucial to analyze unknown protocols in the context of less prior knowledge. To solve this problem, this paper proposes a novel protocol identification method in which association analysis and identification of unknown bitstream protocols are first carried out based on composite feature sets. Furthermore, data mining and statistics-related knowledge are applied to realize protocol message-type identification and protocol message-format analysis. The results of experiments conducted on the bitstream protocol dataset verify that the proposed method can accurately identify different message types. Specifically, taking the ICMP and ARP protocols as examples, the proposed method could effectively infer the main features, which is helpful for further protocol information extraction and analysis.


A. MOTIVATION AND BACKGROUND
With the rapid development of communication and network technology, network security protection and maintenance have corresponding become increasingly more crucial. Analysis of network protocols is the basis and premise of information security [1]. In the electronic information warfare environment, the common method to acquire information from a target network is to capture its communications signal and then analyze the acquired bitstream protocol data to obtain any desired intelligence. Therefore, it is important to analyze and recognize unknown bitstream protocols from captured communication data; however, there is a dearth of efficacious research in this area.
Early methods were often realized by identifying fixed features of the protocol, with the core idea of such methods being to use static features for matching [2]. Protocol identification technology based on port number is the earliest studied protocol identification method. For traditional Internet protocols, the use of port numbers for protocol identification is characterized by high accuracy and efficiency [3][4]. However, with the continuous development of the Internet, an increasing number of new protocols have emerged, some of which use registered ports and dynamic ports. To solve the limitations of the above methods, protocol analysis and identification are used to analyze protocol data at the application layer according to the similarity between protocol data [5], and the analysis of such protocol data generally requires certain prior knowledge [6].
The identification of protocols at the application layer depends on their unique features, which are compared with prior knowledge; however, this technique has significant limitations [7]. In an electronic information warfare environment, the target network typically uses wireless communication for data transmission, and most of the communication protocols are customized [8]. It is difficult to obtain relevant information on unknown protocols to analyze them. Therefore, protocol identification and analysis against a background of zero knowledge is an important research topic in the field of network security and electronic information countermeasures.

B. PROBLEM STATEMENT
Protocol reverse engineering technology in the field of unknown protocol identification primarily includes application-based, execution-trace-based, and networktrace-based methods. [9][10][11][12]. The shortcomings of existing protocol reverse engineering methods are summarized in Section II of this paper.
There is currently a distinct lack of knowledge in this area. First, existing algorithms generally require prior knowledge of the protocol data as data input. Second, the data size of the protocol analysis is large and complicated. The analysis of unknown protocols depends on relatively complete datasets, and also introduces the problem of excessive computation. Third, the proportion of manual analyses is too large. Fourth, most of the existing research on protocol reverse engineering is focused on the unknown protocol of the application layer, with less focus being placed on the unknown bitstream protocol in the data link layer.

C. CONTRIBUTION
To solve the existing problems, this paper proposes a method for the association analysis and identification of unknown bitstream protocols based on composite feature sets, introduces data mining and statistics-related knowledge, and realizes protocol message-type identification and protocol message-format analysis. This study focuses on the following aspects: 1) Composite feature sets of unknown protocols were obtained in this study and a feature library constructed. Some unknown protocols located in the link layer have a feature library and, in their protocol recognition, the fields in the feature library are searched and matched using a pattern-matching algorithm. However, most of the unknown protocols have no public feature library, and protocol information cannot be obtained. Therefore, this study imitates the feature library of the protocol of the part link layer, establishes the feature library for an unknown protocol, and completes the protocol recognition based on the feature library. 2) For feature extraction, an improved FP algorithm, which is a commonly used algorithm in association rules, is used. Based on the FP algorithm, this study improves the frequent pattern tree to effectively reduce the size of the tree, and the frequent items are effectively further filtered to reduce many redundant false rules produced when the system adopts the FP-growth algorithm.
3) Compared with previous research, this research uses a sequence alignment algorithm in bioinformatics to analyze the protocol message format, which can not only identify the address field, but also identify other field information in the message format of the protocol, including the length, check, and sequence number fields. The remainder of this paper is organized as follows. In Section II, related work is introduced. In Section III, the entire method is introduced, including the overall framework. Then, the composite feature extraction, message-type recognition, and protocol format analysis are introduced. Finally, in Section IV, the experimental analysis is discussed and Section V presents the conclusions of this study.

II. RELATED WORK
Early manual protocol reverse engineering methods can extract all elements of the protocol structure clearly; however, these methods time-consuming and error-prone [13]. Furthermore, they are not compatible with the rapid increase in the number of new applications and the vast amount of traffic in today's network environment [14][15]. Therefore, automatic protocol reverse engineering methods have been proposed to address these problems. Automatic protocol reverse engineering approaches can be divided into three categories: application-based, execution-trace-based, and network-trace-based methods.
The application-based approach uses program binaries or source codes. Caballero et al. [16] proposed a novel approach for automatic protocol reverse engineering based on a dynamic program binary analysis. In practice, however, it is difficult to obtain program binaries or their source codes. The execution-trace-based approach must set up an execution monitoring system to keep track of how the program handles messages of unknown protocols. This approach is only made possible by acquiring a program that uses an unknown protocol [17]. However, access to the program of an unknown protocol is rarely possible because of concealment and obfuscation undertakings.
Compared to the above two approaches, the network-tracebased approach is more realistic because it only analyzes the network traces captured by network packets that monitor the target protocol without accessing the program binaries. The primary research in this paper is based on network tracebased methods, which include natural language processing, bioinformatics, and data mining. The natural language processing method identifies protocol keywords by looking for tags that frequently appear together in messages [18][19]. However, because binary protocols typically pack data more densely, this method is not suitable for inferring binary protocol information. Based on bioinformatics, Netzob [20] has been used for sequence alignment to determine the similarity of messages and to cluster them. The messages were divided into fields; however, multi-sequence alignment is exponentially complex because sequence alignment algorithms always use only two messages as inputs simultaneously [21].
In contrast to sequence alignment in bioinformatics, data mining techniques may use all messages as inputs simultaneously. In addition, it is vital to know how to optimize the results, so that the results are intuitive and clear. Common data mining algorithms include classification, clustering, and association rule algorithms. For supervised learning, the classification algorithm needs to know some prior knowledge of the classification model. Common classification models include the Bayesian method [22], genetic algorithm, decision tree algorithm [23][24], and neural networks [25].
Protocol recognition based on classification algorithms requires a known protocol class; however, it is difficult to identify unknown protocols. The clustering algorithm requires no prior knowledge and can directly calculate the original sample data [26]. Association rule mining is one of the most mature, important, and active research topics in data mining. Jian et al. [27] used the Apriori algorithm to extract protocol keywords from network traces based on their support rates and variances of positions, reconstructed message formats, and inferred protocol state machines. Ji et al. [28] used a multipattern-matching algorithm to find frequent sequences, extracted keywords based on frequency, and extracted message formats using FP-growth.
The network trace-based approach mentioned above can only analyze the application layer protocol, whereas the unknown bitstream protocol is located in the data link layer. There is little prior knowledge of unknown bitstream protocol data, and several link layer protocol identification problems need to be solved in commercial applications or electronic information warfare. There are no significant achievements in the analysis of such protocols in current studies. Zhang et al. [29] only recognized the address field in the message format, and Zheng et al. [30] could not identify the message types of the protocol.
In contrast to other studies, this study proposes association analysis and identification of unknown bitstream protocols based on composite feature sets, which uses an association rule algorithm to identify and analyze protocols based on composite feature sets. It can identify the message types of the protocol and analyze the message format, including the address field, length field, and verification field.

III. METHOD
This section discusses the traits of the unknown bitstream protocol, proposes an identification and analysis method for the unknown bitstream protocol, and describes the process of composite feature extraction, message-type identification, and message-format analysis.

A. GENERAL FRAMEWORK
The steps in the analysis and identification of the unknown protocol mainly include three parts: composite feature extraction, message-type identification, and message-format analysis. The core of the unknown protocol recognition model is shown in Fig. 1. First, the unknown bitstream protocol data are used as input. The features constructed by frequent sequences and their offset positions are considered as composite features. After extracting the compound features of the protocol, a feature library is then built. Then, based on the composite feature set in the feature library, the clustering algorithm is used for protocol message clustering, and the single message type of the protocol data is obtained. A message-format analysis is then performed. The ClustalW [31] algorithm is used to infer the message format of the protocol. Finally, the different fields of the message format are obtained, and the field identification results are taken as the output.

B. COMPOSITE FEATURE EXTRACTION
Compared with text data, bitstream protocol data have a single data form with only two values of 1 and 0, which makes it difficult to obtain semantic information. In addition, the offset position information should be considered in the composite feature extraction. In the context of bitstream protocol recognition, the frequently occurring sequences and their offset positions are considered as the compound features of the protocol. The composite feature extraction method proposed in this paper is shown in Fig. 2. It comprises feature extraction, feature recognition, and the construction of the feature library.

1) FEATURE EXTRACTION N-GRAM SEGMENTATION
First, the data are divided, and then frequency statistics analysis of the divided unit length sequence is carried out. The final purpose is to analyze the unit length sequence. The main implementation principle of the N-gram model is based on string statistics, where N denotes the length of the unit string of segmentation, whose value has a significant impact on the effectiveness and integrity of the algorithm. For bitstream data of length m, 1 , 2 , … , , the N-gram model is used for segmentation, and the partition length is N bytes. The size of N directly affects the accuracy and efficiency of feature extraction; therefore, this study utilizes Zipf's law as the basis for selecting the N value [32].

FREQUENT SEQUENCE
This study analyzed the traits of the bitstream protocol. From a statistical point of view, the feature of the protocol bitstream sequences contains two attributes; one is that attributes occur frequently, and the other is that features must have a specific meaning. Therefore, the segmented sequences of the protocol units are not all protocol features. The position information in the protocol is also an important piece of information that can be used for feature extraction. The frequent sequence and the feature pair constructed by the frequently occurring sequence, as well as its offset position, are taken as the protocol composite features.
(1) Therefore, − denotes the protocol set with the same offset position sequence unit, denotes the offset position, and denotes the frequent sequence.

2) FEATURE RECOGNITION
The redundancy of data segmentation by the N-gram model is too high. The features of the protocol should be more frequent and have specific meanings. Therefore, this paper proposes the concept of feature selection and feature joints to filter and splice the sequence of compound elements.

FEATURE SELECTION
In feature selection, an improved FP-growth algorithm is proposed. Based on the FP algorithm, the frequent pattern tree is improved, which effectively reduces the size of the tree, and reduces the storage space of the system; the search space of the algorithm is also effectively compressed, and the frequent items are further filtered to reduce a large number of redundant false rules produced by the system when adopting the FPgrowth algorithm.
The association rules of frequent sequences were mined using the improved FP-growth algorithm. Let I = { 1 , 2 , … , } be a set of n feature sequences, and describes the th feature sequence. The association rule is defined as an implication of the form: X ⟹ Y,Where X is called antecedent or left-hand-side (LHS) and Y is called consequent or righthand-side (RHS). Support is the probability of X and indicates the frequency of the frequent sequence. sup = ( ) (2) Confidence is the conditional probability P( | ) and is an indication of how often the rule has been found to be true.
The lift of a rule is the ratio of the observed support to the expected value if X and Y are independent.
The viction of a rule is defined as follows. This can be interpreted as the ratio of the expected frequency that X appears without Y.
In the improved FP-growth algorithm, after the construction of the FP-tree, FP-tree mining is performed. Starting with the frequent pattern of length 1, a conditional pattern base is constructed. Then, the FP-tree is constructed and recursively digs into the tree. Pattern growth is achieved by linking the postfix pattern to the frequent pattern generated by the conditional FP-tree. This algorithm establishes FP-tree by scanning the frequent sequence database, explains the association between frequent sequences, and filters out infrequent sequences through the minimum support value (min_sup) [33].

FEATURE JOINT
After the feature selection, the length of the bitstream protocol features is not necessarily fixed, and the filtered composite unit sequence also needs to be spliced according to the position relationship. In the splicing process, association rules are used to determine the possibility of splicing. The splicing of the frequent sequence is completed according to the algorithm flow of the association rules. By analyzing the position difference between unit field sequences and according to association rules, the following definitions are provided during stitching: Definition 1: Offset position of sequence P . The offset length bit that conforms to the first character in the sequence P from the first part is defined as the position of P , POS (P ). After adding the offset position limitation, the Pi of the same sequence is treated as a different sequence owing to its different positions.
Definition 2: The expression P⇒P' is introduced. In the context of this paper, the correlation between two sequences P and P' satisfies POS(P)<POS(P'), with the position of P' in the position following P.
Definition 3: Splicing Confidence For P⇒P', the definition of confidence is changed in the context of this paper to the conditional probability of the subsequent occurrence in the adjacent positions under the condition of the presence of the leader, where the length of the leader P is expressed by Len (P).
The input of the feature joint is the segmented and filtered composite feature, and the minimum confidence threshold. The composite feature contains two parts of information, namely: the frequent sequence and the offset position. According to the above definitions of association rules, if the confidence of two composite feature sequences P and P' is greater than the threshold value, then the association rule between them is valid, and the two are spliced into a long string. Finally, the composite feature set obtained after feature selection and the feature joint is put into the feature library.

C. MESSAGE-TYPE RECOGNITION
Each protocol typically contains a sequence of messages. Each message has a message type. After building the feature library in the previous section, the message type of the protocol needs to be recognized based on the composite feature set in the feature library. By extracting the protocol features from the feature library, a clustering algorithm is used to identify and determine the Dunn index of the protocol features [34]. After the vectorization of protocol data is completed, it is used as a variable to complete clustering by setting the number of different message types K, and the Dunn index is introduced to select the final K value to complete the differentiation of protocol message types. Finally, a single message type is obtained.

1) PROTOCOL VECTORIZATION
The variable selection of the clustering algorithm is generally performed in two ways. The first method involves the direct selection of continuous attributes. The second method is for some attributes that can only be represented by "have" or "none," corresponding to "1" or "0" respectively. Based on the compound features selected in this study, the second method was chosen to determine the variables of the protocol data frame. Use "1" or "0" to indicate whether the composite features appear or not, mark the composite features, and finally select the variables to complete the clustering.
When the vectorization operation is performed on the protocol data frame, the offset position in the composite feature is used to perform sequence alignment on the corresponding position of the protocol. For each composite feature, the value is assigned according to whether the corresponding offset position of the composite feature appears in the protocol data frame to obtain the vector M={ 1 , 2 , … , }, where the value of can only be 0 or 1.

2) THE MEASURE OF SIMILARITY
After vectorization of the protocol data frames is completed, an appropriate similarity measurement method is selected. In this study, the Jaccard distance was chosen as the similarity measure for the two data frame vectors. When Jaccard's similarity coefficient is used, the processing object is usually a binary variable, without considering the size of the actual value, and the calculation efficiency is high [35].

3) CLASS CLUSTER EVALUATION
The Dunn index is also used to select the most reasonable clustering result to obtain the number of message types and is defined as follows: where and are any two clusters in the clustering result, ∆( ) represents the furthest distance between samples in the cluster of class cluster , and ( , ) represents the distance between the two clusters of and .
The goal of message clustering is to assign a type to each message. To this end, a metric of similarity is defined between messages and is used to cluster similar messages together. Once all similar messages are clustered, each cluster (and all the corresponding messages) is labeled with a type. As shown in Fig. 3, the message type in message cluster 1 is one type, and there are a total of K message clusters. It is known from the previous section that there are a total of K different types of messages.

D. MESSAGE FORMAT ANALYSIS
The message type is defined by a message-format specification. The message format specifies the structure of a message, typically in a number of fields. After the messagetype recognition of the protocol in the previous section and obtaining the single message type, relevant information needs to be extracted from the protocol by analyzing the message format of the protocol. On the premise that message clusters are taken as input, a multi-sequence contrast algorithm and information entropy correlation theory [36] are used to protocol alignment and field partitioning to infer message formats, including fixed-length and variable-length fields. The ClustalW algorithm is used to identify the different fields of the protocol message format and divide the length, address, sequence number, and check fields of the protocol. This method divides the field area of the protocol under the condition of less prior knowledge.

1) MESSAGE FORMAT INFERENCE
The fields of the message format mainly include fixed-length and variable-length fields and each field is either a fixed length or variable length. The length value of a fixed-length field is static, and it does not change across multiple instances of the same field. The length value for a fixed-length field is part of the protocol specification and is known a priori for the implementation of the protocol. In contrast, the length of a variable-length field is dynamic; that is, it can change across multiple instances of the same field. The message-format analysis in this study focuses on variable-length fields.
The protocol region should be divided to determine the length of the field. Taking byte as the minimum division unit, the bytes of the same field have certain similarities in the statistical law. The value of each position is regarded as a discrete random variable, and its value is not unique. According to relevant statistical knowledge, information entropy can be used to represent the distribution relationships among variables. This study used the statistical distribution of different bytes as the standard for the region merging between bytes according to the offset position. Finally, according to the correlation coefficient between different bytes, the region of the field was divided.

2) FIELD IDENTIFICATION RESULTS
The data input is the protocol data with a single message type. By distinguishing the length and region of the fields of the same message-type protocol data, a message-format analysis of this message type is realized.
With less prior knowledge, the offset position was used to identify the address fields. This method is defined as follows: Definition 1: Address the field candidate set. The set U( ) = { 1 , 2 , … , } is defined as the set of all sequences at the offset .
Definition 2 ： Similarity coefficients Sim( , ) was defined to represent the similarity of sequences at two offset positions.
The entire protocol dataset can be regarded as a twodimensional matrix of n*m field regions, where n is the number of protocols and m is the number of field regions. After the two-dimensional matrix of the field area is obtained, the set of address fields at each position is obtained. Finally, the threshold of the address field is set as follows: if the similarity coefficient is greater than the address field threshold, show in two positions of sequence where the similarity is higher, the collection of the set of similarity coefficients is higher fields, as the address field of the optimal solution.
Other variable-length field identification involves extracting the value of the same variable-length field, arranging the time sequence of these data frames, and counting the change rule of the value. When a field is related to the length of the data frame, it is recognized as a length field; when the value of a field is increasing or decreasing, it is identified as a sequence number field, and when the value types of information entropy and bytes exceed the threshold value, they are identified as check fields.

IV. EXPERIMENT AND RESULT
To express the accuracy of the unknown bitstream protocol identification method, appropriate protocol datasets and design experimental evaluation metrics were selected and the parameter selection and algorithm effect of the above algorithm were verified. In Section IV, the unit segmentation length and frequent sequence filtering threshold are first determined, then experimental indexes are designed to analyze the clustering results, and finally, the field identification results of protocol message formats are presented.

A. PROTOCOL DATA AND DEVELOPMENT ENVIRONMENT
Two types of datasets, both of which were captured using the Wireshark software, were used in this study-namely, the IMCP and ARP datasets. The required configuration and relevant environment for algorithm realization in this study were as follows: The integrated development environment was IDEA with PyCharm. Java and Python were used to realize the algorithms used in this study. The operating system was Windows 10 on an Intel (R) Core (TM) I5-8250U CPU @1.60 GHz.

B. EVALUATION METRICS
In this study, the datasets were used to conduct protocol analysis so that the classification effect could be verified after the completion of association rule classification. First, the following indicators are introduced. Protocols that belong to such a cluster are called positive classes, and protocols that do not belong to such a cluster are called negative classes. Precision Rate: This is used to characterize the proportion of the actual positive class in the instance that is classified into the positive class.

PR = + × 100%
(10) where is the number of protocols belonging to such a cluster and is the number of protocols that do not belong to such a cluster.
Recall Rate: This indicates that there are multiple positive classes that are classified into positive classes.

RR = × 100%
(11) N is the total number of protocol frames.

1) SEGMENTATION UNITS
The protocol data were first divided according to the unit length. In this study, the N-gram model was used for the protocol segmentation. In the process of the experiment using two datasets, in bytes for the basic unit, different lengths were selected to shard and unit series frequency statistics were completed at the same time. Then, natural numbers were used to sort by frequency to obtain the ranking. Finally, the logarithm of the frequency was used as the ordinate, and the logarithm of rankings as the ordinate to draw a line chart. If a certain segmentation length made the result conform to the Zipf distribution, this would indicate that the segmentation length was reasonable.
The Zipf distribution curve of the ICMP protocol dataset is shown in Fig. 6; it varies with the value of the unit length n. When n = 1, the curve exhibits the smallest fluctuation and, with an increase in the value of n, the fluctuation of the curve increases. Therefore, for the ICMP protocol, one byte was selected as the unit segmentation length in this study. The Zipf distribution curve of the ARP protocol data is also shown in the figure; the unit division length is 1.

3) SCREENING THRESHOLDS
Following completion of the unit segmentation of the protocol data, the unit sequences need to be screened to filter out data with a lower frequency. The method proposed in this paper is to randomly divide the unit sequence set into two subsets, A and B, set different screening thresholds, and compare the similarity of the two subsets A and B after filtering the lower frequency sequence. Section III introduced the improved FP algorithm and the minimum support value. In this experiment, min_sup was used as the screening threshold. As shown in the figure, the screening threshold that makes the two sets most similar for the first time is the final threshold. With an increase in the threshold, the similarity between the two sets increases gradually after filtering this type of sequence. The frequency with the highest similarity between the two subsets for the first time was selected as the screening threshold. For the ICMP protocol, the filter threshold was 0.016, whereas the ARP protocol threshold was 0.019.

4) NUMBER OF MESSAGE TYPES
The physical meaning of the Dunn index is the ratio of the minimum value between any protocol feature set and the maximum value within all protocol feature sets. The larger the value, the better is the clustering effect. The Dunn coefficients for different message clusters are shown in the figure.

FIGURE 8. Dunn coefficient line chart
As can be seen from Fig. 8, for the ICMP protocol, when the number of message clusters is two, the Dunn coefficient in the clustering result reaches the maximum. The hierarchical clustering protocol message data of all clusters were compared with the actual data of the dataset. Among them, the data in message cluster 1 belong to the inquiry message, and the data in message cluster 2 belong to the error report message. The number of corresponding messages in each cluster and the number of actual message protocol data frames were counted, and the precision and recall rates were both 100%, which indicates that the proposed algorithm can distinguish the ICMP protocol well. Similarly, the Dunn coefficient under different class clusters of the ARP protocol dataset was calculated. When the number of message clusters was eight, the Dunn coefficient reached its maximum value, as shown in the figure. The average recall rate and average precision rate for this agreement were 96.5% and 100%, respectively.
By analyzing the results of the two datasets, the compound feature extraction method demonstrated that it could effectively extract compound features that can identify different types of protocols. The improved FP-growth algorithm can classify different message types of protocols.

4) FIELD IDENTIFICATION RESULTS
The protocol data of two protocol datasets were selected, and the connection between different bytes was analyzed using information entropy to obtain the specific region segment of the protocol. Then, the protocol format was inferred from the fixed-length fields and variable-length fields according to the statistical characteristics of different fields. We took the first 13 bytes of ICMP as an example to calculate the protocol data statistics and information entropy distribution. In the first 13 bytes of ICMP data, the first six bytes have the same entropy value, the same field value, and corresponding frequency. Therefore, the first six bytes can be divided into one field. Similarly, 6 to 11 bytes can also be identified in the same field. Table III compares the address fields recognized by the ICMP protocol and the actual address fields when different thresholds are set. As can be seen from Table III, when the similarity threshold is low, all the address fields cannot be effectively recognized. With an increase in the threshold, all the address fields can be recognized, and the recognition rate reaches 100%. However, when the threshold is set to 1.0, the similarity requirement is too high, and more complex IP addresses cannot be identified.
The variation in the check fields is shown in Fig. 9. The value of the check fields was irregular and evenly distributed, and the entropy value was large. The changes in the sequence number fields are shown in Fig. 10, and the value of the sequence number field is arranged in an increasing or decreasing order. The field identification results are presented in Fig. 11. In Fig.  11(a), two pairs of address fields, as well as the corresponding fixed-length field, are correctly identified in the protocol. However, for the length field, the offset in the actual format was 16-17. In the actual format, the offset position of the message-type identifier field ranges from 34 to 35. The main reason for this is that the value of the message-type identifier field is too high, resulting in inconsistent entropy values of the two bytes. In Fig. 11(b), for the control frame structure, the length is generally fixed, and the immutable feature fields of the head and tail of the frame are considered as the leading and ending frame markers. In this study, Ox0800 was determined as a fixed-length field, which is the network number in the actual frame format. It is inferred that the reason for the error is that there are fewer samples in the protocol dataset, and the network number communicating in the same network cannot be effectively identified. For the length fields, the actual format should be two independent length fields, and because the length of the control frame is fixed, the entropy values of the two fields are the same; thus, it is misjudged as one.

1) ALGORITHM COMPARISON
To verify the effectiveness of the improved FP-growth algorithm, the performance of the message recognition method in this study was compared with that of the Apriori and FP-growth algorithms. As shown in Fig. 12, the FPgrowth algorithm has the lowest efficiency and the proposed algorithm has a high precision rate and recall rate and can achieve the distinction of protocol message types.  The performance of the improved FP algorithm was also compared with that of the original FP algorithm. The two algorithms mine association rules for transaction database D (as shown in Table IV) to discover frequent sets, as shown in the figure, by scanning 1-5 transactions in the database. By comparing the examples of the two algorithms above, it can be seen that the improved FP-growth algorithm effectively reduces the size of the tree when building the frequent pattern tree (as shown in Table V). Consequently, the corresponding system storage space is also reduced, and the search space of the algorithm is also effectively compressed.

2) PRECISION AND RECALL
In the experimental section, the proposed protocol identification results were compared with the method proposed by Zhang [29]. Zhang's method is also used for unknown bitstream protocols, which discovers protocols by frequent sequences and positions based on clustering and detects address fields based on the similarity of the unit set in different positions. Tables VI and VII provide details of the frequent sequences and locations of the ICMP and ARP protocols. Frequent locations are listed in descending order of frequency. The frequencies of positions (12), (16), (20), (23), and (34) in Table I and positions (12), (28), and (38) in Table II were all observed to reach 1. These common sequences are the keywords used in the protocol and the experiment was conducted based on frequent sequences and positions whose frequencies did not achieve 1. Figs. 13 and 14 show the precision and recall rates of the two methods for different positions in the ICMP and ARP frequent sequences. In Fig. 13, both methods find ICMP messages with 100% precision and recall rates in the frequent sequence at position (34). However, the precision and recall rates of the proposed method are slightly higher than those of Zhang's method for other frequent sequences. In Fig. 14, both methods only find ARP messages with 100% precision and recall at sequence "01" in the 21st byte, which is the key word of ARP. For other frequent sequences, the precision and recall rates of the proposed method are higher than those of Zhang's method. Therefore, the proposed method is faster and more efficient.     1) Composite feature extraction: It contains two important parts: N-gram generation and feature identification. The computational complexity of the two parts is presented in Table VIII, where is the first constant number of bytes of a protocol data frame, denotes the size of the samples of the protocol dataset, and is the number of attributes (i.e., features). As K ≫ L, the overall computational complexity of this phase is O( * ).
2) Protocol message-type identification contains three important parts: protocol vectorization, the measure of similarity, and class cluster evaluation. The time complexity of each part is presented in Table VIII, where is the first constant number of bytes of a protocol data frame, is the size of the samples for clustering, and is the number of attributes (i.e., features). Note that, in practice, the following relationship K ≫ L is present. Therefore, the overall computational complexity of the phase is O( * 2 ).
3) Protocol message-format analysis: It contains two important parts: message-format inference and field identification, where L is the first constant number of bytes of a protocol data frame, and K is the number of attributes (i.e., features). The computational complexity of this phase is O( * 2 ) , where is the size of the field, and is the number of bits per field.

V. CONCLUSION
This paper proposed a method for identifying and analyzing an unknown bitstream protocol. It solves the problem of difficulty obtaining protocol information and specifications against the background of zero knowledge. The proposed method identifies the protocol message type, analyzes the protocol message format, and obtains the results of field identification. Our experimental results showed that it achieved high accuracy and recall rates on the ICMP and ARP datasets. This is of great significance in the identification of unknown protocols; however, a more comprehensive analysis of the protocol's message-format information is required. We are currently working toward extending the proposed method to accurately infer semantic information about fixed-length fields.