Research on Communication Network Structure Mining Based on Spectrum Monitoring Data

The physical characteristics of the massive spectrum signals carrying the communication information and the statistical laws of these characteristics also potentially reflect the communication behavior of the communication individuals and the intelligence information related to the communication behavior. Intercepting and cracking signal content usually faces enormous difficulties and costs, and more often, we are not able to crack the encrypted signal content. However, by studying the physical features extracted from the spectrum monitoring signals and the statistical laws of these features, it is also possible to dig out the hidden relationships between communication individuals and even the communication network structure, so as to analyze the communication behaviors of the communication individuals. Based on the characteristics of carrier frequency, bandwidth, power, signal monitoring time and direction information of spectrum monitoring signals, this paper identifies each spectrum signal and studies the distribution characteristics and statistical laws of massive spectrum monitoring signals in the column coordinate system. Due to the clustering of the spectrum signals generated by the sources in the power, monitoring time and direction, and the correlation of the spectrum signals generated by the two parties in the communication process, based on the improved density clustering algorithm, this paper proposes a method for mining the communication relationship between communication individuals from the spectrum monitoring data, and guesses and constructs the communication network structure by matching the communication individual with the communication relationship. Finally, we analyze the communication network structure mined from the spectrum monitoring data.


II. RELATED WORK
Although a large number of literatures have conducted in-depth studies on spectral signals, these studies have focused more on the characteristics and information of the spectral signals themselves, such as the estimation of spectral signal related parameters [7]- [9], signal detection [10], [11], ano-maly detection based on signal characteristics [12]- [15], monitoring and management of spectral signals [1]- [3], and spectrum sensing [16]- [19], spectrum decision [20], [21] and other related research. For the massive spectrum signals generated by communication, it is not deep enough that the research on mining the connection between spectral signals and the communication relationship between the communi-cation individuals that generate spectral signals and analyzing the behavior characteristics of these communication individuals.
The existing research on the communication behavior of wireless communication individual mainly relies on monitoring or eavesdropping to crack the content of the intercepted spectrum signal [22]- [24], and analyzes com-munication behavior and intention according to the content of the Since the wireless channel is easily interfered by the environment, with a high bit error rate, in order to ensure reliable transmission of data, the radio station that adopts the half-duplex communication mode usually adopts the stop-and -wait ARQ protocol at the data link layer. Fig. 1 shows the length of time occupied by the sending of information frame and the reply of confirmation frame in the communication process based on the stop-and-wait ARQ protocol. The red rectangle indicates the duration of the information frame sent by the station, the green indicates the duration of the confirmation frame (error pattern) sent by the station, the blue indicates the duration of the station receiving the information frame (or the confirmation frame), the yellow indicates the conversion time of transmission and reception, and the blank interval indicates the propagation delay T d . Therefore, for a pair of communication stations, the monitored spectrum signal set is jointly generated by both the transmitting station and the receiving station, that is, corresponding to two sources. The carrier frequency of the frequency hopping communication continuously changes, and the information is not transmitted when the channel is switched. However, the carrier frequency of fixed frequency communication remains unchanged, and the amount of data detected in the same time is larger.

2) THE IMPACT OF SCAN CYCLE ON SPECTRUM MONITORING DATA
For monitoring equipment, the scanning period is affected by the monitoring range and the monitoring scanning rate. Different scanning periods correspond to monitoring data of different densities. Fig. 2 shows the amount of data collected by the monitoring device based on different scanning periods, where the green rectangle indicates the duration of the spectrum signal propagating to the monitoring device, and purple VOLUME 8, 2020  and orange indicate the monitoring conditions corresponding to different scanning periods. Obviously, the smaller the scan period is, the more spectrum data will be detected. For fixed-frequency communication, the main reasons affecting data distribution are propagation delay and scanning period.
On the other hand, for frequency hopping communication, there will be more missing spectral signals. In order to resist interference, the carrier frequency of frequency hopping communication is constantly changing. The characteristics of this hopping also largely avoid the monitoring of monitoring equipment. Fig. 3 shows the monitoring of frequency hopping communication. The abscissa is time, in units of scan cycles, and the ordinate is frequency. The yellow rectangle in the Fig.3 indicates the frequency range monitored within one scan bandwidth, and the horizontal lines with different lengths represent different frequency hopping signals. The monitoring equipment monitors within the range of 30-90MHz, and the scanning bandwidth (corresponding to the height of the yellow rectangle in the figure) is 20MHz. One scanning period corresponds to three yellow rectangles. The signal in the white area is undetected, so the actual scanned signal contains a large number of missing signals. This lack results in a smaller data density of monitoring data and uneven distribution of data. In the data processing process, it is necessary to consider a data processing method that can accommodate data missing.

B. FEATURE SELECTION
Due to the interaction and transmission of information, there is a communication relationship between communication individuals, which constitutes the basis of the communication network. In order to mine the communication relationship and communication network among sources from the spectrum monitoring data, this paper firstly classifies the spectrum signals by clustering method based on the characteristics of spectrum signals. Each cluster set corresponds to the spectrum signal set generated by the source in their respective communication. Then we replace the source with the clustering set, and determine the communication relationship between the source nodes according to the distribution characteristics of the data of the clustering set in time. Finally, we construct a communication network based on the communication relationship between nodes, so as to mine the communication network from the spectrum monitoring data.

1) SIGNAL POWER
The signal power represents the distance information of the station relative to the monitoring device. Due to the error of the monitoring equipment and the fading during signal propagation, the monitored signal power exhibits a normal distribution. If the position of the station is relatively fixed, the power of the monitored spectral signal exhibits a stable distribution. Even if the radio is moving, the power of the spectrum signal monitored in a short period of time exhibits a stable distribution or change.

2) SIGNAL DIRECTION (ANGLE)
The signal direction (angle) represents the direction information of the station relative to the monitoring device. Due to the error of the monitoring equipment and the fading during signal propagation, the monitored signal power exhibits a normal distribution. If the position of the station is relatively fixed, the angle of the monitored spectral signal exhibits a stable distribution. Even if the radio is moving, the angle of the spectrum signal monitored in a short period of time exhibits a stable distribution or change.

3) SIGNAL MONITORING TIME
Since wireless communication has a large error rate, the station usually performs error control based on stopping waiting for the ARQ protocol. The sender sends the data frame, and the receiver sends the acknowledgement frame immediately, so the spectrum signals generated by both parties are roughly the same in the time range. On the other hand, the spectrum monitoring signal appears continuously in time, so that the spectrum signal exhibits a stream pattern in the time domain.
Features such as carrier frequency, signal bandwidth, signal power, time of signal occurrence and signal direction that extracted from spectrum monitoring data carry important information of spectrum signals and can uniquely identify spectrum monitoring signals. Therefore, we take them as features of mining communication relations and speculating communication network structure.

C. FEATURE REPRESENTATION
Let the spectrum monitoring data set be X = {x 1 , x 2 , · · · , x i , represents the signal frequency, B i represents the signal bandwidth, θ i represents the signal direction, P i represents the signal power, and t i represents the signal monitoring time. In order to study the clustering properties of spectral data more intuitively, this paper introduces a cylindrical coordinate system to describe the distribution of data. Let the spectrum monitoring data set be Y = {y 1 , y 2 , · · · , y j , · · · , y n } T , where y j = {θ j , P j , t j . Fig. 4 shows the distribution of spectral data generated by a pair of communication stations in a cylindrical coordinate  system. Obviously, the data shows clustering and communication directivity. The data set Y is projected into the polar coordinate system in the cylindrical coordinate system, and we obtain the data set Z = {z 1 , z 2 , · · · , z k , · · · , z n } T , where z k = {θ k , P k }. Fig. 5 shows the distribution of the data set Z in a polar coordinate system, the origin representing the position of the monitoring station. The distribution of data in polar coordinates indicates radio stations relative position to the monitoring equipment and the relative position among stations. This provides a node location for the construction of the communication network, although it is not a real geographical location.

IV. COMMUNICATION RELATIONSHIP DISCOVERY A. COMMUNICATION RELATIONSHIP MINING METHOD
Mining the communication relationship between the sources from the spectrum monitoring data is to classify the monitored spectrum signals according to the characteristics of the signals. In this process, the spectrum signals generated by each communication station during each communication process are separated from the spectrum monitoring data. The spectrum data of the classification set represents the spectrum signal generated by the radio station in a communication. Based on the classification results, the classification sets of spectrum data with similar time range are matched to mine the communication relations.

B. DENSITY CLUSTERING
Signal power, signal direction, and signal monitoring time can uniquely identify the monitored spectrum signal. Because of the propagation delay and path loss and the error of the monitoring equipment, the monitored spectrum signal has errors in signal power and signal direction. These errors result in the approximate normal distribution of signal power and direction. Signal monitoring time represents the time when the signal appears. Spectrum monitoring data is acquired VOLUME 8, 2020 based on scanning of monitoring equipment, and continuous communication causes the monitored spectral signals to be continuous in time. The data represented by signal power, signal direction and signal monitoring time exhibit manifold clustering, as shown in Fig. 4. On the other hand, because the monitored data is missing and confusing, it also determines to classify the data by clustering.
The data of data set Y exhibits manifold characteristics. In the dimension of time, since the scanning period is constant, the spacing of the generated data is relatively stable, so scaling the spacing of the data in the time dimension does not change the clustering characteristics of the data. On the other hand, we study the distribution law of data in the cylindrical coordinate system. Based on the characteristics of the cylindrical coordinate system, this paper changes the spherical ε-neighborhood in the original OPTICS algorithm to a columnar neighborhood, and the neighborhood is defined as: h is the threshold of the time difference between the data, which determines the height of the columnar field. ε determines the bottom area of the columnar neighborhood. After defining the column neighborhood N ε y j , the value of MinPts needs to be further determined, and ε and h are estimated to determine the range of the neighborhood.
Daszykowski et al. [30] proposes that the selection of MinPts value in the neighborhood depends on the number of objects in the data. In addition to this, the distribution characteristics of the data and additional information about the data cluster can also be used to define MinPts.
Based on the value of the preset MinPts, we estimate ε and h. Daszykowski et al. [30] optimizes the neighbor-hood radius ε by estimating the data set with the same dimension as the research data but uniformly distributed within the experimental range, regardless of the distribution of objects in the data set. As shown in Fig. 6, the data set U contains m data points and follows the normal distribution. The data set V is uniformly distributed and is the same as the data dimension, the number of data, and the experimental range of the data set U. Selecting the optimal neighborhood radius ε for the data set U is to calculate the distance of each object in the data set V to its MinPts − th neighbor, sort the m calculated distances in ascending order, and then select a distance equal to 95% as ε.
Inspired by the literature [30], in order to estimate the column neighborhood, this paper combines the distribution of data objects to estimate the columnar neighborhood of the data set with the same data dimension but uniformly distributed within the experimental range. In cylindrical coordinate system, the data set Y presents local manifold distribution, and different clustering sets have similar density and distribution characteristics. In the communication process, the duration of the acknowledgment message sent by the receiving station is less than the length of time that the transmitting station transmits the information. Within the same time, the number of signals sent by the receiving station is monitored to be small, and the density of the receiving station spectrum data in the cylindrical coordinates is small, as shown in the Fig. 4. The difference in spectral monitoring data density determines the columnar neighborhood formed by ε and h based on the cluster set of the smaller density of the receiving stations, and such columnar neighborhood is still valid for dense data.
Let the spectrum signal set generated by a certain receiving station be R = {θ i , P i , t i }, where i = 1, 2, · · · , m. For a more intuitive representation, the data set R is transformed into a three-dimensional cartesian coordinate system to obtain R = {x i , y i , t i } by the formula (4).
x i = P i cosθ i y i = P i sinθ i (4) The range occupied by the data set R in space is denoted as V R . Let R be a data set with the same dimensions and experimental range as the data set R , but subject to uniform distribution. The average range occupied by each object in R can be expressed as V R m , where 2hπε 2 represents the range occupied by the cylindrical (ε, h)-neighborhood, and MinPts · V R m represents the average range corresponding to MinPts points in the neighborhood of each object. Based on the given MinPts, eq(6) determines the relationship between h and ε and the range of the columnar (ε, h)-neighborhood.

C. MATCH CLUSTERS TO DETERMINE communication RELATIONSHIPS
The spectrum monitoring data is classified by the improved OPTICS algorithm, and each cluster set represents the spectrum signal set generated by the station in one communication, as shown in Fig 4. Based on the stop-and -wait ARQ, the communicating parties maintain the trans-mission and acknowledgement of the data frames during the communication. Therefore, for two stations with communication relationship, the distribution of the generated spectrum signals is similar in the time range, that is, the initial signal time and the end time corresponding to the two cluster sets are similar. Therefore, the communication relationship of signal sources can be confirmed according to the distribution of time. Time

V. THE NETWORK STRUCTURE MINING AND ANALYSIS A. CONJECTURE OF COMMUNICATION NETWORK STRUCTURE
Mining the communication network in the spectrum monitoring data is to classify the spectrum monitoring data by clustering method. Then the relative position of the cluster set in the cylindrical coordinate system is taken as the node of the network. Finally, based on the communication relationship between clustering sets, we connect nodes to build the network and record the communication direction.

Algorithm 1 The Communication Relationship Discovery Algorithm
Input: data set Y = {y 1 , y 2 , · · · , y j , · · · , y n } T , where y j = θ j , d j , t j . ε, MinPts, h Output: Signal spectrum set V corresponding to the com-munication relationship The centroid position (θ i ,P i ) of the source Communication direction Communication sequence 1: According to the distance defined by formula (2) (3), use the OPTICS algorithm to cluster the data to obtain the clustering set U = {U 1 , U 2 , U 3 , · · · ,U l , · · ·} of spectr-um signals 2: Calculate the centroid position (θ i ,P i ) of the cluster set U l projected to the polar coordinate system 3: Sort the objects of the cluster set U l according to time, and extract the initial time and end time of signals of the cluster set U l 4: Calculate the time range of the data in the cluster set U l 5: Matching the cluster set U l to discover the communi-cation relationship of information interaction 6: if the initial time of U l is close to that of U j 7: if the end time of U l is close to that of U j 8: There is a communication relationship between U l and U j . 9: Compare the number of data of U l and U j , the number of receivers is small, while the sender is large. 10: V k = {U l , U j } is the spectrum set correspondding to the communication relationship 11: end if 12: end if 13: Output spectrum signal sets corresponding to differ-ent communication relationships in the cylindrical coordinate system In the process of building the communication network, the nodes of the network must first be determined. The data set Z = {z 1 , z 2 , · · · ,z j , · · · ,z n } T represents the direction and power information of each spectral signal in the spectrum monitoring data, where z j = θ j , P j . In the polar coordinate system, data set Z describes the relative position information of the spectral signals, and the data presents the clustering distribution. The DBSCAN algorithm implements a division of the data set Z = {C 1 , C 2 , · · · , C p , · · · , C m , D, where p = 1, 2, 3, · · · m. The data distribution of the cluster set C p represents the relative position of the source in the polar coordinate system, and D is the set of abnormal points. The centroid neighborhood of each cluster set C p represents the relative position of the source and acts as a node of the communication network. The centroid positionC p of the cluster set C p = {c p1 , c p2 , · · · , c pi , · · · , c pk } (where c pi = (θ pi , P pi )) VOLUME 8, 2020 is expressed as: In order to record and study the communication relationship and gradual change process of the communication network in different time periods, we divide the data set Y into Y = {Y 1 , Y 2 , · · · , Y i , · · ·} according to the time interval t interval . It should be emphasized that the seg-mentation of data set Y is necessary. Only in this way can we intuitively analyze the communication relationships, communication sequences, network connectivity, paths, and communication directions in different time periods. Based on Algorithm 1, we mine the communication relationship of Y i , and record the communication direction, communication sequence, and calculate the relative position of the source in polar coordinates (θ l ,P l ), where l = 1, 2, · · · · · · . In order to correctly match the source relative position (θ l ,P l ) in Y i with the network node (θ p ,P p ), we set the neighborhood range ofC p : In the region of 30km in width and 30km in depth, 10 radio stations were randomly set as the experimental information sources, among which station D and J carried out fixedfrequency communication, and other stations carried out frequency-hopping communication. The spectrum range of radio communication is 30-90MHz, the scanning bandwidth of monitoring equipment is 20MHz, and the scanning rate is 80GHz/s. Fig. 7 shows the distribution of radio stations and monitoring equipment, where blue dots representing radio stations and red dots representing monitoring equipment.
Based on the radio and monitoring equipment set in the Fig. 7, we simulated the communication between the radio stations, monitored the spectrum signals through the monitoring equipment, and then mined the communication relationship and communication network from the spectrum monitoring data. Finally, we analyzed the structure of the communication network. In the communication model of the   Table 1. Fig. 8 shows the sequence of radio communication in Table 1 in time, where different colors represent the communication between different radio stations, ''1'' represents the sending state of radio station, and ''2'' represents the receiving state of radio station.
Based on the geographic location of communication nodes (stations) and the simulated communication between them, the actual communication network model is shown in Fig. 9.

B. SPECTRUM MONITORING DATA DESCRIPTION
After preprocessing the spectrum monitoring data, we obtain the data set X for communication behavior research, which contains the following characteristics: signal center frequen-cy point, signal power, signal monitoring time and  signal direction. Table 2 shows the format of the spectrum monitoring data set X.

C. ANALYSIS OF EXPERIMENT RESULT
For the data set Y = {y 1 , y 2 , · · · , y j , · · · , y n } T , where y j = {θ j , d j , t j }, we set t interval = 8 s and divide the spectral data Y of 56s into 7 segments, namely Y = {Y 1 , Y 2 , · · · ,Y i , · · ·}, i = 1, 2, 3, · · · , 7. According to Algorithm 2, this paper mine the structure of communication network from spectrum monitoring data. Fig. 10(a) shows the projection of data set Y on polar coordinates, that is, the relative position of source is marked by signal power and direction in polar coordinates. Cluster sets of different colors represent the distribution of signals generated by different sources. Fig. 10(b) shows the relative positions of the centroid neighborhoods of the respective cluster sets of Fig. 10(a), which are used as nodes of the network.
We take Y 1 to demonstrate the communication network structure mining. Fig. 11 shows the mining process of communication relationship and communication network structure of Y 1 . Fig. 11 (a) shows the clustering results of the data set Y 1 in the column coordinates composed of signal power, signal direction and signal monitoring time, where different colors represent different clustering sets, that is, clustering sets of different colors correspond to spectrum signal sets generated by different radio stations in the communication process. Since the spectrum signals generated by radio stations with a communication relation-ship are very similar in time, we match the clustering setaccording to the time range of the cluster sets to determine the communication relationship (corresponding to the communication relationship between radio stations). Fig. 11(b) shows the matching result of the cluster set in Fig. 11(a), and the cluster sets with a communication relation are labeled with the same color. Fig. 11 (a) (b) show the discovery process of the communication relation of Y 1 . In the section IV, we described in detail the communication relationship method, in which the data clustering method is the improved OPTICS algorithm.
To build a communication network structure, we need identify nodes and edges. We projected the data in Fig. 11 (b) into the polar coordinate system to obtain the distribution of spectral data in the polar coordinate system, as shown in Fig.11 (c).Then we calculated each centroid of the cluster set of Fig. 11 (c) and matched them with the coordinates of the nodes in Fig. 10(b).The matched nodes are labeled with the same color in Fig. 11 (d). Finally, we connected nodes of the same color according to the communication relation, so as to form the construction network topology structure of Y 1 , as shown in Fig. 11 (d). The arrow indicates the communication direction. Fig. 12 shows the different communication network structure corresponding to subsets of data set Y. Finally, we merge all the network structure snapshots to form the network structure of Y, as shown in Fig. 13.

D. NETWORK STRUCTURE ANALYSIS
As shown in Fig. 13, in the monitoring range, v 8 and v 9 are stations that communicate independently, and do not communicate with other nodes, which can be regarded as network F; and other nodes constitute network G. From the comparison of the various figures in Fig. 12, it can be found that the network G is mainly divided into three paths: Path 1: v 0 ↔v 3 ↔v 4 ↔v 5 ↔ v 1 Path 2: v 0 ↔v 3 ↔v 4 ↔ v 6 →v 5 Path 3: v 0 ↔v 3 ↔v 2 ↔ v 7 In network G, information transfer starts from node v 0 , and then passes through v 3 to other nodes. As nodes v 1 , v 4 , v 5 , v 6 communicate closely, they constitute the sub-network G 1 , where v 4 is the core node of sub-network G 1 . v 2 and v 7 constitute subnetwork G 2 . v 3 is the key node connected with two subnets.v 0 is the initial node of communication, and v 1 and v 7 are the terminal nodes of communication. From the statistical analysis of the nodes, d , these nodes can be regarded as important nodes for network communication.
In summary, the level of the network G can be divided into 4 layers. v 0 is the beginning of communication, which can be regarded as the highest node in communication; v 3 is the connection point of sub-networks G 1 and G 2 , which is the intermediate node of information exchange, as the second-level node; v 4 and v 2 are regarded as the core nodes of the sub-network, which are used to organize the communication inside the sub-network and serve as the third-level nodes. As terminal nodes, v 1 , v 5 , v 6 and v 7 are the fourth-level nodes of the network.

VII. CONCLUSION
As a medium carrying communication information, the physical characteristics of the spectrum signal itself and the statistical laws of certain features also potentially reflect the communication behavior of the communication individual and the intelligence information related to the communication content. Therefore, the in-depth research and analysis of massive spectrum is of great significance. In order to avoid the difficulty and cost of cracking the signal content, this paper mines and analyzes the communication behavior of the communication individual from the physical charac-teristics of the spectrum monitoring signal and the statistical laws of these characteristics. This paper first discusses the characteristics of spectrum monitoring data, and then obtains the frequency, signal power, signal bandwidth, signal monitoring time, signal direction and other characteristics of the signal from the spectrum monitoring data to uniquely identify the spectrum signal. Then we study the distribution characteristics and statistical laws of the spectrum signals in the cylindrical coordinate system composed of signal power, signal monitoring time and signal power. Through the spectrum monitoring signal mining method proposed in this paper, the communication relationship and communication network between communication individuals are extracted from the spectrum monitoring data. The experimental results show that the method has good adaptability to the massive spectrum monitoring signals, and can mine the communi-cation relationship between the source nodes from the spectrum monitoring data, and infer the communication network structure. Finally, we realize the research of individual communication behavior through the statistical analysis of network structure and node communication quantity.
The research of this paper realizes the mining of the communication relationship between the sources and the communication network structure from the spectrum monitoring data. Through the analysis of network connectivity, communication direction, communication times, communication order and other characteristics, we obtain the hierarchical structure of the network, the hierarchical position of different communication individuals in the network, and realize the analysis of communication behavior of communication individuals in the monitoring area. The research method proposed in this paper provides a new method for the analysis of spectrum monitoring data, and it can realize the acquisition of hidden intelligence in military communication, investigation and other related fields, and has practical application value.
XINRONG WU received the M.S. degree from the Communication and Engineering Institute, in 1996. She is currently with the Institute of Communications Engineering, Army Engineering University of PLA. She has authored more than 25 scientific articles. Her current research interests include communication networks and network security.
LEI ZHU received the Ph.D. degree from the College of Communications Engineering, PLA University of Science and Technology, China, in 2002. He is currently a Professor of system engineering. He has authored more than 35 scientific articles. His research interests include network planning and system simulation. LEI WANG received the Ph.D. degree in military operational research from the PLA University of Science and Technology, China, in 2014. He holds a Postdoctoral position in communication and information system and an Engineer of optimization and system engineering with the College of Communications Engineering, Army Engineering University of PLA, China. He has authored more than 25 scientific articles. His research interests include knowledge engineering, data mining, artificial intelligence, and network planning.
HAOREN FAN received the B.S. degree in computer science and technology from Shihezi University, in 2017. He is currently a Graduate Student with the Army Engineering University of PLA. His research interest includes deep reinforcement learning.
TING PAN received the B.S. degree in information and computing science from the Wuhan University of Science and Technology, in 2017. She is currently a Graduate Student with the Army Engineering University of PLA. Her research interests include data mining and machine learning. VOLUME 8, 2020