Prufer Coding: A Vectorization Method for Undirected Labeled Graph

Prufer algorithm is a powerful method for topology vectorization, but the traditional prufer algorithm method can only encode a rootless labeled tree, and no prior work has studied the method of applying it to the graph vectorization. This paper proposes a vectorization method for undirected labeled graphs based on the prufer algorithm, including graph encoding and decoding algorithms. A particular case was discovered by preliminary experiments, which will reduce the accuracy of the coding algorithm (when the node size reaches more than 150, the accuracy can only reach about 60%), so a connectivity check mechanism that based on the Warshall algorithm is proposed and added to the coding algorithm. A large number of experimental verifications show that the accuracy of the coding algorithm can reach 100% after introducing this mechanism. Then the length of the vector generated by the coding algorithm is analyzed, and the results show that graph vectorization can improve the efficiency of partial topology calculation. Finally, the defects of the algorithm are discussed. The most significant defect is that the length of the vector generated by the encoding algorithm is uncertain, which will prevent it from being applied to more topological calculations.


I. INTRODUCTION
With the explosive growth of the nodes of the system, the complexity of the connected network has increased. Graph vectorization is introduced to simplify the graph representation, thus further simplify graph topology calculation.
Graph vectorization refers to representing the topology information of a graph as a vector through a graph transformation algorithm. This vector is generally one-dimensional. A graph is a mathematical abstraction that is useful for solving many kinds of problems. If we only need to solve the connectivity problem, we can only discuss the undirected graph model of the system. An undirected graph refers to a graph in which each edge symbolizes an unordered, transitive relationship between two nodes. Such edges are rendered as direct lines or arcs [1]. A vectorized description of the graph will simplify the process of solving problems related to the connectivity of the graph.
Featherstone [2] proposed to use a parent array of undirected graphs to describe the connectivity of the bodies.
The associate editor coordinating the review of this manuscript and approving it for publication was Haipeng Yao .
It uses another two arrays to represent the set of children of related bodies and the set of joints on the path between the related bodies and root. Yazar [3] introduced a one-dimensional vector Pgraph, which is used to describe the connection between linear graph theory bodies and branches.
The traditional prufer algorithm is a method of coding and decoding labeled trees, which can be used to vectorize unrooted trees, and it was first proposed by Heinz Prufer in 1918 when he proved Cayley's theorem.
Prufer sequence is not incredibly widely used, but it can be applied on some special occasions, such as be integrated into the design of the Genetic Algorithm (GA) [4]- [6] and used to solve the Minimum Spanning Tree (MST) problem [7]. Reference [8] proposes a new XML schema matching framework based on the use of prufer encoding to improve the performance of identifying and discovering complex matches. Reference [9] uses the prufer code to define martingale of the tree and establish a concentration result for a specific family of functions over random trees with given degrees. Reference [10] designed a BMEP polytope iterative enumeration algorithm based on the Prufer coding method of rootless label tree, combined with the multi-faceted combination algorithm of the Balanced Minimum Evolution Problem (BEMP). The chain structure of the objective supply chain [11] and the branched polymer can be represented as a tree structure. Reference [12] generates random trees with the same degree distribution by repeatedly modifying the prufer code to create the randomly branched polymers. Reference [13] used the prufer algorithm to encode its proposed skeleton graph model and check the isomorphism of the skeleton graph based on the prufer sequence, but the skeleton graph proposed in reference [13] is actually a tree structure.
Compared with the tree structure, the graph structure (or the mesh structure) has a broader range of application scenarios. Can the random tree generation method based on the prufer algorithm proposed in reference [12] be extended to the generation of random graphs? Whether the graph isomorphism detection method based on the prufer algorithm in reference [13] could be accurately applied to graph structures? Although scholars have proposed many improved methods for the traditional prufer algorithm, no one has considered applying it to graph vectorization. In order to realize it, our contributions could be summarized as follows: a) Propose a method for undirected labeled graph vectorization based on the prufer algorithm, including graph encoding and decoding algorithm. b) Propose a method to check the connectivity of the graph based on the Warshall algorithm and introduce an improved approach, then apply it to increase the accuracy of the prufer algorithm in coding and decoding undirected labeled graph.
Finally, the application prospect of the graph vectorization in topology calculation will be analyzed. The process block diagram of the topology vectorization is shown in Figure 1.

II. REVIEW OF THE PRUFER ALGORITHM A. PRUFER CODING OF ROOTLESS LABELED TREE
Prufer sequence encoding refers to converting a tree into a character string, and decoding refers to converting a character string into a tree [14].
First, briefly introduce the prufer coding process of rootless trees: Let T be a tree with n vertices; then tree T is called a labeled tree if the n vertices are distinguished from one another by names such as v 1 , v 2 , . . . , v n [15]. Assuming that the known n vertices are simply marked as 1, 2, . . . , n, then suppose that T is one of the trees, and the node with the smallest label in the leaves is a 1 , its adjacent node is b 1 . When the point a 1 and the edge (a 1 , b 1 ) are trimmed from the graph, the point b 1 becomes the leaf of the remaining tree T 1 . Then search the leaf with the smallest label in the remaining tree T 1 , set to a 2 , the adjacency point of a 2 is b 2 , and trim a 2 and edge (a 2 , b 2 ) from T 1 . Continue this step n-2 times until there is one edge left. Then tree T can be expressed as the sequence b 1 , b 2 , . . . , b n−2 , which is called the prufer sequence, and this process is called the prufer coding algorithm. The coding steps are summarized as follows [16]: step_1: Cut the leaf nodes and edges in order from small to large according to vertex labels. step_2: Record the node number that connected to the leaf node on the trimmed edge. step_3: Repeat step_1 and step_2 until only two nodes and edges between them are left in the tree, the algorithm is end. The following is a concrete example to illustrate the rootless tree coding and decoding method of the Prufer sequence. The constructed rootless tree is shown in Figure 2. Firstly, according to the coding step_1, node 2 and the edge (2, 3) with the smallest sequence number among the leaf nodes are cut out to generate a new tree.
According to the coding step_2, record the node number 3 adjacent to node 2, so the current prufer sequence is 3.
Then according to the coding step_3, the above process is repeated until one edge remains. The entire process is shown in Figure 3.
The prufer sequence changes as follows.

B. PRUFER DECODING OF ROOTLESS LABELED TREE
Provide two sequences 1, 2, . . . , n and b 1 , b 2 , . . . , b n−2 , which are sequential sequence (SeqtSeq) and prufer sequence (PruferSeq), respectively. The tree T can be conversely decoded from b 1 , b 2 , . . . , b n−2 . In above coding process, since a 1 has been cut from the tree T when recording b 1 , a 1 will not appear in b 2 , . . . , b n−2 , so find the first number that does not appear in PruferSeq from SeqtSeq. This number is obviously a 1 , and at the same time, rebuild the edge (a 1 , b 1 ), then eliminate a 1 from SeqtSeq and eliminate b 1 from PruferSeq. Continue the above steps n − 2 times until the PruferSeq becomes an empty set. At this time, the SeqtSeq will have two numbers a k , a j left, and the edge (a k , a j ) will be the last edge of the tree T . Decoding steps are as follows: step_1: Construct the SeqtSeq according to the node number of the tree. Find the number that is not in PruferSeq and is located on the leftmost side of the SeqtSeq.
According to the decoding step_3, the above processes will be repeated until SeqtSeq has only two numbers left: [1,7], finally rebuild the edge (1,7). The changes of SeqtSeq and PruferSeq are shown below.
Edges (3,1), (4,5), (6,5), (5,1), (1,7) will be decoded in order, as shown in Figure 5. Finally, we get a tree that is the same as the rootless tree in Figure 2. The prufer coding and decoding algorithm processes are shown in Figure 6. Scholars have optimized the prufer encoding and decoding algorithms, and have proposed many improved algorithms. Reference [16] and [17] propose linear time algorithms. Using the integer sorting algorithm obtained by the particularity of the integer values to be sorted, the prufer encoding and decoding problems are simplified to integer sorting problems, which can better improve the efficiency of rootless tree prufer coding and decoding. Reference [18] uses simple arrays to improve prufer algorithm, which can improve the time complexity of prufer coding to O(n). Reference [14] studied a decoding algorithm that scanned the prufer sequence in reverse order and proved that the algorithm could run in linear time without the need for additional data structures or sorting processes.

III. PRUFER ALGORITHM FOR UNDIRECTED LABELED GRAPH
Traditional prufer algorithms can be used to encode and decode a rootless labeled tree. However, compared to a rootless tree, graphs are more widely used to solve network problems, so it is necessary to design a method for coding and decoding labeled graphs.
The tree and graph are both non-linear data structures, but the graph is more abstract and complex than a tree. Compared with a tree, the graph has a unique structure, which is named cycle. Graph coding needs to focus on solving the coding problem of the cycle. In order to emphasize the structural nature of the graph, we only discuss the coding and decoding of the undirected simple graph, which does not include parallel edges and self-loops. Figure 7 shows a simple undirected labeled graph with a cycle structure and leaf node. By coding graph G, we will find some problems: if use the prufer coding method of the rootless tree to coding graph G, the node 4 and edge (1,4) in graph G will be trimmed first. Then the remaining nodes 1, 2, and 3 form a cycle, where there are no more leaf nodes. In this situation, which node and which edge should be trimmed next? In order to successfully coding the undirected labeled graph, we need to find a suitable way to solve this problem.

A. PRUFER CODING OF UNDIRECTED LABELED GRAPH
First of all, the single node cropping rules are specified: Each cropping step trim the node and all edges connected to it. In the prufer coding algorithm of the rootless tree, the clipped node is always leaf-node; there are only one adjacent node that needs to be recorded each time. Therefore, the single node recording rule is specified: Recording all adjacent nodes of the clipped node in order, then record that clipped node at the end.
Suppose that the n vertex of the undirected labeled graph G is denoted as a 1 , a 2 , . . . , a n . The coding steps are designed as follows: step_1: If the current undirected labeled graph has leaf nodes a i , . . . , a j , cut out the smallest node a min among the leaf nodes, as well as the edge (a i , b i ) formed with the adjacent node b i . If there is no leaf node left, the one with the smallest sequence number among the remaining nodes will be trimmed. step_2: If the clipped node is a leaf node, only its adjacent node b i should be recorded; if the clipped node is not a leaf node and its degree is j(j ≥ 2), All nodes b 1 , b 2 , . . . , b j that connected to a i through edges b 2 ), . . . , (a i , b j ) should be recorded, assuming that b 1 < b 2 < . . . < b j follow the order from small to large, record all of them and add the clipped node a i at the end to generate a sequence [b 1 , b 2 , . . . , b j , a i ]. step_3: When each trim is complete, a new undirected labeled graph will be generated. Continue to repeat step_1 and step_2 until the undirected labeled graph has only two nodes left, and the algorithm terminates. Taking graph G (Figure 7) as an example graph, its coding process is shown in Figure 8.

B. PRUFER DECODING OF UNDIRECTED LABELED GRAPH
The decoding of an undirected labeled graph is the reverse process of encoding, so we should correctly restore all detail in the coding process. Due to the complexity of the coding process, several problems should be considered. Firstly, how to rebuild the cycle structure? We know that cycle structure is the particularity of the graph. In the coding algorithm, we recorded all the adjacent nodes of the clipped node and recorded that node at the end. Therefore, in the decoding process, we only need to locate that node in the PruferSeq, then rebuild all the adjacent edges with the number in front of it. Secondly, according to the coding algorithm of the rootless labeled tree, the way to rebuild leaf node could adopt the same method, that is, find the leftmost number that included in SeqtSeq not appear in PruferSeq, and link it to the leftmost number of PruferSeq to rebuild that edge.
Decoding steps could be designed as follows. step_1. Find the number a that located in the leftmost side of SeqtSeq but does not exist in the PruferSeq. VOLUME 8, 2020  step_7. Repeat the above process until there are only two numbers left in SeqtSeq. Connect the remaining two numbers, rebuild the final edge, the algorithm is over. Taking the undirected labeled graph G as an example, we have got the PruferSeq in above. According to the PruferSeq, the decoding process is shown in Figure 9.
The prufer coding and decoding flow for the undirected labeled graph is shown in Figure 10. To facilitate experimental verification, the pseudocode of algorithm is designed as follows:

Input:
undirected labeled graph G Output: prufer coding sequence PruferSeq. In order to clearly describe the algorithm execution process, we intuitively take an undirected labeled graph composed of 6 nodes as an example to show the entire process of its coding and decoding, as shown in Figure 11.
The basic operation of the coding algorithm is to determine whether a node is a leaf node; its time complexity is O(n). Outer while loop needs to determine the remaining nodes number, its time complexity is also O(n), so the time complexity of the coding algorithm is O(n 2 ). The basic operation of the decoding algorithm is to find the node number that in the leftmost of SeqtSeq but not in PruferSeq, its time complexity is O(n 2 ), consider the outer while loop, the time complexity of decoding algorithm is O(n 3 ).
The optimal time complexity of basic operation can reach O(n log n). Meanwhile, the time complexity of the outer while loop can be reduced to O(log n) by selecting the appropriate data storage structure [19], so the optimal coding and decoding algorithm time complexity are O(n log n) and O(n log 2 n) respectively.
For undirected labeled graphs with different node size scales, a large of experiments have been carried outs. The accuracy rate of the codec still has not reached 100%, as shown in Table 1.
According to the algorithm execution process, to analyze the causes of algorithm errors, we found that the original graph will be divided into two or more graphs in some particular cases. In this situation, the algorithm execution result will be wrong, as shown in Figure 12.
In such a situation, the node with the small label happens to be the bridge node connecting the two subgraphs, and currently, there is no leaf node. If we trim such a node,  the original graph will be divided into two graphs. In order to solve this problem, the shearing condition needs to be added.
Such a problem certainly does not occur when cutting leaf nodes, so it is necessary to detect whether the current undirected graph will be decomposed into multiple graphs in the second case (cutting non-leaf nodes).
The Warshall algorithm uses the idea of dynamic programming to find transitive closures, which can be used to judge the connectivity of the graph [20]. If only need to judge the connectivity of the undirected graph simply, a vector can be introduced to record the reachability of a single node. We know that the n power of the adjacency matrix represents the number of paths that each node can reach through n hops to another node (including itself), so the connectivity detection algorithm can be designed as follows: If CheckLine becomes an all-one array, it means the node that we marked can reach any other nodes, that is, this undirected graph is connected. When considering this particular case, it means that there is no leaf node at present, so a cycle will appear, it will accelerate the check. Only in the worst case, the outer loop needs to be performed n-2 times. The inner layer is to check CheckLine. If we mark the value that has been changed to 1, so that each time only need to check the value that is still 0 in the previous round, the time complexity can be reduced to O(log n), so the time complexity of this check algorithm is O(n log n).
Therefore, the algorithm needs to make the following improvements: If it is found that trimming the current non-leaf node will divide the original graph into multiple graphs, then mark and skip this node until a node that does not decompose the original graph is found, exchange it with the smallest marked node, and record this exchange in order to recover when decoding. We introduce a table structure for recording this exchange and return it at the end of the prufer coding algorithm, and meanwhile, it as the input of the prufer decoding algorithm to help restore this exchange.
The improved part can be described as follows:

IV. ALGORITHM APPLICATION
The algorithm proposed in this paper can better implement the vectorization of the undirected labeled graph and record the connectivity of it. Recording the two-dimensional adjacency matrix as a one-dimensional vector, can sometimes  (1) greatly simplify some graph operations and improve the efficiency of solving graph problems, such as the graph isomorphism judgment problem, as shown in Figure 13.
Graph isomorphism is the most rigorous form of exact graph matching, holding all the mapping, which must be a bijection in both directions [21]. Graph_1 and Graph_2 in figure 13 are isomorphic because their adjacency matrixes are exactly the same. The prufer sequences obtained according to the algorithm proposed in this paper are also the same: [3,4,1,5,2], and they have the uniqueness of decoding. If the graph isomorphism analysis is performed based on the adjacency matrix, it will take 10 comparison operations, and based on the prufer sequence, it will only require 5 comparison operations so that the efficiency will be sharply improved. When use the adjacency matrix to store a simple undirected graph, the useful information is distributed in the upper triangle of that matrix, as shown in Figure 14. The length of prufer sequence is related to the connectivity of the undirected graph, as cropping rules make sure that it will always record by the cropped edges, a conclusion could be made as follow: Therefore, when the undirected graph is sparse, the space occupied by the prufer sequence to store useful information is always sharply less than n(n − 1) 2, it will be verified in the later experiment.

V. EXPERIMENT A. ALGORITHM ACCURACY VERIFICATION
In order to verify the effect of the improvement method, more experiments were carried out.
To compare the accuracy difference between the original algorithm and the improved algorithm, we conducted 500 experiments each for undirected labeled graphs with different numbers of nodes. Table 2 shows the accuracy  comparison of the two algorithm experiments on different Graph_Size. The experimental results show that the improved algorithm can always achieve 100% coding and decoding accuracy.
Meanwhile, from the experiments, we can find that the accuracy of the original algorithm will decrease when the node number increases, as the expansion of the graph will increase the probability of the above special case. Although the improvement of the algorithm increases the time cost, it dramatically improves the algorithm accuracy. The accuracy of the original algorithm in large-scale (between 100 to 300) graph is shown in Table 3.
Excluding the influence of random errors, the accuracy rate of the algorithm that does not introduce the connectivity check mechanism will be reduced to less than 60% when the size comes to about 250 nodes.
The main disadvantage of the algorithm is that for different topological graph models, the length of vectors generated by the coding algorithm are not the same. If the vectors can be determined to have the same length, then multiple vectors can be formed into a full matrix (full matrix here refers to needn't to fill in irrelevant information to make the matrix aligned). When using the matrix method to solve topological calculation problems such as subgraph isomorphism, matrix operations can significantly improve the operation efficiency.
Taking two random undirected labeled graphs with 100 nodes as an example, the vectors generated by the algorithm proposed in this paper are shown in Figure 15. Figure 16 shows two experiments that the topological size from 5 to 100, taking the size step as 5, and performing 50 times coding on each size to generate the average length of the vector.
Through data fitting, the power function is used as the fitting model, and the fitting function obtained is approximately as l = n 1.237 , which is better than n(n − 1) 2.

VI. CONCLUSION
This paper has discussed the method of graph vectorization, the prufer coding and decoding algorithms for undirected labeled graph were proposed, and the algorithm was analyzed and improved according to the experimental results. The final experimental results showed that the algorithm could well encode and decode the undirected labeled graph. By analyzing the time complexity of the algorithm, it has acceptable time complexity. The algorithm has a good application scenario, such as being used to generate graphs that meet certain conditions randomly. Besides, the algorithm provides an idea for the vectorization of graphs, which can simplify some graph operations. However, the length of the vector generated by the algorithm cannot be determined, so it cannot be well applied for some specific graph analysis. In the subsequent research, it will continue to explore the applicable range of the method.
LIN YANG received the B.Sc. degree in electronic engineering from the National University of Defense Technology, Hefei, China, where he is currently pursuing the M.Sc. degree in electromagnetic countermeasure. His research interests include cyberspace security, network security situational awareness, and artificial intelligence.
YONGJIE WANG received the M.Sc. and Ph.D. degrees from the National University of Defense Technology, Changsha, China. He is currently an Associate Professor of electronic engineering with the National University of Defense Technology, Hefei, China. His research interests include cyberspace security, risk assessment, and information system modeling and simulation. VOLUME 8, 2020