Identifying Key Nodes in Complex Networks Based on Global Structure

Quantitative identification of key nodes in complex networks is of great significance for studying the robustness and vulnerability of complex networks. Although various centralities have been proposed to solve this issue, each approach has its limitations for its own perspective of determining an actor to be “key”. In this paper, we propose a novel method to identify key nodes in complex networks based on global structure. Three aspects including the shortest path length, the number of shortest paths and the number of non-shortest paths are considered, and we establish three corresponding influence matrices. Node efficiency, which can reflect the contribution of one node to the information transmission of the entire network, is selected as the initial value of node’s influence on other nodes, and then the comprehensive influence matrix is constructed to reflect the influence among nodes. The proposed method provides a new measure to identify key nodes in complex networks from the perspective of global network structure, and can obtain more accurate identification results. Four experiments are conducted to evaluate the performance of our proposed method based on Susceptible–Infected (SI) model, and the results demonstrate the superiority of our method.


I. INTRODUCTION
Complex network is the abstract expression of a real complex system, in which elements are abstracted as nodes, and the relationships between elements are abstracted as edges between nodes. Complex networks have heterogeneous topology, which makes it impossible for each node in network to have the same importance [1]. Therefore, it is of great theoretical significance and application prospect to identify key nodes in complex networks by quantitative methods and utilize the properties of key nodes [2]. For example, stopping the spread of rumors and viruses [3], [4], drug targeting and discovery in biomedicine [5], [6], guiding effective information to spread rapidly in the network [7], [8], deliberate attack on terrorist and drug networks [9].
With the development of complex networks, a variety of methods have been put forward from different perspectives, such as degree centrality (DC) [10], closeness centrality (CC) [11], betweenness centrality (BC) [12] and Katz centrality [13], etc. Degree centrality only considers the information of the target node and its neighbors, and it has low accuracy The associate editor coordinating the review of this manuscript and approving it for publication was Zhan Bu . and low time complexity. Closeness centrality and betweenness centrality are well-known global centralities which have high accuracy and time complexity, thus the two centralities cannot be applied to large-scale networks. To achieve a balance between high accuracy and low time complexity, Chen et al. [14] came up with a semi-local centrality (LC) based on multi-level neighbor information to rank nodes. This method considers not only the degree of nodes but also the neighborhood information, but does not consider the topology connections among the neighbors. The centralities mentioned above ignore the effect of location of nodes in the network, Kitsak et al. [15] believed that the influence of nodes was related to the location in a network, they proposed to use K-shell (Ks) decomposition to identify key nodes from a new perspective. This method considers that the importance of key nodes is related to the location of the network. The outer nodes are stripped layer by layer, and the nodes in the inner layer are the key nodes in the network. But this method will result in a large number of nodes being assigned the same Ks value. Then a series of methods to expand and improve K-shell decomposition were put forward. Bae and Kim [16] considered that the sum of the Ks value of nodes and their neighbors should be used to measure the importance of nodes. Liu et al. [17] comprehensively considered the Ks value of target node and the distance between it and the node with maximum Ks value in the network. Zeng and Zhang [18] proposed a mixed degree decomposition method, which considered the degree information of removing nodes in calculation, and obtained high ranking accuracy of influence. In the field of search engine, there are also some methods such as PageRank [19], LeaderRank [20] and Hits [21]. In addition, information entropy has been utilized to rank the influence of nodes in complex networks [22]- [25]. Some researchers [26]- [29] proposed to combine indicators with multi-attribute ranking methods.
In this paper, we turn a sight to the global network structure, which considers the shortest path length, the number of shortest paths and the number of non-shortest paths. And three corresponding influence matrices (efficiency matrix, direct arrival paths matrix and indirect arrival paths matrix) are constructed and considered to form the comprehensive influence matrix. Besides, node efficiency is selected as the initial value of node's influence on other nodes. We can identify key nodes effectively by considering these three aspects comprehensively. In order to evaluate the feasibility and effectiveness of the proposed method, Susceptible-Infected (SI) model [30] is utilized to simulate the spreading process in network, and four experiments are conducted based on four real datasets to verify the effectiveness of our method. Other classical centralities (DC, CC, BC, Ks and LC) are employed to be compared with our method in four aspects, and the experimental results demonstrate the superiority of the proposed method.
The remaining of this paper is organized as follows. In section 2, we introduce some related work and classical centralities. Our method is proposed in section 3. In section 4, we conduct four experiments to evaluate the performance of the proposed method based on four real datasets. Finally, section 5 gives a conclusion.

II. RELATED WORK
Suppose that an undirected and unweighted network G(V , E) consists of m = |E| edges and n = |V | nodes, and the adjacent matrix A = (a ij ) n×n is used to describe the network, which is defined as Degree centrality [10] is the simplest centrality in identifying key nodes. Degree centrality of one node is defined as the number of its nearest neighbors. A high value of degree centrality indicates that one node is able to affect numerous neighbors directly. It is denoted as Closeness centrality [11] uses the shortest paths between all pairs of nodes to determine the influence, it is a global centrality with high time complexity. It is defined as the derivative of the average shortest distance from a node to others in the network.
wherein, d ij represents the distance between node i and node j.
Betweenness centrality [12] believes that a node is an influential one if the number of shortest paths passing through the node is great. It describes the influence of a node on information flow in network.
wherein, g st denotes the number of all the shortest paths between node s and node t, g st (i) represents the number of shortest paths passing through node i. K-shell decomposition [15] considers the location of nodes in the network. In this approach, the outer nodes are stripped layer by layer, and the inner nodes have high influence. The decomposition process assigns a ks value to each node. It can be regarded as a coarse-grained ranking method based on node degree.
Local centrality [14] achieves a trade-off between accuracy and time complexity. It takes the nearest and the next nearest neighbors into consideration.
wherein, i represents the set of the nearest neighbors of node i and N (w) denotes the total number of the nearest and the next nearest neighbors of node w. Node efficiency is the average value of the sum of the reciprocal distances between the node and other nodes in the network, it can represent the ability from one node to other nodes in the network. It is defined as VOLUME 8, 2020

III. THE PROPOSED METHOD
We turn a sight to the global structure of network, which can provide a more accurate approach for key nodes identification. Nodes in a network are not isolated, but affected and restricted by other nodes. The relationship among nodes can be described by the influence matrix. From the perspective of information transmission path, the influence among nodes will be affected by two factors: the shortest path length and the number of shortest paths [31], [32]. It should be noted that node can also transmit its importance through non-shortest paths, thus affecting the importance of the pointed node. Therefore, we comprehensively consider these three aspects (the shortest path length, the number of shortest paths and the number of non-shortest paths) and construct three corresponding influence matrices.

A. THE INFLUENCE MATRIX BASED ON EFFICIENCY
According to the theory of spatial autocorrelation [33], it can be considered that the influence between two nodes is inversely proportional to the distance between them. The efficiency between two nodes can be defined as When there is no path between node pairs, we define d ij = +∞, which will make it impossible to use d ij representation directly. Hence, we use e ij (derivative of d ij ) to avoid this issue. If there is no path between node i and j, e ij = 0. If the two nodes are connected directly, e ij = 1. Then the influence matrix based on efficiency (abbreviated as IME) can be established as follows.
IME can reflect the influence between nodes from the perspective of the shortest path length.

B. THE INFLUENCE MATRIX BASED ON THE NUMBER OF SHORTEST PATHS
In this part, we establish the influence matrix based on the number of shortest paths. The influence between two nodes is also affected by the number of shortest paths. Take FIGURE 1 as a simple example, we focus on the influence of node 6 affected by node 3 and node 10. The number of shortest path between node 3 and node 6 is 2, which is the same as that between node 10 and node 6. The number of shortest paths between node 3 and node 6 is 1 (3-5-6), but the number of shortest paths between node 10 and node 6 is 3 (10-7-6, 10-8-6 and 10-9-6). Obviously, the influence of node 10 on node 6 is greater than that of node 3 on node 6.
The number of shortest paths between two nodes can be calculated by the adjacent matrix A. For any positive integer k (k > 2), the element value (A k ) ij in the k-th power of A can be calculated as The element (A k ) ij represents the total number of paths with length k between node i and node j. Therefore, we can obtain the number of shortest paths between node i and node j, that is to say, the number of paths with length d ij between node i and node j is (A d ij ) ij . Then the influence matrix based on the number of shortest paths (abbreviated as IMS) can be established as (11), shown at the bottom of the next page, IMS can reflect the influence between nodes from the perspective of the number of shortest paths. Essentially, the element in IMS represents the ability of nodes to transmit information along the shortest path.

C. THE INFLUENCE MATRIX BASED ON THE NUMBER OF NON-SHORTEST PATHS
Sometimes, information is not transmitted along the shortest path, and the influence of the number of non-shortest paths also needs to be considered. For example, we focus on the influence of node 5 affected by node 3 and node 6 in figure 1. The shortest path length, as well as the number of shortest paths, between node 3 and node 5 is the same as that between node 6 and node 5. Next, we consider the number of non-shortest paths. When the path length is 2, there is no path between node 3 and node 5, or between node 6 and node 5. When the path length is 3, the number of paths between node 3 and node 5 is the same as that between node 6 and node 5. When the path length is 4, no path exists between node 3 and node 5, but there are four paths between node 6 and node 5 (6-7-8-6-5, 6-9-8-6-5, 6-8-7-6-5 and 6-8-9-6-5). We can find that the influence of node 6 on node 5 is greater than the influence of node 3 on node 5. Therefore, we consider the influence of the total number of non-shortest paths, which is defined as wherein, d ave is the average path length of network. The path length between two nodes cannot increase indefinitely, and for the sake of calculation, we set the maximum path length as d ij + d ave . Then the influence matrix based on the number of non-shortest paths (abbreviated as IMN) can be established as (13), shown at the bottom of this page. IMN can reflect the influence between nodes from the perspective of the number of non-shortest paths. Essentially, the element in IMN represents the ability of nodes to transmit information along the non-shortest path.

D. THE COMPREHENSIVE INFLUENCE MATRIX
From the above analysis, we can see that the influence between nodes is affected by the shortest path length, the number of shortest paths and the number of non-shortest paths. Therefore, we consider these three aspects and obtain an influence matrix as follows.
The element IM ij in IM represents the influence of node i on node j. Since node efficiency can reflect the contribution of one node to the information transmission of the entire network, node efficiency is selected as the initial value of node's influence on other nodes. And we establish the comprehensive influence matrix.
Then the influence of node j can be expressed as And the normalized influence of node j is denoted as

IV. EXPERIMENTAL ANALYSIS A. DATASETS
To evaluate the performance of the proposed method, we choose four datasets with varying sizes as the basis for experimental simulation and analysis. (i) Karate club [34]: It is a well-known social network with 34 members in a karate club in a US university, available online at ''http://www-personal.umich.edu/∼mejn/netdata/''. (ii) Jazz musicians [35]: Jazz musicians network is also a social network among jazz musicians, each node represents a jazz musician, and each edge indicates that two musicians have cooperation.

B. SI MODEL
We adopt SI model [30] to simulate the spreading process in network. There are two states of nodes in SI model, including susceptible state and infected state. Nodes in susceptible state will be infected by nodes in infected state with a certain probability. Once a node is infected, it will never be recovered. Suppose that node i is the source node and the number of infected nodes will reach n it after t (t = 1, 2, . . .) time step.  Then the spreading ability of node i, denoted as I i (t) = n it n, can be defined as ratio of infected nodes to the total number of nodes. The maximum time step is set as t = 50 and the spreading process is conducted with 1000 Monte Carlo simulations. Obviously, the higher value of I i (t) indicates greater influence.

C. THE KENDALL'S TAU COEFFICIENT
The Kendall's tau coefficient [38] is a non-parametric statistic used to measure the degree of correspondence between two rankings and assessing the significance of this correspondence. Suppose X and Y are the ranking lists of two centralities, a pair of distinct nodes i and j is concordant if (( wherein, n c and n d are the number of concordant pairs and discordant pairs, respectively. Better performance of centrality has higher τ value.

D. EFFECTIVENESS
We compare the performance of the proposed method with classical centralities including DC, CC, BC, Ks and LC. SI model [30] is adopted to simulate the spreading process in network with a certain spreading probability. In addition, the Kendall's tau coefficient is utilized to evaluate the performance of different methods.

1) EXPERIMENT 1: COMPARE THE TOP-10 NODES USING DIFFERENT METHODS
In this experiment, we focus on the capability of identifying a group of key nodes. The top-10 ranked nodes are considered here. We rank the influence of nodes in four networks using classical five centralities and our proposed method, and the actual spreading ability I = (I 1 , I 2 , . . . , I n )(t = 25) ranked by SI model is adopted for comparison.    as those of I . DC, BC and LC have 9 same nodes as I in the top-10 lists, and Ks has 8 same nodes as I . In Jazz musicians network, the proposed method has 8 same nodes as I in the top-10 lists, which is slightly worse than that of CC. There are 7 same nodes in the top-10 lists between DC, LC and I , BC has 5 same nodes, while Ks has the least number of same nodes. In USAir97 network, the proposed method and DC have 9 same nodes as I in the top-10 lists, which are slightly greater than BC, CC and LC. Ks only has 3 same nodes as I in the top-10 lists. As for Email network, the number of nodes in the top-10 lists of DC, CC, BC, Ks, LC and the proposed method that are the same as that of I is 7, 8, 7, 0, 6, 8, respectively. In a word, the proposed method and CC have the largest number of nodes in the top-10 lists that are same as I , but Ks has the least number of nodes that are the same as I .

2) EXPERIMENT 2: COMPARE THE DISTINGUISHING CAPABILITY OF DIFFERENT METHODS
In experiment 1, we find a phenomenon that some nodes are assigned same ranking value, then these nodes cannot be distinguished. Especially for Ks, it has the largest number of nodes with the same ranking value in the four networks. The frequency (the ratio of the number of nodes with the same ranking value to the number of total nodes) can be taken as an indicator to measure the distinguishing capability of a method, lower frequency of one method indicates better distinguishing capability. The frequency comparison results are presented in FIGURE 2. According to FIGURE 2, almost in all cases, the frequency of DC and Ks is much higher than other methods, that is to say, DC and Ks have the worst capability of distinguishing nodes. On the contrary, the proposed method has the lowest frequency almost in the four datasets and can further distinguish nodes. In addition, we put forward a parameter to further compare the distinguish capability of different methods. It is defined as wherein, n s is the number of nodes with single ranking value. The smaller the DIS value, the better the distinguishing capability of a method. As shown in TABLE 6, we can find that the DIS value of the proposed method is the lowest almost in four networks except that LC and the proposed method both have the smallest DIS value in Karate club network, and the DIS value of DC and Ks is much greater than that of others. The proposed method has the best distinguishing capability.

3) EXPERIMENT 3: COMPARE THE AVERAGE SPREADING ABILITY OF TOP-10 NODES
In this part, we focus on the spreading ability of a group of nodes. A node i is selected from the top-10 lists and taken as a source node. The average spreading ability of top-10 nodes can be defined as I (t) = ( 10 i=1 I i (t)) 10. The maximum time step is set as t = 50 and the simulation runs 1000 Monte Carlo. The average spreading ability of top-10 nodes ranked by different methods is presented in FIGURE 3.
As shown in FIGURE 3, In Karate club network, the curve of the proposed method overlaps the curve of CC, because the top-10 nodes of them are the same. The number of infected nodes of the proposed method is slightly less than that of LC, but greater than that of others. In Jazz musicians network, the average spreading ability of the proposed method is slightly stronger than CC, LC and DC, the average spreading ability of BC is the worst. In USAir97 network, the number of infected nodes of the proposed method is similar with that of DC, but slightly greater than that of LC and CC. BC has the least number of infected nodes. As for Email network, the proposed method has similar spreading ability with CC, and the spreading ability of them is stronger than LC and DC, Ks has the worst spreading ability. In general, the proposed method and CC have similar spreading ability, which is stronger than other methods.

4) EXPERIMENT 4: COMPARE THE CORRELATION BETWEEN SIX METHODS AND THE ACTUAL SPREADING SITUATION
We take the Kendall's tau coefficient as a correlation coefficient between the six methods and the actual spreading situation. The ranking list ε at t = 20 simulated by SI model is considered as the actual spreading situation, and ε is compared with the ranking lists using the six methods. The correlation between different methods and the actual spreading situation with a varying spreading probability from 0 to 0.1. As shown in FIGURE 4, In Karate club network, after spreading probability 0.02, the proposed method and LC have similar τ coefficient, and the τ value of the proposed method and LC is greater than that of other methods, that is to say, they have the strongest correlation with actual spreading situation. In Jazz musicians network, DC has the highest τ value, followed by the proposed method. After spreading probability 0.05, the proposed method and CC have similar τ value. In USAir97 network and Email network, the proposed method has the highest τ value, which indicates that the proposed method has the strongest correlation with the actual spreading situation. Based on the comparison of the correlation coefficient between the six methods and the actual spreading situation, we can find that the proposed method has the strongest correlation with the actual spreading situation almost in the four networks, and the correlation of BC is the worst.

V. CONCLUSION
Identification and protection of key nodes in complex networks is an effective way to improve the invulnerability and robustness of network. To solve this issue, we propose a novel method to identify key nodes based on global structure, which includes three aspects: the shortest path length, the number of shortest paths and the number of non-shortest paths. On the basis of them, three corresponding influence matrices are established. Since node efficiency can reflect the contribution of one node to the information transmission of the entire network, node efficiency is selected as the initial value of node's influence on other nodes. We combine the three influence matrices with node efficiency, and establish the comprehensive influence matrix. Then we can obtain the influence of each node in network. To evaluate the performance of the proposed method, we conduct four experiments based on four real datasets, and six classical centralities are applied to the same datasets for comparison. In addition, we simulate the spreading process using SI model with 1000 Monte Carlo runs, and the Kendall's tau coefficient is also taken as a parameter to compare the performance of different methods. Experimental results show that the proposed method has better performance in identifying key nodes.
YUANZHI YANG received the B.S. and M.S. degrees from Air Force Engineering University, Xi'an, China, in 2014 and 2017, respectively, where he is currently pursuing the Ph.D. degree with the Aeronautics Engineering College.
His current research interests include complex networks, failure propagation, and machine learning.
XING WANG is currently a Professor with Air Force Engineering University.
His research interests include electronic countermeasures, machine learning, and artificial intelligence.
YOU CHEN received the Ph.D. degree from Air Force Engineering University, Xi'an, China, in 2011.
He is currently a Lecturer with Air Force Engineering University. His research interests include pattern recognition, data mining, machine learning, and artificial intelligence.
MIN HU received the B.S. degree from the School of Liaoning Petroleum Chemical Industry University, China, in 2017. She is currently pursuing the master's degree with the School of Southwest Petroleum University, China.
Her research interests include theoretical calculation of materials and corrosion and protection. VOLUME 8, 2020