A Novel Model to Identify the Influential Nodes: Evidence Theory Centrality

The identification of critical nodes in complex networks is an open issue. Many scholars have tried to address it from different perspectives, but their methods are often not as effective as usual especially when meeting some specific graphs or limited to only one aspect. Evidence theory can consider the results from different sources comprehensively and the Shannon entropy can measure the uncertainty of information. In this paper, we use these two methods to rate the results gained from different measures and combine them to generate a new ranking result, namely Evidence Theory Centrality (ETC). The Susceptible-infected (SI) model and Kendall’s tau coefficient are used on six real networks to examine the effectiveness of our method.


I. INTRODUCTION
Recently, with the explosion of data and the rise of the Internet, complex network received much attention in many fields [1]- [10], such as time series [11], [12], link prediction [13], [14] and computer science [15]. The complex network has the characteristic of non-homogeneous topological structure, which determines the status of each node in the network is different. The nodes with great influence on the network account for a small part of the network nodes, which play a very important role in the network, while the nodes with little or no influence on the network account for the most. The identification of critical nodes is important because it can improve the efficiency in modeling works [16]- [27]. In particular, the study on measuring the importance of the nodes is also of practical significance. It could be applied to many areas, such as control of the disease [28], [29], rumor dynamics [30], [31] and public opinion [32].
Many methods have been proposed to identify the influential nodes in complex networks [33]- [41]. The Degree centrality (DC) [42] is the simplest approach but poor in identifying the bridge nodes. As its extension, Chen et al. [43] proposed a method called local rank, which considers fourth-order neighbors of nodes further. Kitsak et al. [44] hold that the location of a node plays the more important role, so the K-Shell was proposed. But it works effectively The associate editor coordinating the review of this manuscript and approving it for publication was Moayad Aloqaily .
only if applied to some networks like star graph and it is too coarse to offer a quantitative way to measure the importance of nodes. The Path-based centralities, such as Closeness centrality (CC) [45], Betweenness centrality (BC) [46] are also of great importance. However, BC and CC are not applicable to large networks due to its high time complexity and are not feasible when applied to disconnected graphs. Katz centrality [47] is worthy of focus because it does not only take the shortest paths but all paths into consideration. As to Iterative algorithm, PageRank [48], [49] has been popular ever since it was first proposed by Google, which Determines ''page value'' by looking at the number and quality of other pages linked to the page. But when it meets disconnected graphs, the result would be inaccurate. To make up it, Lü et al. [50] and Li et al. [51] [50], [51] proposed the LeaderRank by adding a ground node that connects to all other nodes through n bidirectional links. But both two methods assume the jumping probability from node to its neighbor node is same, which still has space to improve. Wei et al. [52] proposed the EVC (Evidence centrality), which has tried to solve the problem of identifying important nodes with evidence theory, but it is only limited to the local characteristics of nodes.
Evidence theory was first proposed by Dempster and developed by Shafer [53], [54]. As the extension of probability theory, it requires the weaker condition than the traditional Bayesian theory. Due to its advantage in handling with uncertainty of information collected from different sources, it has been used in many fields [55], [56]. The concept of VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ entropy originated in physics as a measure of the disorder of a thermodynamic system. In 1948, Shannon pointed out that the amount of information of a piece of information is directly related to its uncertainty, which solves the problem of quantifying information [57]. The probability distribution of events and the information content of each event constitute a random variable, and the expectation of this random variable is the information content generated by this distribution, namely entropy [58]- [62].
The most existing methods focus on only one aspect or limited to specific graphs. In this paper, we abstract the ranking values of different methods into BPA (Basic Probability Assignment) and use Dempster-Shafer theory evidence theory to combine them. Furthermore, we will adopt Shannon entropy to measure chaotic degree of their distribution on which assigning the weight to each method based. In the experiment part, the Susceptible-infected model (SI) and Kendall's tau coefficient will be used on six real networks to examine the effectiveness of our method and we will analyze their time complexity.
The rest of paper is organized as follows. In Section II, some preliminaries including evidence theory and Shannon entropy are briefly introduced. In Section III, the proposed method to identify influential nodes will be presented. In Section IV, the SI model and Kendall's tau coefficient are used to illustrate the effectiveness of the proposed method. The conclusion is given in Section V.

II. PRELIMINARIES
In this section, some basic concepts in complex networks, evidence theory and Shannon entropy will be briefly introduced.

A. CENTRALITY MEASURES
A network can be denoted as G = (V , E), where V and E are the set of nodes and edges, respectively. The centrality measures of DC, BC, and CC are defined as follows.
Definition 1: The betweenness centrality measure (BC) of node i, denoted as b i , is defined as [46] b i = j,k =i g jk (i) g jk (1) where g jk is the number of binary shortest paths between node j and k, and g jk (i) is the number of those paths that go through node i. Definition 2: The closeness centrality measure (CC) of node i, denoted as c i , is defined as [42] where d ij denotes the distance between node i and j. Definition 3: The degree centrality measure (DC) of node i, denoted as D i , is defined as [42] where x ij is the edge between node i and j.

B. DEMPSTER-SHAFER EVIDENCE THEORY
Dempster-Shafer theory offers a useful tool for uncertain information. It is regarded as an extension of the Bayesian theory. Some preliminaries in D-S theory are introduced as follows. For additional details about D-S theory, refer to [53], [54]. Definition 4 (Frame of discernment): Let be the set of mutually exclusive and collectively exhaustive events A i , namely The power set composed of 2 N elements of is indicated by 2 : where the m(A) refers to degree of evidence supporting A. Definition 6 (Dempster Combine Rule): There are two BPAs indicated by m 1 and m 2 , the combination rule is defined as follows [63]:

C. SHANNON ENTROPY
The concept of entropy originated in physics as a measure of the disorder of a thermodynamic system. In information theory, entropy is a measure of uncertainty. The probability distribution of events and the information content of each event constitute a random variable, and the expectation of the random variable is the entropy of the information content generated by this distribution. In 1948, Shannon [57] introduced the entropy of thermodynamics to information theory, so it is also known as Shannon entropy. It is defined as follow: where N is the number of basic states in a system, and p i is the probability of state i.

D. SUSCEPTIBLE-INFECTED MODEL
To test the performance of the proposed method (ETC), the SI model [64] is used to examine the spreading ability of the ranked nodes. In this model, every node has two discrete states: (i) susceptible and (ii) infected, in which the infected nodes stay infected and spread the infection to the susceptible neighbors with probability β = ( 1 2 ) α . The F(t) refers to the number of infected nodes after the time of t, which can be viewed as the measure of spreading ability.

E. THE KENDALL'S TAU COEFFICIENT
Kendall's tau coefficient [65] is utilized to measure the correlation between different centrality methods. The Kendall's tau coefficient considers a set of joint observations from two random variables X and Y. Any pair of observations (x i , y i ) and (x j , y j ) are concordant if the ranks for both elements agree: that is, if both x i > x j and y i > y j or if both x i < x j and y i < y j . They are said to be discordant if x i > x j and y i < y j or if x i < x j and y i > y j . if x i = x j and y i = y j , the pair is neither concordant nor discordant. The Kendall's tau coefficient τ is defined as where n c and n d denote the number of concordant and discordant pairs, respectively. The higher value indicates, the more accurate ranked list a centrality measure could generate. The most ideal case is τ = 1, where the ranked list generated by the centrality measure is same as the ranked list generated by the real spreading process.

III. PROPOSED METHOD
In this section, we will present the proposed method and use a simple example to show it.
Step 1: Choose the methods participating the fusion based on real need. In the experiments of this paper, we choose the DC, BC, and CC.
Step 2: Obtain their ranking value and do the following to get the BPA distribution of each method: where Mi refers to value distribution of the selected method i, and Mi j refers to the value of node j in method Mi. n is the number of nodes in graph.
Step3: Calculate the Shannon entropy and do the following to determine the weight of each method (The Shannon entropy reflects the chaotic degree of distribution so the weights and Shannon entropy are inversely proportional): where Mi is the chosen method, S Mi is its Shannon entropy. S is the sum of Shannon entropy and W Mi is the weight of method Mi. Step 4: Integrate them to generate a new BPA distribution m where m j is integrated value of node j, N 1 is the number of methods involved with fusion and N 2 denotes the number of nodes in graph.
Step 5: Using evidence theory to fuse for N 1 − 1 times.

A. EXAMPLE EXPLANATION
Take Fig. 1 as an  . Notice that, for simplicity, we choose DC, BC, and CC to fuse, but it not limited to these three in the real application. Also, we try some method combinations on Fig.1, the weight of each method is shown in Table 2, from which we can see the weights are inversely proportional to the Shannon entropy.

IV. APPLICATION
In this section, the six real networks and SI model and Kendall's tau coefficient are used to compare the proposed method and other classic measures. The methods involved with fusion are DC, BC, CC.

A. DATASETS
In this paper, we used six datasets which consist of sparse graphs and dense graphs to test the proposed model, all networks used in this paper are undirected. The brief introduction is listed as Table 3. |N | and |V | are the numbers VOLUME 8, 2020  Fig. 1. Fig. 1. of nodes and edges respectively; < K > and K max are the average and maximum value of the degree; < W > is the average shortest distance in the network, is used as initial infected node set in Fig. 2.

C. THE COMPARISON OF SPREADING ABILITY
If a node is located at an important position, it would have a strong infectious ability in the complex network. The SI model is deployed to test the infection ability of nodes obtained by different measures. In this paper, the top 10 nodes shown in Table 4 are used as initial infection nodes, and the rest of the nodes are treated as susceptible nodes. In each time t, infected nodes have spreading rate β = ( 1 2 ) α to infect their neighbor susceptible nodes, and the total number of infected nodes and susceptible nodes are the number of nodes |N | in complex network. The classic methods PageRank, LeaderRank and EVC are chosen to compare with our method in six real networks. Each experiment result is obtained by the average of 100 independent tests. The results are shown in Fig. 2, it demonstrates the correctness of this experiment. In Fig. 2, the number of infected nodes F(t) increases with the transmission time and reaches a stable value eventually. To illustrate experimental result better, we cut out only a part of the whole process because the initial node sets are close.
In Karate and AIDS networks, due to the scale is relatively small, it is subject to infecting probability. The pictures start at the end of process, the ETC achieves the best performance among these methods (After t = 35 in Karate). The performance in Netscience is also the best. It's worth noting that the scale of Netscience is bigger than AIDS, but the Netscience of time (t = 85) when the whole network is infected is faster than AIDS (t = 115). This phenomenon is determined by the structure of graph. If the graph is sparse, the propagation speed would be slow. In the selected time period, none of the four methods reached the peak, but ETC is the fastest in Groad. In USAir97 network, before reaching the maximum number of infections, ETC surpassed other methods as the model for the fastest transmission. In Blog, as we can see from Table 4, the identification result of each method is same to ours, so there is little difference (caused by infecting probability) of them. The advantage of our method is its flexibility, it can adjust its result by changing the method participating fusion based real demand. Furthermore, it is not limited by the topological structure of graphs, because if the method performs poorly, then our model would reduce its weight.

D. EFFECTIVENESS
Kendall's tau coefficient [65] can measure the correlation between two different variables, and higher Kendall's tau coefficient shows these two variables are more similar to the standard model (SI model). In addition, different cases are considered in this experiment. The spreading rate β vary from 0.01 to 0.1 to examine τ . The infection process is indepen-  dently repeated 100 times, and τ is obtained by averaging. In this paper, we compare six methods to the SI model and the results are shown in Fig. 3. It can be seen that the value of τ is above 0.45 in all six graphs even reach to 0.7 in Karate, USAir 97 and Blog. In Karate, the performance of ETC is the best in most time. In AIDS, Netscience, Groad and USAir97, the ETC outperforms all methods but CC. In Blog network, only DC is better than our method. Note that, there  is a phenomenon that the rank correlation τ does not change sharply with varying the β. Fig. 4 shows the average infection ability of top-L nodes ranked by different methods. The F(t) should decrease with the increase with number of nodes. In Karate and USAir 97, the ETC outperforms other methods in most time. In AIDS, Netscience and Groad, our method is better than other methods except CC. In Blog, the proposed method achieve best performance except the number of nodes ranging from 500 to 1000.

E. ANALYSIS OF TIME COMPLEXITY
The time complexity of the original Dempster-Shafer is supposed to be O(n 2 ). Since the distribution is single subset in this paper, which is helpful to reduce time cost and it is down to O(n). This complexity is lower to the majority of existing methods. From Table 5, it can be found that DC always costs least time, this result meets our expectation because it is the simplest method among all measures in identifying the influential nodes. In these six networks, only BC and PageRank can compete with our method. Furthermore, we find in relatively large networks, BC is inferior to the proposed method.

V. CONCLUSION
In this paper, the identification of influential nodes was discussed. First, we proposed a model to abstract the ranking value and measure its chaotic degree by Shannon entropy. Second, the evidence theory was introduced to fuse the chosen methods. In application, the DC, BC, and CC were selected as participating in the fusion. Rather than specifying some fixed methods to fuse, our method provides a flexible framework for selecting the required method to fuse according to the actual situation. The advantage of our method is its flexibility, it can adjust its result by changing the methods participating fusion based on real demand. if the method performs bad under specific evaluation criteria, the ETC would reduce its weight. At last, we compared the proposed method with other classic methods on six networks, the experimental result demonstrated the effectiveness of our method-Evidence Theory Centrality. The shortcoming of ETC is that it needs know the value of other methods first, which would cost more time. This what we will focus on in the future works.