Quantifying the Effect of Community Structures for Link Prediction by Constructing Null Models

,


I. INTRODUCTION
In the field of complex networks, link prediction is a significant problem of predicting the unknown or fake links from a network which has uncertain structures [1], [2]. Link prediction algorithms can apply in identifying unknown interactions to reduce the cost of experiments [3]. For example, the vast majority of interactions among different proteins are unknown in biology [4]- [6], so traditionally we have to spend expensive cost to restore these unclear interactions.
Link prediction plays an essential role in various applications of complex networks. For example, link prediction can be utilized to analyze the evolution of social networks, such as using historical structures of an online social network to predict whether a pair of friends will comment on each other in future [7]- [10]. Link prediction also can measure the fitness of network models for real-life networks [11]. Furthermore, link prediction is helpful for designing recommendation algorithms [12], detecting spurious links [2], ana-The associate editor coordinating the review of this manuscript and approving it for publication was Yongqiang Zhao . lyzing cascading failures [13]- [16], network reconstruction and node classification [17].
The existing link prediction algorithms can be divided into two categories: structural similarity algorithms in network domain [18] and network embedding algorithms in the field of machine learning [19]. Traditionally, link prediction methods in network domain tend to use micro-scale topology properties, such as link existence [11], link weights [20], common neighbors [21], node degree [17], [22], clustering coefficient [23], to predict the links. In recent years, the technique of network embedding has been widely applied in link prediction [19]. It aims to map network data into a low dimensional space in which the network neighborhood information is maximumly preserved. In the previous study, we systematically compared the two categories of prediction algorithms and studied the shortcomings of network embedding algorithms in short-path networks [24]. Our previous findings imply that the algorithms in network domain can effectively characterize node similarity by micro-scale and meso-scale (e.g., community) structure information, thus better performance can be achieved in the category of short-path networks compared with those of network embedding algorithms.
Traditionally, link prediction methods tend to use local topology properties. With this intuition, we try to modify several basic local structure similarity methods for link prediction: Common Neighbors (CN), Jaccard Similarity, Resource Allocation (RA), and Leicht-Holme-Newman (LHN) [17]. CN is the basic and simpest form of these metrics, so we only present the CN-based method here.
The link prediction method of Common Neighbors (CN) is based on the node similarity of local information proposed by Liben-Nowell and Kleinberg [30]. First, we should select a pair of nodes (a and b) that have common neighbors, and let (a, b) denote the set of common neighbors of nodes a and b. Then, the score of the link between a and b is equal to the number of common neighbors between them: In this case, a higher CN score means that it is more likely to have a relationship between a and b.

B. LOCAL COMMUNITY NEIGHBORS (LCN)
Recently, some novel algorithms attempt to use the information that a pair of nodes and their common neighbors which locate in the same community [35]- [39] to improve the performance of traditional methods for link prediction [25], [26]. Based on CN, they introduce the number of common neighbors which locate in the same community to propose the method of Local Community Neighbors (LCN): Here, C(a) and C(b) are the communities that nodes a and b belong to, respectively. In Equation 2, the first half part is the number of common neighbors for nodes a and b, and the second half part is the weight of community information.
Because node i has to be the common friend of nodes a and b, we call it a local way of using community information. If a node i belongs to the community of node a, and node b also belongs to this community at the same time, the second half is going to add 1.

C. GLOBAL COMMUNITY NEIGHBORS (GCN)
The LCN algorithm has the following disadvantages. Firstly, such kind of algorithms only utilize local community information but does not fully explore global community information. Secondly, whether a synthetic or real-life network has the characteristic of community structures, the LCN algorithms in previous studies are usually used to predict unknown links. Obviously, the networks that do not have community characteristics will not experience any performance improvement using such kind of algorithms. At last, as far as we know, there are insufficient studies to systematically discover detailed effects of community structures on link prediction. In this study, we propose a novel algorithm of link prediction to take advantage of the number of common neighbors in the same community. The number of neighbors which locate in the same community is also introduced to present another method of using community information, we call it as the method of Global Community Neighbors (GCN): Here, node i is the neighbor of node a and C(i) denotes the community that node i belongs to. At the same time, node j is the neighbor of node b and C(j) denotes the community that node j belongs to. The main difference between GCN and LCN is that GCN does not require node i or j is a common neighbor of nodes a and b. Therefore, GCN requires less limiting conditions than LCN and other CN-based methods [25], [26], hence it contains more community structure information. Moreover, we introduce different null models to study the role of community structures on link prediction more carefully.
In the GCN algorithm, it is necessary to balance the effects of micro-scale (common neighbors) and meso-scale (common community neighbors) characteristics.To quantify the role of community structures for link prediction, we use αGCN to weight commu nity information. A new parameter α is added to GCN to propose αGCN: Here we set the range of α to be [ −3, 3]. α > 0 means that the information of community structures has a positive effect for link prediction. On the contrary, α < 0 means that the information of community structures has a negative effect for link prediction. If |α| < 1, the weight of community structure information is less than that of common neighbors; if |α| > 1, community information will play a more prominent role than the number of common neighbors.

A. EVALUATION METRICS FOR LINK PREDICTION
Referring to [9], [20], the standard link prediction problem can be described as follows. A part of the non-observed links construct the setĒ, and the observed link set E is divided into two parts E T and E P , where E T ∪ E P = E and E T ∩ E P = ∅. All the links in E = E P ∪ E T have been observed and known. Typically 90% of observed links belong to E T , and E P includes 10% of observed links.The division into E T and E P is arbitrary and will be used for the scoring purpose [20]. Links in E T form a training set and are utilized to implement a link prediction score, the effectiveness of which will be evaluated over the probe set E P .

1) AUC
The so-called Area Under the Receiver Operating Characteristic Curve (AUC), originally applied to evaluate communication schemes, has been widely applied to measure prediction accuracy in a variety of settings [40]. Here, we use the metric of AUC as one of the measures for link prediction accuracy. Only the information of E T is allowed to be used to compute the performance score, and we compare the prediction scores of m pairs of nodes from E P andĒ randomly. If there are m times which the score measured from E P is larger than that fromĒ and m times which the two scores are equal, then, Generally, the range of AUC values is in [0. 5,1]. An AUC value close to 1 means that the corresponding method of link prediction is very efficient.

2) PRECISION
Precision is also widely used to measure the performance of designing recommendation algorithms [41] and detecting spurious links [20] in various settings. Precision is defined as the ratio of selected observed links to the number of selected items. Here, we use Precision as the other measure for prediction accuracy. In this study 10% pairs of nodes from E P and the same number of pairs of nodes from the nonexistent links are selected randomly. We select top L (200, about 1% pairs of nodes) pairs of nodes by their scores in the LFR network.
In other cases, we select 1% of network size as the value of L. If there are m links from E P , then, The values of Precision are in the range of [0, 1]. The closer the Precision value is to 1, the better the prediction performance of the prediction method is.

B. EXPERIMENTAL RESULTS
The LFR network is one of the most commonly used standard benchmarks to test the performance of community detection algorithms [42]. It takes into account not only the distribution of node degree, but also the distribution of community size. In this study, to provide an impartial evaluation for the role of links within the same community and that between different communities, we ensure that the two kinds of links have a similar quantity. Moreover, our LFR network is undirected. Table 1 provides sufficient background details for the LFR benchmark network. Except the above LFR network in Table 1, we also use a series of LFR benchmark networks whose µ values change from 0.8 to 0.1 to test how community structures affect the performance of link prediction in Sect. IV-D. Furthermore,  we use eight real-life networks to verify that GCN can obtain higher performance than LCN and traditional CN-based algorithms in this section.
Firstly we try to select a suitable way to utilize the information of community structures for link prediction. As shown in Fig. 1, there is an obvious difference among the performance of CN, LCN and GCN. The results of LCN and GCN are both superior to that of CN. That is to say, community structures can improve the accuracy of link prediction. In detail, we can see that GCN achieves the most optimal performance for AUC. For Precision, GCN can also achieve the highest accuracy. In general, the performance of GCN stays ahead of those of CN and LCN. Our results indicate that the number of neighbors that locate in the same community can substantially alter the result of link prediction. Therefore, in the following section, we mainly take the advantage of community structure information by the GCN method.
For αGCN, if α > 0, community information plays a positive role for link prediction. On the contrary, if α < 0, the information of community structures plays a negative role. We set the range of α to be [−3, 3]. As shown in Fig. 2, we find that when α > 0, community structures are always helpful to improve the performance. In contrast, the negative role (α < 0) can significantly reduce the prediction accuracy. In addition, at the cases of α > 0, there is no obvious fluctuation for the algorithm performance when we vary the value of α. Hence, we suggest that the specific value of the weight factor α does not strongly affect the performance of link prediction, when it is greater than zero. In the following section, we set α = 1 and use GCN to inspect the effect of community structures for link prediction.
Apart from the synthetic benchmark networks (i.e., GN and LFR), eight well-studied real-world networks are also adopted to study the effect of community structures, including the networks of Adjectives and nouns [43], Zachary Karate Club [44], Sawmill communication [45], Bottlenose Dolphins [46], Football [47], Polbooks [48], and  The statistics of eight real-world networks. N represents the number of nodes, M represents the number of edges, C n is the number of communities, Q is the Modularity index obtained by Infomap, R in is the ratio of the number of inner edges within each community to the total number of edges, AUC% is the percentage of the performance improvement of GCN with CN for the value of AUC, and Precision% is the percentage of the performance improvement of GCN with CN for the value of Precision.
Co-appearances of characters in Les Miserables [49]. The characteristics of these networks are summarized in Table 2.
As illustrated in Table 2, the method of GCN can significantly improve the performance of link prediction by fully exploring community information compared to previous studies. The new framework of link prediction by GCN is superior to the classical CN method. Experimental results on real-life networks show the highest performance improvement is 31.57% for AUC and 49.10% for Precision. Most Link prediction algorithms based on community information are scalable, because community size generally does not increase with the increase of network size. In order to prove the effectiveness of our algorithm in large-scale networks, two large-scale networks were added, namely the LFR network with 5000 nodes and 50245 edges, and the Power grid network with 4941 nodes and 6594 edges. The experimental results of the two networks show that the proposed method is robust and scalable.

IV. THE EFFECT OF COMMUNITY INTENSITY A. EXPERIMENTAL RESULTS OF LFR NETWORKS
We have verified that the information of community structures can significantly improve the accuracy of link prediction algorithms. Therefore, the relationship between community structures and link prediction should be studied more carefully.
A lot of connections within communities mean that the intensity of community structures is high. In contrast, if there are few connections within each community, it means that the community intensity of a network is relatively low. By adjusting the parameter of µ, the strength of community structures can be modified for a series of LFR networks. We conduct experiments on LFR networks in Fig. 3. In these networks, the values of µ changing from 0.8 to 0.1 with a span of −0.1 and a lower µ value means clearer community structures. It is observed that the performance of GCN is always superior to that of CN, and both of them increase with the growing of community intensity in Figs. 3 (a) and (b). It is suggested that a higher community intensity is helpful to achieve the more outstanding accuracy for link prediction. In other words, clearer community structures (i.e., denser links in a community) correspond to the preferable accuracy of link prediction that is based on the information of community structures.
At the same time, with the increase of community intensity, the performance improvement of link prediction obtained by using community information becomes smaller and smaller as shown in Fig. 3. Although this phenomenon is very interesting and deserves further study, we cannot use these LFR networks to reveal the intrinsic mechanism, because when we vary the mixed parameter µ, the whole topology structure of the corresponding LFR network also is modified.

B. NULL MODEL OF INCREASING COMMUNITY INTENSITY
The null models proposed in this section can enhance or weaken the community density of an original network, at the same time maintaining its micro-scale structural characteristics. When the community structure of a network has been determined, we attempt to study the influence of its community intensity on link prediction. Traditionally, a series of benchmark networks can be constructed, but they lose the restriction for micro-scale structures of the original network [50]- [52]. Here, we introduce the first null model that is similar as Bagrow proposed in [53], which enhances community structure intensities. Meanwhile, in the case of varying network meso-scale characteristics (i.e., community structures), our null models can maximally maintain micro-structures of different orders, such as degree distribution, assortativity, and transitivity.
The process of constructing the null model of increasing community intensity is illustrated in Fig. 4. Firstly, we split a toy network into multiple communities by a classical community detection algorithm (e.g., Infomap). Secondly, we attempt to exchange a pair of external edges between two communities to be the edges within a community, in the case of keeping micro-scale structures (e.g., degree distribution) unchanged. For example, we can disconnect two edges A1-B1 and A5-B3. If two pairs of nodes A1-A5 and B1-B3 both are not connected, and then we connect them respectively. The result of the reconnection is illustrated in Fig. 4(b). The process of rewiring links should be repeated a lot of times until the null model meets the requirement of increasing community intensity.

C. NULL MODEL OF DECREASING COMMUNITY INTENSITY
To study the influence of decreasing community intensities on link prediction, we introduce the second null model, which can weaken community structures. The constructing process of reducing community structures is shown in Fig. 5. Firstly, VOLUME 8, 2020 we divide the original network into multiple communities. Secondly, we try to exchange a pair of edges within a community to be external edges between two communities, at the same time keeping micro-scale structures unchanged. For example, we firstly disconnect two edges A1-A4 and B1-B4. If two pairs of nodes A1-B4 and A4-B1 are both not connected, and then we can connect them respectively. The reconnection result is displayed in Fig. 5(b). Repeating the process of rewiring links, we can obtain a series of null models with different levels of reducing community intensities. Actually, when the property of community structures of a network is weak, the kind of network can be utilized to test the robustness of community detection algorithms [33], so this type of null models has more extensive applications.

D. EXPERIMENTAL RESULTS OF NULL MODELS
The density of links in one community is far denser than that between two communities, corresponding to a high value of Q, which means the network has distinct community structures. Therefore, if we add the inner links of a community and at the same time remove the links between two communities, Q is going to increase. On the contrary, if we remove the links within a community and add the links between two communities, the value of Q will decrease. In this study, we use the two null models to increase and decrease community intensities (see Sect. IV-B and IV-C). As displayed in Fig. 6, we represent the prediction result of the original LFR network by the red dashed line, and we also show the prediction performance in the cases of increasing and decreasing Q values.
Obviously, the intensity of community structures for promoting the AUC and Precision performance is quite different. For the results of AUC, with the variation of Q, community information can improve the performance in a linear proportion [ Fig. 6(a)]. However, the improvement of Precision is nonlinear under different Q values [ Fig. 6(b)]. When the values of Q are small (Q < 0.5), the community structure information can greatly improve the accuracy of link prediction. On the contrary, in the cases of large Q values (Q > 0.5), community information has little effect on the prediction performance. This is because that the value of Precision is high in itself based on CN when Q > 0.5, and it can not be greatly improved using community information.

E. THE EFFECT OF COMMUNITY DETECTION ALGORITHMS
Most of complex networks have community structures, and the problem of community detection is a hotspot topic in recent years [54]- [57]. If we adopt different community detection algorithms, the results of community detection for the same network may have very large variations, and they correspond to the distinct intensity of community structures. Because the detected community information can be fused in the methods of link prediction, it is significant to explore how community detection algorithms affect the performance of link prediction.
In this section, we examine the performance of eight kinds of classical community detection algorithms: Multi-Level is based on the multi-resolution version of modularity [58], FastGreedy is based on the greedy optimization [54], WalkTrap is based on random walks [59], GN is based on edge betweenness [47], Kclique is based on the relationship between nodes and subgraphs [60], InfoMap is based on random walk dynamics [61], Label Propagation (LP) is based on the propagation of labeled nodes [62] and the leading Eigenvector algorithm is based on the leading eigenvector of modularity matrix [63]. These diverse algorithms make it more accurate to analyze the effect of community intensity for link prediction in various complex networks.
As shown in Fig. 7(a), the methods of WalkTrap, LP, MultiLevel and Infomap can achieve higher performance for AUC. On the other side, Kclique, Eigenvector, FastGreedy and GN do not perform the desired results. And the similar  results for Precision are illustrated in Fig. 7(b). Furthermore, it is evident that the performance of AUC and Precision is proportional to the values of Q induced by detection algorithms. There is a large difference between the performance of the best community detection algorithm and the worst one. Hence, it suggests that the performance of link prediction is not related to the specific division obtained by community detecting algorithms, but depend on the values of Q. An excellent community detection algorithm can achieve a higher value of Q, which means that inner edges of each community in this partition are denser, and the edges between two communities are more sparse. The intensity of community structures induced by community detection methods can also affect the prediction performance, so a more accurate community detection can get higher performance for link prediction. Because the method of Infomap can achieve the largest value of Q, we utilize it to divide every benchmark network into communities for constructing null models.

V. THE EFFECT OF LINK LOCATION A. PREDICTING LINKS IN A COMMUNITY AND BETWEEN TWO COMMUNITIES
To analyze how community structures affect link prediction on links in a community and links between two communi- ties more clearly, we examine the prediction results for the two kinds of links independently. As shown in Fig. 8, the performance of predicting the links within a community is much higher than that of the links between two communities. In general, the performance of predicting the links within a community is also superior to that of all the links in Fig. 8(a). In addition, the performance of GCN is higher than that of CN. In particular, for Precision in Fig. 8(b), the performance of predicting the links within a community is similar to that of all the links, but the performance of the links between two communities is very low. However, the results suggest that community structure information plays a significant role for predicting the links between two communities.

B. NULL MODEL OF REWIRING EDGES WITHIN A COMMUNITY
The traditional null networks are based on random rewiring, which completely destroy community structures of the original network [64], [65]. The characteristic of community structures is that the density of inner edges is relatively denser than that of external edges between communities. In order to study the influence of inner edges on link prediction, we introduce a new null model, which only modifies the inner structure of each community but maintains the number and structure characteristics of the communities.
The construction process of the null model of random rewiring edges within a community is shown in Fig. 9. Firstly, we divide the original network into multiple communities using a classical community detection algorithm such as Infomap. Secondly, while keeping community structures (external edges between communities) and the whole degree distribution unchanged, we only exchange the inner edges of the communities. For example, we break two edges A1-A4 and A2-A3 in community-A. If the pairs of nodes A1-A3 and A2-A4 are not linked, we connect them. The result of the reconnection is shown in Fig. 9(b). The process of random reconnection should be repeated many times until the null network is completely randomized. At last, we can get a special 1k null network that only destroys the inner topology of each community.
As shown in Fig. 10, for the null model of reconnecting edges within each community, the prediction performance of the links within a community is similar to that of the links between two communities, and that of all the links. In addition, compared with Fig. 8, the performance of predicting the links within a community decreases sharply for both AUC and Precision in Fig. 10. Hence, the result suggest that the original community structure strongly benefits predicting the links within one community. In other words, we confirm that community structures play the crucial role in predicting the links within a community, and these links also have a significant effect on link prediction of the whole network. We have to emphasize that even in this kind of case, the performance of GCN is always much better than that of CN in Fig. 10, which strongly demonstrates the effectiveness and robustness of the proposed algorithm in this study.

C. NULL MODEL OF REWIRING EDGES BETWEEN COMMUNITIES
To investigate the impact of external edges between community structures, we introduce another null model that manipulates only the links between two communities but maintains the number of communities and all the links within each community. The process of constructing the null network of random reconnecting edges between two communities is shown in Fig. 11. First, we use the classical community detection algorithm to divide the original network into multiple communities. Second, we only exchange the edges between two communities, keeping the topology within each community  and the degree distribution of the entire network unchanged. For example, we break the two edges A1-B1 and A4-B4. If the pairs of edges A1-B4 and A4-B1 are not linked, we connect them and the reconnection result is shown in Fig. 11(b). The process of random reconnection should be repeated many times, and then we can get a special 1k null network that only destroys the external structures of all the communities. The two types of null models that are depicted in Figs. 9 and 11 can distinguish the role of edges within a community and that of edges between different communities.
For the null network of randomly reconnected edges between communities, the performance of predicting all kinds of the links are similar to that of the original LFR network in Fig. 8, so the relevant results are not shown here. This phenomenon means that the edges between communities  have little impact on link prediction, and further verifies the significance of the links within a community.

VI. THE EFFECT OF MICRO-SCALE STRUCTURES A. 0k-3k NULL MODELS FOR MICRO-SCALE STRUCTURES
In general, compared with lower-order null models, the structure of a higher-order null model is s closer to the topology of the original network [65]. The null models of all these orders are interconnected, i.e., 0K ⊇ 1K ⊇ 2K ⊇ 3K . Any higher-order null network encompasses the features of lower-order null networks [64]. Here, we briefly introduce the null models of different orders. 0k null model is the simplest but completely randomized version of the original network, which only has the same number of nodes and the same average node degree as the original network in Fig. 12(a). Compared with 0k model, 1k-3k null models maintain more topology structure characteristics, so they have been widely used in previous studies. As shown in Fig. 12, we describe a toy model of 1k-3k order null models. All the null models are generated by the edge reconnection algorithm, which can randomize all the edges under certain constraints [66], [67]. 1k null model can maintain the degree sequence of the original network and it is depicted in Fig. 12(b). The original network and its correspond 2k null model have the same joint degree distribution, that is, the degree value of each link-end node is the same as that of the original network. The result of 2k null model is depicted in Fig. 12(c). There are two k 1 in it, which mean that the two nodes have the same degree and the value is k 1 . To keep the same joint degree distribution of the original network, the nodes of k 2 and k 3 in the corresponding 2k null network have to possess the same degree value (k 1 ) for the nodes on the end of each link. Finally, the clustering coefficients of all the nodes in the original network and its 3k null model are the same, and at the same time the degree distribution and joint degree distribution are retained. The result of 3k null model is shown in Fig. 12(d).
It is important to note that not all the information of the original network will be randomized in the null models, so we can use them to explore the impact of different micro-scale and meso-scale network characteristics on link prediction. In each real-life network, as the order of the null model increases, the average number of shuffled edges in the cor- responding null models decreases. It means that the diversity of generated benchmarks is decreasing because fewer edges can be rewired, making the network structures similar to the original network. Therefore, 1k-3k null models can actually lead to different changes in the original real-life networks, providing diversified test networks for performance evaluation of link prediction algorithms.

B. EXPERIMENTAL RESULTS
0k-3k null models can be utilized to verify the effect of different micro-scale attributes (such as degree distribution, assortativity, and transitivity) on link prediction. First, we use 0k-3k null models to randomize the LFR benchmark network. As shown in Fig. 13, the performance of GCN is always higher than that of CN. Here we define C 1k as the difference between CN and GCN for 1k null model in Fig. 13(a), which quantifies the improvement for the accuracy of link prediction after adding community information. Using this approach, we can detect the impact of community structures on link prediction performance in the null networks of different orders (i.e., C 0k , C 1k , C 2k , and C 3k ).
The higher order null models correspond to the smaller structural variation of the original network. Meanwhile, if we increase the order of null models, the performance of link prediction will be more preferable, which means that the retention of more micro-scale characteristics can allow us We quantify the effect of community structures on link predicting on different orders of null networks. C 0k is the performance difference between CN and GCN for 0k null model of the LFR benchmark network, C 1k is the performance difference for 1k null model, C 2k is the performance difference for 2k null model, and C 3k is the performance difference for 3k null model. At the same time, we also quantify the effect of different order micro-scale structures on link prediction. D 01 is the performance difference of CN between 0k and 1k null models, D 12 is the performance difference between 1k and 2k null models, D 23 is the performance difference between 2k and 3k null models, D 3o is the performance difference between 3k null model and the original LFR network.
to improve the prediction performance. We attempt to measure the effect of different order microscopic characteristics on link prediction of a specific network. For instance, D 23 represents the performance difference of CN between 2k and 3k null models. In this way, we can quantify the effect of different order micro-scale structures on link prediction (D 01 , D 12 , D 23 , and D 3o ). Specifically, as shown in Fig. 13(b), the performance of 3k null model is similar to that of the LFR benchmark network. Furthermore, the link prediction accuracy grows sharply when the order of null model increases to 3k. Hence, for link prediction, maintaining the 3k microscale property (i.e, transitivity) can achieve the same function of the original network.
In brief, based on the results of Table 3, we can distinguish the different roles of micro-scale and meso-scale structures on link prediction, and reveal the relationship between them. Moreover, the effect of community structures on link prediction can be quantified for different orders of null networks. For example, C 1k is the performance difference between CN and GCN for 1k null model of the LFR benchmark network, and C 2k is the performance difference for 2k null model. In these two kinds of micro-scale null networks, the information of community structures has a strong positive effect, because it can greatly improve prediction performance as shown in Table 3.
At the same time, the results in Table 3 can quantify the effect of different order micro-scale structures on link prediction. For instance, D 23 is the performance difference for CN between 2k and 3k null models, which shows that the transitivity characteristic is a significant structural feature that affects link prediction. Actually, this framework can also be used to distinguish different roles of micro-scale and mesoscale structures on other applications in complex networks, such as social recommendation [12], detecting spurious links [2] and node classification [17].

VII. CONCLUSION
In summary, we proposed a succinct algorithm that is based on extended community common neighbors to improve the performance of link prediction. More importantly, we firstly quantified the role of community structures for link prediction using the proposed null models. Comparing the original network to null models, our results suggest that the intensity of community structures plays an important role for link prediction. Especially, the higher modularity value leads to more clear community structures, hence it can give rise to a higher accuracy of link prediction. Even only detecting more structure information of the original network by an efficient community detection algorithm, can achieve a higher prediction performance. Our findings have been verified by both of synthetic benchmarks and real-life networks.
We also distinguished the role of links in the same community and that between two communities. Our study uncovers that community information plays a crucial role for predicting links within one community, and links within one community also have a strong impact on predicting all the links, especially on the links between two communities. Our study also provides a start point for detailed analysis for the correlation of community and micro-scale structures by constructing different kinds of null models. In particular, it reveals the relationship and the dependence between this special mesoscale structure (community) and micro-scale structures of different orders (i.e., degree distribution, assortativity, and transitivity) for link prediction. In future, we try to extend the proposed framework to signed networks [68], and will pay attention to applying the information obtained by fuzzy overlapping communities [69] for link prediction.