Empowering Random Walk Link Prediction Algorithms in Complex Networks by Adapted Structural Information

In the link prediction problem a relevant algorithm running over a network attempts to determine whether a link between two nodes will exist in the future, given that it is not present at the moment. Most link prediction algorithms take into account the structure of the network on which they are applied and based on this, they attempt to predict the existence or not of future new edges in the network. However, many of them are quite standardized, applying the same concept and parametrization to all networks, thus not always achieving good results in every different network structure. Algorithms based on Graph Neural Networks (GNNs) are more adaptive to any network structure but they do not give appreciable results when the only information available is the network structure. In this paper, we propose a new approach to this problem that approximates the structure of a complex network by allowing adjusted weight to this network structure to create additional information, which we can embed into effective algorithms such as local and superposed random walk link prediction. To achieve this goal, we use well-known kernel functions such as Sigmoids, in which we fit their parameters appropriately by a genetic algorithm to achieve the best possible approximation. To demonstrate the effectiveness of our proposed method we have compared our prediction method results based on precision, AUC and AUPR on eleven selected networks of different structures and properties with seven well-known link prediction algorithms and one more utilizing GNNs. In every case, we have improved the results of random walk algorithms and in most cases we achieved better results from all employed benchmark algorithms.


I. INTRODUCTION
In network science, complex networks are used quite often in order to model various entities, either physical or technological [1].Examples of complex networks are the Internet, the WWW, social, biological, technological or judicial networks, which can be modelled as graphs [2], [3], [4].Complex networks have non-trivial topological features, exhibiting connection patterns between their entities that are neither purely regular nor purely random.Such features include heavy-tailed degree distributions, high clustering The associate editor coordinating the review of this manuscript and approving it for publication was Yilun Shang.coefficients, assortativity or disassortativity among vertices representing the entities, community structure, hierarchical structure, etc., [5], [6].
Predicting links in complex networks has been one of the essential needs within the realm of data mining and topology discovery [7], [8], over the past few years.For various types of complex networks such as biological/metabolic networks and food webs, discovering interactions (i.e., links) is very expensive in the laboratory or the field, resulting in a limited existing knowledge on them.For example, one may consider the enormous number of physical interactions between pairs or groups of proteins [9].Instead of having to find physically (typically at a great cost) all possible connections between nodes, one can predict them using link prediction algorithms, thus reducing costs, as long as the link prediction estimates are sufficiently accurate.Also, in social networks, very likely but not yet existing links can be suggested to users as recommendations of potentially promising friendships, aiding users to find new friends and possibly enhance their loyalty to the corresponding online network.
In the link prediction problem, we first model the complex network as a graph consisting of nodes connected by edges.A link prediction algorithm over a graph attempts to determine whether a link between two nodes will exist in the future, given that it is not present at the moment.Such an algorithm tries to find the similarities that exist between the nodes of the complex network.If two nodes are very similar to each other then it is very likely that they will be connected by a link in the future.During link prediction operation, the algorithm scores the missing edges of the graph according to some metric.Edges with higher scores are the link prediction algorithm's assessment that they are more likely to be added in the future than those with lower scores.In order to be able to make the best possible estimates, the link prediction algorithms try to take advantage of all the available information for each node in the network [10], [11], [12].However, such information in complex networks is quite limited and usually only constitutes the structural characteristics of the network, such as the degree of a node, i.e., the number of neighbors.
In the literature, there exist many algorithms that use the network structure in various ways to make the most accurate predictions.Depending on the logic they use, we can categorize them into the following classes.There are link prediction algorithms that use local heuristic methods such as common neighbor centrality (CNC) which counts the number of neighbors shared by two nodes [13], [14], [15], Jaccard Coefficient measures the proportion of common neighbors [16], [17], preferential attachment (PA), which uses the product of node degrees [17], [18], Adamic-Adar Index (AA) considers the weight of common neighbors [19] and Resource allocation (RA), which is similar to AA but uses a more aggressive downweighting factor [20].There are also link prediction algorithms that use global heuristic methods that require knowing the entire network.Examples include Katz [21], rooted PageRank [22], SimRank [23].Some very effective and simple link prediction methods are based on Local Random Walk and Superposed Random Walk [24].Another interesting approach to the link prediction problem is by learning from graph topology and node/edge features in a unified way using Graph Neural Networks (GNNs).There are two basic link prediction methods with GNN, node based methods [25], [26] and subgraph based methods [27], [28].Finally there are frameworks that attempt to compose the best features of link prediction algorithms in a unified way, e.g., [29].
To be able to calculate a reliable and efficient similarity between two nodes we need as much information as possible about these nodes.In complex networks, however, the information we have about the nodes is very limited.We usually refer to a node's neighbors through the node's degree.Also information that is important in one type of network may not be as important in another type of network.For example node common neighbors in social networks is very important link prediction information but not so important in biological networks.Our approach is to try to integrate additional and useful for link prediction information from that available in known and effective link prediction similarity algorithms (local and superposed random walk in this work) in such a way as to take into account the structure of each network in which we will apply the new integrated algorithm.The integrated information we have used is the neighbors of the neighbors of a node.We believe that this fills in an important gap in the existing work on link prediction.
In this paper, we use some of the most well-known similarity-based link prediction algorithms as benchmarks and specifically: Resource Allocation Index, Adamic-Adar Index, Common Neighbor Centrality, Jaccard Coefficient, Preferential Attachment.We also use two more very effective similarity-based link prediction algorithms: similarity algorithm by local random walk and similarity by superposed random walk.Finally, we have chosen a well-known link prediction algorithm that uses graph neural networks, the SEAL (learning from Subgraphs, Embeddings, and Attributes for Link prediction) [27].The evaluation methods used in graph neural networks platforms were not compatible with ours, so we adjusted the SEAL output to be compatible with our results to have objective metrics and comparisons.We explore and compare the performance of all the above approaches on several data sets, while we propose new similarity metrics.In these new metrics we have integrated additional node information via an adapted way that takes into account the structure of the network.The impact of the added information differs from network to network, depending on network structure.In this way, the new similarities are more efficient and give better results regarding link prediction.
Specifically, we define four node similarities based on local random walk and superposed random walk.The innovation lies in the calculation of these similarities, where we add in an adapted way more information (the sum of neighbors of neighbors degrees) according to the structure of each network to which they are applied.For the similarity adaptation we use sigmoid functions, namely the logistic and tangent hyperbolic.For better adaptation, we design a genetic algorithm for the calculation of adapted sigmoid function growth parameter.By this adaptation we succeeded, in most cases, to have more accurate similarities leading to better prediction results.To test and evaluate the four new similarities, we have performed several experiments on different types of networks.
The general link prediction algorithms that have been proposed so far in the literature, apply the same logic to all complex networks (common neighbors, node degree, etc.).
But this has the result that they are not equally effective in all types of complex networks since the structure of the networks is not always the same, on the contrary, it differs in many cases.Graph neural networks work in a different way.They adapt to each network and try from learning its structure to give efficient estimates for missing links prediction.However, as shown by the experiments we have done in this paper, their effectiveness is limited and in most cases quite worse than that of classical algorithms.Our proposal is to try to approximate the structure of a complex network by giving adjusted weight to this network structure to additional information that we can embedded into effective algorithms such as LRW and SRW.To achieve this goal we use well-known kernel functions such as Sigmoids in which we adjust their parameters appropriately by a genetic algorithm to achieve the best possible approximation.The proposed method is quite flexible compared to graph neural networks or classical link prediction algorithms, because we can choose the most efficient kernel function each time as well (Logistic or Tahn) as the most suitable link prediction algorithm (LRW or SRW) for a specific network, in order to get even better results.In this paper, we report the experiments we have done on several types of networks and their results demonstrate the superiority of the proposed method over the benchmark link prediction algorithms we have used.
The remaining sections of the parer are organized as follows: in section II, we describe the related work in link prediction, the five benchmark link prediction algorithms and the two random walk similarity algorithms.In section III, we present the proposed approach, providing the necessary theoretical background as well as the definitions of the newly suggested similarities.Section IV describes the designed genetic algorithm, while section V, provides the evaluation steps of our proposed approach.In section VI, we present the results of our experiments and finally, in section VII, we conclude the paper and provide directions for future work.

II. RELATED WORK
In the link prediction problem, the considered topology is represented as a graph G = (V , E), where V is the node set and E the edge set.For two nodes u, v ∈ V , e(u, v) ∈ E represents a link between nodes u and v.A link prediction algorithm over G attempts to determine whether a link e(u, v) will exist between nodes u and v in the future, given that it is not present at the moment.
Link prediction has been studied extensively in the previous decade, with a vast available relevant literature.There have already been several surveys regarding the link prediction problem [30], [31], [32].There are many different link prediction algorithms, owing to the different application objectives and underlying network topologies.There exist link prediction algorithms for PPI [33] (proteinprotein interaction) networks [34], [35], [36], [37], for social networks [19], [38], [39], for recommendation systems [40], etc.There are also special algorithms or frameworks that try to combine well-known link prediction algorithms, in order to produce better predictions [41], [42].
Among the many link prediction algorithms that exist in the literature, we have chosen five of the most well-known and effective ones for the performed comparisons.These are: Resource Allocation Index [20], Adamic-Adar Index [19], Common Neighbor Centrality [13], [14], [15], Jaccard Coefficient [16], [17], Preferential Attachment [17], [18].We describe them in more detail in the following: (i) The Resource Allocation Index of nodes u and v is defined as where N (u) denotes the neighborhood of node u.According to RA the more common neighbors that nodes u and v have, the higher the value of this index will be.RA index grows even more if the common neighbors of u and v have a small degree, i.e., they have few edges to other nodes.
(ii) The Adamic-Adar index of nodes u and v is defined as The AA index is very similar to the RA index.The only difference is that rather than considering the degree of the common neighbors, the logarithm of this degree is considered.
(iii) The Common Neighbor and Centrality index of nodes u and v is defined as where m is a parameter that varies in the range [0, 1], N denotes the total number of nodes in the network, and dist uv denotes the shortest distance between u and v. CNC is based on two vital properties of nodes, namely the number of common neighbors and their centrality [1].
(iv) The Jaccard index of nodes u and v is defined as JC index can be a value between 0 and 1, with 0 indicating no overlap and 1 complete overlap between the neighbor sets.
(v) The Preferential Attachment index for nodes u and v is defined as According to the PA, the nodes that have a very high degree are more likely to attract more neighbors in the future.Another class of link prediction algorithms is that based on the random walk in a graph G = (V , E).A random walk on a graph is a Markov chain process describing the sequence of nodes visited by a random walker [43], [44].We can describe this process by the transition probability matrix, P. We present the probability that a random walker staying at node u will walk to node v in the next step as: P uv = a uv /d u , where a uv = 1 if node e(u, v) ∈ E and a uv = 0 if e(u, v) / ∈ E. Parameter d u denotes the degree of node u.Given a random walker starting from node u, denoting by − → π uv (t) the probability that this walker is located at node v after t steps, we define where − → π u (0) is an N × 1 vector with the u th element equal to 1 and others equal to 0. Symbol ⊺ is the matrix transpose.The initial value is usually assigned according to the importance of nodes [45].
Local random walk similarity index, is a similarity index based on random walk [24].According to this link prediction algorithm we set the initial value of node u proportional to its degree d u .The similarity index by local random walk is defined as where |E| is the number of edges of the graph.Clearly s LRW uv = s LRW vu .The fractions in (7) represent the importance of nodes normalized by the number of graph edges.We have to note that local random walk similarity index refers to the few-step random walk and not the stationary state, which is characterized by the eigenvector centrality [46], [47].In the stationary state, the stationary distribution will be π uv = d u 2|E| , but according to (7) we will have , which is the preferential attachment index defined in (5) From the local random walk similarity index, one can define the superposed random walk similarity index [24] as: It is obvious that s SRW uv = s SRW vu .Superposed random walk algorithm releases the walkers at the starting point, resulting in a higher similarity between the target node and the nodes nearby.This algorithm gives better results in networks with high clustering or locality.
With the great development of machine learning, several efforts and works have been proposed in link prediction problem.Usually graph neural networks (GNN) are used which try by learning from graph topology and node/edge features in a unified way to make more accurate link predictions.Two GNN based link prediction methods stand out: node based and subgraph based.Node based methods first try to learn and extract a node representation through GNN, and then aggregate the pairwise node representations as link representations for link prediction [25], [26].Subgraph based methods extract a local subgraph around each target link and learn a subgraph representation through a GNN for link prediction [27], [28].Subgraph based methods actually have a higher link representation ability than node-based methods, due to modeling the associations between two target nodes [48].A very effective work in this area is the SEAL Framework (learning from Subgraphs, Embeddings, and Attributes for Link prediction) [27].SEAL first extracts enclosing subgraphs around target links to use for prediction.It then applies a node labeling to the enclosing subgraphs to differentiate nodes of different roles within a subgraph.Finally, the labeled subgraphs are fed into a GNN to learn graph structure features (supervised heuristics) for link prediction.Supervised heuristics can be rank results from other link prediction algorithms like Jaccard, common neighbors, etc.
We use all these link prediction algorithms as benchmark algorithms in our experiments in section VI.We note that the above approaches for link prediction are a small indicative subset of the vast relevant research, and they were selected strategically due to their employed approaches, performance results and relevance to our approach.
There are other relevant works, e.g., [49], which address similar issues but with different scope.For instance, [49] proposed an extended version of local random walk based on pure random walking for solving link prediction in the multiplex network, MLRW.Information mined from inter-layer and intra-layer of a multiplex network was leveraged to define a biased random walk for finding the probability of the appearance of a new link in one target layer.However, [49] focuses on multi-layer networks, compared to our approach that considers single-layer networks.

III. ADAPTED NODE SIMILARITY APPROACH
With the method of finding similarity index between two nodes using either Local Random Walk or Superposed Random Walk, which we have described in detail in section II, it has been shown experimentally that one can achieve quite good results for missing link prediction in complex networks [24].However, a limitation that exists for even better results is that these methods are applied unilaterally to all types of networks.According to the calculation of node similarity, an important role in its calculation is played by the degree of each node that participates in the creation of the edge whose emergence is to be predicted (or not) in the network.If one could enter additional information in the node similarity metric employed then he/she could distinguish even more the nodes of the network examined from each other.This would lead in even better results.
In this paper, we examine complex networks for which we have no information other than their structure.Therefore, the ability to extract useful information for each node of the network is quite limited.However, in addition to the degree of each node in the network, we can use the sum of the degrees of its neighboring nodes, i.e., the nodes with which it is directly connected.We normalize this sum by dividing it by 2|E|, where |E| is the number of graph edges, as well as the degree of each node in the network in the initial calculation.
However, if we used this quantity directly, we might had good link prediction results in some networks and in others not.This is because not all networks have the same structure and properties but instead they are usually very different from each other [1].There are many graph structural properties such as the distribution of vertex degrees, the diameter of the graph, clustering, graph efficiency, etc., in which networks can differ.A vast number of graph measures exist to describe the variations in graph structure.The most well-known and effective link prediction algorithms in complex networks are applied to all networks regardless of their structure and special features.This results in a link prediction algorithm giving better predictions on some types of networks and worse on some other.
For example, the common neighbors and resource allocation index link prediction algorithms by construction consider and score high, nodes that have many common neighbors and score low nodes that have no common neighbors or have a limited number of common neighbors.These link prediction algorithms are quite effective in social networks where it is quite likely that nodes with a large number of common neighbors are connected to each other.But they are not particularly effective in networks in which the number of common neighbors between two nodes is not particularly important or is usually low or zero, such as in biological networks.
Our effort is to extend existing efficient link prediction algorithms, such as node similarity based on Local Random Walk or Superposed Random Walk, by embedding new information (the sum of node neighbors degrees) to improve link predictions.The integration of this new information must be done in an adapted way according to the structure of the network in which the link prediction algorithm will be applied.With the help of sigmoid functions, we manage to take into account the added information as needed in each type of complex network and to modify the similarity of the nodes properly in order to obtain better missing link estimations.
At this point we provide some fundamental definitions and concepts of sigmoid functions.Two of the best known sigmoid functions are the logistic function and the tangent hyperbolic function (tanh), described in the following.

A. LOGISTIC SIGMOID FUNCTION
The logistic sigmoid function or logistic curve is a common S-shape curve (sigmoid curve) with equation: where e = 2.718281828 . . . is the mathematical constant serving as base of the natural logarithms and a is the growth parameter (steepness of the curve).Logistic sigmoid function gives output values in (0, 1) and is monotonic.For real values of u in the range −∞ < u < +∞, the curve has two asymptotes, σ (u) = 1 as u → +∞ and σ (u) = 0 as u → −∞.For u = 0, σ (u) = 0.5.In Figure 1, we present the logistic sigmoid function graph for various values of a parameter.We can see how the slope of the curve change for the various values of parameter a.

B. TANGENT HYPERBOLIC SIGMOID FUNCTION
Tangent hyperbolic function or simply tanh is another very well known S-shape function with equation: where e and a are defined as before.Tangent hyperbolic function takes values in (−1, 1) and is also monotonic.The main difference is the fact that the tanh function pushes the input values to 1 and −1 instead of 1 and 0 of logistic function.For u = 0 gives value 0, for positive input gives positive output and for negative input gives negative output.
In Figure 1, we show the tanh function graph for various values of parameter a.We can see that yields output values in (−1, 1) and the change of slope of the curve for the various values of parameter a.

C. RELATION BETWEEN THE LOGISTIC SIGMOID FUNCTION AND TANH FUNCTION
We will prove that the logistic sigmoid function σ is just a re-scaled version of the hyperbolic tangent function tanh.
In Figure 1, we have plotted those functions together.For simplicity we assume that a = 1.Since the logistic sigmoid function is symmetric around the origin and returns a value in range (0, 1), we can write the following relationship: Now, to see the relationship between tanh(u) and σ (u), let us rearrange the tanh function into a similar form by: Now, from the logistic sigmoid's perspective, we have: Hence, we can conclude that the tanh function is just a re-scaled version of the logistic sigmoid function σ .

D. ENHANCED CUSTOM SIMILARITY
According to the previous definition of similarity based on local random walk in (7), we define the new enhanced graph custom similarity after t steps as: for the local random walk, where d i , d j are the degrees of nodes i, j, |E| the number of graph edges, π ij (t) the probability that a random walker starting from node i locates at node j after t steps, sigmoid a selection of logistic sigmoid σ function or tanh function.The inputs u i , u j of sigmoid function are: where nsd(i), nsd(j) are the neighbors sum degrees of nodes i, j respectively.For superposed random walk our new similarity is: Equations ( 14) and ( 16) are adaptable to any network we want to apply them to.This is due to the sigmoid function formulas and in particular, to the values of the growth parameter a that we set.In Figure 1, we can see that for different values of growth parameter a the slope of the S-curve changes and this has also the consequence of the sigmoid function giving different output results.This is a very convenient property of the growth parameter a because it allows us to control the impact of the new information we embed into the link prediction score calculation, depending on the network we are applying the enhanced custom similarity to.
Therefore, for every network we want to apply the enhanced custom similarity, we should have previously found the appropriate value of the growth parameter a that gives us the best link prediction score results for a specific network.That happens because we insert new information according the network specific structure and properties.We can think of growth parameter a as a controller that directs the results of enhanced custom similarity to optimal regions in the search space.
This requires first the enhanced custom similarity be trained for the network to which it will be applied, in order to find the optimal value of growth parameter a, define it as a ⋆ , which will give us the best link prediction score.The training of the enhanced custom similarity for a specific network is done using a genetic algorithm that we have developed.The genetic algorithm takes as input the network with its nodes and edges and returns as output the value of a ⋆ for this network.The description of the genetic algorithm is given in next section.
We use the two sigmoid functions, applying one at a time to the enhanced custom similarity, in order to have another chance for better link prediction results.This is because tanh function is just a rescaled version of the logistic sigmoid function σ as we showed in subsection III-C.There are networks in which logistic sigmoid function σ gives better link prediction results, networks in which tanh gives better link prediction results, and networks in which both sigmoid functions give roughly the same link prediction results.

IV. ADAPTIVE GENETIC ALGORITHM
In order to be able to find the optimal value of growth parameter a ⋆ for every graph, we have developed a relevant genetic algorithm.At the start of the algorithm we provide as input the graph we are analyzing as well as a dictionary containing the connections of each node in the graph.We also give the input parameters to be able to execute the genetic algorithm which are: population size, number of epochs.Finally, we provide to the genetic algorithm as input the set of edges that we have removed from the original graph, i.e., the train set and the step number of a Random Walk.Below we describe the preliminary steps and the different parts of the genetic algorithm.

A. TRAIN AND TEST GRAPHS AND SETS CREATION
Again, we consider a network as a graph G = (V , E).First, we randomly remove K edges from the graph and store them in a set, denoted as E totalremoved .The K threshold comes from a percentage of the number of edges in the network, usually more than 500 and less than one third of the number of edges in the graph so as not to spoil the original graph's structure.
In the experiments we have performed, the number of edges of the networks we have used is over 500 and we removed a small percentage of them (between 5% and 30%).The resulting graph is We split E totalremoved into two equal subsets E r1 , E r2 .We also create another graph as: After all these steps we have: G ′ as G train , E r1 as E train .These two, the train graph and the train edge set, we give as input to our genetic algorithm.
G ′′ as G test , E r2 as E test .With these two, the test graph and the test edge set, we can calculate how effective is our link prediction every time, after applying our genetic algorithm.

B. CHROMOSOME REPRESENTATION -POPULATION
We consider a chromosome c as a class object, contains a variable named a ∈ ℜ and a variable named score ∈ Z. Variable a represents the growth parameter of sigmoid function which we want to optimize and variable score is the score we succeed with this value of a.We describe score calculation in section IV-E.Initially, a has a random value and score equals 0. From chromosome object genetic algorithm can call functions of chromosome interface to have access to chromosome data.
We define as population P n a finite set of chromosomes such that: P n = {c i } for 0 ≤ i < n.At initial state of genetic algorithm, the population P n consists of random initialized chromosomes.

C. FITNESS FUNCTION -OPTIMIZATION
With our genetic algorithm we try to maximize a function f (x), x ∈ P n .This function returns the score of chromosome x.The problem here is to find chromosome x opt such that: Let us assume, without loss of generality, that: This is an optimization problem.For this problem we define fitness function or objective function fit for a chromosome x that is equivalent to the function f: where c corresponds to x.

D. INITIALIZATION PHASE
The genetic algorithm starts making some essential calculations according to its input.First, in addition to its parameters, genetic algorithm calculates and keeps in memory an one-dimensional array referred to as SNBD.We define SNBD (summarize of neighbors degrees) matrix as: where Neigh(i) = {(i, j)} such that for a G = (V , E) i, j ∈ V and (i, j) ∈ E. Genetic algorithm at initial phase calculates and keeps in memory also the transition probability Matrix M (we have given its definition before) after as steps as it takes as input -usually 2 steps (because these numbers of steps give usually better link prediction results).
We maintain these two matrices in memory to have lower calculation cost every time we need them during genetic algorithm process.

E. CHROMOSOME EVALUATION
For every chromosome c ∈ P n , the genetic algorithm applies evaluation function eval(c), which returns chromosome c score.Eval function for a chromosome c operates as follows: Let E ′ be the set of edges where they do not belong to G train = (V , E), that is, and let Eval_Set = {t i }, for 0 ≤ i < |E ′ | where t i is a tuple (u, v, rval) : u, v ∈ V and (u, v) ∈ E ′ and rval the customS LRW uv (k), or customS SRW uv (k) link prediction estimation value for the edge (u, v), after k steps, where rval ∈ R and k ∈ Z.Let Eval_Set_sorted be the Eval_Set in descending order by rval.Let also Eval_Set_topK a set of tuples (u, v) from the top |E train | elements of Eval_Set_sorted.We define score of a chromosome c as: F. SELECTION In the selection process, we use roulette [50], [51], [52], according to the following steps: • Calculate the fitness value fit(c i ) for each chromosome c i ∈ P n where 0 ≤ i < n.
• Find the total fitness of the current population P n , FIT = n−1 i=0 fit(c i ).• Calculate the probability g i of selection where g i = fit(c i )/FIT .
Crossover [50], [51], [52] is a very important genetic process, according to the theory of genetic algorithms.From the total population, a large part of it goes through the crossover process (the probability we have used in the experiments is usually 0.9).With crossover, two individuals of the population exchange genetic material.In our genetic algorithm, we use the arithmetical crossover [51], [52] variant suggested by the literature.Let us have selected from the selection process two chromosomes c 1 , c 2 with growth parameters a 1 , a 2 , respectively.
After crossover process two offspring will be produced from c 1 and c 2 , let be offspring c 1 and offspring c 2 with growth parameters a ′ 1 , a ′ 2 respectively as: where b is a small random number in [0, 0.1].

H. MUTATION
Mutation [50], [51], [52] is also a very important genetic process according to the theory of genetic algorithms.Genetic algorithm during this process gives to growth parameter a of a selected chromosome c a new random value.With mutation genetic process, we try to prevent the genetic algorithm from becoming trapped in local extrema.We expect a small part of the population in each generation to undergo mutation, about 15%, according to the settings we have given to the genetic algorithm.

I. REPRODUCTION -ALGORITHM STEPS
Our Genetic Algorithm at each step, i.e., at each generation, passes the population of that generation through the genetic processes we described before.During the application of these genetic processes, new chromosomes are produced in place of the old ones, that is, in each generation we have a new population to manage.Our genetic algorithm follows the elitist mode [51] according to which it boosts the population with the best chromosomes so far.Thus, it converges easily to total maximum scores of the chromosomes.After its execution, the genetic algorithm returns the list of the best chromosomes, i.e., the chromosomes with highest score between chromosomes of every population.In Algorithm 1, we give our genetic algorithm pseudo code.

V. PERFORMANCE EVALUATION
To be able to evaluate the effectiveness of the proposed approach, we applied our model to various graphs.These graphs come either from real networks that we found publicly available from various sources or from artificial network topologies that we constructed with the aid of Python libraries, such as networkx.

A. EVALUATION STEPS
The method we have used to test our model on different types of networks and draw useful conclusions reported below, consists of the following steps: The data sets we select or create always have the form of tuples.Each tuple consists of two nodes and represents an edge of the graph.Nodes are represented by natural numbers.To apply all link prediction algorithms to a data set, we only work with strongly connected graphs.In data sets, some graphs may have more than one connected component.Of these connected components we always keep the largest one.

Algorithm 1 Genetic Algorithm Pseudo Code
Input: The train graph: Graph(V , E), the dictionary containing the connections and the neighbors of each node in the graph: Dict, the set of edges that we have removed from the original graph: Removed_edges, the population size: pop_size, the number of epochs: epochs, the step number of a random walk: rwalk_steps, the crossover probability: CROSSOVER_PROB, the mutation probability: MUTATION _PROB Output: The list of the chromosomes as we defined in subsection IV-B with highest score between chromosomes(atoms) of every population: best_chroms_list Initialisation Phase: 1: Compute nsd matrix by Eq.( 22) 2: Compute P rwalk_steps matrix using Dict prob = random(0, 1) Append best_chroms_list atoms to P 38: end for 39: return best_chroms_list

2) CREATE TRAIN SETS AND TEST SETS
With the help of Python code we have developed, we randomly remove some edges from the initial data set, typically 5% to 30% of initial set, it depends on the size of the network we are looking at to get clearer results.For each edge we remove, we check that the graph remains strongly connected.With these edges we create two more sets: the train_omissible_edges set and the test_omissible_edges set.These sets have usually the same number of edges.The remaining graph is simply the train_graph set.We also create the test_graph set, which is the union of train_graph set and train_omissible_edges set.

3) APPLY GENETIC ALGORITHM TO FIND PROPER GROWTH PARAMETER A FOR THE SELECTED NETWORK
We execute the genetic algorithm by giving as input the train_graph set and the train_omissible_edges set.We also have setup the genetic algorithm according to the selection of sigmoid function (σ or tanh).The genetic algorithm searches the search space and returns the best growth parameter a it has found.It is worth noting that there are more than one growth parameters a, which have the same results in train sets.We always keep the smallest of these parameters because if we use one of the larger ones we might not get the best results in the test set evaluation.

4) APPLY ENHANCED CUSTOM SIMILARITY WITH ADAPTED GROWTH PARAMETER A
This is the last and most critical step of the evaluation process.The application and evaluation of enhanced custom similarity to test_graph set is done with python code that we have written.For each edge missing from the test_graph set, we apply the enhanced custom similarity, which returns for this edge its estimate, which is a real number.We store all estimates for all edges (along with the edge for each estimate) in an evaluation list and then sort this evaluation list in descending order of the enhanced custom similarity estimate.Then we use the test_omissible_edges set corresponding to the test_graph set we are considering.The evaluation of the enhanced custom similarity results from how many edges are in the first |test_omissible_edges| positions of the evaluation list are actually contained in the test_omissible_edges set.The more edges than those that are really missing the evaluation list contains in its first |test_omissible_edges| positions, the more effective is the estimation that the enhanced custom similarity gives us for an edge that does not exist in the graph.With the results we get we can give values to three well-known metrics we use: Precision, AUC and AUPR.

B. FRAMEWORK OF PROPOSED METHOD
According to the steps we mentioned in the previous subsection V-A, we summarize in Algorithm 2 the framework for our proposal.Algorithm 2 returns E result list which contains the edges estimated by the algorithm to be missing from the original graph.We can now compute the precision as the fraction with numerator the number of the actually missing edges from the test graph (|E result |) and the number of edges estimated to be missing by our algorithm as the denominator (|E test |).

C. PARAMETERS SETUP
In Algorithm 2 which shows in pseudo code the steps we take to predict connections in some graph and derive metrics for each prediction, there are several parameters we give to the genetic algorithm we use.Also there are ''portions'' of edges that we remove from the original graph in order to create train and test edge sets for training and testing our proposal method.Below we will mention for each parameter the logic for setting these parameters in order to have good and reliable results in each execution of the proposed method.

1) TOTAL REMOVED EDGES
As we have already mentioned in the section IV-A from every graph we remove from original graph a percentage between 5% and 30% of edges.We have chosen this percentage because on the one hand we do not want to remove too many edges and thus change the structure of the original graph and on the other hand we do not want to remove a very small number of edges because we will not get reliable results.Usually when calculating the similarities the link prediction algorithms succeed at the higher scores to have better predictions, while as these scores decrease, the quality of the predictions worsens.We observe this in measurements that have a low threshold of the number of missing edges, resulting in the accuracy values being quite optimistic in relation to reality.In our own measurements we have defined as a threshold exactly the number of edges we have removed from the original graph which is large enough and thus the results we report correspond to reality.

2) POPULATION SIZE -EPOCHS
In most of our experiments we have done the population size did not need to be large.Usually with 10 atoms (chromosomes) we obtained pretty good results.Also the number of epochs was between 10 and 30.When we tried to run the genetic with a large number of atoms and epochs (over 100) then we had overfitting.That is, we obtained pretty good results when running genetic algorithm on the training set but not on the test set.

3) SEARCH SPACE SETUP
According to the implementation of the method we propose, the genetic algorithm we use searches in a state space for an optimal value for the growth parameter a.After several experiments we noticed that this value is not particularly large.This is also shown in the Table 4 with the values of the growth parameter a for various networks.Also, the value of the growth parameter a is not particularly sensitive, i.e. it has enough tolerance to give us optimal results even if we add or subtract small amounts from its final value.With these observations we can considerably reduce the search space for an optimal value of growth parameter a and make our genetic algorithm more efficient.In practice we usually start with a search space of [0, 10] and then increase or decrease the range of the interval accordingly.The search space is also limited to positive values (a ≥ 0).

4) CROSSOVER PROBABILITY
After several runs of the genetic algorithm we consider that the crossover probability value 0.95 is satisfactory and gives good results.

5) MUTATION PROBABILITY
After several runs of the genetic algorithm we consider that the mutation probability value 0.15 gives good results and sufficiently prevents the genetic algorithm from converging to a local extrema.

Algorithm 2 Enhanced Custom Similarity Framework Pseudo Code Input: The network graph G(V , E)
Output: A set of potential missing edges 1: Remove n edges from G and create the edges set

D. ADAPTED ALGORITHM FOR EVERY NETWORK
During every execution of the genetic algorithm we have implemented, we can observe the results of the application of enhanced custom similarity with specific values of the growth parameter a in the network we apply it for link prediction.That is, we can see how many of the edges we have in the train_omissible_edges set our algorithm finds.
In the executions we have done, we almost always get better or much better link prediction results than the classic similarities, which we have already applied to make the necessary comparisons.Although these results are for the training sets and not for the test sets, we can safely say, after several executions, that if we get good results on the training sets then we will definitely get good results on the test sets.This is due to the fact that the enhanced custom similarity that we propose is adapted through the sigmoid function and the value of the growth parameter a to the structure of the network to which we apply it and not to its instant at some point in time.

E. METRICS
To demonstrate the good link prediction results of our algorithm as well as to correctly evaluate it in comparison to other benchmark algorithms we have used, we use three well-known metrics from the literature: AUC, AUPR and Precision.

1) PRECISION
Precision is defined as the ratio of the true positive item to the sum of true positive and false positive selected items: The True Positives are the edges that a link prediction algorithm found and are really missing from a graph, while the denominator corresponds to the number of edges that we have actually removed from the graph before.Clearly, higher value of precision means higher link prediction accuracy.

2) AUC (AREA UNDER THE CURVE)
Another standard that we have measured and reported in our executions is the Area Under the Curve (AUC) [53].For link prediction, AUC can be interpreted as the probability that a randomly selected missing edge gives a higher score than a randomly selected nonexistent edge.In this work, at each time we randomly pick an existing edge and a non-existing edge to compare their scores, if among n independent comparisons, there are n 1 times that the existing edge has a higher score and n 2 times that they have the same score, the AUC value is If all the scores are from independent and identical distributions, then the AUC should be approximately 0.5.Therefore, the degree to which AUC exceeds 0.5 indicates that link prediction is much better than random selection.In our work, we have taken AUC measurements with n = 1000.

3) AUPR (AREA UNDER THE PRECISION -RECALL CURVE)
In addition to the two previous standard metrics, we also use the area under the precision -recall curve (AUPR).Precision -Recall curve is a threshold curve and is commonly used in the binary classification community to express results.Precision -Recall curve is especially popular when the class distribution is highly imbalanced and hence is used increasingly commonly in link prediction evaluation.We have already describe precision metric above.Recall, also known as the true positive rate (TPR), is the percentage of data samples that a machine learning model correctly identifies as belonging to a class of interest-the ''positive class''-out of the total samples for that class.The recall formula in machine learning is: The Area under Precsion Recall curve (AUPR) is a scalar and ranges from 0 to 1, with 1 indicating a perfect classifier.We have calculated AUPR with the help of scikit-learn package and it's metrics.average_precision_scoreutility.Average precision computes the average value of precision over the interval from recall = 0 to recall = 1.precision = p(r), a function of r -recall: The integral computes the Area under Precsion Recall curve (AUPR).

VI. EXPERIMENTAL RESULTS AND DISCUSSION
To be able to compare metrics from different link prediction algorithms we used eleven representative networks from disparate fields: CEG, YST are biological networks [54], [55], SMG is co-authorship network [54], [55], HMT [54], [55], SOCFB [56] are social networks, EML is a network of individuals who shared emails [54], [55], INF is a network of face-to-face contacts in an exhibition [54], [55], UAL is an airport traffic network [54], [55], RNDGEO is a random geometric graph that we construct from NetworkX library by calling the function random_geometric_graph(620, 0.2), SMWORLD is a small world graph that we construct from NetworkX library by calling the function connected_watts_strogatz_graph(620, 8, 0.9), ECONB is an economic network [56].Table 1 summarizes the basic topological features of all these networks.The features are: (i) Transitivity is based on the relative number of triangles in the graph, compared to total number of connected triples of nodes [57].(ii) Average clustering (Clustering coefficient) is a measure of the number of triangles in a graph [58].(iii) Average shortest path length [59].(iv) Network diameter.(v) Network efficiency (average global efficiency).The efficiency of a pair of nodes in a graph is the multiplicative inverse of the shortest path distance between the nodes.The average global efficiency of a graph is the average efficiency of all pairs of nodes [60].
(vi) Average Local Efficiency, the average of the local efficiencies of each node [60].(vii) Mean degree which is the average number of edges per node in the graph.(viii) Degree Pearson Correlation coefficient, degree assortativity of graph.Assortativity measures the similarity of connections in the graph with respect to the node degree [61], [62].
We compare the four enhanced custom similarity indices (LRW with σ and tahn, SRW with σ and tahn) with classic LRW and SRW indices and with other five similarity indices: common neighbor centrality (CNC), preferential attachment (PA), resource allocation (RA), Adamic Adar (AA) and Jaccard coefficient (JC).The precision results of all these eleven methods on the eleven networks are shown in Table 2 and the AUC results in Table 3.In Table 2 we also show the precision on the train sets in addition to the precision on the test sets for each network.The results from the four enhanced custom similarity indices for every network have been calculated using the values of the parameter a that we show in the Table 4.All random walk algorithms have been executed with step two because after two steps they gave almost always the best results.
In Table 2 we can observe that the random walk link prediction algorithms usually have better precision results than the rest of the others algorithms in most of the types of networks we have applied them to.Also random walk algorithms that use sigmoid function have better precision results than those that don't.In particular, comparing to the best classic random walk algorithms LRW and SRW precision results on test sets it has been a 19% improvement in CEG network, a 17% in UAL network, no remarkable improvement in EML, SMG and RNDGEO networks, a 14% improvement in SMWORLD network, a 7% improvement in HMT network, a 5% improvement in YST network, a 6% improvement in INF network, a 9% improvement in SOCFB network and finally a 16% improvement in ECONB network.
Another important observation we can make in Table 2 is that the precision results of random walk algorithms with sigmoid function on the test sets are almost always better than the precision results on the train sets.This means that by finding the appropriate value of the growth parameter a after training, the algorithm calculation is adapted to the structure of the network to which it is applied in order to give a better result.This is very important because it proves that using the concept of adapted sigmoid function is very beneficial in the link prediction task.
From Table 3 we can observe the different AUC values that we achieve with each of the link prediction algorithms when we apply them to the respective networks.As we can easily observe random walk algorithms with sigmoid function maintain the high values of classical random walk algorithms and in most cases surpass them.This is another proof of the stability and completeness of our proposition.
In addition to the classical algorithms we have mentioned, we compared and evaluated our recommended enhanced custom similarity indices with an algorithm that belongs T is the transitivity.C aver is the average clustering.SPL aver is the average shortest path length.D is the network diameter.Eff is the network efficiency.LocEff aver is the average local efficiency.⟨K ⟩ is the mean degree of the network.DPC coeff is the degree Pearson correlation coefficient.

TABLE 2.
Precision values comparison between the four enhanced custom similarity indices (LRW with σ and tahn, SRW with σ and tahn) with classic LRW and SRW indices and other five similarity indices: common neighbor centrality (CNC), preferential attachment (PA), resource allocation (RA), Adamic Adar (AA) and Jaccard coefficient (JC).The highest accuracy in each line is emphasized by black.Table shows the precision on the train sets in addition to the precision on the test sets for each network.The results from the four enhanced custom similarity indices for every network have been calculated using the values of the parameter a that we show in Table 4.All random walk algorithms have been executed with step two.
to the field of machine learning using neural networks.We chose the SEAL model because it is one of the best link prediction algorithms using GNNs according to the literature, and has inspired several works in deep learning link prediction domain.The evaluation methods used in neural networks platforms were not compatible with ours because the evaluation is done at the time of training and testing and is based on arbitrary samples during the operation of the model.To be able to objectively compare our model with the SEAL model we followed similar steps as in subsection V-A and worked as follows: We first created and trained the SEAL model with the exact same training and validation sets as we trained our model; we call them train_graph set and train_omissible_edges set, respectively.The union of these two sets is called test_graph and the set which contains all actually missing edges from the test_graph set is called test_omissible_edges set.
We fed then the SEAL trained model with every potential edge missing from the test_graph set and took the model prediction estimates for every such edge.We store all model prediction estimates for all edges (along with the edge for each estimate) in an evaluation list and then sort this evaluation list in descending order of the model prediction estimate.Then we use the test_omissible_edges set TABLE 3. AUC values comparison between the four enhanced custom similarity indices (LRW with σ and tahn, SRW with σ and tahn) with classic LRW and SRW indices and other five similarity indices: common neighbor centrality (CNC), preferential attachment (PA), resource allocation (RA), Adamic Adar (AA) and Jaccard coefficient (JC).The highest AUC value in each line is emphasized by black.Table shows the AUC on the test sets for each network.The results from the four enhanced custom similarity indices for every network have been calculated using the values of the parameter a that we show in Table 4.All random walk algorithms have been executed with step two.

TABLE 4.
Values of growth parameter a that four enhanced custom similarity indices results for every network have been calculated for various networks.For every specific growth parameter a value enhanced custom similarity indices prediction results referred to other Tables.corresponding to the test_graph set we are considering.The evaluation of the trained SEAL model is obtained by accounthow many edges in the first |test_omissible_edges| positions of the evaluation list are actually contained in the test_omissible_edges set.The more edges than those that are really missing in the evaluation list contained in its first |test_omissible_edges| positions, the more effective is the estimation that the trained SEAL model gives us for an edge that does not exist in the graph.With these results we obtain values compatible to our recommended model values for the three metrics we use: Precision, AUC and AUPR.
Tables 5, 6, 7 show the Precision, AUC, AUPR values respectively.From these tables we can easily observe that random walk algorithms that use sigmoid function or not have better results in almost all cases from SEAL model.In the only case where the SEAL model manages to have a better result, although it is not particularly significant, is in SMALLWORLD dataset.More generally, in complex networks where the information we have about their nodes is quite limited, neural networks also have limited prediction capabilities.On the contrary, the classic algorithms and especially the random walk family algorithms give quite remarkable results.
After all the experiments and measurements we have taken from them we can confidently say that the proposed method is highly stable and gives very good results in different types of networks we have applied it to.That is, it manages to better exploit the structure of the networks in which we apply it each time.The graph neural networks (GNN) used by the SEAL method also try to exploit the structure of the networks.However, the results yielded are poorer, because there are no other data that could be exploited since we only deal with complex networks in which we have no information other than their structure.Possibly the SEAL or other GNN method would have performed better if additional information was available for each node in the network, which is not the case here.Since SEAL is a prominent GNN algorithm according to the literature, we consider it indicative of the anticipated performance of relevant algorithms and did not perform comparisons with other relevant algorithms.
The proposed method, as shown by the results of the experiments, indeed manages to improve two quite well-known and efficient algorithms such as LRW and SRW, however it is also limited by them.That is, it enhances their effectiveness to the extent that it manages to be properly integrated into them through the sigmoid function.However, there are also cases, albeit few, such as in the SMG network, where the proposed method does not achieve any improvement in the results of the basic algorithms in any of its four versions.More generally, where the two basic algorithms have poor results, the proposed method does not offer much improvement in these results.Another point that needs attention when applying the proposed method to a network is overtraining (or overfitting).We should not exhaustively train our model to give the best results on the training set.It is enough to run the genetic algorithm with a limited number of individuals (as we mention in V-C2) and try to get better results than the  shows the precision on the test sets for each network.The results from the four enhanced custom similarity indices for every network have been calculated using the values of the parameter a that we show in Table 4.All random walk algorithms have been executed with step two.TABLE 6. AUC values comparison between the four enhanced custom similarity indices (LRW with σ and tahn, SRW with σ and tahn) with classic LRW and SRW indices and SEAL GNN model.The highest AUC value in each line is emphasized by black.Table shows the AUC on the test sets for each network.The results from the four enhanced custom similarity indices for every network have been calculated using the values of the parameter a that we show in Table 4.All random walk algorithms have been executed with step two.TABLE 7. AUPR values comparison between the four enhanced custom similarity indices (LRW with σ and tahn, SRW with σ and tahn) with classic LRW and SRW indices and SEAL GNN model.The highest AUPR value in each line is emphasized by black.Table shows AUPR on the test sets for each network.The results from the four enhanced custom similarity indices for every network have been calculated using the values of the parameter a that we show in Table 4.All random walk algorithms have been executed with step two.basic algorithms we use.In this way we usually get better results in the test set as well.In general, when using genetic algorithms, the correct choice of parameters is important.In section V-C we gave some tips on the correct setup of our model parameters.
Extending the previous discussion, our approach faces the same limitations in disconnected networks as LRW and SRW.The latter cannot operate in disconnected networks and therefore, our approach too cannot yield meaningful results.The LRW is able to perform in each connected component of the disconnected network, due to its local decision-making.In terms, of network density, both approaches can operate in dense and sparse networks, with the results reflecting the corresponding features of the formed topologies, i.e., a sparsely connected component will have smaller probability to be visited by the random walks compared to a tightly connected component.Therefore, only the connectivity properties of the network have an impact on the operation and performance of the scheme.
The proposed algorithm runs offline to find a suitable value for the growth parameter a so that it can make good predictions online.Therefore, we are not particularly interested in achieving optimized performance when processing the data.In networks with hundreds to few thousands of nodes the execution time is relatively short.In networks with many thousands to millions of nodes the execution time increases considerably, as for many algorithms.This is because the proposed algorithm tries to score all the missing edges in the network, which is a very large number when the network size increases considerably.To reduce the execution time we can reduce the number of missing edges.Efficient training can be done even with a smaller subset of them.
For future work, we believe that the proposed method can be improved by integrating additional information into the basic algorithms, besides the sum of the degrees of a node's neighbors.However, to do this effectively we should avoid over-training or the possibility that some of this information will conflict with each other or be incompatible with each other.We could also try to integrate more information with the proposed method into other prediction or classification domains such as community detection.Finally, we can try to use some new parameterizable function that might achieve a better integration of additional information into known and efficient formulas in various knowledge domains.

VII. CONCLUSION
The missing link prediction problem in complex networks has attracted a lot of attention in recent years.Many link prediction algorithms have been proposed in the literature and each of them gives more or less good predictions for missing link results, depending on the structure and properties of the complex network to which it is applied.In this work we chose local random walk and superposed random walk similarity, two well-known link prediction by node similarity algorithms with quite good predictions results on various types of complex networks.These algorithms use the transition probability and node degree as information to give node similarity for missing link prediction.We embedded additional information to these algorithms, which is the sum of the degrees of each node's neighbors.We have tried to adapt this added information to each type of complex network on which we apply the new customized algorithms.To achieve this we used sigmoid functions with appropriate values of the growth parameter a for each type of complex network.To optimize these values we used a genetic algorithm.From the experimental results on various types of complex networks, we showed that with the new similarities we had better prediction performance than before.

FIGURE 1 .
FIGURE 1. Logistic sigmoid function and tanh function graph for different values of parameter a.The Logistic sigmoid function σ (u) is displayed as solid lines for various values of a and tanh(u) function is displayed as dashed lines for various values of a as well.The range of the logistic sigmoid function is in (0, 1).The range of the tangent hyperbolic function is in (−1, 1).Both are monotonic.We can see how the slope of the curves change for the various values of parameter a.The tanh function is just a re-scaled version of the logistic sigmoid function σ .

3 :
P current = new Chromosome[pop_size] {# creation of current chromosome population.#} 4: for ch in P current do 5: ch.a = random() 6: ch.score = eval(ch) as by subsection IV-E 7: end for 8: Let bas the atom(s) with the highest score of P current 9: Append bas to best_chroms_list Reproduction phase: 10: for i = l to epochs do 11: Create empty P new 12: for j = l to (|P current |/2) do 13: chrom1, chrom2 = selection(P current ) as by subsection IV

TABLE 5 .
Precision values comparison between the four enhanced custom similarity indices (LRW with σ and tahn, SRW with σ and tahn) with classic LRW and SRW indices and SEAL GNN model.The highest accuracy in each line is emphasized by black.Table : Compute the best_chrom_list by executing Algorithm 1 give as arguments G train , D train , E train and the rest parameters which are population size, number of epochs, random walk steps, crossover and mutation probabilities.Arrange ResultList in descending order by score 13: Create from the first |E test | tuples in ResultList the edge set E result 14: return E result train .5: Create D train and D test from G train , G test respectively where D i dictionary from graph G i such that: D i = {node : (neighsnum, neighborslist)} where node ∈ V , neighsnum is the degree of node and neighborslist is the neighbors list of node and i ∈ {train, test} 6

TABLE 1 .
Networks Features: The topological features of the eleven example networks.|V | and |E | are the total numbers of nodes and links, respectively.