A Review of Random Walk-Based Method for the Identification of Disease Genes and Disease Modules

Traditional techniques for identifying disease genes and disease modules involve high-cost clinical experiments and unpredictable time consumption for analysis. Network-based computational approaches usually focus on the systematic study of molecular networks to predict the associations between diseases and genes. The random walk-based method is a network-based approach that utilises biological networks for analysis. As the random walk models efficiently capture the complex interplay among molecules in diseases, it is extensively applied in biological problem-solving based on networks. Despite their comprehensive employment, the fundamentals of random walk and overall background may not be fully understood, leading to misinterpretation of results. This review aims to cover the fundamental knowledge of random walk models for biological network analysis. This study reviewed diffusion-based random walk methods for disease gene prediction and disease module identification. The random walk-based disease gene prediction methods are categorised into node classification and link prediction tasks. This study details the advantages and limitations of each method. Finally, the potential challenges and research directions for future studies on random walk models are highlighted.


I. INTRODUCTION
Genetic diseases are caused by gene mutations in combination with epigenetic factors or by a chromosomal abnormality [1], [2].Genetic disorders are a result of improper protein production.The disorders can be divided into three categories, namely single-gene disorders (mutations in a single gene), complex disorders (mutations in The associate editor coordinating the review of this manuscript and approving it for publication was Vincenzo Conti .two or more genes), and chromosomal disorders (changes in the number or structure of the chromosomes).Among the factors that affect the diagnosis of genetic disorders include variability in the phenotypic characteristics, overlapping symptoms with other disorders etc. [2].Elucidating the relationship between human genetic diseases and their causal genes (or proteins) remains a major public issue [3].
Although traditional techniques for disease gene prediction and disease module identification provide predictive biomarkers and protein complexes through genetic variation studies, these methods are expensive and time-and resourceconsuming, as many false positives need to be analysed further [4], [5].Moreover, traditional techniques focus on direct association between genes and diseases as well as associations between diseases and protein complexes are not cost-effective.Based on extant literature, biological molecules (genes or proteins) collaboratively perform their functions [6], [7], [8], [9].Therefore, computational modelling techniques could be more efficient in understanding system-level diseases.
Computational modelling of biological systems uses networks to understand their structure and dynamics [10].More helpful information may be revealed and systematic aspects can be gained by designing and defining their specific roles and collaboration to a wired network graph structure [11].A network-based environment enables efficient tracking of disease-causing factors by trailing network perturbations (e.g.edge or node removals) in the molecular networks [11].Thus, molecular networks like protein-protein interaction (PPI) networks, gene co-expression (GCE) networks, gene regulatory networks (GRN), and Bayesian networks are efficient and effective for complex data visualisation and interpretation.Such complex modelling interplay is represented by nodes as molecules (e.g.genes, RNA, proteins and metabolites) and edges as relationships between the nodes (e.g. regulatory relationship) [12].
The random walk model is a network-based approach that employs graph-theoretical algorithms to solve biological problems, including disease gene prediction, protein function annotation, and disease module detection.This diffusionbased method uses information encoded in the complete network topology and the placement of all known disease genes for influence propagation in different networks through symmetric diffusion.Whereby information flow diffuses through each edge to other nodes in the network.The node weights following the diffusion represent their affinity or closeness to other highly weighted nodes [13].The random walk model is a useful tool to study the structure of graphs and the relationship between nodes [14].The underlying assumption of random walk-based methods is that phenotypically similar diseases are caused by functionally related genes that are located close to each other in the molecular networks [15], [16], [17].
Diffusion-based random walk methods have been increasingly enhanced by considering prior information from omics data sources or topological information to calculate the network's node weights or adjacency matrix.For instance, Prioritisation with a Warped Network (PWN) [13] was designed as an enhanced random walk-based method that incorporates both network properties and prior knowledge to quantify the proximity between genes in the network.Hence, it is extensively used to complement and enrich existing statistical analyses to solve biological problems.
There are several reviews [2], [15], [18], [19], [20], [21] and benchmark [11], [17], [22] articles published on network-based methods.Most of these published works covered an overview of the existing network embedding methods, ranging from machine learning to graph representation learning methods.However, only a limited number of studies focused on presenting a wide range of diffusionbased random walk methods for disease gene prediction and disease module identification.Moreover, some articles [23], [24], [25] that extensively discussed the random walk models mainly focused on their theoretical definitions and underlying mathematical concepts.Some former surveys [14], [26] reviewed the application of random walk models in solving different biological problems.These articles either covered limited tools or did not assess various available state-of-theart network diffusion random walk methods.
To address the abovementioned issues, this study aims to present a comprehensive review of diffusion-based random walk methods leveraging network or graph data for disease gene prediction and disease module identification.A highlevel illustration of the pipeline for applying diffusion-based random walk methods to different biomedical tasks is provided.The general concepts and principles of the random walk approaches are introduced, and a classification scheme of computational approaches based on problem definition (i.e.node classification and link prediction) for disease gene prediction is discussed.A list of available random walkbased methods for disease module identification is also provided.The capabilities and limitations of the random walk-based methods were acknowledged to deliver a fast and clear initiation in using these promising research tools.Finally, challenges and future directions for the methodological development and applications of random walk-based methods were described.

II. FUNDAMENTALS OF RANDOM WALK
The basic concept of a random walk on a graph begins with a single or group of nodes that visits each node by taking serial random walking steps.For every moving step, the nodes move to a random neighbour, where a distribution value is calculated for every node in the graph, indicating the likelihood that a walker is present at that node at that particular step.The random walk process is repeated until all nodes in the graphare covered or converged.Finally, the distribution value of each node remains constant and is proportional to the time a random walker travels to that node and the distance from the starting nodes [14].
A biological network can be represented as an undirected (e.g.PPI networks, GCE networks) or directed (e.g.GRN, metabolic networks) graph.Given a graph G = (V, E), V is a set of nodes and E is a set of edges.For any node u∈V and (u, v)∈E, B(u) is the set of all nodes that links to node u and |L(v)| is the number of neighbours (outgoing links) of node v. PR t+1 (u) is the probability (rank scores) of a particular node u at time step t+1, and PR t (v) is the probability of node v at time step t.The node proximity on a graph can be calculated based on a simple random walk model and some extended random walk models defined as follows.

A. PageRank ALGORITHM
PageRank is an algorithm developed to rank the importance of webpages by employing the link structure of the web [27].A Markov chain with a primitive transition probability matrix can be built using the hyperlink structure of the web.The stationary vector or PageRank vector is obtained based on the irreducibility of the Markov chain.The values corresponding to each page in the PageRank vector are known as the PageRank scores of the page [28].This algorithm indicates that a page with important incoming links will produce outgoing links to other pages that are also essential [28].Thus, PageRank considers the backlinks and propagates the ranking through links: a page ranks high if the sum of the ranks of its backlinks is high [27].A simplified version of PageRank is defined as [27] and [29]: where u represents a node (web page).B(u) is the set of nodes (pages) that point to node u and c is a factor used for normalisation.The overall PageRank score is calculated based on ranking all network nodes.However, PageRank has a rank sink problem whereby not all users follow the direct link within a website [27].In this case, the original PageRank is modified as follows [27], [29]: where N is the number of nodes in the network, α is a constant in [1,0] called the damping factor or teleport probability.α can be referred to as the probability of users following the links and 1−α as the PageRank distribution from non-directly linked pages [29].

B. PERSONALISED PAGERANK (PPR) ALGORITHM
PPR is a variant of PageRank algorithm that focuses on the relative significance of a target node concerning the source node in a graph [28].In the original PageRank, the rank score of a web page is divided evenly over the pages to which it is linked.Some links may be more critical than others on an actual web-based on the users' preferences.Therefore, PPR was developed to estimate the relevance of nodes according to users' preferences, aiming for personalised search [30].
It simulates a random walker that begins simultaneously at source node u (or a set of source nodes).At each step, the random surfer either jumps to a random out-neighbour node, v, with probability, α, or returns to the current node e u according to the probability distribution (user preference distribution) with probability 1−α.The PPR can be defined as [31]: 116368 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
where e u is the identity vector of node u whose u th entry is equal to 1, or similarly referred to as the probability vector that contains all other nodes jumping to the node.The difference between PageRank and PPR is that PageRank assumes the random walker returns to any node with uniform probability, while PPR considers the random walker to randomly return to specific states (i.e., query states) [32].

C. PAGERANK WITH PRIORS ALGORITHM
PageRank with Priors is an extension of the PPR algorithm to estimate the relative importance of nodes in a graph based on a set of root nodes.The root set can be represented as the data analyst's prior knowledge or bias based on the nodes in the graph that are deemed essential.Both the PageRank with Priors and PPR algorithms share the similar goal of ranking nodes in a graph, except for the particular context of PageRank and Web pages, which is to bias the standard PageRank rankings in favour of a set of prior topics (root set) [33].PageRank with Priors defines a prior bias vector used to assign a probability distribution to a set of root nodes, where the root nodes have probabilities of more than zero and all probabilities add up to one [34].The back probability parameter, α governs the probability that a walk on the graph will restart at a root node.In this context, the root node denotes a known disease gene.Meanwhile, the random surfer lands on any node during this set of walks with a probability of α or ends stochastically at the prior bias nodes (p u ) with a probability of 1−α.Mathematically, PageRank with Priors is defined as [33]: where p u refers to the prior bias of node u.In general, the difference between PageRank with Priors and PPR is that PageRank with Priors allows any weight distribution of nodes associated with a set of defined root nodes (root nodes consist of a prior bias vector).Contrarily, PPR assumes a uniform distribution for all the nodes related to a set of topic-specific seed nodes (seed nodes consist of topic-specific identity vectors).

D. RANDOM WALK RESTART (RWR) ALGORITHM
RWR is an improved PageRank algorithm that measures each node's relevance with respect to a given single seed node in a graph [35].RWR executes a random walker that begins simultaneously at source node s.At each state of a certain step, there is a possibility to move to a neighbouring node along an edge (based on edge weights) with probability α or to restart from the source node s with probability 1−α.RWR can be formally defined as [36] and [37]: where s is the vector that contains N entries vector elements.All its entries are set to 0 except for the single seed node [14].Q is the normalised adjacency (transition) matrix.Compared to PageRank with Priors, the initial probability vector of RWR was constructed to assign equal probability to each seed node (seed nodes consist of disease genes).PageRank with Priors initialises prior information vectors (e.g., seed nodes incorporating disease similarity information) to a set of defined root nodes.

E. WEIGHTED PAGERANK ALGORITHM
Weighted PageRank is an extension of PageRank combined with the RWR algorithm to compute the closeness between any two nodes in a graph.In the original PageRank, the transition of a random walker from a node to its neighbours relies upon the corresponding quantity of its neighbours.However, in Weighted PageRank, the computation is performed by iteratively visiting the neighbours with which the edges connecting the node have higher weights [14].Thus, by reinforcing the weight of interactions, Weighted PageRank can be defined as [38]: where PR 0 is the initial probability vector, generated by assigning the set of root nodes (known disease genes) with an equal probability of being a start node, which sums to 1.All other nodes are designated a value of 0 [34].W is the normalised adjacency (transition) matrix, whose values depend on the weight of the edges represented by v∈B(u) w(u, v).A weight-adjustment scheme is introduced to adjust the degree of modularity in a biological network.The difference between Weighted PageRank and RWR lies in constructing the transition matrix.Weighted PageRank intensifies the weights of interactions using an efficient parameter (e.g.weight-reinforcement rate parameter) to modularise the network [39], [40].Meanwhile, RWR naively considers the original interaction weights based on reliability scores in the PPI network [41], [42], [43], gene ontology based on the similarity of genes [44] or the relationship between heterogeneous biomedical concepts [45] for network construction.

III. RANDOM WALK ON BIOLOGICAL NETWORKS
Biological networks can be represented as graphs that serve as models of biological systems, where each node is a unit (gene or protein) and each edge indicates the interaction between two units [2].Biological networks are categorised into homogeneous, heterogeneous, and multiple networks.Random walk models can be implemented in single (homogeneous network) or multi-networks (heterogeneous or two-separated networks) based on the classified category.A brief description of random walking on homogeneous, heterogeneous, and two-separated networks is provided as follows.

A. RANDOM WALK ON HOMOGENEOUS NETWORKS
A homogeneous network is a graph with a single type of nodes and a single type of edges.Gene-gene, PPI, phenotype, and gene expression networks are homogeneous networks [2].Given a graph G = (V, E), V is the set of nodes and E is the set of edges.Let A(NxN) denote the adjacency matrix of the homogeneous graph, where it has an entry of 1, if two vertices i and j are connected and 0 otherwise.The equation can be represented as: The normalised adjacency matrix is obtained by dividing each row by the degree of the corresponding node.Formally, the normalised adjacency matrix is defined as [46]: where each row of A is normalised, summing up to 1.The computed normalised adjacency matrix is applied in equation ( 6) to obtain the steady-state probability vector.

B. RANDOM WALK ON HETEROGENEOUS NETWORKS
A heterogeneous network refers to a graph consisting of different types of nodes and edges.It is constructed by integrating two or more homogeneous networks with known associations.Some of the common heterogeneous networks include gene-to-phenotype networks, gene-disease networks, phenotype-disease networks, and transcription regulatory networks [2].For instance, let A G (NXN) and A P (MxM) be the adjacency matrixes of two input networks.The mapping of these two networks is stored in matrix B(NxM).The integration of the two input networks and their association network forms a heterogeneous network, which is denoted as follows: where B T is a transpose of matrix B. A random walker iteratively transitions from its current node to a randomly selected neighbour, starting at a given set of seed nodes in subnetworks A G and A P .During a random walk on a heterogeneous network, the walker is likely to stay in a subnetwork while jumping from one subnetwork to another through their interrelationships at a certain probability [14].The following equation illustrates the process: where M is the transition matrix of the heterogeneous network consisting of four subnetwork transition networks and is denoted as follows [47]: where M G and M P are intra-subnetwork transition matrices of networks G and P. M GP and M PG represent the intersubnetwork transition matrices between networks G and P.
Let α be the jumping probability between the two subnetworks.When the random walker is in network G, it can jump to network P or stay in network G.If a node is directly linked to network P, the random walker will jump to network P with a probability of α, or move to other nodes in network G with a probability of 1−α.Otherwise, it will not be able to jump to network P and can only move to other nodes in network G. Thus, the inter-subnetwork transition probabilities between networks G and P are described as: Meanwhile, the intra-subnetwork transition matrices of networks G and P can be defined as: , otherwise. ( , otherwise. ( The initial probability that begins with the seed nodes in networks G and P is denoted by u 0 and v 0 , respectively.The initial probability vector of the heterogeneous network is denoted as: where parameter η ∈ (0, 1) balances the level of importance of each subnetwork.When η = 0.5, the importance of networks G and P are equal.If η > 0.5, the importance of network G becomes greater than network P, and vice versa.
A steady-state probability PR∞ is achieved after several steps and is denoted as: The nodes in networks G and P are ranked based on steady probabilities of u ∞ and v ∞ , respectively.

C. RANDOM WALK ON TWO SEPARATED NETWORKS
Random walking on two separated networks can be performed based on a balanced or unbalanced bi-random walks algorithm.A balanced bi-random walk algorithm begins simultaneously with seed nodes in two input networks and walks separately across each network.The potential interrelationships between the nodes in the two networks are explored while walking following some known and recently updated connections [14].Mathematically, the process is illustrated by the following equation [48]: G and P represent the affinity matrices of networks G and P, respectively.A is the known association matrix that acts 116370 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
as prior knowledge to regulate the iteration process.PR is iteratively updated by extending the path in the two networks (achieved by multiplying G on the left and P on the right in each step) [14].The parameter α regulates the weight of known associations in matrix A.
On the other hand, the process can also be taken sequentially on the two networks based on an unbalanced bi-random walk algorithm instead of random walking on two separate networks simultaneously [48].Theoretically, the random walker employs a different number of steps for the two input networks, eventually converging to a stationary distribution by taking a series of random walking steps separately.Since the two input networks contain different topologies and structures, the optimal number of random walk steps might differ for the two networks.The two parameters introduced into the two networks include l and r, representing the numbers of maximal iterations, for which l is for network G and r for network P. The mathematical definitions are as follows: Network P: Merged result: where λ G and λ P ensures the maximal walking steps taken on network G and network P does not exceed the threshold l and r, respectively.

IV. DISEASE GENE PREDICTION
Identifying disease-associated genes is a task of predicting the most plausible candidate genes involved in a disease [49].
With the development of high-throughput technologies, genetic mapping approaches emerged to generate candidate disease genes.The traditional genetic mapping methods include linkage analysis and genome-wide association studies (GWAS), which provide chromosomal regions containing up to ten or even hundreds of candidate genes possibly associated with genetic diseases [50].However, it may not be possible to experimentally validate the candidate disease genes that lie on the specified genomic intervals.Thus, computational disease gene prioritisation may be an optimal strategy for identifying the most promising candidates among the long list of genes to reduce experimental costs and efforts.Random walk-based methods are network-based computational approaches that represent biological data as a network and apply graph mining techniques to predict disease candidate genes [21].The ability to amplify association between genes that lie in network proximity facilitates the analysis of biological pathways for disease gene prediction.Random walk-based methods can be categorised into node classification and link prediction tasks.Node classification uses the known disease genes to infer the disease label of the unlabeled genes, whereas link prediction uses gene-disease associations to identify the potential disease-causing genes [11].The following subsections describe the formal definitions of node classification and link prediction tasks.The relevant biological applications of random walk methods based on node classification and link prediction are also elaborated.A comprehensive review of random walk-based methods for disease gene prediction based on the two tasks is described below.

A. NODE CLASSIFICATION
Node classification aims to predict or classify unlabeled nodes (genes with unknown disease associations) in the biological network, with known labels on some nodes (genes with known disease associations).In a homogeneous graph G = (V, E), V refers to the set of nodes/genes and E is the relationships between nodes.Let a subset of genes labelled as disease-causing genes, V labeled ∈V, and another set of genes with unlabeled disease associations, V unknown = V\V labeled [11].Node classification on network G predicts the labels of nodes in V unknown [11].Similar criteria can also be used to define this node classification task in heterogeneous graphs and multi-view graphs.
Disease gene prioritisation, also known as disease gene association prediction, is one of the popular biological applications of random walk methods for disease gene prediction based on node classification tasks.The prioritisation (or the selection of a smaller subset) of candidate genes is the process of assigning similarity or confidence scores to genes before ranking them based on the probability of being causally related to a disease of interest [51], [52].Disease gene prioritisation primarily comprises three steps.First, some known disease genes are chosen as seed genes.Then, the positions of the seed genes on their chromosomes are determined based on gene expression profiling, linkage regions, and other chromosomal abnormalities [14].However, these approaches have identified thousands of candidate genes, most of which are irrelevant to the disease of interest, indicating the need to rank candidate genes using a prioritisation method to identify the most likely disease genes from these candidates.

1) RANDOM WALK METHODS BASED ON NODE CLASSIFICATION
Network-based candidate gene prioritization (ToppNet) [53] adopts PageRank with a priors algorithm to prioritise disease candidate genes based on their relative importance in PPI network.A list of known disease-related genes is used as the prior bias vectors to run the PageRank algorithm with different parameter values.Besides that, PRIoritizatioN and Complex Elucidation (PRINCE) [54] is proposed to prioritise genes and protein complexes associated with a disease of interest.It computes the disease similarity measures of known causal genes as prior probability vectors to run PageRank with the priors algorithm on a weighted PPI network.While Network Propagation with Dual Flow (NPDE) [55] employs a dual-flow PageRank with a priors approach to prioritise candidate disease genes.It aims to analyse the topology associations between disease and essential proteins by assigning positive flow to known disease proteins and negative flow to essential proteins within the PPI network.The empirical results demonstrated that disease genes are not well connected with essential genes to conclude further that disease proteins are topologically more important than other proteins in the network.
On the other hand, VAVIEN [56] aims to measure the topological similarity among the protein pairs in the PPI network.It uses the RWR algorithm to construct topological similarity between the seed and candidate proteins based on the proximity of the protein of interest to every other protein in the network.The computed topological profile scores are then used for candidate disease gene prioritisation.Next, Diffusion Profile based on Linear Correlation Coefficient (DP-LCC) [57] is proposed as a diffusion-based method to prioritise candidate disease genes in PPI network.It constructs separate diffusion profiles for disease genes and candidate genes to compare both profile vectors with the query disease based on a linear correlation coefficient.Whereas PRioritization bY protein NeTwork (PRYNT) [58] employs two closenessbased algorithms, shortest-path and random walk, to prioritise the kidney disease genes.The PPI network is contextualised by grouping the proteins within cliques.The multiplication of rank scores computed from both strategies proved that the results were better than direct ranking implemented in previous studies.
RWR [41], [59] is proposed to prioritise candidate disease genes based on random walk methods.The calculated rank score reflects the global similarity of candidate genes to known members of a disease-gene family in the PPI network.While Degree-Aware Disease Gene Prioritization (DADA) [60] introduces a disease gene prioritisation method based on statistical adjustment to correct degree bias in the conventional RWR algorithm.DADA suggests three reference models: the degree of disease genes (seed nodes), the degree of candidate genes, and the likelihood ratio using eigenvector centrality to adjust the degree distribution of the PPI network.Although these methods successfully identified the loosely connected disease genes, they also created more false negatives for highly connected genes.Neighbourfavoring weight reinforcement (ORIENT) [38] proposed a Weighted PageRank algorithm to prioritise candidate disease genes in the PPI network improving the conventional RWR algorithm by introducing an efficient parameter to reinforce the weights of interactions close to the known disease genes.The proposed method thoroughly considered the modularity principle through proper neighbour-favoring weight reinforcement.
Directed Random Walk (DRW) [61] applies the RWR algorithm to infer robust pathway biomarkers at functional categories level than of the individual genes.It introduces an efficient gene-weighting strategy according to topological importance, effectively enhancing the reproducibility of pathway activities for cancer classification.Whereas significant Directed Random Walk (sDRW) [62] aims to assess the optimal restart probability parameter according to different genomic datasets by introducing an additional weight to enhance the conventional RWR algorithm.Enhanced Directed Random Walk (eDRW+) [63] adopted the RWR algorithm to identify breast cancer prognostic markers from multiclass expression data.This method utilises an analysis of variance (ANOVA) F-test statistic and pathway selection to improve the weight of genes in the directed pathway network.Integrative Directed Random Walk (iDRW) [64] proposed a multi-omics data integration method based on DRW algorithm for disease gene prediction.It constructs a directed gene-gene interaction graph based on gene expression and copy number alteration.It further defines an effective weight initialisation and genes scoring method to identify topologically important genes and pathways.PWN [17] is a variant RWR algorithm that prioritises disease targets based on a combination of internal and external features of network warping: graph curvature and prior knowledge.It generates a weighted asymmetric network from unweighted and undirected networks by computing the edges' Ricci curvature and assigning higher weights to prior knowledgerelated edges based on RWR.The final gene scores are obtained via diffusion through the warped network.Figure 2 illustrates the graphical overview of PWN.On the other hand, BioGraph [45] utilises a stochastic RWR model on an integrated network containing 21 publicly available curated databases for disease gene prioritisation.This data integration and mining platform computes the posterior probability for a given candidate gene prioritisation query to identify genes for hereditary diseases.Simplified Laplacian Normalization-Supervised Random Walk (SLN-SRW) [65] integrates biomedical data from heterogeneous sources to predict disease genes.It proposed a Laplacian normalisation-based supervised random walk algorithm to model an integrated network's edge weights for the prediction of gene-disease relationships.Meanwhile, Driver genes discovery with Improved Random Walk method (Driver_IRW) [66] is a novel method based on the RWR algorithm to identify cancer driver genes by integrating transcriptomic data and interaction networks.This method incorporates transition probabilities and global centrality measures to compute the probability vectors for random walking to seed nodes.Weighted PageRank [67] aims to prioritise type 2 diabetes genes by leveraging the modified PageRank algorithm on bilayer biomolecular networks.It constructs the network based on differential mutual information and ranks the diabetes genes using RWR on the heterogeneous networks.Biological Random Walks (BRW) [68] employs RWR to leverage the integration of multiple biological sources for disease gene prioritisation.This method computes personalisation vectors and aggregated transition matrix using a convex combination before applying a random walk model to rank genes.Figure 3 illustrates the framework of BRW.
Improved sDRW [69] is an enhanced sDRW algorithm that implements sequential random walks on two biological networks.It aims to enhance the sensitivity of cancer prediction in conventional sDRW algorithm by introducing a walker network to identify significant genes in both networks.Meanwhile, entropy-based Directed Random Walk (e-DRW) [70] performs RWR on two separated networks to prioritise disease genes.It constructs separate biological networks from different pathway databases to improve the coverage of pathway information for random walking.A robust gene-weighting and pathway activity inference method incorporating an entropy probability parameter is proposed to infer pathway biomarkers at the functional categories level.Table 1 summarises a collection of random walk-based methods for disease gene prediction based on node classification tasks (refer to Appendix Table S1 for more details).

Link prediction aims to predict unknown links between two sets of nodes (i.e. genes and diseases) based on known associations between the nodes (known disease-gene associations).
In a heterogeneous graph G, denoted as G(U, V, E), U and V represent the sets of genes and diseases, respectively.At the same time, E indicates the edges in U, V and those between U and V (i.e.known disease-gene associations) [11].Link prediction on network G predicts disease-gene associations by measuring the proximity or similarity between the nodes/genes for the disease of interest.
Random walk methods have several biological applications for disease gene prediction based on link prediction tasks.These applications include protein function prediction, drug target interaction prediction, microRNA-disease association prediction, and lncRNA-disease association prediction.
Protein function prediction predicts the function of a protein by exploring the protein-function relationship from PPI networks and Functional Interrelationship Networks (FIN).PPI network refers to a complex network of associated proteins, whereas the FIN network is constructed based on Gene Ontology (GO) term functional similarity.Based on the assumption that proteins that are located close to each other in a PPI network tend to share similar functional annotations, and two similar functions usually co-annotate a common protein [14], random walk models effectively diffuse information to the whole networks by discovering the interrelationships between nodes of different biological networks based on converged probability distribution.
On the other hand, drug target interaction prediction is a typical link prediction problem that aims to facilitate drug repositioning.Random walk methods are a drug repositioning tool used to predict unknown drug targets or drug-disease interactions.Since similar drugs often target similar proteins, several biological networks like drug-drug interaction networks and PPI networks are employed to explore the nodes' associations and solve the prediction problem.Suppose random walk is considered for heterogeneous network.In that case, a drug-drug interaction network can be constructed based on drug chemical structure similarity, and PPI network can be constructed based on amino acid sequences of target proteins [14].Random walk models perceive heterogeneous network as input and compute the likelihood of an edge between pairs of proteins and drugs through network diffusion.
Apart from that, microRNAs are single-stranded noncoding RNAs that play an important role in the pathogenesis of human diseases [71].Random walk models represent a promising tool to uncover potential miRNA-disease associations using the constant accumulation of miRNA, disease, and miRNA-disease association data.Suppose functionally related miRNAs are frequently associated with phenotypically similar diseases [71].In that case, microRNA-disease associations can be predicted by constructing two subnetworks, miRNA functional similarity networks and disease phenotype similarity networks bridged by known miRNAdisease associations.As such, random walk models jump from one subnetwork to another due to their interrelationships at a certain probability to detect miRNA candidates that could potentially be associated with diseases.
Long-non-coding RNAs (lncRNAs) are long chains of nucleotides with various biological mechanisms closely related to human diseases, including cancers and degenerative neurological diseases [72].Based on the hypothesis that functionally similar lncRNAs are possibly related to diseases with similar phenotypes [73], lncRNA-disease association prediction has rapidly gained attention among researchers in understanding the pathogenesis of diseases at a molecular level.By integrating multiple biological data sources, random walk models can effectively integrate disease semantic similarity networks and lncRNA function similarity networks with known lncRNA-disease associations to predict lncRNAdisease associations.

1) RANDOM WALK METHODS BASED ON LINK PREDICTION
Random Walk with Restart on Heterogeneous Network (RWRH) [42] is an extended RWR algorithm that prioritises genes and phenotypes simultaneously using known gene-phenotype relationships.Gene-phenotype associations connect the gene and phenotype networks to construct a heterogeneous network.Random Walker on Protein Complex Network (RWPCN) [74] is proposed to predict and prioritise disease genes on a heterogeneous network comprising of phenotype similarity network, protein complex network, and protein interaction network.It uses protein complexes to aid in their inference of gene-phenotype associations for disease gene prioritisation.Figure 4 presents the overall network structure of RWPCN.Meanwhile, Random Walk with Restart on Multiplex-Heterogeneous network (RWR-MH) [75] extended the RWR algorithm to multiplex and heterogeneous networks to prioritise disease genes.A multiplex network is formed by integrating PPI, pathway, and co-expressed networks.This multiplex network is further connected to a disease-disease similarity network through gene-diseases associations to predict disease-associated genes.
Laplacian normalisation based Random Walk with Restart on Heterogeneous network (LapRWRH1 and LapRWRH2) [76], a Laplacian normalisation-based RWR on heterogeneous network algorithm, prioritises disease genes and identifies potential gene-phenotype relationships.Laplacian normalisation is utilised to normalise the weight of edges in heterogeneous networks and transition probability matrices.Besides that, Network-based Random Walk with Restart on the Heterogeneous network (NRWRH) [77] was developed to infer potential drug-target interactions based on RWR in a heterogeneous network.Drug similarity and protein (target) similarity networks are connected via drugtarget interactions for drug-target prediction.Random Walk with Restart on Heterogeneous Network with Multiple Data Sources (RWRH-MDA) [78] operates RWR on a heterogeneous network to predict miRNA-disease associations.The heterogeneous network is constructed based on disease similarity, while the miRNA similarity network is connected by known miRNA-disease interaction networks.This heterogeneous network overcomes the limitations of previous methods (i.e.use of only a single dataset, inadequate disease semantic similarity, and overestimation of the predictive accuracy) to identify potential disease-related miRNAs.
Random Walk with Restart on Multigraphs (RWRM) [79] adopts the RWR algorithm to prioritise disease genes based on the proposed Complex Heterogeneous Network (CHN).Whereas the CHN model is constructed based on PPI network and multigraph gene network (i.e.integration of Biological Process (BP), Cellular Component (CC), and Molecular Function (MF) network).A phenotype network is then connected to the model as a subgraph to guide the random walk.Two-Rounds Random Walk with Restart based on Multiple Biological Data (TRWR-MB) [80] is an extension of the RWRH algorithm to explore cancer genes based on a quadruple-layer heterogeneous network.The network integrates multiple biological data consisting of PPI network, pathway network, microRNA similarity network, lncRNA similarity network, cancer similarity network, and protein complexes.A two-round RWR is then executed on the network to obtain the final ranking score.RWRH-Malaria [81] was proposed to predict malaria-associated genes based on RWR on cross-species PPI networks for humans and parasites.The network integrates human-human, parasite-parasite (Plasmodium falciparum), and human-parasite protein interactions using known malaria genes as the seeds to identify candidate malaria genes.
Furthermore, the prediction of potential miRNA-disease associations based on degree-based RWR on a heterogeneous network was discovered called Biased Random Walk with Restart on Multilayer Heterogeneous networks for MiRNA-Disease Association prediction (BRWRMHMDA), an enhanced Biased Random Walk with Restart (BRWR) method [82].This method designed a multilayer heterogeneous network based on known miRNA-disease associations, disease semantic similarity, miRNA functional similarity, and Gaussian interaction profile kernel similarity for diseases and miRNAs.Biased RWR was then implemented on the degree-based heterogeneous network to obtain potential miRNA-disease associations.Bi-Random Walk (BiRW) [48], on the other hand, employs a bi-random walk algorithm to prioritise disease genes based on paired phenotypegene associations.It aims to capture the patterns of the phenome-genome association network based on a regularisation framework for graph matching.RWR is performed on the Kronecker product graph between PPI and phenotype similarity networks based on balanced and unbalanced steps.Meanwhile, Unbalanced Bi-Random Walk (UBiRW) [83] applies an unbalanced bi-random walk on PPI network and functional interrelationship network to predict Unbalanced Random Walk on Three Biological Networks (ThrRW) [84] implements RWR by considering several steps of random walking on three biological networks: protein interaction network, domain co-occurrence network, and functional interrelationship network to predict functions for unknown proteins.Functional protein information is propagated among the three networks through associations between the nodes in different networks.Three-layer heterogeneous network Combined with unbalanced Random Walk for MiRNA-Disease Association prediction (TCRWMDA) [85] aims to predict the potential miRNA-disease associations based on an unbalanced random walk on a three-layer heterogeneous network.To compute the potential association scores between disease and its associated miRNAs, it takes three different random walking steps on lncRNA similarity network, disease similarity network, and miRNA similarity network for miRNA-disease association prediction.Multiple Similarities Fusion based on Unbalanced Bi-Random Walk (MSF-UBRW) [73] is based on a multiple similarities fusion of an unbalanced bi-random walk used to identify lncRNA-disease associations.This method fuses multiple similarities (including functional, Gaussian Interaction Profile Kernel, and linear neighbour similarities) of lncRNAs and diseases to assist different random walking steps for the lncRNA and disease similarity networks, respectively.While Bi-Random Walk and Matrix Completion (BRWMC) [86] is a network-based approach used to predict lncRNA disease association based on a bi-random walk and matrix completion method.It employs RWR to preprocess the known lncRNA-disease association matrix and combines the matrix completion method to predict the association of lncRNA and disease.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

V. DISEASE MODULE IDENTIFICATION
Identification of a disease module is also called module inference or graph clustering.It detects a group of genes related to a disease phenotype [87].These groups of genes are involved in similar biological functions are called communities, modules, or clusters.It is driven by the underlying assumption that disease-related proteins tend to interact closely in biological networks [88].Meanwhile, traditional techniques for disease module identification focus on a particular protein or biological pathway and are neither economical nor, by definition, able to study the entire system [89].For this reason, networkbased approaches that model the structure and dynamics of biological systems can aid in identifying disease modules in the human interactome.It offers a comprehensive understanding of the disease mechanisms and pathophenotypes at a system level and directs the search for therapeutic targets [90].
Functional module or protein complex detections, is the main biological application of random walk methods for disease module identification.A functional module can be defined as a group of genes or products connected by one or more genetic or cellular interactions [91].Since the interactions of gene products in PPI drive the biological process, functional module detection has become a significant biological problem for predicting densely clustered essential proteins and disease genes in biological networks.As clusters of genes or proteins are typically highly and loosely connected with the rest of the nodes in the network, random walk models are more likely to stay within a cluster of connected nodes than travel between them.Based on this concept, various random walk clustering methods were developed to identify the functional modules from the PPI networks.
Markov Clustering algorithm (MCL) is a network-based computational approach based on the simulation of stochastic flow in graphs [92].The main idea of MCL algorithm is that if a random walker starts from a node and randomly travels to a connected node, it is more likely to stay within a cluster than to cross clusters.In general, the MCL algorithm involves six steps to cluster a network.Firstly, an association matrix given an undirected graph as an input is created.Then, self-loops are added to each node and a normalised adjacent matrix is constructed for the network.Next, repeated multiplication of the adjacent matrix occurs to expand the information flow to other network regions.Followed by the rescaling of the resulting matrix using inflation to strengthen strong currents and weaken weak currents.As the fifth step, the expansion and inflation operations are repeated until they reach a steady state (convergence).Finally, the resulting matrix is interpreted to discover clusters.
Markov Clustering based on Core Attachment on weighted networks (MCL-CAw) [93] is developed as a coreattachment-based refinement method coupled with MCL to identify yeast complexes using weighted PPI networks.It refines the clusters produced by MCL using the coreattachment structure and utilises the affinity-scoring PPI network to derive meaningful yeast complexes.Meanwhile, Soft Regularised Markov Clustering (SR-MCL) [94] adopted MCL as a base algorithm by iteratively re-executing the clustering operation to identify functional modules in PPI network.To ensure different clustering results in each execution, it introduces a penalised ratio to control the stochastic flow of each node.Following the clustering algorithm, a post-processing algorithm is applied to remove redundant and low-quality clusters.Another study proposed Markov Clustering [95] as a graph clustering method to identify protein complexes within highly interconnected PPI networks.It optimises MCL parameter to further compute network modularity and density from the MCL cluster granules to generate protein complexes with high protein interaction.
Next, the Repeated Random Walks (RRW) [37] was proposed as an extended RWR algorithm based on repeated random walks on graphs to discover molecular complex and functional modules within protein interaction networks.The edges in network are weighted by the strength of functional associations and the random walk process is repeated to identify overlapping clusters of yeast genes.In another literature, Node-Weighted Expansion of clusters of proteins (NWE) [96] was used as an enhanced RRW algorithm to detect protein complexes on the PPI network.This method weighted the clusters of nodes by the total sum of the weights of all the adjacent edges in the network.Whereas Local Protein Community Finder [97] applied two local clustering algorithms called Nibble [98] and PageRank-Nibble [99] to discover high-quality communities near a queried protein in a PPI network.This method locally partitions a protein network to identify quality clusters with high conductance and functional coherence.
Weighted PageRank-Nibble and Core Attachment structure (WPNCA) [100] aims to detect protein complexes from PPI networks using weighted PageRank-Nibble algorithm and core-attachment structure.The method assumes that neighbours that tend to construct clusters with the node should assign higher values.It treats adjacent nodes equally by assigning weights with different probabilities based on an edge-clustering coefficient.Walktrap [101] is another algorithm based on RWR to detect modules that are significantly enriched with cancer genes.It develops an integrated network weighted by an average weighting scheme and utilises distances to derive transition probability vectors.An efficient scoring method is proposed to partition the clusters and is further customised based on its modularity, module size, and maximum module score to guide clustering.Figure 5 presents the flow diagram of the Walktrap.A network-based approach [102] is proposed to identify genes and gene modules in breast invasive carcinoma (BRCA) based on RWR on the PPI network.DNA methylation and gene expression data are integrated to calculate the weights of the PPI network using Principal Component Analysis (PCA) and Canonical Correlation Analysis (CCA).The detected significant genes are then used for sub-network construction, while a random walk algorithm is applied to discover candidate disease-related modules.
Isolation [91] is a multiplex approach based on random walks for functional module identification.It integrates mRNA expression information and biomedical knowledge to reveal the functional relations of genes.This algorithm transforms the PPI network based on a k-step random walk that enumerates each node to identify clusters with locally optimal isolation.Mutual EXclusion and Coverage based random walk (MEXCOwalk) [103] is a vertex-weighted, edgeweighted random walk-based approach to extract TCGA pan-cancer driver modules in the PPI network.The weight of edges incorporates coverage information and the degree of mutual exclusivity between pairs of gene neighbourhoods in the network for module detection.TOP-down Attachment of Seeds (TOPAS) [89] implements a top-down approach to detect disease modules based on the RWR of the functional association networks.It seeks to connect the largest number of seed nodes while adding the fewest connectors in the final module.Meanwhile, Active Module Identification using Experimental data and Network Diffusion (AMEND) [104] is a novel active module identification method that uses network diffusion with Equivalent Change Index (ECI) to identify a connected subset of genes regulated similarly or opposingly between the two experimental conditions.It employs RWR to select genes and create gene weights before applying Heinz (heaviest induced subgraph) to determine the maximumweight connected subgraph using the node weights derived from RWR.The process is iterated until the highest-scoring network is derived as the final module.Table 3 lists a collection of MCL and random walk-based methods for disease module identification (refer to Appendix Table S3 for more details).

VI. CHALLENGES AND FUTURE DIRECTIONS
The random walk model plays a significant role in solving biological problems, such as ranking nodes in biological networks, measuring similarity or distance between nodes in biological networks, detecting modules from biological networks, and determining interrelationships between nodes from different biological networks [14].It is a highly efficient algorithm as it is fast to implement and applies to large biological networks for analysis.A random walk model can be used to compute the proximity of a node to a set of source nodes and not just a single source node.This property is beneficial when a core set of members of a pathway or complex is known, and queries (or the initial node) for candidate members are being conducted on this network [36].However, some challenges are observed in applying random walk models to solve biological problems based on networks.Therefore, improvements are necessary to increase the computational efficiency and scalability when such models are used extensively for genome-scale biological networks.
Parameters in random walk models are crucial in controlling the performance of the algorithms.As mentioned before, the random walk model iteratively updates the values vector and obtains a steady-state probability vector when the Euclidean Distances between the current value vector (PRt) and the last time-step vector (PRt+1) are less than the threshold ε.Parameter ε acts as a threshold parameter that controls the precision of values vector in the algorithm.The larger the ε, the faster the convergence of the algorithm [24].On the other hand, parameter α also known as restart probability or back probability) controls the information flow returning to the seed nodes at each iteration of the algorithm.The larger the α, the more likely it is for the nodes close to the seed nodes to be ranked forward and vice versa [24].In brief, parameters ε and α not only regulate the number of iterations in the algorithm but also affect the performance in terms of overall accuracy and prediction results.
Besides that, the size of the biological network for random walking can significantly affect the algorithm's computing time.It takes longer for the random walk models to converge when the network size is huge.It implicitly creates a high computational complexity, ultimately limiting the networks' in-depth analyses.However, modelling a dynamic network is another challenge for random walking in a biological network.Inherently, biological networks can change with time, context, and complexity [105].Although many networks contain such temporal information, most studies applied the random walk model on static snapshots of the graph and have largely ignored the temporal dynamics of the network [106].Thus, biological network construction is important to yield appropriate and meaningful results for large-scale information networks.
116378 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.With the development of high-throughput techniques, ranwalk models can be improved to overcome these limitations.The optimal value of the parameter for applications should be determined using theoretical formulas or equations [14].Additionally, the cost of computing probability vectors should be reduced to improve the efficiency of the algorithms for various machine learning tasks, such as node similarity measure, link prediction, classification, and clustering.There should be more focus on the construction of biological network dynamics, including dynamical network construction [107], [108], dynamical disease genes prediction [109], and dynamical functional module identification [110], [111], [112].In the future, more efforts should be made towards designing effective random walk-based methods to work on active subnetworks as well as association networks of those subnetworks.Besides, the integration of multibiological resources and multi-biological networks should be emphasised to improve the application of random walk model in solving biological problems based on networks.

VII. CONCLUSION
Identifying disease genes and disease modules are critical for understanding disease mechanisms and uncovering diseasegene associations.Random walk-based approaches have been widely applied in bioinformatics for solving biological problems based on biological networks.This study reviewed some diffusion-based random walk methods leveraging various networks in their problem formulation for disease gene prediction and disease module identification.The basic concepts of the random walk model, including a variation of random walk approaches, were introduced for specific applications.This review focused on underscoring the strengths and weaknesses of state-of-the-art random walk methods for disease gene prediction and disease module identification instead of their prediction performance.An organised, up-to-date overview of the computational approaches provided merit exploitation for researching the genetic causes of human diseases.
Selecting a random walk computational approach for specific biological problems is difficult because it depends on various factors.Some general principles are provided as guidance for potential users of such applications.An important consideration that needs to be addressed is whether the random walk methods can integrate multi-biological resources and networks.Multi-dimension data can reflect various biological features, while multi-biological networks serve as the framework to capture the complex hierarchical relationships among those biological molecules.Thus, undoubtedly both properties contribute to solving biological problems.However, integrating different biological networks into a heterogeneous or multiplex network may ignore the inherent differences between those networks.In conclusion, an effective random walk-based method should treat biological networks unequally by considering different numbers of walking steps on multiple networks.

FIGURE 1 .
FIGURE 1. Pipeline for applying diffusion-based random walk methods to biomedical tasks.

TABLE 1 .
Random walk-based methods based on node classification.

TABLE 2 .
Random walk-based methods based on link prediction.
Table 2 presents a collection of random walk-based methods for disease gene prediction based on link prediction tasks (refer to Appendix Table S2 for more information).

TABLE 3 .
MCL and random walk-based methods for disease module identification.