Identifying Genes Associated With Autism Spectrum Disorders by Random Walk Method With Significance Tests

Autism spectrum disorders (ASD) are generally defined as a development disorder typically characterized by social interaction and communication ailments and stereotyped actions due to combined genetic and environmental factors. Different critical aspects contribute to ASD, and consensus has been reached among autism researchers about its predominant genetic factors. However, the pathogenesis of ASD has not been fully revealed, and a systematic method must be developed to identify the genes related to this disease. Here, we predicted new ASD-associated genes by random walk method on the basis of prior-known ASD genes from AutDB. New genes such as RAC3, AC1 (ADCY1), PKC (PRKC) gamma, EPH receptor A5, WNT3A, calretinin, RAS-R, KLF4, and calpain 3 were found to play an irreplaceable role in ASD pathogenesis.


I. INTRODUCTION
Autism spectrum disorder (ASD) is a widely reported developmental illness with typical symptoms of social interaction and communication disorders and restricted and repetitive behaviors and pathogenesis of combined genetic and environmental factors. According to a global report [1], approximately 24.8 million people suffered from such disease in 2015. Even in developed countries, more than 1.5% of child population is clinically diagnosed with ASD as early as 2017 [1], indicating that this ailment is one of the most mental threatening diseases for people worldwide [1]. ASD pathogenesis has been widely studied, and some critical factors for its initiation and progression have already been identified. This systematic disease involves multiple tissues, organs, and regulatory systems including neural (the brain), immune, and digestive systems [2], [3]. Among these, the neural system is the most important with typical pathological symptoms associated with ASD, and three of its regions are correlated with the pathogenesis of this illness. Cerebellum abnormalities participate in the pathogenesis of ASD.
The associate editor coordinating the review of this manuscript and approving it for publication was Quan Zou .
Further studies on the communications between cerebellum and parietal lobes via brainstem confirmed that pathogenesis on such region may trigger motor dysfunction and dyspraxia, typical symptoms of ASD [3]. In addition to cerebellum, amygdala regions are also correlated with ASD pathogenesis. An abnormal metabolism pattern in such brain region is correlated with the progression of ASD, especially in the early initiation stage [4].
Some typical physical regions in the neural system are correlated with ASD pathogenesis. Genetic background is essential in studying the initiation and progression of ASD.
Multiple key pathological factors of ASDs have been identified in multiple tissues at the genetic level. One work focused on the neural system, which was introduced with a regional genetic variation induced syndrome named as fragile X syndrome (FXS), the leading genetic pathogenesis for ASD experienced by more than 5% of patients [5]. In addition, patients with FXS share similar symptoms with ASD such as stereotyped behaviors [6], indicating that fragile X syndrome and its typical genetic variation patterns are correlated with ASD by regulating specific behaviors of this illness. In addition to the whole or regional genome abnormality at genetic levels, specific factors, such as the functional molecule oxytocin, are pathologically correlated with ASDs. Oxytocin and its specific system regulate sculpting socio-sexual behaviors. Pathogenesis involving such system has been observed in ASD and is correlated with problems in multiple neural systems including language and social communication, sensory, and cognitive [7]. Genes such as SHANK3 [8] and PTCHD1 at X-chromosome [9] also participate in ASD pathogenesis via regulation of the neural system and contribute to certain typical symptoms of this disease. Other genes related to ASD, such as ITGB3 [10], RORA [11], oxytocin-encoding genes, and arginine-vasopressin receptors [12], are also functionally correlated with the pathogenesis of such disease. The above genes and genetic abnormalities have been functionally correlated with the neural system, thereby validating the key role of genetic factors affecting neural system during ASD. In addition to the regulatory factors involving the neural system, an endogenous glycosylated vitamin D binding protein was found in macrophages, a key component of the focal immune system, and participates in the correction and repair of dysregulated tissues in pathological regions. Hence, this protein may also be abnormally expressed during ASD.
In summary, the comprehensive pathological mechanisms of ASD have not been fully revealed. Although some genes were found to play a role in the development of ASD, studies on genes that affect ASD are still lacking. Therefore, developing a systematic method to identify the genes related to ASD is necessary. Here, we predicted ASD-associated genes by random walk method based on knowledge on prior-known ASD genes from AutDB [13] and biological network from STRING database [14]. Several newly identified genes can be confirmed, thus validating the efficacy and accuracy of our prediction technique.

A. DATA
A total of 990 ASD genes were identified from AutDB (http://autism.mindspec.org/autdb/) database with version in 2018 [13] listed in Supplementary file S1. AutDB manually curated these genes from literature, and each association has corresponding evidence in references. These genes were considered as seed genes to identify novel ASD-related genes on the STRING network [14], a comprehensive functional association network that integrates various databases, literatures, high throughput analysis, and functional experiments.
In 2020, a new version of AutDB updated the ASD gene number to 1141. As shown in Figure 1, 978 ASD genes were also included in the old version of AutDB, whereas 163 genes were newly added. These newly added ASD genes would be used for independent test, which are listed in Supplementary file S2.  problems [15]- [22]. RWR algorithm, as a typical networkbased feature ranking algorithm [23]- [27], can simulate a random walker starting from one or several seed nodes and walking on a network, thus updating the weight (probability) vector of network nodes in an iterative way.
In the mathematical description, P i is the probability vector after the i th updating procedure, and the next new probability vector P i+1 is updated as follows: where A represents the column-wise normalized adjacency matrix of the given network (e.g., PPI network), and λ represents the probability of returning to seed nodes (e.g., restart action of random walk) and indicates the importance of seed nodes, and P 0 is the initial probability vector. The RWR algorithm stops when the probability vector becomes convergent, and output P i+1 as the final result. Each element in this final vector indicates the probability of the corresponding nodes associated with seed nodes (e.g., as ASD-associated genes). A total of 990 prior-known ASD genes were picked up as seed nodes in the RWR algorithm. An initialization vector P 0 was constructed and consisted of 19,247 elements, which correspond to the 990 ASD-associated genes set to 1/990. The other elements were set to zero, and the probability of returning to seed nodes λ was set as 0.8. The algorithm termination rule is P i+1 − P i 1 < 10 −6 , and the genes assigned with probabilities higher than a threshold as 10 −5 were picked up as candidates for further investigations.

C. SIGNIFICANCE TESTS FOR SELECTING ASD-ASSOCIATED GENES
The RWR algorithm can detect candidate ASD-associated genes on the basis of available ASD genes. Owing to the high dependence on the structure of PPI network, RWR algorithm might select some nodes (genes) with special positions in PPI network to introduce some false-positive candidate genes. Increasing the reliability of selected candidate genes on the basis of their important functions is also necessary. Therefore, three significance tests, namely, permutation test, association test, and function test were further applied to rank and select ASD-associated genes with high confidence.

1) PERMUTATION TEST
The permutation test aims to evaluate the probability values produced by RWR algorithm. m Ensembl ID sets with given size were randomly produced, and RWR algorithm was conducted on the PPI network with this set members as seed nodes to ensure that the tested ASD-associated genes from RWR can receive their probability values in a test. After all m sets were tested, each tested ASD-associated gene should have one actual probability value and random m ones. The P value of permutation test can be calculated as follows: where represented the number of randomly produced sets on which candidate gene g received higher probability values than its actual one. A candidate gene with a small P value is likely to be associated with ASD because it has less association with randomly produced sets.

2) ASSOCIATION TEST
Among the candidate ASD-associated genes passing the permutation test, those with strong associations with priorknown ASD genes were selected. In the PPI network, the interacting proteins always share similar functions; hence, each candidate gene g was linked to ASD genes, and the MIS was measured as follows: where I (g, g ) indicates the confidence score of the interaction between g and g . Candidate genes with high MISs are likely to be actual ASD-associated genes.

3) FUNCTION TEST
The validated ASD-associated genes must be highly related to biological processes. The gene ontology (GO) terms and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways are two popular sources of biological processes. The enrichment theory was employed to quantify the relationship between one gene and GO terms or KEGG pathways under the following assumption: the important candidate ASD-associated genes should exhibit similar relationships with at least one ASD genes on the GO terms and KEGG pathways. For one gene g and one GO term or KEGG pathway T , their relationship defined by enrichment theory was calculated as follows: where N is the total number of genes/proteins in humans, M is the number of genes/proteins annotated to T , n is the number of genes/proteins in H that consist of g and its direct neighbors in the PPI network reported in STRING, and m is the number of genes/proteins that are in H and annotated to T . Our in-house program using R function phyper was developed to calculate this S value. Accordingly, each gene can be encoded into a vector by collecting all outcomes of this function. For gene g, the obtained vector was formulated as E(g). The relationship between two genes g and g on their GO terms and KEGG pathways can be measured by the following: Similar to MIS, another measurement, MFS, can be calculated for each gene g as follows: MFS(g) = max{Q(g, g )|g is a ASD gene} (6) The candidate ASD-associated genes with high MFS, which is important for biological function, must be selected.

III. RESULTS AND DISCUSSION
In this study, based on the RWR algorithm and three significance tests, we designed a computational method to discover novel ASD-associated genes. The entire procedures are illustrated in Figure 2.

A. ASD-ASSOCIATED GENES
Among the 990 ASD genes from AutDB database with 2018 version, RWR analysis assigned weights or probabilities for all genes on the network, and those with probability larger than 10 −5 remained as initial candidates. Three significance tests helped in extracting important genes and controlling false-positive genes. RWR candidates with P-value less than 0.05, MIS larger than 900 (the cutoff of the highest confidence score in STRING), and MFS larger than 0.97 were further selected and are listed in Supplementary file S3. Finally, 147 genes were determined as ASD-associated genes. Some representative genes and their quality measurements are listed in Table 1, and the network interactions consisting of all 147 genes are displayed in Figure 3. It can be observed that most of these genes had strong associations with each other, indicating that they may share similar functions. Thus, if one gene can be confirmed to be ASD, others may also be related to ASD. Functional enrichment analysis indicated that these predicted ASD-associated genes participate in many disease-relevant functions ( Table 2 and Supplementary file S4). As mentioned in Section II.A, 163 newly added ASD genes were picked up for the independent test. A Venn diagram for the predicted ASD genes and newly added genes was plotted, as shown in Figure 4. Clearly, three predicted genes (PRKCA, ACTL6B, and EGR3) were included in the recently updated AutDB in 2020, indicating that our computational method can really discover novel ASD-associated genes.

B. FUNCTIONAL RELEVANCE OF ASD-ASSOCIATED GENES
A total of 147 genes are functionally correlated with ASD as confirmed by recent publications, thus validating the efficacy   and accuracy of our prediction results. Owing to the limitation of the manuscript's length, only the representative genes were selected for further detailed discussion. RAC3 is the first gene in our prediction list. This gene encodes a GTPase of the RAS superfamily and regulates diverse essential biological processes including cell growth, cytoskeletal reorganization, and protein kinase activation [28], especially in the neural system as the major pathological region of ASD. With regard to the correlations between RAC3 and ASD, this predicted gene contributes to the regulation of neuronal development by interacting with a homologue named RAC1 [29], [30]. Considering that the FIGURE 3. STRING network of predicted candidate ASD genes. Most of these genes have strong associations with each other, suggesting that if one gene can be confirmed to be ASD genes, others may also be related to ASD. abnormal development of neural system is one of the major pathogenesis of ASD [31], [32], predicting such gene as a candidate pathogenic factor for ASD is highly reasonable. This finding validates the efficacy and accuracy of our prediction.
In addition to RAC3, the next predicted gene is AC1(ADCY1) encoding a functional adenylate cyclase a group of enzymes that regulates the essential functions of living cells [33]. With regard to the correlations between adenylate cyclase and ASD, reduced sociability as a typical symptom for ASD is correlated with AC1 at the transcrip-tomic regulatory level [33]. Further studies on the whole genome level also identified typical genetic variants located inside the gene body of adenylate cyclase 1 (AC1) from an ASD family, indicating that such gene may also participate in ASD pathogenesis at the genomic level [34]. In summary, gene AC1 from our optimal prediction list is definitely an ASD associated gene, thus validating the efficacy and accuracy of our prediction.
As another typical gene, PKC gamma (PKCG, also known as PRKCG) is involved in multiple cellular signaling pathways in multiple biological systems [35], [36], including the neural system. ASD regression can be attributed to pathological changes in multiple G-protein-coupled signaling pathways including PKCG-associated pathways functioning in the central neural system [37]. Protein kinases such as the candidate gene PKCG are the key regulators in such G-protein coupled signaling pathways. Hence, PKCG can be identified as a potential regulator for ASD.
The next predicted gene is EPH receptor A5 (EPHA5), an ephrin receptor that participates in the development of neural system by interfering with neuron migration and axon targeting [38], [39]. Considering that ASD can also be partially attributed to the abnormal development and dysfunction of the neural system, the candidate gene EPHA5 may also participate in ASD pathogenesis.
The next identified gene is WNT3A, a famous member of the Wnt signaling pathway [40] and is correlated with multiple proliferative biological functions under either physical or pathological conditions. With regard to the internal correlations among the predicted genes WNT3A and ASD, the abnormal alterations at the genomic levels, such as the microdeletions and micro duplications of WNT3A and related genes from the Wnt signaling pathway, have been widely identified in patients with ASD [41]. This finding indicates the potential regulatory role of these genes for ASD. Early brain development [42], [43] and synaptic functioning [44] are also functionally connected to canonical Wnt signaling pathway, thus further confirming our prediction of WNT3A as a candidate ASD-associated gene.
The next gene is CALB2, which encodes an important protein named as calretinin [45]. Identified in the neural system, CALB2 regulates the calcium related metabolism in multiple neuronal and endocrine cells, specifically in the pathogenic region of ASD, the cerebellum. A long-standing interest has been directed for its possible role in the hippocampus of patients with ASD and a selective increase in the density of CALB2-immunoreactive interneurons in the dentate [46]. This finding suggests that CALB2 regulates ASD through its increased expression. And an aberrant splicing variant and non-synonymous SNPs on Ca2+ -dependent activator protein have been identified in patients with ASD [47]. These studies implicate calretinin to function in the neuropathology and behavioral characteristics of patients with developmental disorders. In conclusion, CALB2 can be a critical gene in the ASD development, thereby corresponding with our prediction on its specific role in ASD.
According to our prediction list, RRAS encoding a specific Rap GTP binding protein named as RAS-related is an optimal gene correlated with ASD. RAS-related regulates neural progenitors [48]. With regard to the correlations FIGURE 5. General functional distribution of the predicted ASD-associated genes. VOLUME 8, 2020 between RAS-related and ASD, researchers from NY State Institute for Basic Research in Developmental Disabilities in 2013 reported that this candidate gene and its correlated pathway contribute to the regulation of neuronal cell migration and pathogenesis of ASD [49]. Combined with effective pathways such as RAS signaling pathway, RRAS suppresses the development and maturation of central neural system [49], a major system implicated in the pathogenesis for ASD [50]. This finding validated that our predicted gene is definitely an ASD-associated pathological factor.
KLF4 as another effective predicted gene and participates in the regulation of gene expression as a member of zinc finger DNA-binding proteins. With regard to the detailed correlations between this gene and ASD pathogenesis, its inhibition may further suppress the expression of cyclin B1 and D1 and has been functionally correlated with abnormal axon growth in the central neural system [51]. This finding indicated that kruppel-like factors 4 may pathologically regulate ASD by affecting axon regeneration.
Finally, CAPN3 encodes a member of calpain family, a family of effective calcium-dependent, non-lysosomal cysteine proteases (proteolytic enzymes). Proteins from Calpain family are activated by brain-derived neurotrophic factor (BDNF) and further regulate memory formation [52]. Therefore, this predicted gene may also be critical for BDNF-associated learning and memory procedures. In 2013, researchers from University of Southern California confirmed that calpains including our predicted gene play an irreplaceable role in neural system development via PTEN signaling pathway [52]. Considering the internal correlations between ASD and neural system development, CAPN3 was reasonably predicted as an ASD-associated gene.
To present the functional distribution pattern of our predicted genes, we also applied gene enrichment analyses using R package clusterProfiler to show the general functional distribution of the predicted ASD-associated genes ( Figure 5).

IV. CONCLUSION
ASD-associated genes were determined based on prior-known ASD genes and biological network information by using the random walk method combined with three significance tests. Many new candidate genes associated with ASD were identified and found to be enriched in many disease-related functions and densely interactive on the biological network. Recent literature also supports the function or pathogen relevance of the predicted genes with ASD. Among which, three genes were confirmed as ASD-related in the updated autDB. Therefore, our newly presented computational approach may identify some potential biomarkers for ASD and provide basic research foundation for further studies on the detailed pathological mechanism of ASD.

SUPPLEMENTARY MATERIAL
Supplementary file S1: List of prior-known ASD genes in 2018; Supplementary file S2: List of prior-known ASD genes in 2020; Supplementary file S3: List of candidate ASD-associated genes predicted by RWR analysis; Supplementary file S4: Functional enrichment analysis for candidate ASD-associated genes; Figure S1 Overlap of different versions of autDB; Figure