Identifying Essential Methylation Patterns and Genes Associated With Stroke

Stroke is a serious lethal factor for human beings, and thus its unique pathogenic factors and underlying molecular mechanisms must be thoroughly investigated. Current complicated examination and treatment approaches on stroke reflect the complicated pathogenesis of stroke, which involves two major pathogenic factors: phenotypic characteristic and genetic background. Stroke occurrences with different symptoms and pathogenic characteristics may be induced by genetic variations. However, epigenetic contribution and regulation on stroke pathogenesis have been neglected for a long time and thus environmental influence on stroke onset is often underestimated. In this study, relied on our newly presented computational method, we re-screened out the genome methylation data of stroke patients with different subtypes and identified a group of functional methylated or demethylated genes. Recent reports validated the abnormal methylation status of the all identified genes in the pathogenesis of stroke. The genes were associated with biological functions involved in stroke onset. Further functional enrichment analysis confirmed and summarized the novel specific pathogenic roles of ion binding and focal adhesion in the regulation of stroke at the methylation level.


I. INTRODUCTION
Stroke is a medical emergency with fast onset. In a typical clinic phenotype of stroke, blood supply to the brain is greatly interrupted or reduced [1], [2]. Stroke cases can be divided into two subgroups according to etiological features: ischemic stroke induced by lack of blood flow [3] and hemorrhagic stroke induced by excessive bleeding [4]. Both subtypes can trigger typical stroke symptoms, such as dysfunction in some brain regions and blockage of blood supply to crucial brain regions [5]. In the USA, stroke is the leading cause of death, affecting more than 800,000 people annually [6]. Therefore, given that stroke is a serious lethal factor for human beings, the unique pathogenic factors and underlying molecular mechanisms of stroke must be deeply explored.
The associate editor coordinating the review of this manuscript and approving it for publication was Quan Zou . Clinically, high-risk cases of stroke have various typical symptoms [3], [6], and patients with such symptoms are recommended to take medical treatments. For the accurate clinical diagnosis of stroke, various diagnostic approaches are generally applied, including CT (Computed Tomography), MRI (Magnetic Resonance Imaging), carotid ultrasound, and cerebral angiogram [7], [8]. Moreover, therapy approaches for stroke during follow-up clinical treatments after diagnosis vary among different patients due to their respective pathogenesis. For instance, patients suffering from ischemic stroke are generally recommended to be treated with tissue plasminogen activators within 4.5 hours [9]. By contrast, patients suffering from hemorrhagic stroke are usually treated with direct surgery rather than drugs [10].
Such complicated examination and treatment approaches on stroke actually reflect the complicated pathogenesis of stroke. The two major pathogenic factors of stroke are phenotypic characteristic and genetic background [5]. On the one hand, various phenotypic characteristics including overweight, age, inactive lifestyle, smoking, and drinking have been tightly connected to the stroke onset, according to recent publication reports [11]- [13]. On the other hand, the genetic contribution of stroke has been gradually revealed along with the development of next generation sequencing. Early in 2011, a systematic review on the genetic background of stroke revealed a set of genes with potential functional contribution to the pathogenesis of stroke [14], [15], and these genes were clustered into five groups by onset position, affected tissues and potential biological mechanisms of stroke. Some works suggested that stroke onset is affected by genetic contributions and stroke occurrences with different symptoms and pathogenic characteristics may be induced by genetic variations [16], [17].
Although genetic studies on stroke have increasingly expressed interest in genes and their variants, the epigenetic contribution and regulation on stroke pathogenesis have been neglected for a long time and have not been deeply studied. Therefore, in this study, we re-screened the genome methylation data of stroke patients from a recent publication [18], [19]. Basing on our newly presented computational method, we identified a group of functional methylated or demethylated genes that may participate as potential biomarkers in the pathogenesis of different subtypes of stroke. All predicted the genes and their enriched functions were confirmed to be associated with the stroke cases in recent reports. All in all, all the screened genes identified by our newly presented computational method not only may reflect the methylation pattern of stroke patients but also may contribute to the identification of potential detailed pathogenic mechanisms for stroke initiation and progression.

A. DATASETS
We downloaded the methylation profiles of patients with stroke from Gene Expression Omnibus under accession number of GSE69138 [18], [19]. Previous genome-wide methylation study extracted whole-blood DNA from 404 patients with ischemic stroke, which were distributed across three ischemic stroke subtypes: 132 patients with large-artery atherosclerosis (LAA), 141 patients with small-artery disease (SAD), and 127 patients with cardio embolic (CE). The DNA methylation in CpG sites were measured by Illumina HumanMethylation450 BeadChip array.

B. FEATURE SELECTION
Max relevance feature selection method was first used in the selection of relevant features (i.e., methylation sites), and only relevant features with scores greater than the predefined cutoff were retained. Then, a ranked feature list was obtained by feeding the remaining features into Monte Carlo feature selection (MCFS). Based on the ranked features, increment feature selection (IFS) adopting support vector machine (SVM) as the classifier was further used in the selection of optimum features with the best discrimination performance in classifying samples from the different subtypes of stroke. The whole analysis framework is shown in FIGURE 1. 96670 VOLUME 8, 2020

C. MAX RELEVANCE SCORE
In general, if a feature (i.e. one methylation site) is highly relevant to the outcome variable (i.e. disease phenotype), it is considered important. The maximum relevance score calculates the mutual information (MI) between features and class labels [20]- [22]. The higher the score is, the more important this feature is [23], [24]. The MI of variables x and y is defined as follows: where p(x) and p(y) is the marginal probability density for x and y, respectively, and p(x, y) is the joint probability density.
In this study, we used the MI program integrated in the minimum redundancy and maximum relevance (mRMR) package, to calculate the MI between features and class labels.

D. MONTE CARLO FEATURE SELECTION
Monte Carlo feature selection (MCFS) is a newly proposed feature selection approach based on several decision trees on several bootstrap sets [25]- [27]. Each bootstrap set is randomly produced from an original training set [28]. MCFS first generates t feature subsets, each of which contains m features that are randomly selected from original M features. Then, for each feature subset, p decision trees are generated based on the p bootstrap sample sets, in which samples are represented by features in a given feature subset. The abovementioned procedure is executed t times for the t feature subsets. Accordingly, t * p trees are constructed. The relative importance (RI) score for each feature is measured in terms of the number of observation times of this feature in all constructed trees. MCFS package [28] retrieved from http://www.ipipan.eu/staff/m.draminski/mcfs.html was used in this study. For convenience, default parameters of MCFS were used to execute, i.e. t = 2000 and p = 5. Obviously, such RI score indicates the importance of features relevant to class labels. The higher RI score is, the more important the feature is; thus, features can be further ranked with the decreasing order of their RI scores.

E. INCREMENT FEATURE SELECTION
Not all features are needed to show excellent performance. Thus, IFS [29], [30] was used to select optimum features for a supervised prediction model (i.e. multi-class classifier). It first generates a series of feature subsets with a step interval of 10 according to the ranked features yielded by MCFS. For example, feature subset 1 has only the top 10 features, and feature subset 2 has the top 20 features, and so on. Then, for each feature subset, a supervised classifier (i.e. support vector machine) is trained and evaluated on the samples represented by features in this feature subset.
In the end, we selected the feature subset yielding the best performance during 10-fold cross-validation as the optimum feature (e.g., optimal methylated/demethylated genes).

F. CLASSIFICATION MODEL
Support vector machine (SVM) is a widely applied supervised classification model under a statistical framework [31], [32]. It finds an optimal hyperplane between two classes of samples to make the margin maximum. Clearly, the margin is closely related to generalization error. Furthermore, SVM can efficiently handle linear and nonlinear data, as it can map the original data into a linear space with high dimension through kernel trick. The SVM has an effective mathematical theory for solving convex objective function with the global minimum and is appropriate for data with nonlinear structures. Additionally, it requires appropriate kernels and few tuning parameters. In this study, three groups of samples were considered, so a multi-class SVM adopting one-versus-rest strategy for multiclass classification was adopted. The polynomial function was set as the kernel function, and the regularization parameter was set to 1.

G. PERFORMANCE MEASUREMENT
In this work, there are three classes of samples, so that, a multiclass SVM classifier was learned. The individual accuracies of the three classes and the overall accuracy and Matthews correlation coefficient (MCC) [33], [34] were calculated for objective performance evaluation, combined with 10-fold cross-validation. Given that X is the predicted labels of samples and Y is the true labels, the MCC is calculated as wherex j andȳ j are the average values of members in the j-th column of X and j-th column of Y , respectively. Besides, the accuracy (ACC) for each class can also be estimated by

H. BIOLOGICAL ENRICHMENT ANALYSIS
According to the probe annotation table of Illumina Human-Methylation450 BeadChip [18], [19], our selected methylation probes were mapped onto detailed genes, which were then enriched onto GO and KEGG through a hypergeometric test [35]. The GO and KEGG terms with false discovery rate smaller than 0.05 were considered as relevant biological functions with significant enrichment on our selected features. VOLUME 8, 2020

III. RESULTS
The input features were ranked as 482,421 methylation sites by their maximum relevance scores, and only 19,491 highly relevant features with score of >0.015 were selected. Then, the discriminative features for classifying stroke samples from different disease subgroups were obtained by further refining these relevant features.
The optimal feature combination was determined through a supervised classifier by running IFS with SVM for sample classification. During this process, a series of feature subsets with a step interval of 10 was generated, and the SVM with 10-fold cross-validation were run on the samples consisting of features from one feature subset and validated. This operation was performed in each feature subset. The best MCC (0.892) was obtained when the top 4730 features were used, and the overall accuracy was 0.928. The features along with their importance scores calculated by MCFS are shown in TABLE S1. Performance corresponding to all feature subsets is provided in TABLE S2. An IFS curve is illustrated (FIGURE 2A) with the MCC value (i.e. y) as y-axis and the number of features (i.e. x * 10) as x-axis. We ran IFS in the same way, with random forest as the supervised classifier, to verify the selection of SVM as a supervised classifier in this work. The best MCC value of 0.556 was obtained when the top 790 features were used, and the overall accuracy was 0.703. Performance corresponding to all the feature subsets is provided in TABLE S2 as well. FIGURE 2B illustrates the MCCs that the RFs yielded when the number of features involved varied. The results in TABLE 1 and FIGURE 2 indicated that SVM outperforms RF and is thus a better choice for classifying samples from dissimilar stroke subtypes.

IV. DISCUSSION
Basing on our newly presented computational method in FIGURE 1, we screened a group of functional genes that have abnormal methylation status contributing to stroke pathogenesis. Of note, we adopted a two-stage feature selection procedure, considering many features required filtering. In the first stage, the Max's relevance score (MI value) of each feature was calculated, thereby discarding lots of irrelevant features. Then in the second stage, the RI score of each remaining feature was computed to obtain final relevant features. Thus, the Max's relevance score and MCFS had different functions and were both necessary in our analysis, providing an efficient feature selection procedure for 482,421 methylation sites. The final optimal genes and enriched functional terms are analyzed and discussed below.

A. OPTIMAL MEHTYLATED/DEMETHYLATED GENES
Recent publication confirmed the identified optimal methylated/demethylated genes with high ranks (i.e., the top 5 genes).
TSPY4, encoding a specific functional gene participating in sperm differentiation and proliferation, is a potential stroke-associated gene with respective methylation abnormality [36]. The methylation regulation of TSPY4 and its homologues have been widely identified during tumorigenesis and found to interact with Y chromosome-located oncogenes [37]- [39]. Moreover, the specific gene expression patterns of genes in the Y chromosome may directly participate in the pathogenesis of ischemic stroke [40]. Therefore, indirectly, our predicted TSPY4 with differential methylation may influence the expression of Y chromosome-located genes during the pathogenesis of stroke.
LHB, as a glycoprotein-hormone-encoding gene, has been predicted to be abnormally regulated at the methylation level [41], [42] during the initiation and progression of stroke. As reported, an abnormal sex hormone (e.g., testosterone in the blood) may participate in the pathogenesis of stroke [43]. Given that LHB is the encoding gene of a quiet effective sex hormone (i.e., luteinizing hormone), its methylation status may affect gene/protein expression level and is potentially pathogenic for our objective disease/symptom, stroke.
FAM153C is another identified stroke-associated gene. Binding to its related regulatory factors, FAM153C has only been reported to participate in human height regulation. According to a recent publication on voltage-gated proton channels in sperm, which is functionally connected to FAM153C [44], the proton channels are potential targets of novel drugs with specific side effects in stroke. Therefore, despite the lack of evidence, the methylation status of our predicted gene FAM153C may be functionally related to stroke symptoms.
GRIN2A, encoding a member of the glutamate-gated ion channel, has been predicted to contribute to stroke-related abnormally methylation regulation. A systematic analysis on a typical disease named Landau-Kleffner [45] confirmed that the expression and mutational pattern of GRIN2A may contribute to the pathogenesis of the disease as one of the core driver at the genetic and epigenetic levels. Further studies identified that stroke is one of its major complications of Landau-Kleffner which would be induced by the defected mitochondrial respiratory chain [46]. Therefore, these facts indicate the methylation of GRIN2A has potential to involve in the pathogenesis of stroke.
LHX9, encoding a member of the LIM homeobox gene family, acts as a development-associated transcription factor [47], [48]. The methylation status of LHX9 contributes to the formation of the testicular cord [49]. Therefore, speculating that the abnormal expression/methylation of LHX9 may lead to the endogenous defect of the testicular cord and results in hemorrhagic shock in some cases is reasonable.
As shown in FIGURE 3 and FIGURE S1, the above discussed genes have different methylation distributions in different patient groups. In addition, we also validated the VOLUME 8, 2020 differential expression of these genes between control samples and cardioembolic stroke samples, on an independent dataset of GEO GSE58294 (TABLE S4). Actually, we found FAM153C, GRIN2A, and LHX9 have significantly differential expressions (P value < 0.05), which might be caused by their differential methylation reported here.

B. FUNCTIONAL ENRICHMENT ANALYSIS OF OPTIMAL GENES
All the optimal screened genes directly and indirectly participate in stroke-associated biological processes. For the accurate identification and analysis of these potential pathological processes, gene ontology and KEGG enrichment analyses were performed on the selected optimal genes by the R function phyper, whose results are supplied in TABLE 2 and  TABLE S3.
Two specific GO terms (GO: 0009653 anatomical structure morphogenesis, and GO: 0048856 anatomical structure development biological process) are quite significant for stroke symptoms, indicating the specific role of unique anatomical structure in such processes. For instance, various specific anatomical structures, such as the testicular cord, participate in the pathogenesis of stroke.
Ion binding (described by GO: 0043167) has also been screened out as one of the enriched items. In fact, various publications have systematically confirmed that ion binding processes, such as gating in TRPV1 channels [50] and potassium ion for Na(+)/K(+)-ATPase regulation [51], may directly participate in various complicated pathological processes related to stroke. Another GO-term-like membranebounded organelle has been found to have core biological processes, reflecting the complicated biological mechanisms for the onset of stroke.
We identified a few functional KEGG pathways that are significantly enriched. Among them, hsa04510 (focal adhesion) is the only one that is quite significant for the pathogenesis of stroke. A specific regulator of a focal adhesion, named as carcinoembryonic antigen-related cell adhesion molecule 1, has been identified as a pathogenic factor for stroke [52]. This finding indicates the detailed connection of focal adhesion and our objective disease symptom.

V. CONCLUSION
Based on our newly presented computational method, all the identified genes have been validated to have abnormal methylation status during the pathogenesis of stroke and to be associated with biological functions involved in stroke onset. Further functional enrichment analysis confirmed and summarized the novel specific pathogenic roles of ion binding and focal adhesion for stroke regulated at the methylation level.