Selecting Relevant Genes From Microarray Datasets Using a Random Forest Model

Recent studies have demonstrated microarray expression data can be used to identify gene regulatory pathways. However, one of the major challenges is to utilize the large microarray data (genes and micro-RNAs) to have an efficient computational model. Therefore, there is an urgent need to reduce the dimensionality of these large sets using machine learning methods without compromising the accuracy. This requires an appropriate machine learning algorithm to select the significant features from these large datasets. Therefore, in this study, we use a supervised method based on a Random Forest to identify significant features from three microarray datasets from prenatal nicotine, alcohol, and nicotine and alcohol exposure groups in two different cell types (dopamine and non-dopamine neurons). Our approach was computationally efficient to reduce the dimensionality of extremely large microarray datasets. Furthermore, our results indicated that using only the top 20% of features was sufficient to confirm the genetic pathways previously identified when using all of the features in the model.


I. INTRODUCTION
Microarrays enable the global screening of gene expression profiles by quantifying the changes in the regulation of thousands of genes [1]. Recently, microarrays have been adopted to identify the gene regulation pathways [2] using supervised or unsupervised machine learning methods. In practice, the large number of features limits the model reliability and in many cases, may cause overfitting [3]. To improve the efficiency of the gene regulatory network modelling, the dimensionality of the features including messenger RNAs (mRNA, genes) and microRNAs (miRNAs) needs to be reduced [4].
There are two different approaches including unsupervised and supervised methods to reduce the dimensionality of complex datasets. In unsupervised learning, having a large size data and features negatively affects the computational performance of the underlying learning algorithm. The Hill Climb (HC) unsupervised learning algorithm for dimensionally reduction has been widely used in practice to improve its computational efficiency [5].
The associate editor coordinating the review of this manuscript and approving it for publication was Henry Hess.
Biological systems are inherently complex and nonlinear [19], [20]. Therefore, the extraction of relevant features and determining the related pathways using these large datasets can be challenging. Furthermore, this requires the use of non-linear modelling approaches since linear feature selection methods are inadequate. One nonlinear approach involves tree models [21], which are computationally fast and have been widely adopted as an effective and efficient feature selection solution in cancer diagnostics [22], [23], cancer classification [24], [25], image-based medical FIGURE 1. Gini index estimation in random forest is based on representative features (i.e., differentially expressed genes (DEGs) and differentially expressed miRNAs (DEmiRs)). (a) Bayesian Network Running time over various number of features using bnlearn. †Using 2500 features, a Bayesian Network cannot be built within 24 hours (b) Running time of performing one iteration of impurity-based feature importance algorithm with Random Forest. The feature importance algorithm was updated using variant random seeds until the (c) standard deviation and (d) root-mean-square-error (RMSE) converges.
In this paper, we performed feature selection using the supervised Random Forest classifier over a collection of expression data (differentially expressed genes (DEGs) and differentially expressed miRNAs (DEmiRs)) obtained from 13 microarrays. These data were previously generated by our group to investigate perinatal exposure to alcohol, nicotine, and both nicotine and alcohol during rat gestational development [29], [30]. This final dataset for this study consisted of 5523 genes and miRNAs for the alcohol exposure, 7863 for nicotine, and 5613 for co-exposure (i.e., nicotine and alcohol) dataset. To implement the Random Forest, we assigned the data to three labels (i.e., nicotine/alcohol/ dopamine-cell (DA) as shown in Table 1). Among these three labels, samples labelled as ''nicotine'' and/or ''alcohol'' identified whether the pup was prenatally exposed to nicotine and/or alcohol, respectively. The DA label was used to identify dopamine cells. After feature selection, we performed pathway enrichment analysis over both the feature-reduced datasets and the original datasets. We found comparable genetic regulation pathways using both methods of modelling the datasets.

A. ANIMAL EXPERIMENTS
The microarray data was collected from dopaminergic and non-dopaminergic neurons obtained from the rat ventral tegmental area (VTA). All experiments were performed in accordance with the protocols approved by the Institutional Animal Care and Use Committee (IACUC) and the University of Houston Animal Care Operations (ACO). The detailed protocol has been published previously [29]- [32]. Briefly, pregnant Sprague-Dawley (SD) rats (Charles River, Wilmington, MA, USA) were maintained at standard conditions and were given an ad libitum diet. Rats were implanted with a subcutaneous osmotic minipump (Alzet, Cupertino, CA) containing either nicotine (at levels to stimulate moderate smoking CITE) or saline. A liquid diet of ethanol was gradually (a) sample from embryo related with perinatal alcohol exposure, (b) sample from embryo related with perinatal nicotine exposure, and (c) sample from DA cells exposed to XX. The test set consisted of 5000 data points randomly selected from the entire data set. Rows correspond to predicted classes and columns correspond to actual classes. The overall low-resolution classification of the algorithm is 0.68 ± 0.15. (d) tracks the rank of ten most important and ten least important features (DEGs and DEmiRs) from five machine learning trials over the perinatal alcohol dataset, among more than 5000 features.
introduced to the pregnant mothers to produce blood alcohol concentrations similar to what is observed in children with fetal alcohol spectrum disorders [33].
The fetus was continuously exposed via the placenta to nicotine and/or alcohol ingested by the mother from gestational day 6 (G6) to delivery (around G21-22). After birth, the pups were still exposed to nicotine and/or alcohol via the rat mother's milk.
The brain tissue samples from the pups were pooled for each litter for either the alcohol, nicotine, nicotine-alcohol, and saline treated groups for a total n = 13 litter groups?. The samples were then dissociated, pelleted, fixed, and labelled with conjugated primary antibodies neuronal marker, NeuN/Alexa Fluor 488 (NeuN/AF488, ab190195, Abcam, Cambridge, MA, USA), and tyrosine hydroxylase/ phycoerythrin (TH/PE, ab209921, Abcam). The labelled cells were sorted on an (LSR II) FACS Aria (BD Biosciences, San Jose, CA, USA) flow cytometer to identify dopamine (DA) and non-dopamine (NDA) neurons.
Following sorting, the total RNA of the cells was extracted using a miRNeasy Micro Kit (Qiagen, Hilden, Germany). The expression level of mRNA and miRNA was accessed using Agilent Sureprint mRNA and miRNA microarrays (Santa Clara, CA, USA). The raw microarray dataset was then collected from the resulting images using the Feature Extraction Software v12.0.1.

B. RANDOM FOREST
One major challenge to perform feature selection over microarray data is the inability to rely on a feature's (e.g., gene or microRNA) expression level because is difficult to determine its relevance. Therefore, when performing feature selection, we selected a subset of features, which improved the performance (in terms of running time or accuracy) of the machine learning algorithm. This method is typically considered a nondeterministic polynomial hard (NP-hard) problem [34]. This challenge becomes even more pronounced when analyzing microarray data, in part, due to the large TABLE 1. Representative structure of the data (X) and the multi-labelling (y) of the microarray samples used in this study. For the data, each column represents the expression level (adjusted against negative control) of a differentially expressed gene or miRNA. For each sample set (i.e., AlvS, NivS, and NiAlvS), 5523-7863 genes and microRNAs are included/evaluated. Each row represents a sample, which is named by their respective treatment method, and the origin of the measured cell group. DA Cell: dopamine cells. ADA: dopamine cells exposed to alcohol; NADA: dopamine cells exposed to both nicotine and alcohol. SDA: dopamine cells treated with saline (control). NAND: non-dopamine cells exposed to both nicotine and alcohol. SND: non-dopamine cells treated with saline (control). NDA: dopamine cells exposed to nicotine. NND: non-dopamine cells exposed to nicotine. dimensionality of the features, which can lead to overfitting or lower efficiency. To overcome this challenge, one feature selection method incorporates labels to the data, which then converts the unsupervised feature selection process to a supervised one.
Herein, we propose a method that converts an unsupervised feature selection to supervised using the Random Forest classifier to analyze microarray data. We combined 13 microarray datasets and created labels that reflected their respective experimental conditions. That is, whether the sample was marked as exposed to alcohol, whether the sample has been exposed to nicotine, and whether the sample was a dopamine cell. We then calculated the Gini index as the feature importance in which p i represents the relative frequency of the feature in the dataset, and c represents the number of classes.
Using this supervised method, the Gini of each feature (miRNA or gene) was used to quantify the likelihood of the Random Forest classifier to branch the data into subgroups.

A. DATA MULTI-LABELING
The genes and miRNAs used in these datasets were the DEGs and DEmiRs from Keller, et al. [32]. These DEGs and DEmiRs were identified from microarray data which compared gene and miRNA expression profiles of VTA DA neurons between treatment groups and their respective controls. This approach has been previously described in more detail [29], [32]. Briefly, this method is based on a qvalue <0.001 (adjusted p-value using Benjamini-Hochberg (BH) correction) and absolute log2 fold change >1 (≥1 for upregulation, ≤ −1 for downregulation) [32]. The treatment groups were compared to the saline control and labeled as: alcohol (AlvS), nicotine (NivS), or both nicotine and alcohol (NiAlvS). This approach enabled us to identify large scale gene and miRNA expression profiling in the target neurons of the interested brain area (i.e., the VTA). Table 1 demonstrates that we organized the samples using three labels: 1) perinatal alcohol exposure, 2) perinatal nicotine exposure, and 3) DA cells to represent 4 different conditions (nicotine, alcohol, co-exposure and saline) for each neuron type (DA vs NDA). For each sample set, we performed five trials using different random seeds. In each trial, we performed 5000 iterations that calculated the feature importance using the Random Forest classifier. The feature importance was calculated as the average of all 5000 iterations. The hyperparameter of the Random Forest classifier was optimized using grid search. We used 5000 iterations for each trial to ensure converge of the average feature importance.

B. SELECTING RELEVANT FEATURES EFFICIENTLY
We built a Bayesian network to represent the running-time advantage of feature selection. The Bayesian network was buit using bnlearn [35] and the expression levels of the features as vertices in a directed acyclic graph (DAG); the relationships among the expression levels were predicted as arcs that connected the vertices. A hill climbing (HC) algorithm adds one arc per iteration and was used to learn the net-TABLE 2. KEGG pathways found using the top 20% important DEGs, and the corresponding genes identified in pathway analysis following perinatal nicotine exposure, following perinatal alcohol exposure, and following perinatal nicotine and alcohol exposure. *: Pathways that have been reported in our group's previous publications.
work structure.As illustrated in Figure 1a, the running time after building a Bayesian Network increases exponentially when the feature size (also described as number of vertices) increases. This method will consume a PC (using 32.0 GB RAM and Intel Core i7-10875H CPU) and take >24 hours to perform one iteration of Bayesian network prediction when 2500 features are used. We implemented impurity-based feature importance in Random Forest to calculate the relative importance of the DEGs and the DEmiRs. When training a decision tree using a Random Forest, the feature importance VOLUME 9, 2021 TABLE 2. (Continued.) KEGG pathways found using the top 20% important DEGs, and the corresponding genes identified in pathway analysis following perinatal nicotine exposure, following perinatal alcohol exposure, and following perinatal nicotine and alcohol exposure. *: Pathways that have been reported in our group's previous publications.
can be measured by how much the feature will decrease the weighted impurity (or information entropy) over various trees [36]. This process runs in polynomial time. As shown in Figure 1b, the running time of calculating the feature importance using the Random Forest model increases linearly against the feature size. The short running time facilities the generation of stable feature importance results. As shown in Figure 1c and 1d, the standard deviation and root-meansquare error (RMSE) of the importance of all features converges after a large number of iterations. It takes typically 2500 and 1000 iterations for the standard deviation and the RMSE to plateau (which is defined as the slope of the moving average of 1000 data points to approach 0), respectively. To ensure a stable result and to reduce overfitting, we used 5000 iterations with 5-fold cross-validations in the feature importance calculation for all samples.

C. SELECTING REPRODUCABLE FEATURES
Using the experimental conditions as the prediction labels, we then performed a grid search over various hyper-parameters in the Radom Forest model for each of the datasets (i.e. AlvS, NivS, and NiAlvS), to ensure precise predictions. The details on the hyperparameter search can be found in the Methods section. Figures 2a-2c illustrate the confusion matrix for Random Forest based feature importance selection on the DEGs and DEmiRs from the AlvS group. The hyperparameter-tuned model yielded a 93% true positive rate and 100% true negative rate when predicting whether a neuron had been perinatally exposed to alcohol. For nicotine-treated neurons, the true positive rate was 83% and the true negative rate was 99%. Finally, this model had a 71% true positive rate and 65% true negative rate when predicting on whether a neuron was a DA neuron. The subset accuracy (i.e., the percentage rate of all three labels to be correctly predicted) was 68 ± 15%. For the NiAlvS and the NivS data group, the hyperparameter-tuned model yielded similar classification accuracy (data not shown) This method is rather stable. As shown in Figure 2d, out of the 5000 features in this perinatal alcohol dataset, we tracked the rank of the ten most important (i.e., highest feature importance score) and the ten least important features from five machine learning trials. The important features obtained a high importance score in all five trials, while the less-important features reproducibly obtained a lower score in all five trials.

D. COMPARISON WITH EXISTING KEGG PATHWAYS
To examine the effectiveness of our feature selection method, we selected the top 20% most important DEGs and DEmiRs to estimate gene pathways using the Kyoto Encyclopaedia of Genes and Genomes (KEGG) pathways. The results are summarized in Table 2. Then, we compared the identified pathways with those identified using the full DEGs and DEmiRs data sets in [29], [31]. We noticed that we were able to detect multiple pathways that have been identified in our previous publications using this proposed method. We identified enriched pathways including Protein processing in endoplasmic reticulum, cGMP-PKG signaling, Osteoclast differentiation, Axon guidance, Arrhythmogenic right ventricular cardiomyopathy (ARVC), cAMP signaling, and Pathways in cancer from the NivS dataset (p < 0.05), similar to our group's previous report which used all the DEGs and DEmiRs [31]. We also identified enriched pathways including Endocytosis, Alcoholism, Oxytocin signaling, Circadian entrainment, Phagosome, Axon guidance, Dopaminergic synapse, Viral carcinogenesis, MAPK signaling pathway and Melanogenesis from the AlvS dataset (p < 0.05). Using the NiAlvS dataset, we identified enriched pathways including Circadian entrainment, Oxytocin signaling pathway, Dopaminergic synapse, Alcoholism, Choline metabolism in cancer, cGMP-PKG signaling pathway, Lysosome, Pathways in cancer, Proteoglycans in cancer, and Endocytosis (p < 0.05], in agreement with [29].

E. VALIDATION OF RESULTS WITH EXISTING MODELS
We also analzyed the same top 20% DEGs and DEmiRs data sets in the NivS, AlvS and NiAlvS groups using the commonly used parametric (Pearson [12]) and nonparametric (Spearman [36]) corellation methods and the Random KNN method [37] to identify potential gene pathways as identified using the KEGG pathways.
Using the top 20% DEGs and DEmiRs data sets in the NivS group, the Pearson correlation method was able to identify a few enriched pathways including cGMP-PKG signalling, Osteoclast differentiation, Axon guidance, Arrhythmogenic right ventricular cardiomyopathy (ARVC) and cAMP signalling (p < 0.05). However, the Random KNN was able to identify protein processing in endoplasmic reticulum (p < 0.05). The Spearman correlation method failed to identify any pathways.
Using the top 20% DEGs and DEmiRs data sets in the AlvS group, the Pearson correlation method was able to identify two pathways including alcoholism and viral carcinogenesis (p < 0.05). Additionally, the Spearman correlation was able to identify four pathways including Endocytosis, Phagosome, axon guidance, and viral carcinogenesis (p < 0.05). The Random KNN was unable to identify any pathways.
Using the top 20% DEGs and DEmiRs data sets in the AlNivS group, the Pearson correlation method was able to identify three pathways including alcoholism choline metabolism in cancer, and proteoglycans in cancer (p < 0.05). The Spearman correlation was able to identify four pathways including cGMP-PKG signalling pathway, lysosome, proteoglycans in cancer, and endocytosis (p < 0.05). Finally, the Random KNN identified two pathways including pathways in cancer and proteoglycans in cancer (p < 0.05).
These results indicate that the Random Forest method performed better that the Pearson, Spearman and Random KNN methods on the same top 20% DEGs and DEmiRS data sets in the NivS, AlvS and the NiAlvS groups to estimate gene pathways using the KEGG pathways. The Random Forest method was able to identify more pathways identified using the all DEGs and DEmiRS data sets in each group [29], [31].

IV. DISCUSSION
Building a mathematical model over a large set of features could be a computationally-challenging problem. For example, when building a Bayesian network, the expression level of the DEGs and the DEmiRs are treated as vertices in a Directed acyclic graph (DAG), and the relationships (to be predicted) among the DEGs and the DEmiRs will be represented by arcs that connect the vertices. Assume that we start to predict the Bayesian Network model with an empty DAG (that is, a DAG that contains all the vertices but no arcs), and that we use a hill climbing (HC) algorithm to add one arc per iteration, this requires O(N 2 ) model comparisons. Thus, the overall time complexity of the HC algorithm is to the scale of O(cN 3 ) model comparisons [5]. The NP-hardness of learning Bayesian networks have been generally accepted by the researchers in the machine-learning community [39]. Similarly, in most machine learning studies using microarray datasets, feature selection is needed.
Other well-used methods to perform gene selection include permutation-based feature selection, which randomly reshuffles the data and balances the influence of each feature to the model performance. Compared with the Gini-based feature selection, the permutation algorithm is usually significantly slower [40], [41]. The Gini-based method may not be effective when the potential predictor variables vary in their scale of measurement or their number of categories [42]. However, this may not be applicable to the analysis of microarray data, which is homogeneous.
Notably, our methods have suggested several pathways that were also previously identified by our group using the enriched KEGG pathways (e.g., the Dopaminergic synapse pathway from the AlvS and the NiAlvS dataset) [29]. The Dopaminergic synapse pathway describes the release of DA neurotransmitter. According to the major hypothesis of drug reinforcement, the reinforcing effect of addiction is believed to be conveyed through the activation of the meso-corticolimbic DA system. Stimulation of VTA DA neurons via alcohol and/or nicotine administration results in the release of DA in the NAc, which is believed to describe how the DA synapse pathway plays a role in the reinforcing effect [29], [43]. The enrichment of this pathway in our current results demonstrates that perinatal alcohol and/or nicotine exposure leads to genetic alterations in VTA DA neurons that are in accordance with addiction mechanisms.
Additionally, we found the Axon guidance pathway from the NivS and the AlvS datasets, which was highlighted in our group's recent work [29]. Axon guidance is an important pathway that regulates the migration of an axon is directed to a specific target. Following perinatal alcohol and/or nicotine exposure, we found Semaphorin 3F, 4A, 4D, 4G, and 6D, which are genes that belong to the Semaphorin axon guidance pathway [44].

V. SUMMARY
In this study, we proposed a fast and reproducible method using a Random Forest classifier to perform impurity-based feature selection over a microarray datasets with a large dimensionality. Using our method, we successfully identified the Glutamatergic synapse and the Axon guidance pathways, which were previously reported to be enriched following perinatal nicotine or alcohol-nicotine exposure. In the future, we are interested in performing additional unsupervised mathematical studies, such as Bayesian Network analysis, over the selected important genes and miRNAs.