EnsemPseU: Identifying Pseudouridine Sites With an Ensemble Approach

Pseudouridine (<inline-formula> <tex-math notation="LaTeX">$\Psi$ </tex-math></inline-formula>) is the most prevalent RNA modification, which is formed from uridine through an isomerization reaction. With the increasing availability of genomic and proteomic samples, computer-aided pseudouridine-synthase-specific <inline-formula> <tex-math notation="LaTeX">$\Psi $ </tex-math></inline-formula> site recognition is becoming possible. In this paper, we propose an ensemble approach to identify pseudouridine sites, named EnsemPseU. First, five sequence-encoding strategies, namely, kmer, binary encoding, enhanced nucleic acid composition (ENAC), nucleotide chemical property (NCP), and nucleotide density (ND), were applied to extract sequence information. Then, chi-square feature selection was used to reduce the feature dimensionality and remove redundant information. Finally, an ensemble algorithm integrating support vector machine (SVM), extreme gradient boosting (XGBoost), naïve Bayes (NB), k-nearest neighbor (KNN), and random forest (RF) was used to build our prediction model. Upon testing, the results showed that the accuracy improved 5.3% for <italic>H. sapiens</italic>, 6.09% for <italic>S. cerevisiae</italic>, and 5.55% for <italic>M. musculus</italic> after chi-square feature selection. Moreover, upon evaluation via 10-fold cross-validation and an independent test, our proposed model EnsemPseU outperformed the other best existing model. The source code and data sets are available at <uri>https://github.com/biyue1026/EnsemPseU</uri>.


I. INTRODUCTION
Pseudouridine ( ) is the most prevalent RNA modification, which is formed from uridine through an isomerization reaction. It was discovered as early as the 1950s and was initially studied in noncoding RNA types such as ribosomal RNA (rRNA) and transfer RNA (tRNA) [1], [2]. However, recent reports have shown that is also present in messenger RNA (mRNA). It has been detected in transcripts obtained from humans, yeast, and the unicellular eukaryotic parasite Toxoplasma gondii [3]- [5]. The production of , an isomer of uridine, is catalyzed by highly conserved pseudouridine synthase, which detaches the uridine residue's base from its sugar, followed by ''rotating'' it 180 • along the N3-C6 axis, and subsequently reattaches the base's 5-carbon to the 1-carbon of the sugar. can be considered to play important roles in structure, function, and metabolism [6]. Therefore, The associate editor coordinating the review of this manuscript and approving it for publication was Jijun Tang. the identification of sites is crucial for revealing the biological principle concerned.
Experimental verification plays an important role in identifying sites, but is costly and time-consuming. Instead, some computational methods have been developed to identify sites, which have the advantages of rapidity, low cost, and efficiency. In 2015, Li et al. built the first computational model called PPUS to predict the pseudouridine-synthasespecific sites in H. sapiens and S. cerevisiae [7]. They used the nucleotides around as features and employed SVM as the classifier. The following year, Chen et al. developed another model called iRNA-PseU to identify sites in H. sapiens, S. cerevisiae, and M. musculus, and also employed SVM as the classifier [8]. But they considered the combination of the occurrence frequency density distributions of the nucleotides and their chemical properties into the general form of pseudo k-tuple nucleotide composition (PseKNC) as feature vectors. Based on the same SVM classifier, He et al. developed PseUI using hybrid features including nucleotide composition (NC), dinucleotide composition (DC), pseudo dinucleotide composition (PseDNC), position-specific nucleotide propensity (PSNP), and positionspecific dinucleotide propensity (PSDP) in 2018 [9]. In 2019, two new models, iPse-CNN and XG-PseU, were built to identify sites. iPse-CNN proposed by Tahir et al. is a convolutional neural network-based method using onehot features [10], while XG-PseU proposed by Liu et al. is an XGBoost-based method with optimal features from six types of features, namely, nucleotide composition (NC), dinucleotide composition (DNC), trinucleotide composition (TNC), nucleotide chemical property (NCP), nucleotide density (ND), and one-hot encoding [11]. It is worthy of note that Liu et al. built the newest benchmark datasets according to the latest RMBase v2.0 database [12]. Despite these efforts, the performance of computational model still needs further improvement.
To develop a more effective model for identifying sites, we propose an ensemble model called EnsemPseU (https://github.com/biyue1026/EnsemPseU) that integrates support vector machine (SVM), extreme gradient boosting (XGBoost), naïve Bayes (NB), k-nearest neighbor (KNN), and random forest (RF) based on a majority voting strategy. We collected and applied the latest datasets used in XG-PseU. Then five encoding methods are employed to extract features, namely, kmer, binary, enhanced nucleic acid composition (ENAC), nucleotide chemical property (NCP), and nucleotide density (ND). Besides, chi-square feature selection is used to reduce the feature dimensionality and redundant information. Furthermore, 10-fold cross-validation, jackknife and independent tests were used to evaluate the model performance. Figure 1 displays the specific framework of EnsemPseU.
A detailed statistical summary about these datasets is listed in Supplementary Table S1.
Kmer is a widely used feature extraction approach in bioinformatics, which has been used in various fields, such as the identification of microRNA precursors [14], enhanced regulatory sequence prediction [15], and the phenotypic classification of metagenomic colon cancer reads [16]. For a RNA sequence, kmer can be defined as the frequencies of occurrence of k neighboring nucleic acids. The kmer (k = 2) can be denoted as: where N (t) is the number of kmer type t and L is the length of a RNA sequence. In this study, k ranges from 1 to 5.

3) ENAC
ENAC encoding was proposed for the first time by Chen et al. [17]. It calculates the nucleic acid composition based on the fixed-length window that continuously slides from the 5 to the 3 terminus of each nucleotide sequence. It is usually applied to encode the nucleotide sequence with an equal length. The description of ENAC is as follows: where S is the size of the sliding window, N t,win r is the number of nucleic acids t in the sliding window r, t ∈ {A, C, G, U , and r = 1, 2, . . . , L − S + 1. In this work, the sliding window's length S is fixed as 5.

5) ND
The nucleotide density (ND) encoding integrates the nucleotide frequency information and the distribution of each nucleotide reflected in the RNA sequence [11]. The density d i of any nucleotide N j at position i in the RNA sequence is defined by the following formula: where S i is the length of the sliding substring concerned, while l is the corresponding locator's sequence position. Take

C. CHI-SQUARE FEATURE SELECTION
The chi-square (χ2) test is often applied to determine the independence of two events in statistics [18]. It can also act as a feature selection method that can evaluate the degree of association of features with respect to the class labels [19].
The features with higher scores in the chi-square test may be deemed as priority features for classification. The basic formula of the chi-square test is as follows: where A is the observed value and E is the expected value. For discrete data, the chi-square test can directly test whether two features are related, while the range of continuous value features needs to be discretized into intervals.

D. ENSEMBLE MODEL CONSTRUCTION
For the identification of pseudouridine sites, the common practice is to find a suitable classifier that can precisely recognize the special sites. Nevertheless, a single classifier often fails to achieve satisfactory predictive performance. As demonstrated by a series of previous studies, such as on protein fold pattern recognition [20], enzyme functional classification [21], protein subcellular location prediction [22], protein-protein binding site identification [23], multiple lysine PTM site identification in proteins [24], recombination spot identification [25], the enhancer identification [26] and so on, the ensemble predictor based on fusing of an array of individual predictors via a voting system or other strategy can yield much better predictive performance [25].
Here, five popular classification algorithms (SVM, XGBoost, NB, KNN, and RF) were integrated into an ensemble model based on a majority voting strategy. The prediction label of a test sample is determined by the formula: Assuming that we get the results of five classifiers as follows: P SVM = 1, P XGBoost = 1, P NB = 0, P KNN = 0 and P RF =1.
We would classify the test sample as class 1 according to Y = mode{1, 1, 0, 0, 1} for the number of 1 is larger than that of 0. SVM is a commonly used machine learning algorithm in bioinformatics, which makes the linear indivisible samples in original space become separable by mapping the original sample features in low-dimensional space to highdimensional space [27], [28]. Kernel functions are used to map low-dimensional space to high-dimensional space, and have been widely developed for different classification scenarios, including Gaussian radial basis function (RBF) and linear/polynomial/sigmoid kernel.
XGBoost is a machine learning algorithm based on gradient tree boosting [29]. It can find the overall optimal solution by performing the second-order Taylor expansion to the loss function, and adds regularization items to the objective function. In theory, XGBoost is an improvement on the gradient boosting decision tree (GBDT) algorithm and is more efficient for avoiding over-fitting.
NB is a classification method since it is based on the simple assumption that attributes are conditionally independent of each other when the target value is given [30]. In this paper, we used Bernoulli NB as one of the base classifiers in our study. The Bernoulli NB classifier works under the assumptions that data features are independent and have Bernoulli distributions.
KNN algorithm is a commonly employed unsupervised algorithm that clusters samples by calculating their similarities/distances [31]. The key idea of the KNN algorithm is that, the sample would also belong to a category if the most of the k nearest samples of this sample belong to a certain category.
RF classifier is a widely employed ensemble classifier that produces multiple decision trees, using a randomly selected subset of training samples and variables [32]. When applying RF, the number of decision trees is an important parameter and should be tested exhaustively based on the specific application or biological question, for optimal predictive performance [17].

III. RESULTS AND DISCUSSION
To measure the performance of our model, we used four metrics, sensitivity (Sn), specificity (Sp), accuracy (Acc), and the Matthew's correlation coefficient (MCC), which have been used in a series of studies to evaluate the effectiveness of predictors [33]- [35]. These measurements are defined as follows: 10) where N + represents the total number of positive samples ( ); N − represents the total number of negative samples (non ); N + − represents the number of positive samples incorrectly predicted as negative samples; and N − + represents the number of negative samples incorrectly predicted as positive samples. K-fold (k=10) cross-validation, jackknife and independent tests were also used to determine the model's generalizability.

A. THE RESULTS OF FEATURE SELECTION
In our study, we chose the kmer, binary, ENAC, NCP, and ND encoding schemes to convert nucleotide sequences to numeral characteristics. The total number of features for each nucleotide sequence exceeds 1700 in this case. As has been investigated, highly dimensional feature vectors may contain some noisy information and influence the model's performance. Therefore, it is essential to apply a feature selection approach to reduce the irrelevant features from a large number of original features. The feature selection process calculates the score of each probable feature based on a specific feature selection technique and then selects the best k features [36].
The bottleneck in this approach is how to decide the number k of optimal features. Here, we combine the selection of k value with the parameter optimization of the ensemble model. Specifically, we took the values of k ranging in [100, 1000] with the step of 100, and parameter optimization process was implemented at each k value based on 10-fold crossvalidation. We optimize the parameter γ of SVM, boosting the learning rate (r) of XGBoost, k neighbors (k) of KNN, and the number of decision trees (n) of RF. In addition to the parameters mentioned above, the remaining parameters follow the default of the scikit-learn ( https://github.com/scikitlearn/scikit-learn). Based on our artificial experience, we created 6×5×5×5 = 750 combinations by setting γ ∈{0.01, 0.02, 0.04, 0.06, 0.08, 0.1}, r ∈{0.1, 0.2, 0.3, 0.4, 0.5}, k ∈{1, 3, 5, 7, 9} and n ∈{100, 200, 300, 400, 500}.
For each k value, 10-fold cross-validation test was implemented according to different parameter combinations running 750 times. The optimization results can be seen in the Supplementary file. To estimate the performance of each k value objectively, we calculated the average accuracy of 750 results of 10-fold cross-validation. For better observation, we drew the change of accuracy along with the change of k in Figure 2.
For H. sapiens, the accuracy reached 60.35% under 1600 dimensions of original features. When k = 400, the accuracy peaked at 65.65%, which was an improvement of 5.3% in comparison with the accuracy of original features. For S. cerevisiae, the accuracy was 65.99% under 1720 dimensions  of original features. At k = 400, the accuracy increased to 72.08%, which was an improvement of 6.09% in comparison with the accuracy of original features. For M. musculus, the accuracy was 66.45% under 1600 dimensions of original features. When k = 400, the accuracy reached up to 72%, which was an improvement of 5.55% in comparison with the accuracy of original features. The improvements of accuracy indicate that χ 2 feature selection can eliminate redundant information and retain useful information. Therefore, we determined 400 as the number of optimal features for above three species. After that, we selected the parameter with the highest Acc to build the model with k = 400 for each species. See Supplementary Table S2 for detailed corresponding settings.
In order to find an appropriate feature selection method, we have also carried out experiments using other two kinds of feature ranking methods including mRMR, F-score, and compared them on 10-fold cross-validation. The comparison results shown in Figure 3 indicated that the prediction results of chi-square are the best for all of three species. Therefore, we adopted the chi-square feature selection method to construct our predictor.

B. EFFECTIVENESS OF ENSEMBLE MODEL
During our experiments, we chose five different machine learning methods SVM, XGBoost, NB, KNN, and RF as basic classifiers and integrated them by a majority voting strategy. The parameters for each classifier were determined as described in the above section on feature selection. Then we trained the model for each species. To verify the ensemble's effectiveness, we compared ensemble model with individual classifiers on 10-fold cross-validation. The detailed results can be seen in Supplementary Table S3. For more convenient observation, we plotted accuracy of ensemble model and individual models for three species as show in Figure 4    latest predictor XG-PseU to evaluate the ability to identify sites. The results obtained by EnsemPseU and XG-PseU on 10-fold cross-validation and independent tests are listed in Table 1 and Table 2, respectively. For H. sapiens, EnsemPseU demonstrated the effectiveness and advantages when compared to XG-PseU on both 10-fold cross-validation test and independent test with respect to Sn, Sp, Acc, and MCC. For S. cerevisiae, the results of 10-fold cross-validation tests of EnsemPseU are higher than those of XG-PseU with respect to Sp, Acc, and MCC, except for Sn. Moreover, the results of independent test of EnsemPseU are higher than that of XG-PseU. For M. musculus, the results of 10-fold cross-validation are higher than those of XG-PseU with respect to Sn, Acc, and MCC, except for Sp. In addition, the ROC curves on 10-fold cross-validation were also plotted as shown in Figure 7. EnsemPseU and XG-PseU attained the same AUC (the area under ROC) 0.700 for H. sapiens. Nevertheless, EnsemPseU obtained the AUC of 0.786 and 0.775 which is better than XG-PseU with AUC of 0.74 and 0.77 for S. cerevisiae and M. musculus, respectively.
Since jackknife test is usually regarded as the most objective test method [8], and so, we provide the jackknife test results for H. sapiens, S. cerevisiae, and M. musculus in Table 3, respectively.  In conclusion, EnsemPseU showed better performance regarding most indicators, especially for S. cerevisiae. However, we also noticed that both EnsemPseU and XG-PseU achieved the best MCC for S. cerevisiae, but the lowest one for H. sapiens. That is to say, the conservatism of sequences varies among different species.

IV. CONCLUSION
In this paper, we proposed an ensemble approach through integrating five classifiers: SVM, XGBoost, NB, KNN, and RF. The chi-square test was employed to remove redundant information. Testing revealed that the accuracy improved 5. 3% for H. sapiens, 6.09% for S. cerevisiae, and 5.55% for M. musculus after chi-square feature selection. Moreover, upon evaluation by 10-fold cross-validation and an independent test, our proposed model EnsemPseU outperformed the other best existing model.