iRNA-m5C_NB: A Novel Predictor to Identify RNA 5-Methylcytosine Sites Based on the Naive Bayes Classifier

As one of the widespread RNA post-transcriptional modiﬁcations (PTCMs), 5-Methylcytosine (m5C) plays vital roles in better understanding of basic biological mechanisms and major disease treatments. In experiments, traditional high-throughput approaches to ﬁnd m5C sites are usually expensive and laborious. Additionally, facing with a large number of RNA sequences, developing accurate computational methods to distinguish m5C and non-m5C sites is an efﬁcient solution. Here we introduced a novel predictor, called iRNA-m5C_NB, to identify m5C sites in Home sapiens using Naive Bayes (NB) algorithm. In this method, unbalanced dataset Met935 is ﬁrstly analyzed using efﬁcient hybrid-sampling strategy SMOTEEEN. Then top 57 features are selected by the ANOVA F-value from four kinds of well-performance feature extraction techniques, including Bi-proﬁle Bayes (BPB), enhanced Nucleic Acid Composition (ENAC), electron-ion interaction pseudopotentials (EIIP) and mMGap_1. Based on the jackknife test, the evaluated recall for the unbalanced training dataset Met935 is up to 82.81% with MCC of 0.63. And for the independent dataset Test1157, the predictor still shows high recall of 70.06% and MCC of 0.34. It is the ﬁrst m5C predictor constructed using the unbalanced dataset, and the recall scores are increased by 19.82% and 59.23% for jackknife and independent tests compared with the latest tool RNAm5CPred, respectively. We demonstrate that the proposed predictor iRNA-m5C_NB outperforms other state-of-art models, which hopes to be an efﬁcient and reliable method to identify m5C sites.


I. INTRODUCTION
5-Methylcytosine (m5C) is one of the widely spread posttranscriptional modifications (PTCMs) in rRNA, tRNA and mRNA sequences, which has been found in many organisms [1]- [4]. Specifically, m5C can be formed on carbon atom by the catalysis of RNA methyltransferase (such as NSUN2 and DNMT2), where a methyl group is attached in the 5 th position of the cytosine (C) ring [5]. As a research hotpot in recent years, m5C has been discovered in various biological processes, such as tRNA stabilization, rRNA translational fidelity and codon identification [5]- [10]. Meanwhile, it is proved that m5C has important effect on many major human diseases, including breast cancer, autosomal-recessive intel- The associate editor coordinating the review of this manuscript and approving it for publication was Dariusz Mrozek . lectual disability and Dubowitz syndrome [11]- [16]. Therefore, fast and efficient recognition of m5C is the primary task of further researches on biological mechanisms and valuable applications. Although kinds of biological experiments have been proposed to detect m5C sites (i.e. bisulfite treatment [17], [18], m5C RNA immunoprecipitation (m5C-RIP) [19], 5-azacytidine-mediated RNA immunoprecipitation (Aza-IP) [20] as well as methylation iCLIP (miCLIP) [21], it is believed that corresponding costs of time and money are very high. Meanwhile, the number of RNA sequences shows sharp accumulation with the mature sequencing techniques. Therefore, constructing high-performance computational models to predict m5C becomes a reliable method to resolve this problem.
To our best know, totally eight models have been built to recognize RNA m5C sites [22]- [29]. Except the tool PEA-m5C for Arabidopsis thaliana [26], the remaining seven [22]- [25], [27]- [29] are all involved in the identification of human sequences (abbreviated as H. sapiens). The first computational model, m5C-PseDNC, was proposed by Feng et al. using the Support vector machine (SVM). The prediction accuracy is 90.42% over jackknife test, where pseudo dinucleotide composition (PseDNC) features with three physiochemical characteristics (entropy, enthalpy and free energy) were used to encode RNA sequences [22]. Then, RF-based model iRNAm5C-PseDNC was provided by Qiu et al. using PseDNC features with ten important properties considered, where jackknife test achieves high accuracy of 92.37% [23]. Later on, Zhang et al. developed a novel model m5C-HPCR by heuristic nucleotide physicochemical property reduction algorithm (HPCR), in which MCC and AUC are up to 0.859 and 0.962 [24]. Sabooh et al. developed new method pM 5 CS-Comp-mRMR based on the Kmer features (k = 2 ∼ 4). Particularly, feature selection approach Minimum Redundancy and Maximum Relevance (mRMR) was applied to choose effective features, which finally gives the accuracy value of 93.33% [25]. And compressive and cell-specific predictor RNAm5Cfinder was established by Li et al. using binary encoding (BE) features to analyze m5C sites in eight tissues/cell types, corresponding AUC values are both higher than 0.77 and 0.87 [27]. At same time, Lv et al. introduced iRNA-m5C model using four integrated features, including Kmer, BE, pseudo k-tuple nucleotide composition (PseKNC) and Natural Vector (NV) [28]. The jackknife accuracy is up to 92.9%. Very recently, Fang et al. published a new predictor RNAm5CPred based on combination of three nucleotide compositions, namely Kmer, K-spaced nucleotide pair frequencies (KSNPFs, same as mMKGap in this paper) and PseDNC [29], where the recall and MCC are 68.79% and 0.154 over independent test.
The datasets are the most basic and important part for constructing model. For the five tools, namely m5C-PseDNC [22], M5C-HPCR [24], Pm5cs-Comp-Mrmr [25], iRNA-m5C [28] and RNAm5CPred [29], they all used the balanced training dataset Met240 (containing 120 positive and 120 negative instances) collected by Feng et al. [22]. And for iRNAm5C-PseDNC using the unbalanced dataset Met1900 (475 positive and 1425 negative samples), there are large amount of redundant sequences with the accuracy and MCC achieve 92.37% and 0.79. It means that serious overfitting problem is existed in this model [23]. As for the latest predictor RNAm5CPred [29], kinds of sequences datasets (Balanced: Met240; Unbalanced: Met1900, Met935, Train935, Train839, Test96 and Test1157) were all investigated. However, the model was finally constructed using Met240 by comparison of models results based on Met240 and Met935. Finally, the results over independent dataset Test1157 are unsatisfied (R e = 68.79%, S p = 53.70%, P re = 18.19% and MCC = 0.154). Meanwhile, the jackknife performances using unbalanced Met935 are still low (R e = 62.99%, S p = 99.50%, MCC = 0.749, P re = 95.24%), as well as independent test using Test1157 (R e = 10.83%, S p = 93.00%, MCC = 0.050, P re = 19.54%). In general, although the high accuracies (more than 93%) were reported using the balanced dataset over jackknife test, it is an urgent need to construct the high-performance model using the unbalanced data based on the fact that the m5C sites is distributed unbalanced. In another hand, the number of Met240 is so small that it lacks statistics characteristics.
In this paper, we focused on the identification of RNA m5C sites in H. sapiens using the unbalanced dataset Met935 and Test1157. Figure 1 displays the basic flowchart of this work. Based on the training dataset Met935, several unbalanced strategies are firstly tested using the single BPB features, where four algorithms are also applied simultaneously, including RF, SVM, AdaBoost and NB. After preliminary studies, hybrid-sampling technique SMOTEENN and NB algorithms are selected for the next experiments. Then, we investigate the results of five popular sequence representations, where four features (BPB, ENAC, EIIP and mMGap_1) are finally used. The model is finally constructed using the efficient top 57 features selected by the ANOVA F-value.

A. BENCHMARK DATASETS
In the present work, two unbalanced benchmark datasets Met935 and Test1157 are used for cross validation and independent tests, as well as balanced Met240 for nucleotide distribution analysis. As mentioned above, Met240 is the first benchmark dataset collected by Feng et al. [22] to construct m5C sites model. It is obtained from the popular RNA modification database RMBase [1] with 120 positive and 120 negative instances. Met935 is built by Fang et al. [29], which includes 127 positive and 808 negative samples. Specifically, positive sequences are also obtained from RMBase [1], and the negative sequences are obtained from 1425 non-m5C samples in Met1900 collected by Qiu et al [23]. Testing dataset Test1157, containing 157 m5C and 808 non-m5C sequences, is used to evaluate the model performances over independent test, which is selected from Gene Expression Omnibus datasets (GEO) website with gse90963 (https://www.ncbi.nlm.nih.gov/geo/) by Feng et al. [22]. It is noted that the sequence similarity is less than 70.00% using CD-HIT program for the mentioned three datasets [30]. More details can be found in [22], [23], [29], [31]- [34].

B. RNA FEATURE REPRESENTATION
Efficient RNA feature representation is important to building the machine-learning-based predictors. Various state-of-art feature-extraction platforms have been proposed to conveniently encode RNA segments [35]- [38]. In this paper, six kinds of RNA features are applied to determine whether the nucleotide C can be modified.

1) BI-PROFILE BAYES (BPB)
BPB is a popular sequence-encoding technique, which is widely chosen to solve identification subjects in VOLUME 8, 2020 bioinformatics [39]- [48]. In this method, nucleotide distribution properties in positive and negative samples are separately used to represent sequences, which well reflects the sequence position-specific information. Considering a l-length RNA sequence S = R 1 R 2 R 3 . . . R l , BPB features can be formulated as where (p 1 , p 2 , . . . , p l ) and (n l+1 , n l+2 , . . . , n 2l ) give the corresponding nucleotide occurrence probabilities at each location i(i = 1, 2, ..., l) in positive and negative samples, respectively. Considering the nucleotide C always locating in the center of the sequence, we remove two features p l and p 2l for the center C (i.e. p l and p 2l always keep 1.00 for all samples). Therefore, BPB can induces totally 2(l − 1)-dimensional features.

2) KMER
As one well-known vector model, Kmer is simply expressed as the k-tuple or k-neighboring nucleotides composition [26], [28], [35], [49], where f k i indicates the calculated frequencies of i-th k-tuple. Obviously, Kmer will induce a 4 k -dimensional vector. In this work, we set k = 1 ∼ 4 to generate sequence features.

3) ENHANCED NUCLEIC ACID COMPOSITION (ENAC)
In ENAC method, nucleotide frequencies in a length-fixed subsequence are calculated to represent RNA instance, which is usually thought to be an improved version of NAC approach (i.e. Kmer with k = 1). Many subsequences will be obtained when the nucleotide window continuously slides from 5' to the 3' terminus over full RNA segment [50]. If we set the subsequence length as m, a (l − m + 1) × 4)-dimensional ENAC feature vector can be obtained. Here we use the default window length 5 to carry out our research.
In EIIP scheme, the RNA sequence is directly replaced as the related EIIP values [51]. Furthermore, PseEIIP feature can be expressed using the extended average EIIP value of related trinucleotides, Here, EIIP XYZ represents the EIIP value of the i-th trinucleotide XYZ by EIIP XYZ = EIIP X + EIIP Y + EIIP Z (i.e., the sum of three related nucleotides X, Y and Z), and f XYZ is the related frequency of XYZ. These two methods EIIP and PseEIIP form l and 64-dimensional numeric vectors, respectively.

6) PC-PseDNC-GENERAL
The PC-PseDNC-General method is a frequently used encoding technique to predict RNA sites [37], [55]- [57], which successfully incorporates sequential information and physicochemical properties of dinucleotides. It induces (16 + λ)-dimensional features, here f k is the normalized frequency of the k-th dinucleotide (k = 1, 2, . . . , 16), and λ is the highest counted rank of considered RNA sequence correlations. And ω is the associated weighting factor in the range of 0 ∼ 1; θ j indicates the j-tier correlation factor, which well demonstrates the sequence-order correlations of the most contiguous dinucleotides. In this study, we change the λ parameter from 2 to 5 to extract different features.

7) XXKGAP
Similarly, xxKGAP feature is one variation of Kmer method implemented in PyFeat package [36], where the composition of subsequences with k-gaps is used to describe sequences.

C. MACHINE LEARNING ALGORITHMS
The powerful and efficient machine learning platform based on Python language Scikit-learn package [58] was applied to construct model and analyze features. Here four useful classifiers NB, RF, AdaBoost and SVM were used for prediction task with default parameters.

1) NB
NB is a useful supervised classification algorithm based on Bayes' theorem under the ''naive'' assumption [59]- [61], which can be defined as where F = f 1 , f 2 , . . . , f n indicates the object and involved f 1 , f 2 , . . . , f n give the associated features. And c labels the class of samples (positive class: c = +; negative class: c = −). It has been widely used in bioinformatics researches with good performances [62]- [64]. Here we use the Gaus-sianNB algorithm for classification.

2) RF
RF is a widely used tree-based ensemble estimator in bioinformatics [65]- [78]. In this method, the voting results of a number of decision tree classifiers are finally treated as the output prediction performances [79].

3) SVM
SVM is a useful supervised learning algorithm [80], [81], which has been extensively deployed in bioinformatics [82]- [97]. Low-dimensional feature space can be effectively transformed into high-dimensional Hilbert space to find the best margin for hyperplane using the radial basis kernel function (RBF).
In this method, various weaker learners are fitted using bicluster-based classifiers, such as small decision trees. The good prediction performances can be finally generated by integrating those classifiers through weighted vote/sum [101]- [103].

D. FEATURE SELECTION AND VISUALIZATION
In order to analyze importance of different features and simply the model, three feature-selection methods are used to rank associated features, namely AdaBoost, F-value and Chi2 implemented in sklearn toolkit [58]. F-value and Chi2 are the two traditional univariate feature selection approaches, where the best feature is selected using univariate statistical tests [104]- [108]. Specifically, first one calculates the F-value for the all studied samples. And Chi2 selects important features using the chi-squared stats between each non-negative feature and class. Meanwhile, t-Distributed Stochastic Neighbor Embedding (t-SNE) [109] is applied to visualize distribution by reducing the dimension of original high-dimensional data. Similarities between data points are firstly converted into joint probabilities. Then, Kullback-Leibler divergence between those joint probabilities is optimized to illustrate data distribution.

E. UNBALANCED STRATEGY
In this paper, several unbalanced strategies are used to solve the unbalanced problem of training dataset, including resampling methods and ensemble classifiers [49], [110]- [116]. For the hybrid-sampling method SMOTEENN [117], Synthetic Minority Over-sampling Technique (SMOTE) and undersampling method Edited Nearest Neighbours (ENN) are incorporated to balance the dataset. Specifically, SMOTE is first applied to generate new examples in minority class [118] followed by ENN to remove the mixed samples. More details can be found in [119].

F. CRITERIA FOR PERFORMANCES EVALUATION
Although performances of existing tools are finally evaluated over jackknife test, we firstly used 10-fold cross validation (10-fold CV) for preliminary experiments. Then, jackknife and independent tests are used to give objective results. For the 10-fold CV, the training dataset are randomly split into 10 subsets on average. Later, the model is trained using 9 subsets and tested using the remaining one. Repeat this process 10 times until each subset is used once as testing set. The average performance of related 10 folds is used as the final scores of models. As a special case, jackknife test is a special case of k-fold CV, where k is equal to the total number of samples. Based on the 10-fold CV and independent tests, six metrics associated with the confusion matrix, namely recall or sensitivity (R e , S n ), specificity (S p ), accuracy (A CC ) and Matthew's correlation coefficient (MCC), precision (P re ) and F1 score are used to check performances of classification models, which are defined as bellow (7), as shown at the bottom of this page.
Here, TP and TN give the number of predicted true positive and negative instances, whereas FP and FN indicate the number of false positive and negative sequences, respectively. Furthermore, the AUC value (area under ROC curve) is also applied to represent the prediction results, which is no sensitive to the thresholds of predicted probability [120]- [127].

A. ANALYSIS OF NUCLEOTIDE DISTRIBUTION
First of all, the nucleotide distribution characteristics are displayed in Figure 2 for the unbalanced dataset Met935 (Left) and balanced Met240 (Right). The enriched and depleted nucleotides are calculated by the differences of nucleotide frequencies between positive and negative samples (i.e. p i − n l+i at position i, see details in Sec. II). Obviously, there are big differences existed between two datasets. For Met935, corresponding distribution is quietly different in individual place, which can be clearly seen for the nucleotides near the center. For the nucleotide at upstream position -1, C and A are separately enriched in positive and negative samples, respectively. On the contrary, G and U are obviously located in downstream positions 1∼4 and 5∼7, respectively (positive instances), while A, C, U enriched in positions 1∼3 as well as A, C in 4∼6 (negative instances). As for Met240, the distribution is basically uniform and simple. Specifically, nucleotides C and G are widely distributed in positive samples, whereas A and U in negative sequences, except for upstream position -20. Generally, there is obvious differences existed between the unbalanced and balanced datasets, where former is more complex and weaker.
As a supplement, corresponding visualization of these two datasets, i.e. Met935 (Left) and Met240 (Right), is also plotted in Figure 3. Here BPB features are finally transferred into a 2-dimensional vector to conveniently display. For the unbalanced dataset Met935, positive samples are basically placed in the entire feature space, only a few gathers at the bottom right. However, it is simpler and clearer for balanced Met240, where almost positive samples are clustered in the upper right. Considering the fact that m5C and non-m5C sequences are unbalanced, we can demonstrate that the unbalanceddataset-based predictor is more reasonable and accurate, but also more difficult than the model using balanced dataset to diagnose m5C sites.

B. PRELIMINARY RESULTS OF DIFFERENT UNBALANCED STRATEGIES WITH SEVERAL ALGORITHMS
There are many unbalanced strategies overcoming the unbalanced problems and various algorithms constructing models. Using BPB features, we perform several preliminary experiments to investigate different unbalanced approaches using benchmark dataset Met935, including resampling and ensemble techniques. At the same time, four kinds of algorithms, namely NB, RF, SVM and AdaBoost, are separately applied to select the efficient algorithm. There are totally seven metrics (R e or S n , S p , A CC , MCC, AUC, P re and F1) are used to evaluate performances. The best algorithm is mainly decided by combing the recall and specificity scores of training and testing datasets, especially for the recall over independent test. Among those experiments, we found that the combination of SMOTEENN and NB showed best results. Results of several unbalanced techniques and classifiers are listed as following to demonstrate the optimizing process.
Table1 summarizes the prediction performances of six unbalanced strategies using NB method, where the best results are obtained using SMOTEENN approach labeled as superscript a. For the training dataset Met935, it can be seen that all methods show good performances (R e and S p achieve about 80.00%), except for the under-sampling technique ENN with low R e of 52.76%. However, the results are generally unsatisfactory for the testing dataset. Particularly, five models are almost focused on the prediction of negative samples, which ignored the prediction of positive samples with low R e scores, including SMOTE, ADASYN, ENN, SMOTEENN and SMOTETomek. Setting popular over-sampling technique SMOTE as an example, 10-fold CV experiment shows high recall for positive results (R e = 80.69%), however, independent test gives bad score (R e = 45.86%). As for the SMOTEENN, unified and better performances can be found (Met935: R e = 88.59%, S p = 85.83%,  Similarly, Table 2 lists the results of different algorithms based on SMOTEEENN strategy, including NB, RF, SVM and AdaBoost. It can be seen that the calculated results for four classifiers over 10-fold CV are exactly high, while bad VOLUME 8, 2020  results over independent tests. For example, SVM model obtains the high R e of 100.00% on training dataset, however only score of 59.24% on testing dataset. Combined the performances of Met935 and Test1157, we believe that NB is the best candidate to construct prediction model labeled as superscript a. Related R e and S p achieve 88.59%, 85.83% and 65.61%, 71.60% for two datasets, respectively.

C. EVALUATED RESULTS OF SINGLE FEATURES USING NB ALGORITHM
Considering the various sequence features, here we further investigate five feature extraction techniques (see Table 3

D. MODEL OPTIMIZATION AND COMPARISON WITH EXISTING TOOLS
We incorporate four well-performance features, including BPB, ENAC, EIIP and mMGap_1, to build prediction tool. There are totally 285 features concluded with the prediction results (R e = 89.10%, S p = 92.02%; Test1157: R e = 44.59%, S p = 89.50%). It can be found that the prediction performances for negative samples in Testing dataset are not very well. Thus, we applied three useful feature-selection methods F-value, Chi2 and AdaBoost to analyze the feature importance and remove redundant features. Finally, we find that top 57 features based on F-value show the best performances. Especially, the evaluated results between training and testing experiments are relatively unified and higher   In order to conviniently compare performances of different models, we further perform jackknife test for taining dataset Met935. Table 4 lists our results and simultaneously compare with three proposed models, including M5C-HPCR, iRNA-m5C and RNAm5CPred [24], [28], [29]. It is noted that the results of M5C-HPCR [24] and iRNA-m5C [28]  HUAIKUN XIANG is currently an Associate Professor with the Department of Automotive and Transportation, Shenzhen Polytechnic. His research interests include analysis of resident travel characteristics based on the big data, research on the evaluation of safe driving behavior based on the coupling of driver and vehicle, and vulnerability analysis for urban congested road networks. VOLUME 8, 2020