Prediction of Piwi-Interacting RNAs and Their Functions via Convolutional Neural Network

In eukaryotic cells, Piwi-interacting RNAs (piRNAs) are the type of short chain non-coding RNA molecules, which interconnect with PIWI proteins. It performs various cellular and genetic functions such as gene-specific protein translation, expression regulation, maintenance, and formulation of germ cells. Seeing the prominent contribution of piRNA in eukaryotic organism cells, many attempts were made to identify it computationally, however, unsatisfactory results were obtained. So, it is requisite to extend the concept of a computational tool in such a way that accurately represents piRNA. In this regard, intelligent and high discriminative deep learning i.e., the convolutional neural network based sequential-computational model known as “piRNA-CNN” is carried out for the prediction of piRNA. RNA sequences are mathematically expressed using the natural language processing method namely: word2vec in order to get prominent, relevant, and high variated numerical descriptors. The proposed “piRNA-CNN” model yields an accuracy of 93.83% for the first-layer in which the provided query RNA molecule is predicted as non-piRNA or piRNA. In case of the piRNA, the proposed model identified the query as mRNA deadenylation or without deadenylation in the second layer, and achieved 91.19% of accuracy. The obtained outcomes authenticated that the piRNA-CNN model exposed substantial results matched to the current tools stated in the literature, so far. It is further expected that the suggested predictive tool will assist scientists and researchers to design improved computational tools.


I. INTRODUCTION
In eukaryotic cells, Piwi-interacting RNAs (piRNAs) are the leading group of short chain non-coding RNA molecules with a length of 24-31 nucleotides long polymer [1]. Various genomic and cellular functions including transposon silencing, gene expression regulation, maintenance and formulation of germ cells, and specific protein translation are performed by piRNAs. Numerous attempts were carried out and finally revealed that piRNAs are involved in various kinds of cancer; so, the study and knowledge regarding such type of RNAs are very imperative in certain areas such as RNA biology and drug development [2]- [4]. In a sequel, Lee et al., and Nishibu et al., performed several experimental methods in order to categorize whether an RNA molecule is piRNAs or not [5], [6]. However, only relying on laboratory The associate editor coordinating the review of this manuscript and approving it for publication was Hao Ji. experimental methods for sequence analysis are inadequate, inefficient; expensive as well as insensitive in some situations. Viewing the importance of piRNAs, computational approaches are essential to make possible the analysis of piRNAs in a more precise and efficient way. In the real world, there are two types of piRNA are reported, one is carried out deadenylation to target mRNA while the other one is without deadenylation [7]. However, the experimental methods are failed to explicitly explain the difference between these two types. Researchers have only concentrated on classifying piRNAs and non-piRNAs and introduced various computational models. Zhang et al. employed a support vector machine (SVM) and k-mer approach for proposing an automated model known as piRNAs predictor [8]. Later on, Wang et al. utilized SVM and transposon interaction for discrimination of piRNAs [9]. Likewise, Luo et al., used physicochemical properties of RNA [10]. In a sequel, Li et al., adopted the notion of ensemble learning for the prediction of piRNAs [11]. Recently, Liu et al., proposed a two-layer automatic model for the prediction of piRNAs and their functional types [1]. Similarly, li et al., introduced a predictor known as ''piRNAPred'' to identify piRNA and their function types by support vector machine [12]. Khan et al., suggested a deep neural network (DNN) based computational model called ''2L-piRNADNN'' utilizing physicochemical behavior of RNA and di-nucleotide auto covariance as feature extraction techniques [13]. Here, an effort was made to propose a computational intelligent model for discrimination of piRNA and its types, adopting contemporary machine learning and deep learning approaches. In the case of Machine learning, two distinct nature of RNA sequences formulation methods were applied to extract numerical features. The extracted feature spaces are then provided to individual learning algorithms. Further, the predictions of individual learners are merged using a bio-inspired evolutionary genetic algorithm in order to correctly identify the desired class. In Deep Learning, word2vec method based feature space is used in combination with the CNN model. The analysis of the developed model is carried out on two benchmark datasets to demonstrate the stability and generalization strength of the model. As shown in Table 2 and 3, the Deep learning approach obtained successful results than the Machine learning approaches and existing methods.
• Natural language processing method ''word2vec'' is used for expressing RNA sequences.
• Compared machine learning algorithms with deep learning algorithm.
• 7-fold is applied for assessment.
• Various performace metrics are used for examining the algorithms performance.
• High throughput intelligent computational predictor is developed for piRNAs.

A. BENCHMARK DATASET
In a biological system, Chou's 5-step rules become a benchmark for introducing a sequence-based statistical predictor [14]. The first and main step is the selection or construction of a valid dataset according to the problem that represents the motif of the target class. Here, the Liu et al dataset is selected as a benchmark dataset S Liu et al., [1]. It can be mathematically expressed as: where the negative subset consists of 1,418 non-piRNAs segments; the positive subset contains 1,418 piRNAs segments; the subset is composed of 709 samples of piRNAs segments, which have the function of instructing target mRNA deadenylation [7]; while the remaining 709 samples subset belongs to without such function. In this study, efforts were made to analyze various machine learning and deep learning algorithms in order to develope an intelligent computational model for the identification of piwi-interaction RNA and their types. RNA sequences are also formulated with via discrete and natural language processing methods. Details are mentioned below:

B. MACHINE LEARNING APPROACH
Here, we used various feature extraction methods and classifiers to work as baselines for comparison with the proposed deep-learning approach.
1) Feature Extraction Methods The second step of Chou's rules is how to mathematically express DNA/RNA instances with an operative numerical formulation, which can correctly return the correlation with the desired class to be predicted. However, machine learning algorithms are designed in such a way that merely uses the vector. In order to collect only numeric features in the form of a vector from biological sequences, the discrete feature extraction method pseudo amino acid composition (PseAAC) was used [15]- [19]. The PseAAC concept has been broadly and rapidly exploited in the area of proteomics. Later on, this concept was extended to RNA/DNA sequences in the form of pseudo K-tuple nucleotide composition (PseKNC) [20]- [28]. It is also used for genome analysis. Accordingly, the idea of PseKNC has been implemented for expressing RNA sequences using discrete methods di-nucleotide composition (DNC) and tri-nucleotide composition (TNC).
• Di-nucleotide Composition (DNC) is a featureencoding scheme, which expresses an RNA sequence with the help of two consecutive nucleotides pair. The occurrence frequency of each pair, such as N1N2 represents the 1st pair, N2N3 denotes the 2nd pair, and so forth, is computed. Finally, 4 × 4 = 16D resultant features space is generated [22], [25], [29]. DNC can be mathematically formulated as below: is the frequency of AG; and so forth, T is a transpose.
• Tri-nucleotide Composition (TNC) is another feature-encoding scheme, which represents RNA sequence with the help of three consecutive nucleotides pair. The frequency of each pair is calculated. For example, in RNA sequence, the first pair is N1 N2 N3, the second pair is N2 N3 N4, and so forth, consequently, 4 × 4 × 4 = 64D corresponding features vector is produced [20], [25]. The TNC can be numerically expressed as: is the AAC in RNA sequences; and so forth.
2) Classification Algorithms The next step of Chou's rules is what type of classification hypotheses implement in order to execute the training and predicting process effectively. Here, various supervised learning hypotheses are adopted as an operational engine. These learning hypotheses were implemented in numerous fields of pattern recognition, computational biology, data mining, and bioinformatics [15], [26], [27], [30]- [39]. In this study, we applied various powerful learning algorithms namely: K-nearest neighbor (KNN), Support Vector Machine (SVM), Probabilistic neural network (PNN), Random forest (RF), and Generalize regression neural network (GRNN). The basic idea of these algorithms has been explained and cited in the previous works [16], [25], [40]- [50] 3) Ensemble Learning Ensemble classification is a learning technique that is using for enhancing the prediction rate of individual learners as well as reducing generalization errors. Mostly, ensemble classification has obtained efficient performance compared to individual learner based systems due to its discrimination power, because it compensates the weakness of individual learners by each other [51]- [53]. However, there is no predefined rule that how to combine the number of learners in an efficient way. A number of different approaches are formulated to combine the learners. The simplest one is to fuse a large number of learners and then choose the optimal combination. Boosting is another ensemble technique in which, the single learner is re-trained iteratively in order to reduce classification error. Ensemble learning is mostly performed in two different ways, namely: majority voting and weighted voting. Majority voting is a simple approach in which a decision is made on the basis of the majority in a pool of input. In weighted voting, learners are not treated uniformly. Each learner is associated with a weight that is proportional to its performance. High weight leaner has more influence on the learning process. In addition, optimization techniques are also utilized in ensemble learning to minimize classification errors. Optimization techniques are employed in two different ways such as coverage optimization and decision optimization. Coverage optimization is the selection of optimum learners' subset from the utilized learners.
On the other hand, decision optimization is the selection of optimal output by combining the predictions of multiple learners. Learner selection is the process to select the subset of k optimum learners from the pool of N learners, which have an advanced prediction rate. In this case, the possible combinations in solution space are NLk (N ≥ k), which shows that the solution space is exponential. This issue was resolved by applying an effective bio-inspired tool genetic algorithm (GA) widely used for solving the problem of local search. GA avoids local minima by utilizing crossover and mutation operators and tries to seek an optimum or near optimum solution employing probabilistic search techniques in massive and intricate search space. Few researchers have utilized GA in ensemble learning for learners' selection in order to obtain promising results [54]- [56]. In this research, five diverse nature of learning hypotheses; KNN, PNN, RF, GRNN, and, SVM is operated [18], [57]- [60]. KNN is an example-based learner who operates on the theory of proximity in the value of the attributes [61]. SVM is a powerful operational engine based on the statistical learning theory while PNN is established on Bayes theory [62]. First, the individual learners are trained and their outcomes are saved. Then these outcomes are forwarded to GA for ensemble learning. The process of GA is presented as follows: • Chromosome encoding The first step of GA is to encode the solution into a chromosome. The size of the chromosome is limited to the number of learners in the pool and weight is assigned to each learner either 1 or 0 where 1 shows the learner is included in the learning process while 0 denotes the learner does not take part in learning. For instance, chromosome S = 10110110 illustrates that L1, L3, L4, L6, and L7 learners are taken place in ensemble learning. In this work, 100 population and 200 generations were used.
• Initial population The first step in the function of a GA is to randomly generate an initial population. Every member of this population encodes a conceivable answer for a problem. After making the initial population every member is calculated and allocated fitness value according to the fitness function.

• Fitness function
The assessment of each individual is performed by the fitness function. The fitness value is computed by fusing the predicted outcomes of selected learners in the ensemble and finally, the decision is made on the basis of majority voting. In this study, accuracy is utilized as a fitness function.
• Selection A fitness-based methodology is used to select individual solutions in the selection process, where the fitter individuals measured by a fitness function, are more likely to be selected. In this study, two high fitness value chromosomes are selected as parents using roulette wheel-based selection.

• Reproduction
In GA, a new generation (offspring) is reproduced by using genetic operators like crossover and/or mutation. Crossover is the exchange of information between the parents and offsprings; consequently, the generated offsprings may be better than parents. Here, the m-points crossover operator is used. The mutation operator is used to change the value of one or more genes in the selected chromosome.

• Termination criteria
The GA proceeds the next generation till the maximum number of generations and finally, the best solution is returned to the problem.
where the symbol ⊕ represents the merging operator and EL represents Ensemble Learning. The procedure of how the ensemble learning functions by merging the five base learners are as per the following: Assume the anticipated outcomes of a single learner for the genomic query R are , C 5 are single learners and S 1 , S 2 , S 3 , S 4 , S 5 are piRNA.
Lastly, the outcome of the ensemble learner merged through majority voting using GA is produced as: where GA EL is the anticipated outcome of the ensemble learner, the Max denotes selecting the maximum one, and x 1 , x 2 , x 3 , x 4 , x 5 is the weight of learners. The framework of the proposed system is illustrated in Figure 1. C. DEEP LEARNING APPROACHES 1) Distributed Feature Representation: The concept of natural language processing (NLP) was adopted by scientists in order to develop computational models for various biological applications, such as i.e., iN6-Methyl (5-step) [63], and alternative splicing site prediction [64]. Therefore, keeping the significance of NLP models in existing predictors, a distributed feature representation of natural languages processing technique i.e., word2vec method is applied in order to obtain interpretable representations for piwi-interacting RNAs and their functions. In this work, the genomes are collected from the Genbank of http:// hgdownload.soe.ucsc.edu, which are split into '21' chromosomes ''Chr1'', ''Chr2'', ''Chr3'', ''Chr4'', ''Chr5'', . . . , ''X'', and ''Y''. Moreover, the chromosome having a sequence length of 100nt is divided into sentences. The words are created by splitting each sentence using 3mers. The continuous bag-of-words (CBOW) approach is utilized in order to train the word2vec model. Whereas, CBOW is used to predict the current word ''w(t)'' according to the contiguous words in a predefined window. The training parameters of the proposed word2vec model are illustrated in Table 1. At last, after extracted features using the word2vec model, each extracted feature space has the dimension size of n*100, where ''n'' denotes the number of samples and 100 is the number of information/features against each sample. Moreover, each sample with length ''L'' is denoted ((L -2) × 100).

2) Convolutional Neural Network (CNN): A CNN is a
deep learning algorithm applied for the prediction of image processing as well as sequential based bioinformatics data [47], [65]- [67]. In this context, a onedimensional (1D) CNN model is very effective for the prediction of the bioinformatics dataset. The architecture of CNN consists of convolution layer, ReLU layer, max-pooling layer, normalization layer, loss layer, dropout layer, fully connected layer. The CNN model is trained by several optimal hyper-parameters i.e., the size of the filters is [3], [5], [7], [9], number of filters are [10], [12], [14], [16], [18], number of convolution layers are [1]- [3], the padding values are same, the stride value is 1, the number of the neurons of the dense layer and dropout probability after dense and convolution layers. The range of dropout probability is [0.25, 0.3, 0.35]. The selection of hyper-parameters is based on the higher prediction outcomes in terms of sensitivity, specificity, accuracy, MCC, and AUC. Moreover; the normalized class probability of the input data can be displayed using the sigmoid( ) function. These operators can be mathematically expressed as follows: ReLU represents the rectified linear function and mathematically can be defined as ReLU(y) = max(0, y) In this work, the proposed ''piRNA-CNN'' model was implemented using Keras library in python [68]. On the other hand, the number of batch size = 64 and epochs = 100. To train the model, a minimum learning rate of 0.0004 is kept and Adam optimizer is utilized.
where T + , F − , T − , and F + indicate true positive, false negative, true negative, and false positive respectively.

E. CROSS-VALIDATION
In literature, there are three popular CV methods used for analysis and prediction: i.e., jackknife test, K-fold crossvalidation (or subsampling) test, and independent dataset test. Though, the jackknife test yields unique results for a examine benchmark dataset with high time complexity. In contrast, the K-fold cross-validation test overcomes the complexity issue of the jackknife test along with performing the same characteristics of the former. Therefore, in this study, we have adopted the seven-fold CV test to assess the error rates of the proposed ''piRNA-CNN'' model. The feature spaces are split into seven subsets at each layer i.e., First-layer and Secondlayer, where one subset is used for testing and the rest are used for training to measure the performance. This procedure was repeated seven times until each subset was used as a test set once [1], [73]. Therefore, every seven subsets were single  out one by one to test the model and their average outcome is considered the final result. Table 2 demonstrates the prediction performance of machine learning and deep learning classification algorithms on various benchmark datasets. In machine learning classification algorithms SVM, KNN, RF, GRNN, and PNN along with ensemble models are utilized in combination with discrete feature spaces. On the other hand, in the Deep learning classification algorithm, CCN in conjunction with natural language processing technique based feature space is used. The first-layer prediction performance of the proposed model in Table 2 shows that the performance of the deep learning based approach is much better than not only individual machine learning approaches but also from the ensemble model. The success rates of the deep learning approach in terms of accuracy, sensitivity, and specificity are 4.98%, 3.28%, and 6.70%, respectively are improved than machine learning approaches. In the second-layer, the predictive outcome of the deep learning method is 5.16% higher than the highest result of the machine learning method. Finally, a comparison has been made between the developed model and the current state-of-the-art methods as shown in Table 3 on CV tests such as 7-folds. The pioneer works on these data have been carried out by 2L-piRNA, piR-NAPred, and 2L-piRNADNN. After empirically examining the outcomes of the developed model and already existing models, it is observed that the accuracy of our developed computational model for the first-layer is 2.04% higher than VOLUME 9, 2021 existing methods. Similarly, for the second layer, the developed computational model obtained 6.67% higher accuracy than existing methods. Establishing a user-friendly and open access web-predictor provides a practical platform for researchers in the design of pharmaceutical drugs and also expedient for academia as established in a series of recent publications [1], [74]- [81].

IV. CONCLUSION
An attempt was made to develop an intelligent and high-throughput computational model namely ''piRNA-CNN'' for the identification of piRNA and non-piRNA, in this study. Here, analysis has been drawn between machine learning algorithms and deep learning algorithms. First, two discrete feature encoding methods such as DNC and TNC are applied to excerpt numerical values from RNA sequences. Then these feature spaces are provided to five machine learning algorithms and noted their outcomes. Furthermore, the concept of ensemble learning is adapted to merge the prediction of individual learners in order to minimize variance instigated by the peculiarities of a single training. It is shown that ensemble learning with TNC feature space achieved efficient outcomes compared to individual learners. The ensemble was carried out via GA. In contrast, RNA sequences are expressed by the natural language processing technique word2vec. Then the obtained feature space is provided to deep learning algorithm CNN for prediction of piRNAs. The results demonstrate that the success rate of the CNN base model is much better than machine learning algorithms.
In conclusion, the obtained outcomes authenticated that the piRNA-CNN model exposed substantial results matched to the current tools stated in the literature, so far. It is further expected that the suggested predictive tool will assist scientists and researchers to design improved computational tools.