A Robust and Precise ConvNet for small non-coding RNA classification (RPC-snRC)

Functional or non-coding RNAs are attracting more attention as they are now potentially considered valuable resources in the development of new drugs intended to cure several human diseases. The identification of drugs targeting the regulatory circuits of functional RNAs depends on knowing its family, a task which is known as RNA sequence classification. State-of-the-art small noncoding RNA classification methodologies take secondary structural features as input. However, in such classification, feature extraction approaches only take global characteristics into account and completely oversight co-relative effect of local structures. Furthermore secondary structure based approaches incorporate high dimensional feature space which proves computationally expensive. This paper proposes a novel Robust and Precise ConvNet (RPC-snRC) methodology which classifies small non-coding RNAs sequences into their relevant families by utilizing the primary sequence of RNAs. RPC-snRC methodology learns hierarchical representation of features by utilizing positioning and occurrences information of nucleotides. To avoid exploding and vanishing gradient problems, we use an approach similar to DenseNet in which gradient can flow straight from subsequent layers to previous layers. In order to assess the effectiveness of deeper architectures for small non-coding RNA classification, we also adapted two ResNet architectures having different number of layers. Experimental results on a benchmark small non-coding RNA dataset show that our proposed methodology does not only outperform existing small non-coding RNA classification approaches with a significant performance margin of 10% but it also outshines adapted ResNet architectures.


Introduction
Ribonucleic acid (RNA) serves as a coding template in the creation of proteins, performs various biological functions, and is largely responsible for several diseases such as Alzheimer, cardiovascular, Cancer, and type 2 diabetes [1,2].
Primarily, RNA is classified as protein coding or non-coding.About 3% region of RNA which produces proteins is called protein coding or messenger RNAs (mRNA) while other 97% portion of RNA is known as non-coding (ncRNA) or functional RNAs [3].Functionality of almost all protein coding RNAs is pretty much known, and studied extensively over the period because of its necessary protein encoding function, and active participation in different biological processes including embryonic development [4].Non-coding RNAs were considered useless for quite some time, however with the advancements in biological research, it was extrapolated lately that most of the ncRNAs perform multifarious essential biological processes like dosage compensation, genomic imprinting, and cell differentiation [5,6].With the passage of time, analysis of ncRNA has become even more interesting because of their importance in understanding the phenomena behind human diseases and achieving stable health [5].
Non-coding RNAs differ from each other in terms of length, conformation, and biological cell function.As shown in Figure 1, ncRNAs are typically classified as small non-coding RNAs (sncRNA) or long non-coding RNAs (lncRNA).
The long non-coding RNAs are greater than 200 bp in size [6] and are fur-Figure 1: Overall taxonomy of non-coding RNA families, adapted from [5].Where the magneta colored square at top layer represents non-coding RNA, dark green squares at second layer namely small non coding and long non coding refer to the subclasses of non-coding RNA.Similarly, light green squares are the major subclasses of small, and large non-coding RNA.At last level, yellow squares show the types of Cis-regulatory sequences, light pink show the kinds of Gene sequences, purple shows the subtypes of Intron sequences.On the other hand, navy blue reveal the kinds of Linear long non-codinf RNA sequences, and Circular subclasses are shown by Red squares.ther splitted into linear RNAs and circular RNAs.Linear RNAs are considered noteworthy resources in gene transcription and translation [7].Circular RNAs have the capability of gene regulation and are strongly connected with complex human diseases like lung cancer and are considered important in identification and treatment of tumours [8], [9].Over the last decade, researchers have proposed various machine and deep learning based methodologies to classify RNAs since verifying humongous transcriptomes through manual experimentation is time consuming and an extremely expensive task [10].Extensive research has been done to differentiate protein coding RNA from non-coding RNA.Especially, in order to discriminate long non-coding RNA (lncRNA) from protein coding or to classify long non-coding RNA (lncRNA) into their corresponding families, diverse machine and deep learning based methodologies have been proposed [11,12,13,14,7,8].
The small ncRNAs possess length around 20-30 bp and are involved in translation, splicing, and regulation of genes [15].Primarily, small ncRNAs are classified into 13 subclasses where each subclass has distinct medical and biological significance.For instance, ScaRNAs (most of which are functionally and structurally identical to snoRNAs and few are considered composite of HACA-Box and CD-Box) can guide modifications in pseudouridylation and methylation.
Likewise, SnoRNA and miRNA play substantial role in cancer development [16].
In addition, miRNAs perform post transcriptional gene expression regualtion and RNA silencing.miRNAs target almost 60% of human genes and play an indispensable role in several biological processes like cell differentiation, proliferation, and death [17], [18], [19], [20].Studies have proved that miRNAs are involved in diverse and complex human diseases such as cancer, autoimmune, cardiovascular, and neurodegenerative diseases [21].Similarly, Ribosomal RNA (rRNA) is essential for all living organisms.It has an essential role in protein synthesis and its characteristics are considered extremely valuable for the development of multifarious antibiotics.In addition, 5S ribosomal-another kind of rRNA, exists in ribosome.Although its function is not discovered yet, it has been seen that their deletion substantially alleviate protein synthesis and also produce detrimental effects on fitness of the cell [22].Likewise, 5.8S ribosomal RNA actively participates in protein translocation [23].It forms covalent connection with tumour suppressor proteins [24], can be used to detect miRNA [25] and understand other rRNA pathways and processes in the cell [26].
Classification of small non-coding RNAs (sncRNAs) becomes substantial because of their large number and distinct functions.It can help biologists and clinicians to better understand the impact of sncRNAs in the development of various diseases and biological operations.Furthermore, classification of sncR-NAs is also important in developing cancer therapeutic strategies [5].According to our best knowledge, two computer based (in particular, deep learning based) approaches have reported the highest classification results for small non-coding RNA to date.The first approach-"nRC" by Antonino Fiannaca et al. [27]-comprises of three fundamental tasks including estimation of secondary structures from Rfam dataset (publicly available benchmark dataset containing 8920 samples belonging to 13 sncRNA subclasses), extraction of common substructures, and classification into 13 known ncRNA classes using a convolutional neural network (CNN).This approach achieved 81% ncRNA classification accuracy.
The second approach-proposed by Emanuele Rossi et al. [28]-extracts sec-ondary structural features from the same Rfam database.However, rather than using simple convolutional neural network, they utilize Graph based convolutional architecture for the extraction of discriminative features and classification.
According to the best of our knowledge, this approach reported the state-ofthe-art performance with 85% accurate classification of small non-coding RNA sequences.
Note that the state-of-the-art small non-coding RNA classification approaches take secondary structure of RNA sequences as input and extract discriminative features by utilizing convolution layers or graph based methodologies.However, feature extraction methods based on secondary structures usually only consider the global characteristics while ignoring the mutual influence of the local struc-tures [29].Such methods usually have a possibility of neglecting important information that might have been available in the primary sequences and has potentially lost while developing the secondary structures(on whom the final classification is based on).Furthermore, secondary structure based methods in-tegrate highdimensional feature space which is computationally inefficient [29].In the current paper, rather than extracting discriminative features from any secondary structures, we propose to use the primary RNA sequences directly.We present a Robust and Precise Convolutional neural network for small non-coding RNA Classification (RPC-snRC) system.The proposed system is based on an end to end small noncoding RNA classification methodology which uses a set of deep convolutional layers for the extraction of discriminative features by utilizing positioning and occurrences information of various nucleotides in RNA sequences.To evaluate the integrity of proposed methodology, we perform ex-periments on the publicly available benchmark dataset provided by Antonino Fiannaca [27].The proposed system clearly outperforms all the existing meth-ods and outshines the previous state-of-the-art method (by Emanuele Rossi et al. [28]) by a fair 10% margin in terms of different performance metrics including accuracy, precision, recall and F  [35], NONCODE [36], and RNAdb [37] for experimentation.This method utilized composition of amino acid, exposed residues estimated percentage, peptide length, compositional entropy, found homologs from mentioned databases searches, alignment entropy, and estimated content of secondary structure.
In order to raise the performance of ncRNA classification further, few researchers explored ensemble approaches considering the effectiveness of decision trees.For instance, Marasri et al. [38] came up with a hybrid tool for the task of ncRNAs classification.They combined an ensemble of several decision trees and random forest with logstic regression model to discriminate short, and long ncRNA sequences.This tool includes naive feature SCORE which was computed by logistic regression through the combination of five features, i.e., structure, robustness, sequence, modularity, and coding potential.For experimentation, it used multiple datasets including, RefSeq [39], Rfam [31], lncRNAdb [40], and genome database "GenBank" of NCBI.In the proposed methodology, a set of 369 features were extracted to predict ncRNAs.Amongst these features, discriminative features were acquired through feature selection based on correlation and genetic algorithm.While logistic regression was utilized to locate relationships among features, sequence similarity was facilitated by fundamental local alignment finder (BLAST) [41].Random forest acted as primary classifier.
Ensemble of several decision trees in random forest was capable to acquire heterogeneity of ncRNA subfamilies.This methodology was robust as it exploited composite features which raised the classifier performance.This approach was used to classify known ncRNAs, and also unknown ncRNAs.Similarly, Yanni et al. [42] presented a method namely lncRNA-ID based on balanced decision trees to identify long ncRNAs.This method utilized multiple sequence alignment and LncRNADisease database [43] for experimentation.
Furthermore, researchers also experimented with unsupervised methodologies for ncRNAs identification.For example, Yasubumi et al. [44] presented a methdology, namely EnsembleClust, for hierarchical clustering of ncRNAs.This methodology enabled the discovery of new ncRNA families [44] and aided to investigate functional diversity of ncRNAs.EnsembleClust implemented an unsupervised approach which utilized unlabelled data to construct clusters of ncR-NAs on the basis of structural alignment results.As the computation of structural alignment was extremely expensive, approximate algorithms were utilized which considered all possible secondary structures and sequence alignments.In addition, for the sake of accurate clustering, a robust measure was used which considered primary sequences, and secondary structures.EnsembleClust produced better performance when compared with previous approaches such as FOLDALIGN [45], Stem kernel [46], and LocARNA [47].Moreover, Milad et al. [48] came up with an approach, RNAscClust, to identify ncRNAs.RNAsc-Clust was used to combine RNA sequences through structure conservation, and graph oriented motifs [49].This approach used structural similarities in order  Contrarily deep sequencing has also been employed for ncRNA classification.
For instance, Yasubumi et al. [54] presented an approach SHARAKU based on deep sequencing for ncRNA classification.SHARAKU incorporated an algorithm which aligned read mapping profiles of ncRNAs next generation data containing sequences.This system also implemented a program for the alignment of read mapping profile which used decomposition for the sake of folding and aligning RNA sequences at the same time [55].Profiles of read mapping allowed the detection of common patterns.Secondary structure and sequence information were acquired concurrently in this approach.

Following the triumph of deep learning methodologies in the ImageNet Large
Scale Visual Recognition Challenge (ILSVRC), 1 , researchers were more interested to employ deep learning for diverse computer vision, natural language 1 http://www.image-net.org/challenges/LSVRC/processing, and bioinformatics tasks [70], [71], [72], [73], [74].Generally, the aim was to develop deeper architectures with proper gradient flow among the layers which could learn better hierarchical representation of features.
Deoxyribonucleic acid (DNA) and ribonucleic acid (RNA) sequences are often treated in the same way as traditional text is treated in natural language processing [75].A term K-mers is used for DNA and RNA sequences where a group of three or four nucleotides are combined to form a word known as 3mers or 4-mers.Today, however, there is a debate about which atom-level (single nucleotide known as character or K-mers known as word) would be the most effective representation for DNA and RNA sequence analysis tasks?Furthermore, researchers are also working on protiomic and genomic data to provide biomedical pretrained neural word embeddings for different k-mers.This is because neural word embeddings, known as continuous representations of features or words, have played an important role to improve the performance of various NLP tasks.In this regard, recently Asgari et al. [76] provided pretrained neural word embedings for proteins and genes.However, there are again several open questions about the impact of utilizing pretrained neural word embedings for DNA, and RNA sequence analysis, e,g., will deep architectures learn better features using pretrained word embedings of protines and genes?
This paper presents a robust and precise convnet based system for small non-coding RNA classification.The proposed system takes direct RNA sequence data as input and utilises convolutional layers for the extraction of discriminative features which are eventually passed to dense layers for classification.This system does not require any alignment or manual feature extraction provides an end to end deep learning based system which takes primary RNA sequences as input and provides class label as output.Furthermore, to provide answers of above questions, we have performed detailed experimentation on small noncoding RNA classification dataset with the proposed system and also with two

Proposed Methodology
This section briefly describes the proposed methodology of RPC-snRC for classification of small non-coding RNA.We develop a deep classifier in which a phenomena similar to DenseNet is used to enable proper flow of gradient between

DenseNets
Consider a small non-coding RNA sample S0 that is passed through a convolutional network.The network consists of L layers, each of which performs a non-linear conversion HL(•), where L indicates the layer.HL(•) may be a composite function for operations like batch normalization [77], rectified linear units (ReLU) [78], Pooling [79], or Convolution (Conv).We refer to the L th layer output as xL.
Dense connectivity.State-of-the-art feed-forward convolutional networks attach the L th layer output as an input to the (L + 1) th layer, which produces the following transition layer xL = HL(xL−1) [70].Resnets [73]  Transition layers.We refer to the layers between blocks that perform convolution and pooling operations as transition layers.The procedure of concatenation used in equation 2 is not applicable if size of feature maps is variable.
In our architecture we split the network into various tightly linked dense blocks to make the same size of feature maps.Down sampling is performed through transition layers which consist of batch normalization layer and a convolution layer of kernel size 1, followed by an average pooling layer of kernel size 4.

Growth rate. If each composite function HL(•) produces N feature maps, then
L th layer will have N0 + N × (L − 1) input feature-maps, where N0 denotes number of channels in the input layer.We refer to the N hyper parameter as the network's growth rate.

Validation method and evaluation criteria
We perform experimentation on a small non-coding RNA classification dataset manually tagged by Antonino et al. [27].This is the only benchmark dataset which is publicly available.It consists of 8920 samples that belong to 13   Accuracy.Accuracy is considered as a reasonable metric when dataset is symmetric-where values of false negatives and false positive are nearly equal.
It computes the ratio of correctly predicted samples to the total samples.

Tp + Tn Accuracy
Precision.Precision is the ratio between correctly predicted positive samples and total predicted positive samples.

Experimental setup and Results
We implement the proposed RPC-snRC and ResNet based methodologies in Python using Pytorch [80].Detailed parametric description about adapted ResNet based methodologies is summarized in Table2.Cross entropy is used as a loss function with Adam [81] optimizer where learning rate is initialized from 0.001.In order to alleviate training time, an early stopping approach is used.High-performance NVIDIA GeForce GTX 1080Ti GPU is used for experimentation.
Results.This section briefly describes performance of the proposed RPC-snRC   [69], and RNAGCN [28]) methodologies on the benchmark small non-coding RNA dataset.Neural architecture based methodology given by Emanuele RossiGet et al. [28] as it marked state-of-the-art performance for small non-coding RNA classification with 85.7% accuracy.However, the adapted ResNet-18 and Res-Net-50 just manage to produce the peak performance of 91%, and 89% by representing RNA sequences as character with one hot encoding and as 3-mers features with pre-trained prot2vec embedding, respectively.On the other hand, the proposed RPC-snRC classification system has significantly outperformed the state-of-theart methodology as ewll as the two dapted Resnet architectures in all settings.
While, RPC-snRC with 3-mers random embedding initialization and pre-trained   shows that it suffers less from type I, and type II errors as compared to nRC methodology-performance of whom seems less stable at class level.

Conclusion
This paper proposes a novel methodology, named RPC-snRC, which classifies small non-coding RNA sequences into their relevant families by utilizing positioning and occurrences information of various nucleotides.Experimental results prove that the proposed RPC-snRC methodology is highly robust as it is neither biased towards false positive nor false negative predictions.
Adapted Res18-snRC and Res50-snRC methodologies perform better than the to group paralogous RNAs.RNAscClust enbaled clustering of humongous occurrences.Sequences were transformed into a graph, where every nucleotide was taken as graph vertices represented with the labels A, U, G, C in form of base pair connections, and the edges were representing encoded backbone.The structures were compared with one another through graph kernels.This method considered the changes of base pairs-which were never encountered by previous clustering approaches.For experimentation, Rfam database having ncRNA sequences was used.Authors reported that the proposed method managed to facilitate accurate clustering which made it possible to align large clusters efficiently.Considering the promising performance of deep neural network for diverse natural language processing tasks, researchers employed Convolutional Neural Networks (CNNs) to classify ncRNAs.For example, Yasubumi et al.[50] proposed a methodology CNNClust to make the clusters of ncRNAs.This technique integrated pair wise alignment of ncRNA sequences.CNN was trained using positional weigh matrices of underlying sequence motifs.Two kinds of neural word embeddings, one hot encoding and word2vec, were used by CN-NClust.Information of secondary structures and read mapping were also utilized in CNNClust.Matrix of similarity score was computed for each pair of RNA sequences and clustering was performed to group highly similar structures.

Figure 2 :
Figure 2: Proposed RPC-snRC methodology for small non-coding RNA classiftcation.In ftgure, (128,16,18) indicates there are 128 kernels, each of width 16 and length 18 in a convolutional layer and (1,4) indicates kernel width and length are set to 1 and 4 respectively in a pooling layer.Others have the similar meaning.

Figure 2
Figure2illustrates the architecture of proposed methodology along with noteworthy model parameters.The proposed RPC-snRC methodology is based on three dense modules.Each dense module contains the same number of layers, however output units get doubled in every following dense module.Each dense module first performs batch normalization on the given input and then it applies ReLu activation to introduce non-linearity followed by convolution operation to extract discriminative features.Finally it repeats the discussed operations one more time in order to better learn hierarchical representation of data.Each dense module is followed by a transition layer which performs batch normalization, ReLu activation, convolution with the filter size 1 × 1, and max pooling with the size of 4 to retain discriminative features and discard useless features.Dense architecture was proposed by Gao Huang et al.[74], and

1 ) 2 )
along with skip connection strategy use an identity function to bypass non-linear transformations shown in equation 1 XL = HL (XL−1) + xL−1 (ResNets benefit is that the gradient can flow straight from subsequent layers to previous layers through the identity function.However, the identity function and output of HL are mixed by summation which can hinder the flow of data in the network.We utilise Densenet a distinct connectivity model to further enhance the information flow between layers.In this model L th layer gets all previous layers ' feature maps, x0,• • • ; xL−1, as input.XL = HL ([x0, x1,• • • ; xL−1]) (In equation 2, x0,• • • ; xL−1 relates to the concatenation of the feature maps in the 0,• • • , L − 1 layers Composite function.Following He et al. [73], we define HL(•) as a composite function of three successive operations: Batch Normalization (BN) [77], accompanied by Activation function named as rectified linear unit (ReLU)[78] and a convolution (Conv) layer.
classification system and two adapted ResNet architectures (res-net 18 layer, res-net 50 layer) for the task of ncRNA classification.It shows the impact of three sequence representation schemes while treating RNA sequence as a set of characters, and k-mers based word for both proposed and adapted methodologies.For experimentation, we have fixed the sequence length to 1180 characters which is essential for convolution operation.To make all sequences of equal lengths, we apply padding where the size of sequence is less than 1180 and truncate extra characters in the opposite case.Experimentation is performed in two different ways: First, RNA sequence is taken as a set of characters with two different representation schemes namely one hot vector encoding and random embedding initialization, which are separately fed to the proposed RPC-snRC system.Second, we generate 3-mers of the sequence by rotating a window of size three on the sequence.K-mers based sequence representation along with one hot vector encoding, random embedding initialization, and pretrained word embeddings provided by Asgari et al.[76] are fed to the proposed RPC-snRC system.

Figure 4 :
Figure 4: Individual class performance of proposed RPC-snRC methodology and state-of-the-art nRC [69] methodology on small ncRNA classiftcation dataset

Table 1 :
Note that our methodology does not require any alignment or manual feature extraction technique as it provides an end to end deep learning system which takes RNA sequences as input and provides class label as output.Summary of the previous work for non-coding RNA classiftcation and clustering in terms of [28]ration deep sequencing for the task of classifying ncRNA.The proposed method utilized features of protein coding to discriminate among coding and non-coding RNAs.It incorporated alignment of pairwise sequences and used lncRNA, NONCODE, NCBI datasbases for experimentation.In order to improve the performance of small non coding RNA classifica-tion, more recently, Emanuele Rossi[28]proposed Graph convolutional neural network based methodology which also takes secondary structural features as input.This methodology uses Graph convolutions for the extraction of discrim-inative features from the secondary structural features.According to our best knowledge, this is the latest methodology which has excluded manual feature en-gineering and produced state-of-the-art performance for small non-coding RNA classification.Table1summarizes the state-of-the-art work for RNA sequence classification.In this paper, we proposed a RPC-snRC methodology which takes input RNA sequence data and utilises convolutional layers for the extraction of discriminative features which are eventually passed to dense layers for classification.exploited technique, alignment of sequences information, and type of features used as an input.In Table three methodologies namely Ensemble clust, RNAscClust, SHARAKU and CNN clust makes groups of non coding RNAs according to their structural similarities.LNCRNAid and LncRNANet performs long non coding RNA identiftcation.Two machine learning based methodologies namely RNAZ, and CONC differentiates between coding and non coding RNA sequences.Hybrid random forest methodology is used to discriminate between small non coding and long non coding RNA sequences.Deep RNN methodology identiftes microRNAs.Two deep learning methodologies namely nRC and RNAGCN methodologies performs classiftcation of small non coding RNA.

Table 2 :
(64,17) Res-Net architectures.Parameter details of the adapted Res-Net architectures are summarized in Table2.Architecture summary of Res18-nRC and Res50-nRC: In both architectures, before res modules, there is a convolutional layer through which ncRNA samples are passed.Both architectures have 4 res modules, while each module of Res18-nRC has 2 basic blocks, where each basic block has two convolutional layers, but Res50-nRC architecture has variable bottleneck blocks in each res module which are mentioned by a number out side the matrix brackets, i.e., ftrst res module has 3 bottleneck blocks and second has 4. In ftst matrix(64,17)64 represents number of feature maps and 17 shows the kernel size.

Table 3 :
Characteristics of Non-coding RNA classiftcation dataset, where Max-seq length and Minseq length illustrate maximum and minimum length of neucleotides in each class samples, whereas in training set, each class has 500 samples except the IRES class which has 320 samples available for training.A well known statistical cross validation method namely leave one out cross validation is used to better analyze behaviour of the proposed model.We have used the training set for training and validation of the proposed model while test set is only used for the final evaluation of the model.Furthermore, training set is split into 5 equal parts, 4 parts are used to train the model and the 5 th part is used to validate the trained model.For dual evaluation, trained model is also evaluated on test data set which was held out separate.The process of training and dual evaluation is repeated five times where every time test set remains the same but every next fold is taken as validation set.Final results are computed by taking the average of 5 results which are produced by the proposed model at each fold.

Table 4 :
Confusion Matrix where True Positive illustrates the count of correctly predicted positive class values, e.g., if both the actual and predicted class labels will be yes then it will be considered as true prediction of positive class label.Similarly, True Negative is accurate prediction of negative class labels.False Positive denotes the count for wrongly predicted class labels , i.e., when actual class is 'no' but model predicts 'yes', similarly, False Negative is wrong prediction of 'no' class when actual class was 'yes'.

Table 5
[27]ares the performance of state-of-the-art and adapted res-net based methodologies with the proposed RPC-snRC methodology for the task of small non-coding RNA classification.It also illustrates the performance of the proposed RPC-snRC methodology when RNA sequence is treated as set of characters, 3-mers based features with random, and pre-trained neural word embeddings.As is depicted by the Table5renowned methodology proposed by Antonio Fiannaca et al.[27]managed to achieve the performance figures of 78%, 77%, 78%, and 77% in terms of accuracy, precision, recall, and F 1 measure, respectively.This performance is outperformed by a recent Graph Convolutional