Assessing Global-Local Secondary Structure Fingerprints to Classify RNA Sequences With Deep Learning

RNA elements that are transcribed but not translated into proteins are called non-coding RNAs (ncRNAs). They play wide-ranging roles in biological processes and disorders. Just like proteins, their structure is often intimately linked to their function. Many examples have been documented where structure is conserved across taxa despite sequence divergence. Thus, structure is often used to identify function. Specifically, the secondary structure is predicted and ncRNAs with similar structures are assumed to have same or similar functions. However, a strand of RNA can fold into multiple possible structures, and some strands even fold differently in vivo and in vitro. Furthermore, ncRNAs often function as RNA-protein complexes, which can affect structure. Because of these, we hypothesized using one structure per sequence may discard information, possibly resulting in poorer classification accuracy. Therefore, we propose using secondary structure fingerprints, comprising two categories: a higher-level representation derived from RNA-As-Graphs (RAG), and free energy fingerprints based on a curated repertoire of small structural motifs. The fingerprints take into account the difference between global and local structural matches. We also evaluated our deep learning architecture with k-mers. By combining our global-local fingerprints with 6-mer, we achieved an accuracy, precision, and recall of 91.04%, 91.10%, and 91.00%.


INTRODUCTION
N ON-CODING RNAs (ncRNAs) are RNA molecules that do not code for protein but nevertheless serve several functions in living cells, including the regulation of gene expression at all levels: transcription, splicing, and translation [1].Naturally, they have been found to play important roles in diseases such as cancer [2], [3], [4] and multiple sclerosis [5], to name a few examples.Consequently, ncRNAs are also central to the development of novel and promising therapeutic approaches [6], [7], [8], [9], [10].
Based on criteria such function and length, as well as primary, secondary, tertiary, and quaternary structure, they can be further classified into different subclasses [11]; including micro RNA (miRNA), transfer RNA (tRNA), CDbox, riboswitch, and small nuclear RNA (snRNA) [12].
One emerging area of research is the classification of ncRNAs given sequence information.This includes distinguishing ncRNAs from protein-coding sequences, identifying whether or not a sequence belongs to a specific ncRNA class, and classifying a ncRNA sequence into its specific class [13].Prior research in detecting novel ncRNA genes has suggested that statistically significant, obvious biases were not found in ncRNA sequences (as opposed to protein-coding sequences with their open reading frames) [14], [15], which makes identification using sequence information challenging.

BACKGROUND AND RELATED WORK 2.1 RNA Secondary Structure
RNAs can fold into structures [16], which provide them with "chemical stability" and "facilitate their biological functions" [17].Because of this, functions of RNAs may be inferred from their structural information [17].In addition, when it comes to ncRNA, two of the possible criteria that are used to further divide them into types are structural features and functions [11].As a result, prior studies on ncRNA classification have attempted to utilize such structural features.

Related Work
Previous studies on ncRNA classification have also strived to use machine learning techniques.One such work is nRC [18], where sequences are fed into the IPknot program [19] in order to predict their secondary structures.The predicted secondary structures are then forwarded to the MoSS algorithm to obtain the substructures [20], which are used as inputs to a convolutional neural network (CNN) [18].A different but related work found that it was effective to use Ran-domForest on nucleotide distributions in order to distinguish between coding and non-coding RNA, but not the specific class a ncRNA sequence belongs to; while igraph-calculated properties of IPknot-predicted secondary structures of the sequences worked well with RandomForest to predict the latter [21].Deep learning methods have also been shown to be applicable to the ncRNA classification problem; as demonstrated by circDeep, a deep-learning-based method developed to distinguish between circRNA and other specific classes of ncRNA [22], and by LncRNAnet, an approach developed to distinguish sequences of lncRNA from other types of sequences [23].Finally, Boukelia et al. proposed an approach integrating multiple sources of information, namely genomic and epigenetic data, and using CNN to classify ncRNA sequences into different classes [24].

Limitation
A common theme among studies that used secondary structure information as features to perform ncRNA classification is that they rely on a single predicted secondary structure per sequence, in many cases using a secondary structure prediction tool such as IPknot [19].Although promising results have been achieved with this approach [18], relying solely on a single predicted secondary structure per sequence may possibly discard useful information: a strand of RNA may have more than one possible conformation [16], while using a single predicted secondary structure for each RNA sequence could discard these other possible conformations; secondary structure formation may not tend towards the minimum free energy conformation [16]; RNA formation in certain species can stay "in non-equilibrium states" [16]; in vivo, the presence of "RNA chaperones" may influence strands of RNA to form into certain conformations that would otherwise be difficult to form in vitro [25], while using a single predicted secondary structure for a single sequence may, again, discard these possible in vivo RNA chaperonesinfluenced forms.Since the additional structural information is discarded and unused in previous approaches, the resulting classification performance may be suboptimal.Alternatively, taking into account multiple possible secondary structure conformations for a single sequence of RNA may yield improved classification results.
Additionally, previous approaches did not differentiate between global and local structural matches nor did they utilize such information [18], [26], [27].Thus, classification performance may be negatively impacted when the data set contains different RNAs that share similar local structures but different global structures [18].Our work [26], [27] was also impacted by this limitation.This paper presents an extension to address this limitation.

Contributions of This Paper
Given an input sequence, we want to predict its class.For this, we propose the use of secondary structure fingerprints as features.Unlike previous methods, the structural features are not derived from predicted secondary structure.Rather, each position of the feature vector represents the "odds" that a given secondary structure motif is found in the input sequence.Two categories of RNA secondary structure motifs are evaluated.One category is derived from the RAG database, RNA-As-Graphs [28].These are higher-order topological motifs.To the best of our knowledge, this is the first application of RAG to derive features for a machine learning application.The second category consists of manually curated motifs, these include GNRA loops, tandem AG pairs, and the Sarcin/Ricin motif.These two sets comprise 47 and 44 motifs, respectively.We extend this approach to introduce globallocal secondary structure fingerprints.We show that positional information improves classification accuracy.When tested on the same data set, global-local secondary structure fingerprints slightly outperform ncrna-deep [29], a state-ofthe-art method.

Secondary Structure Fingerprints as Features
Huang, Li, and Gribskov have successfully used topological fingerprints to classify experimentally validated secondary structures [30].In their work, the RNA structures are represented as "XIOS graphs" [31].
Herein, instead of using a single predicted secondary structure for a single strand of RNA sequence, we propose and evaluate the use of secondary structure fingerprints to circumvent the potential loss of information from other possible secondary structure conformations as previously discussed.

Secondary Structure Motifs
The secondary structure fingerprints are formed by checking whether or not there are specific structural motifs that can be formed by the input sequence, or parts of it, scoring any matches that are found then using the values derived from these scores.In our study, the structural motifs are derived from two sources: "RNA-As-Graphs" [28] and common small RNA secondary structure motifs that we have curated.

RNA-As-Graphs
RNA-As-Graphs (RAG) is a collection of graph-theory-based representations of RNA secondary structure motifs, which catalogues the "topological properties" or "connectivity patterns" of "all RNA structures" [28].The RAG database includes not only motifs that represent secondary structures of identified RNAs (at the time of the development of the database), but also motifs that are theoretically possible, which is intended to support further research on novel RNAs [28].
The database consists of two types of graphs: tree graphs to represent motifs that do not take secondary structure pseudoknots into account, and the more elaborate dual graphs to represent motifs including the pseudoknots [28].In addition to taking pseudoknots into account, the dual graph representation includes all of the motif representation made by the former graph type [28].Therefore, dual graphs are more general than tree graphs.In order to account for pseudoknots, we adopted the dual graph motifs for our fingerprinting method.
A stem in a secondary structure is represented as a vertex in the dual graph representation.Single stranded parts of the structure, including "bulges", "loops", and "junctions", are represented as edges.The complexity of each motif can be measured by the number of vertices the corresponding graph representation has.As this was exploratory work, and to limit the required computational resources, we opted for graph representations of 2 to 5 vertices.
Despite using only dual graph representations that are of 2 to 5 vertices, the approach will still produce a unique fingerprint for a longer/more complex sequence (that could otherwise adopt a secondary structure represented by a dual graph with more vertices) as follows: although the matches and scores against dual graphs with a higher number of vertices would not be included, parts of the sequence would still match against the graphs with 2 to 5 vertices, with their corresponding scores, thus the unique set of scores would still result in a unique secondary structure fingerprint for the entire sequence.RAG was seen as a principled approach to include higher-order information.

Common Secondary Structure Motifs
The RAG-based fingerprints focus on higher level/overall topological structural information.To complement this information, we have also created sets of fingerprints based on commonly known RNA secondary structure motifs.The curated secondary structure motifs used to build the additional fingerprints were extracted from the Rfam 14.1 database [32], the annotated RNA secondary structures of the bpRNA-1m database [33], and the literature on RNA secondary structure motifs.
Certain motifs were further split based on their length and/or sides (5' or 3').The splitting process was based on distributions of motif lengths/sides that were extracted from the bpRNA-1m database [33].This splitting process also ensures that there will be no overly common/generic motifs that would produce matches significantly more frequently than the other less generic secondary structure motifs.
The curated set of common secondary structure motifs is as follows: 1) Hairpin with a loop length of 3 nucleotides, 2) Hairpin with a loop length of 4 nucleotides which sequence is GNRA (the "GNRA tetraloop"), 3) Hairpin with a loop length of 4 nucleotides which sequence is UNCG (the "UNCG tetraloop"), 4) Hairpin with a loop length of 4 nucleotides which sequence is UMAC (the "UMAC tetraloop"), 5) Hairpin with a loop length of 4 nucleotides which sequence is CUUG, 6) Hairpin with a loop length of 4 nucleotides which sequence is CUCG, 7) Hairpin with a loop length of 4 nucleotides which sequence is GANC (the "GANC tetraloop"), 8) Hairpin with a loop length of 4 nucleotides which sequence is ANYA, 9) Hairpin with a loop length of 4 nucleotides with any sequence, 10) Hairpin with a loop length of 5 nucleotides, 11) Hairpin with a loop length of 6 nucleotides, 12) Hairpin with a loop length of 7 nucleotides, 13) Hairpin with a loop length of 8 nucleotides, 14) Hairpin with a loop length of 9 nucleotides, 15) Hairpin with a loop length of 10 nucleotides, 16) Internal loop with length of 1 to 7 nucleotides on either side (5' or 3'), 17) Internal loop with a length of 1 nucleotide on the 5' side and a length of 1 nucleotide on the 3' side, 18) Internal loop with a length of 2 nucleotides on the 5' side and a length of 1 nucleotide on the 3' side, 19) Internal loop with a length of 1 nucleotide on the 5' side and a length of 2 nucleotides on the 3' side, 20) Internal loop with a length of 2 nucleotides on the 5' side and a length of 2 nucleotides on the 3' side, 21) Internal loop with a length of 3 nucleotides on the 5' side and a length of 3 nucleotides on the 3' side, 22) Internal loop with a length of 3 nucleotides on the 5' side and a length of 4 nucleotides on the 4' side, 23) "C-loop" on the 5' side, 24) "C-loop" on the 3' side, 25) Internal loop with "tandem AG pairs", 26) Internal loop with the "UAA/GAN motif", 27) Bulge with a length of 1 nucleotide on the 5' side, 28) Bulge with a length of 2 nucleotides on the 5' side, 29) Bulge with a length of 3 nucleotides on the 5' side, 30) Bulge with a length of 4 nucleotides on the 5' side, 31) Bulge with a length of 5 nucleotides on the 5' side, 32) Bulge with a length of 6 nucleotides on the 5' side, 33) Bulge with a length of 1 nucleotide on the 3' side, 34) Bulge with a length of 2 nucleotides on the 3' side, 35) Bulge with a length of 3 nucleotides on the 3' side, 36) Bulge with a length of 4 nucleotides on the 3' side, 37) Bulge with the "Sarcin/Ricin" motif on the 5' side, 38) Bulge with the "Sarcin/Ricin" motif on the 3' side, 39) Paired stem with a length of 3 to 4 nucleotides, 40) Paired stem with a length of 5 to 6 nucleotides, 41) Paired stem with a length of 7 to 8 nucleotides, 42) Paired stem with a length of 9 to 10 nucleotides, 43) Paired stem with a length of 11 nucleotides, 44) Paired stem with a length of 12 nucleotides.
See [26] and references therein for further details.

Overview
Given the RAG-based and common secondary structure motifs, the next step to produce the secondary structure fingerprints for an RNA sequence is to find structural matches that could be formed by the sequence and score the matches.
In this study, we used a publicly available program, RNA-Motif [34], for this purpose.The program takes a sequence and a "descriptor" [34], and produces a list of matches with their scores, if applicable.Therefore, descriptors representing each of the included RAG graph and each of the curated common RNA motifs were made.For the RAG-based fingerprints, the "bits" scoring system built into RNAMotif was used, in which matches are scored for their "sequence complexity" [35].At the time of our study, the latest available version of RNAMotif does not support secondary structure free energy calculation for descriptors that contain pseudoknots.As a result, free energy scoring was not used for the RAG-based fingerprints as they take pseudoknots into account.
For our common motifs based fingerprints, since no pseudoknots are involved, we chose to utilize free energy based scores; this complements the RAG-based fingerprints.Free energy of matches were calculated using RNA-Motif implementation of Mathews et al.'s free energy calculation [36].
Fig. 1 provides an overview of the steps involved to find matches, produce the scores, and form the secondary structure fingerprint.The steps are further elaborated in the following subsections.

Scoring
For a sequence and either a RAG graph or a curated common structural motif, there can be more than one location within the sequence where the corresponding structure can be formed.To handle such cases, we performed and evaluated two different approaches: 1) Use only the score from the best match (i.e., minimum for common motifs based match scores, maximum for RAG-based match scores), and 2) Take the minimum, average, and maximum of the match scores and use them together.For the free energy based scores, the minimum free energy represents the most thermodynamically stable match, whereas the maximum free energy represents the least stable match.Thus, the best match is the one with the lowest-free energy.On the other hand, for the "bits" scoring scheme [34], the minimum score represents the least significant match, whereas the maximum score represents the most significant and therefore best match.If there is only a single structural match in the sequence, the corresponding score is used; and the minimum, average, and maximum scores are equal to that single score.
In case of no match, a score of 0 (equivalent to insignificant or accidental matches) is used for our RAG-based fingerprints; whereas for the common motifs based fingerprints, the highest free energy value (representing the poorest/least thermodynamically stable match) that could be produced by the corresponding motif in our data set is used instead.

Score Rescaling
The obtained scores that would make up the structural fingerprints are then rescaled before being used as deep learning features.For a set of RAG-based scores containing only either the minimum, average, or maximum scores for a single sequence, the scores in the set are linearly rescaled so that the highest value in the set becomes 1, while zero stays as 0. Therefore, the set of scores for that sequence ranges from 0 to 1 if there is at least a match with one of the included RAG graphs.If there are no matches at all with any of the included RAG graph for a particular sequence, all of the scores would be 0. Note that the rescaling processes are performed separately for the minimum scores, average scores, and maximum scores.
When it comes to the common secondary structure free energy fingerprints, for each of the descriptor and a set of scores (i.e., minimum, average, or maximum), the highest and lowest score values from all of the sequences in the data set are obtained.They are then rescaled as follows: highest free energy values different than infinity (i.e., highly unstable matches) are set to 0, lowest free energy values (i.e., thermodynamically stable matches) are set to 1, and the free energy values in between for that particular descriptor are scaled linearly between 0 and 1.As with the RAG-based fingerprints, the processes were done separately for the minimum, average, and maximum free energy values.In other words, values in the average and maximum free energy sets will not affect the rescaling of the minimum free energy values.

Wobble Pairs
For the common motifs based fingerprints, two sets of matches and free energy scores were obtained: one set of matches and scores that allow the GÁU wobble pairs [37], and another set that does not allow wobble pairs.The scores are then used both separately and together, to compare the different resulting classification performance.

Differentiating Local and Global Structural Matches
Some classes of ncRNA, such as HACA-box and CD-box, may share similar local structures despite not sharing overall global structures [18].Therefore, the fingerprints need to differentiate between local and global structural matches.
To achieve this, we obtained not only matches and scores from a sequence in its entirety, but also structural matches and scores from parts of the sequence.Specifically, we broke down each sequence into 3 parts.When the sequence can be broken down into 3 equal parts; the lengths of the first third, second third, and last third parts of the sequence are equal.Otherwise, the second third part of the sequence may be 1 or 2 nucleotides longer than the other parts.Secondary structure matches and the match scores are then obtained for the broken-down sequences as well (i.e., localized scores), in addition to the full sequence (i.e., global scores).Separate sets of fingerprints comprising only global scores, only local scores, and a combination of global and local scores, are then tested to compare how they perform relative to the other sets.The differences in the process to obtain global scores, local scores, and combined global and local scores are illustrated in Figs. 2, 3, and 4, respectively.

K-mer Features
In addition to features based on secondary structures, we also used k-mer or nucleotide distribution as features, similarly to [21].K-mer has also been used as features in other bioinformatics machine learning applications with promising results [38], [39].We used kÀmer features both independently and combined with the secondary structure based features.Kmers of k = 2, 3, 4, 5, and 6 were used separately.We stopped our exploration at k ¼ 6 since we could no longer fit the tensor in GPU memory.Our implementation counts the occurrences of each of the possible nucleotide combinations for the corresponding k, and each of the counts are then divided by the total of the counts.Thus, the sum of the distribution for a single sequence for a particular k is 1.This also means that the kmer features used as inputs to the neural network do not convey information on the sequence length.While it is possible for sequence length information to affect classification performance, sequence length is not used directly in this work.

Deep Learning
A large number of published studies in bioinformatics have taken advantage of deep learning techniques [13], [40], [41].Deep learning is suitable for problems where the relationships between the features and the results, in this case the RNA sequence features and their corresponding specific RNA class, are unknown or poorly understood [41].Given that, we expect that using deep learning would be suitable to learn and infer the relationships between the features (which are representation of "function" and "structural/sequence features", two of the factors that could determine which specific class an RNA sequence belongs to [11]) and the corresponding class.
We used Tensorflow [42] and Keras API [43] to create our deep learning implementation.After experimenting with different network shapes, configurations, and hyperparameters (including but not limited to choices of activation functions), we found that a 5 densely-connected layers network produced the best results in our case.These layers are described as follows: 1) An input layer that takes the features with a length of n (e.g., for "RNA-As-Graphs"-based secondary structure fingerprints, n ¼ 47 as there are 47 RNA graphs with 2 to 5 vertices; similarly, n ¼ 256 for 4-mer), 2) A densely-connected layer with n Â 3 neurons, with the relu activation function [44], 3) A densely-connected layer with n Â 3 neurons, with the relu activation function [44] and L2 kernel regularization [45] of 0.0001, 4) Another densely-connected layer with n Â 3 neurons, with the relu activation function [44] and L2 kernel regularization [45] of 0.0001, 5) A densely-connected layer with n Â 3 neurons, with the sigmoid activation function [46], 6) A last densely-connected layer with the number of neurons being equal to the number of possible RNA  classes used in the study, with the softmax activation function [47].The sparse categorical cross-entropy function available in the Keras API was chosen as the model's loss function, since it is suitable for our classification problem where each class is represented by an integer [43].In addition, as the number of sequences differs from one RNA class to another in our data set, class weights were calculated and passed to the model during training.Consequently, the misclassification of a minority RNA class would be penalized more by the loss function than misclassification of a majority RNA class during training.
Each training runs for 120 epochs starting with a learning rate of 0.001.Every 40 epochs, the learning rate is decayed by a rate of 0.1 (or reduced by 90%).
We wanted to compare how each type or variation of features performs on its own, and when combined with the others.Therefore, the deep learning training and evaluation processes were performed separately for each of these set of features: 1) RNA-As-Graphs secondary structure fingerprints from global matches using best match scores; 2) RNA-As-Graphs secondary structure fingerprints from local matches using best match scores; 3) A combination of 1 and 2; 4) RNA-As-Graphs secondary structure fingerprints from global matches using minimum, average, and maximum match scores; 5) RNA-As-Graphs secondary structure fingerprints from local matches using minimum, average, and maximum match scores; 6) A combination of 4  ture fingerprint variations using best match scores); 22) A combination of 2 and 12 (all local secondary structure fingerprint variations using best match scores); 23) A combination of 4 and 18 (all global secondary structure fingerprint variations using minimum, average, and maximum match scores); 24) A combination of 5 and 19 (all local secondary structure fingerprint variations using minimum, average, and maximum match scores); 25) A combination of 3 and 13 (all of our secondary structure fingerprint variations using best match scores); 26) A combination of 6 and 20 (all of our secondary structure fingerprint variations using minimum, average, and maximum match scores); 27) 2-mer; 28) 3-mer; 29) 4-mer; 30) 5-mer; 31) 6-mer; 32) A combination of 25 (all secondary structure fingerprint variations using best match scores) and 2-mer; 33) A combination of 25 (all secondary structure fingerprint variations using best match scores) and 3-mer; 34) A combination of 25 (all secondary structure fingerprint variations using best match scores) and 4-mer; 35) A combination of 25 (all secondary structure fingerprint variations using best match scores) and 5-mer; 36) A combination of 25 (all secondary structure fingerprint variations using best match scores) and 6-mer; 37) A combination of 26 (all secondary structure fingerprint variations using minimum, average, and maximum match scores) and 2-mer; 38) A combination of 26 (all secondary structure fingerprint variations using minimum, average, and maximum match scores) and 3-mer; 39) A combination of 26 (all secondary structure fingerprint variations using minimum, average, and maximum match scores) and 4-mer; 40) A combination of 26 (all secondary structure fingerprint variations using minimum, average, and maximum match scores) and 5-mer; 41) A combination of 26 (all secondary structure fingerprint variations using minimum, average, and maximum match scores) and 6-mer; In order to make fair performance comparisons between the different sets of features, the training and the evaluation process used the same neural network configuration described above.While the number of features for each set may differ (n), the number of parameters for a single feature remains the same.

Training and Evaluation Data Set
We used the sequences and their RNA classes from the Rfam 14.1 data set [32].Duplicate entries are eliminated, then the secondary structure scores are calculated with RNAMotif [34] to form the secondary structure fingerprints.Not all sequences in the data set could be reasonably processed with RNAMotif, as we learned that the longer sequences require larger memory and computational time.We have obtained 1,013,200 sequences, their RNA category, and their secondary structure scores that were successfully processed with RNAMotif [34].We worked with this processed data set to design the deep learning architecture.
Afterwards, in order to remove redundancy in the data set, we used the CD-hit tool [48] to obtain sequences that are only at most 80% similar -a CD-hit threshold that has been used in other studies using RNA sequences as input [49], [50], including a deep learning-related study [51], and another study that uses k-mer [52].This yielded a non-redundant data set of 63,612 sequences and their fingerprints, which is $ 6.28% in quantity compared to the full data set.
Since some of the RNA classes in the processed data set (containing only the sequences that RNAMotif could process) are either low in numbers or have no data, we excluded such classes, and ended up using the following classes: Cisreg, miRNA, CD-box, ribozyme, snRNA, HACA-box, Intron Group II, tRNA, 5S rRNA, sRNA, antisense, and riboswitch.Other studies [18], [21] also did not use all available classes in the Rfam data set.After filtering the classes from the nonredundant data set, we ended up with 59,723 sequences, which were then used to evaluate the performance of our proposed approach.
This data processing pipeline is summarized and illustrated in Fig. 5.

Additional Independent Evaluation Data Set
The models that were trained with Rfam 14.1 data were also evaluated with new sequences from Rfam 14.5 [32].Only sequences belonging to one of the RNA classes used in training, and which length do not exceed the maximum sequence length in our primary Rfam 14.1-based data set, were included.There were 28,599 new sequences that met our criteria.These sequences were additionally filtered with CD-hit [48] using the same threshold of 80%.We then end up with 2,437 sequences.

Comparison With ncrna-deep
We compared the classification performance of our approach to a state-of-the-art method, ncrna-deep [29].The scripts made available by the authors were utilized to build, train, and evaluate the models on the same data set used to evaluate our approach.Different combinations of padding schemas ("new", "constant", "random"), number of layers, boundary noise levels, input representations ("Hilbert", "Morton", "Snake", kmer), and model setups offered by the authors were tested.However, we were not able to use their "improved architecture" with k-mer due to the script inability to complete the training process using our data set.Thus, the "improved architecture" was only used with the other input representations.Additionally, the Monte Carlo dropout explored by the authors was not used.A total of 5184 ncrna-deep configurations were tested.Among the results produced by the different configurations, we considered only the best classification performance to compare with the results from our approach.

Classification Performance
We performed 10-fold cross validation [53] to obtain and assess the classification accuracy.To address the class imbalance, the data splitting takes into account the number of elements in each RNA class.In other words, the distribution of classes is retained in each fold, and none of them are missing data from any of the RNA classes.In each training session of a fold, a deep learning model of the previously discussed architecture is trained from scratch with 90% of the data, and then the model performance (and the performance of the feature set being used) is assessed by having it classify the remaining unseen 10% of the data.No early stopping criterion was used, we report the accuracy, precision, and recall of the trained model after 120 epochs.The averaged results across the different folds are shown in Table 1.Note that an untrained model or a random classification would result in an average accuracy of 8.33%, as there are 12 RNA classes that a sequence could be classified into.

Evaluation With New Sequences From Rfam 14.5
Each of the 10 models trained in our 10-fold validation was used to predict the new sequences obtained from Rfam 14.5.The results are then averaged across the 10 models.This is shown in S1, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety. org/10.1109/TCBB.2021.3118358/.

Comparison With ncrna-deep 4.3.1 10-Fold Classification Performance
We evaluated Noviello's et al.'s approach with 10-fold validation as well, on the same data set and with the same 10fold split, see Table 2.In terms of overall accuracy, recall, F1-score, and MCC; among all of the tested configurations, the configuration that utilized 3 layers, the random padding schema, 25 epochs for training with a learning rate of 0.0005, and 1-mer encoding.The attained 10-fold accuracy, recall, and F1-score was 88.20%, while its MCC is 83.93%.Additionally, this configuration achieved a precision of 89.33%.Another similar but different configuration (random padding, 20 epochs with learning rate of 0.001, 3 layers, 1-mer encoding) achieved a slightly higher precision at 89.56%.However, its accuracy and recall are slightly lower at 88.07%, F1-score at 88.14%, and MCC at 83.75%.
Based on this preliminary comparison that does not include the use of the "improved model" [29] with k-mer encodings or the Monte Carlo dropouts; our approach (particularly with a combination of all secondary structure fingerprint types based on scores of best structural matches, and 6-mer), achieved a slightly higher classification performance on all metrics.

New Data Prediction from Rfam 14.5
We also assessed the classification performance of the trained models with new data from Rfam 14.5 -using the same set of new sequences used to assess our Rfam 14.1trained models.
This ended up being a challenging data set for both approaches, ours and ncrna-deep.The highest overall accuracy, F1-score, and recall at 51.59%, 57.94%, and 51.59% respectively was achieved by a configuration that uses the "new" padding schema, trains for 10 epochs with a training rate of 0.0005, a 3-layered model, and the 1-mer encoding.In terms of precision and MCC, this configuration resulted in 71.67% and 31.95%respectively.Meanwhile, the best precision at 76.19% and best MCC at 32.08% were achieved by two other different configurations.
Unlike the primary 10-fold validation results, Noviello et al.'s approach performed better on this data set.

Discussion
When it comes to secondary structure fingerprints, the results show that our common secondary structure motifs based fingerprints performed better than the RAG-based fingerprints alone.One possible reason behind this is simply because the features based on RAG-represented secondary structures are too general.Alternatively, the scores based on free energy might be more informative than the RNAMotif "bit" [34] scores for classification using deep learning.We also noticed that combining different variations of secondary structure match scores that make up the fingerprints and using them together may result in improved classification performance.For example, using both global and local secondary structure match scores together for the RAG-based fingerprints resulted in higher classification accuracy, recall, and precision, compared to using them separately.This applies to both the RAG-based fingerprints that use only scores from the best match as well as the fingerprints that use minimum, average, and maximum scores in case of multiple structural matches per graph for a sequence.
Similarly, using the minimum, average, and maximum free energy scores together instead of only the minimum free energy (representing the best match) for the common motifs based fingerprints yielded improved results.
Given that positional information had been shown to improve classification when using kÀmer features [54], [55], we were curious to see if positional information could also improve the performance of secondary structure fingerprints.In a previous study [26], when using secondary structure fingerprints that do not take positions into account the best accuracy, precision and recall were 64.06%, 73.10%, and 63.90% respectively.Similarly, in this study, the overall best performing secondary structure fingerprints that do not incorporate positional information of the structural matches achieved an accuracy of 68.55%, recall of 68.40%, and precision of 72.20% -a combination of the common motifs based fingerprints that both allow and disallow wobble pairs, and derived from the minimum, average, and maximum free energy of structural matches.Meanwhile, the best performing set of secondary structure fingerprints without any sequence features does take positions of matches into account.It consists of all of the variations covered in this study -that is, a combination of RAG-based fingerprints from local and global structural matches taking the minimum, average, and maximum match scores; common motifs based fingerprints both with and without wobble, from global and local matches, also using the minimum, average, and maximum free energy scores.This achieved a 10-fold validated accuracy of 79.05%; and 10fold validated recall and precision of 79.00%.
When it comes to sequence-based features alone, 4-mer performed best with 86.71% accuracy, 86.80% recall, and 87.10% precision.None of the feature sets consisting only of the secondary structure fingerprints manage to exceed this classification performance.
Combining the RAG-based fingerprints (both global and local, based on best matches), common motifs based fingerprints (both global and local, and both allowing and disallowing wobble pairs; also based on best matches), and 6mer yielded the highest 10-fold classification accuracy of 91.04%, recall of 91.10%, and precision of 91.00%.Interestingly, when the k-mers are used independently without any of the secondary structure fingerprints, the classification performance declines starting from the best performing 4mer as k increases.However, when combined with the secondary structure fingerprints, the classification performance improved as k increases to 5 and 6.

Current Limitations
Fig. 6 shows the confusion matrices when using the best set of features, (a) without kÀmers and (b) with kÀmers.As can be seen, the performance varies drastically between different classes (RNA types), with tRNA and HACA-box showing the best and worst performance respectively.As previously noted, the computational identification of snoRNAs, both CD-box and HACA-box, is particularly difficult [56], [57].Specialized programs have been developed to identify them in genomic sequences.For instance, snoReport [56], [57] first locates specific sequences: boxes C and D, or H and ACA.These sequence signals are required to be within a specific distance of each other.Constrained secondary structures are then predicted around these boxes.Features are extracted based on the minimum free energy, the number of paired and unpaired nucleotides, and the distances between the boxes.Finally, this information is fed to a support vector machine.In the case of snoStrip, information about the target sequence is also used [58].
In the case of HACA-box snoRNAs, the features used in the current study are not sufficient to reliably classify them.However, CD-box snoRNAs are well predicted.In fact, CDbox snoRNAs is the type of RNAs that benefits the most from combining secondary structure fingerprints and kÀmers.A sensitivity increase of 27.10% was observed when combining these two types of features, see Table 3.
From the study, we have also identified another limitation to our approach: the computational resources (primarily memory requirements, especially for longer sequences) required for the RNAMotif program [34] to find matches against the descriptors and to score them.As a result, the data set used for training and evaluation only has RNA sequences up to 148 nucleotides in length.In turn, some RNA classes with longer sequences (such as Intron Group I) were excluded from the deep learning data set as well.A possible solution for a future study would be to use shorter but more secondary structure descriptors to match and score sequences against, leading to reduced memory requirements.

CONCLUSION
We have designed RNA secondary structure fingerprints, based on "RNA-As-Graphs" (RAG) [28] and common small RNA secondary structure motifs, to be used as machine learning features for the purpose of classifying sequences into specific ncRNA classes.The principle behind the design was to include possible but non-optimal secondary structure conformations in the features that would otherwise be discarded if a single predicted secondary structure is used to build the features.The RAG-based fingerprints are formed by matching and scoring sequences (using RNAMotif "bits" scoring [34], which supports pseudoknots) against graph secondary structure representations.Meanwhile, the fingerprints based on common motifs are formed by finding the matches of each curated motif against the sequence, and taking the free energy values of the matches.
We then used and evaluated the secondary structure fingerprints and k-mers with deep learning, in order to classify ncRNA from the Rfam 14.1 database [32] into their classes.
Our results show that fingerprints based on common secondary structure motifs performed better than the RAGbased fingerprints when the different types of fingerprints are used separately.However, when the different variations of the secondary structure fingerprints are combined and used together, the classification performance typically improved (although not in all cases).We have shown that positional information does improve the performance of our secondary structure fingerprints.The best performing set consisting only of secondary structure fingerprints contains all of the secondary structure fingerprint variations covered in this study used together (i.e., combining global and local RAG-based fingerprints based on minimum, average, and maximum match scores; global and local common motifs based fingerprints using minimum, average, and maximum free energy values without allowing wobble pairs; and global and local common motifs based fingerprints using minimum, average, and maximum free energy values that allow wobble pairs).When used as deep learning features to classify ncRNA into their classes, this set achieved a 10-fold validated accuracy of 79.05%, recall, precision and F1-score of 79.00%, and MCC of 70.76%.
Although sequence-based features used alone outperformed the secondary structure fingerprints, we observed that combining both types of features can result in a better classification performance.Specifically, combining 6-mer with all variations of RAG-based fingerprints and all variations of the common motifs based fingerprints that use best match scores resulted in the highest classification performance in our study -a 10-fold validated accuracy, recall, precision, F1-score, and MCC of 91.04%, 91.10%, 91.00%, 90.90% and 87.51% respectively.However, our current evaluation of the approach is limited in terms of sequence length, as we found that RNAMotif [34] in our current pipeline requires a large amount of memory for longer sequences, especially with the RAG-based descriptors (possibly due to the inclusion of pseudoknots).A future study that addresses this limitation could further evaluate our proposed approach with longer sequences.

Fig. 1 .
Fig. 1.Overview of match-finding and scoring process to form secondary structure fingerprints.

Fig. 6 .
Fig. 6.Confusion matrices for the best combination of features, (a) without kÀmers and (b) with kÀmers.In both cases, the data corresponds to fold 10 in a 10-fold experiment.

TABLE 2
Comparing the Global-Local RNA Structure Fingerprints to Noviello's on 10-Fold Cross Validation Reporting the accuracy, recall, precision, F1-score, and Matthews Correlation Coefficient (MCC).

TABLE 3 Average
Sensitivity Per Type Sorted by the Sensitivity Difference Without and With kÀmers Features Average values over 10 folds with standard deviation.Diff. is the difference between the two values.