LNCRI: Long Non-Coding RNA Identifier in Multiple Species

The pervasive nature of long non-coding RNA (lncRNA) transcription in the mammalian genomes has changed our protein-centric view of genomes. But the identification of lncRNAs is an important task to discover their functional role in species. The rapid development of next-generation sequencing technology leveraged the opportunity to discover many lncRNA transcripts. However, the cost and time-consuming nature of transcriptomics verification techniques barred the research community from focusing on lncRNA identification. To overcome these challenges we developed LNCRI (Long Non-Coding RNA Identifier), a novel machine learning (ML)-based tool for the identification of lncRNA transcripts. We leveraged weighted k-mer, pseudo nucleotide composition, hexamer usage bias, Fickett score, information of open reading frame, UTR regions, and HMMER score as a feature set to develop LNCRI. LNCRI outperformed other existing models in the task of distinguishing lncRNA transcripts from protein-coding mRNA transcripts with high accuracy in human and mouse. LNCRI also outperformed the existing tools for cross-species prediction on chimpanzee, monkey, gorilla, orangutan, cow, pig, frog and zebrafish. We applied the SHAP algorithm to demonstrate the importance of most dominating features that were leveraged in the model. We believe our tool will support the research community to identify the lncRNA transcripts in a highly accurate manner.The benchmark datasets and source code are available in GitHub: http://github.com/smusleh/LNCRI.


I. INTRODUCTION
A BOUT About 2% of human genomic regions are involved in encoding proteins, and the rest are noncoding regions which do not finally produce proteins [1]. The intricating transcriptional landscape in humans has opened a new paradigm of pervasive transcription process which led to the discovery of novel non-coding RNAs and their role in cellular and functional processes. Long non-coding RNAs (lncRNAs), which are defined as a type of ncRNA having more than 200 nucleotides in length, have recently been shown to be evident in linking mutations in their sequence and their role in the dysregulation for many diseases [2]. H19 was one of the earliest discovered lncRNAs having very similar characteristics to protein coding genes considering the polyadenylation, splicing, localization, and the transcription mechanism in support of RNA polymerase II [3]. XIST was among the early examples of lncRNAs which are known to be involved in the silencing of X-chromosome [4]. Since these pioneering discoveries of H19 and XIST, many lncRNAs have been discovered in human and other mammalian species as well as in plants [5]. LncRNAs, predominantly being considered junk regions for decades [6], are now recognized as the most cryptic but functionally most crucial player in biomedical research. Many other lncRNAs are known to play a significant role in a multitude of human diseases, and readers are referred to the articles ( [2], [7], [8], [9]) to have a deeper understanding of their role in multiple cellular processes and diseases.
The pervasive transcriptomics nature of the lncRNA in human [10] and mouse [11] has already been established by the FANTOM consortium in their calalogue of lncRNAs. The FANTOM Consortium proposed over 23K lncRNA genes with highly accurate 5' end [12]. Recent version (version 37 and version 26 for mouse, respectively) of GENCODE [13] release provides a list of 18K and 13K lncRNA genes from human and mouse, respectively. MiTranscriptome has collected over 58K lncRNA genes in humans [14], however, it is not well established if all of them have functional evidence. NONCODE version 6.0 prepared a collection of 96K and 87K lncRNA genes from human and mouse, respectively [15]. Based on the above-mentioned landmark projects on lncRNA, we can observe that the discovery of lncRNAs is continuing. Advances in next-generation sequencing (NGS) techniques have provided the scientific community to discover a large number of lncRNA transcripts and their cellular roles [12]. Although thousands of lncRNAs transcripts have already been discovered across species, there is a huge number of lncRNAs that are yet to be identified and annotated in multiple species and this is a challenging task. Firstly, lncRNAs may have similar biogenesis characteristics to those of messenger RNA (mRNA) as both are mainly transcribed the RNA Polymerase II [16]. Secondly, lncRNAs can even undergo the similar transcriptional and post-transcriptional processes as mRNA transcripts [17]. Thirdly, mRNA and lncRNA transcripts have similarities in terms of transcript length as well as splicing structure [18] [19], which make the lncRNA identification process complicated. Fourthly, the low-level expression of lncRNAs also hinder their discovery in multiple cells or tissues [16].
To overcome the challenges of the currently available experimental technologies, many computational methods, mainly machine learning (ML)-based techniques, have been leveraged to distinguish lncRNA transcripts from mRNA transcripts. The common formulation of this ML based approach is to put the transcripts from lncRNA and mRNA under a classification framework. There exist multiple MLbased methods for recognizing lncRNA transcripts from mRNA transcripts based on ML-based techniques. One of the pioneering works in this domain was CONC ("coding or non-coding"), where the authors applied support vector machine (SVM) based ML-model to distinguish mRNA transcripts which are responsible for producing proteins, from non-coding RNAs (ncRNA) transcripts [20]. The authors used mammalian and non-mammalian eukaryotic ncRNA transcrirpts from RNAdb [21], NONCODE [5] database and mRNA transcrirpts from SwissProt [22] database to train ML models. Lia et al. developed PLEK ("predictor of lncRNAs and mRNAs based on an improved k-mer scheme") to recognize lncRNAs transcripts from mRNA transcripts by introducing an improved k-mer scheme [23]. The authors proposed k-mers (k=1 to 5) with sliding window sizes up to five with a step size of one to encode the whole transcripts. Then the proposed k-mer based features were fed into a SVMbased model to classify lncRNA transcripts from mRNA transcripts for human, mouse and other vertebrates. Sun et al. developed the CNCI ("Coding-Non-Coding Index") tool to distinguish protein-coding transcripts from lncRNA transcripts based on the intrinsic composition of sequences [24]. The authors proposed a novel encoding approach of the sequence based on adjoining neighboring triplets representing all possible two consecutive triplets (64*64 possible combinations) from the sequence. Out of six reading frames, the authors selected only one most-like CDS (MLCDS) and used the length and score of MLCDS as a feature vector. Using the proposed features, SVM-based ML model was built to classify lncRNA transcripts and protein-coding transcripts. CNIT ("Coding-Non-Coding Identifying Tool"), an updated version of CNCI, was developed for the same purpose with higher accuracy and much faster speed [25]. Han et al. developed LncFinder, an lncRNA recognition system which recommend 19 different features summarizing the sequence composition, structural information, and physicochemical properties of the nucleotides to find lncRNA from human, mouse, chicken, zebrafish, and wheat [26].
Recently, deep learning (DL)-based models have been proposed for the identification of lncRNA transcripts. Baek et al. developed lncRNAnet [27], a DL-based model by combining convolutional neural network (CNN) and recursive neural network (RNN). RNN was used to detect the intrinsic nature of the input sequence. Variable-length sequences were handled using the Bucketing technique [28] along with kmer embedding vectors, and the network was trained through the GloVe embedding [29] for the identification of lncRNA. Yang et al. developed LncADeep for the identification of partial and full-length lncRNA transcripts by incorporating hand-curated features like Fickett score, hexamer score, coding sequence (CDS) length etc. and then feeding into a deep belief network based model [30]. Tripath proposed DeepLNC, a deep neural network-based model that used k-mers (k=1 to 5) from input sequences as a feature set to classify lncRNA transcripts and mRNA transcripts [31]. Interested readers may check the reviews in [32] [33] which summarize different ML-based methods that have been proposed to identify lncRNAs.
In this study, we propose LNCRI (Long Non-Coding RNA Identifier), a novel ML-based pipeline to distinguish lncRNA transcripts from mRNA transcripts. To evaluate the prediction performance of LNCRI and compare it with other existing tools, we used benchmark datasets for multiple species. We found that LNCRI outperformed the existing state-ofthe-art tools for lncRNA identification in human, mouse and eight other species. Our contribution in this work can be summarized as follows: 1) We have used the largest collection of lncRNA transcripts and mRNA transcripts from GENCODE and RefSeq for the identification of lncRNA transcripts in multiple species. 2) We proposed a novel combination of features representing weighted k-mer, pseudo nucleotide composition, hexamer usage bias, Fickett score, information of open reading frame, UTR regions, and HMMER score to distinguish lncRNA transcripts from protein-coding transcripts. 3) We proposed a CatBoost based model LNCRI, which achieved the best performance compared to other existing methods considering multiple evaluation metrics for human and mouse. 4) LNCRI outperformed other existing models for eight other species in the cross-species transcript prediction task.

A. DATA COLLECTION
We collected all the lncRNA transcripts from GENCODE release 37 for human and release 26 for mouse. GENCODE database, launched by Human Genome Research Institute (NHGRI) under a project named The ENCyclopedia Of DNA Elements (ENCODE), is one of the largest and most reliable sources for human and mouse functional elements [34]. The GENCODE database mainly incorporates four types of functional elements: (a) protein-coding genes, (b) pseudo genes, (c) long non-coding RNA genes, and (d) small non-coding RNA genes [35] . The genes/transcripts in GENCODE were annotated based on computational approach supported by manual annotation and experimental validation. For lncRNA annotation, GENCODE does not a apply strict 200bp length threshold, albeit very few annotated lncRNAs fall below this threshold [35]. For protein-coding transcripts, we collected all the transcripts from RefSeq [36]. RefSeq database, established by the National Center for Biotechnology in the United States, provides a comprehensive collection of well annotated, non-redundant set of sequences for mRNA transcripts and other non-coding RNA transcripts. RefSeq covers sequence from multiple species including human and mouse. From RefSeq, we considered only the transcript that had clear potential for clear protein-coding (i.e., having an NM RefSeqID) capability. Interested readers are suggested to check [37] to compare the annotation pipeline of GENCODE and RefSeq. All the transcript sequences were collected from the genome assembly version GRCh37 and GRCm10 for human and mouse, respectively. The collection contains 97,482 lncRNA transcripts and 104,760 protein-coding transcripts from human. The collection also contains 18,833 lncRNA transcripts and 37,907 protein-coding transcripts from mouse.

B. DATA SET PREPROCESSING
From the collected dataset, transcript sequences having characters other than "A", "C", "G", or "T" (represents U in the corresponding RNA) were discarded. Then, we removed the duplicate sequences. To avoid redundancy from the collected dataset, we applied CD-HIT [38]. As suggested in [39], sequences having more than 80% similarity, based on CD-HIT, were dropped to avoid any bias for the ML model. Moreover, sequences shorter than 200 nucleotides (nt) or longer than 3000 nt were removed from our analysis as prescribed in lncRNAnet [27]. To make both lncRNA and mRNA transcript datasets balanced, we down sampled the mRNA transcript dataset to be in bar with the lncRNA transcript dataset. After all the pre-processing steps, we had 43,839 and 43,383 transcripts from human lncRNA and mRNA, respectively. We also found 3,295 and 2,828 transcripts from mouse lncRNA and mRNA, respectively. Hereafter we will refer this collection of dataset from human and mouse as permissive dataset. To avoid any bias and to check if the CD-HIT cut-off has any significant effect on the performance of machine learning model, we generated another dataset for both human and mouse based on 60% CD-HIT cut-off. This cut-off generated 41,817 and 36,433 transcripts from human lncRNA and mRNA, respectively as well as 3,292 and 2,381 transcripts from mouse lncRNA and mRNA, respectively. Hereafter we will refer this dataset as stringent dataset.

C. K-MER RELATED FEATURES
For each sequence we counted the frequencies of mono-, di-, tri-consecutive nucleotides in the whole transcript body. Then we normalize the k-mer count by the sequence length and calibrated by the possible combination of k-mer. This generated 84 features (4 from mono-, 16 from di-and 64 from tri-consecutive nucleotides), for the development of ML models based on the following equation.
where, C i represents the count of k-mer in the transcript and L represents the transcript length. We also checked the observed/expected ratio of nucleotide combinations in lncRNAs and mRNAs as suggested in [40]. Supplementary File 01 provides the details of observed/expected ratio of all mono-, di-and tri-nucleotides.

D. PSEUDO K-TUPLE NUCLEOTIDE COMPOSITION (PSEKNC)
The pseudo k-tuple nucleotide composition (PseKNC) reflects the physicochemical properties and sequence-order effects of nucleotides in DNAs [41], [42]. The sequenceorder information is preserved through the physiochemical properties of the constituent oligonucleotides. The dimension of this feature vector is of (4k+ ) where k represents k-mer (having a positive integer value), and represents the highest counted rank of the correlation along a DNA sequence. In our case: k=3 and =10 were used to generate a 74-dimension feature vector.

E. ORF RELATED INFORMATION
Open reading frame (ORF) is a well-known property of mRNA transcripts. ORF information has been used in multiple previous studies ( [27], [30]) to distinguish lncRNA transcripts from mRNA transcripts. We considered the length of the longest ORF from three forward frames, starting with the start codon ("ATG") and ending with any of the stop codons ("TAG", "TAA", or "TGA"). We also considered the longest ORF coverage defined as the ratio of the longest ORF length to the whole transcript length. This provided a total of two features for the development of ML models.

F. HEXAMER USAGE BIAS
Hexamer has shown to be effective in the discrimination of protein-coding sequence from the non-coding sequence they can capture potential adjacent amino acids based on codons VOLUME 4, 2016 [43]. Inspired by this phenomena, we calculated the hexamer score representing the log-likelihood ratio of the presence of hexamer in coding to non-coding sequence. For the longest ORF, we calculated the average hexamer score of all the hexamers as suggested in [30] in the following way.
where n is the total number of hexamers in a sequence. FC(h i ) and FNC(h i ), i=1,2,. . . ,4096 represent the frequency of hexamer h i in all the training coding and non-coding sequences, respectively. This provided one feature for the development of ML models.

G. FICKETT SCORE
In 1982, Fickett [44] demonstrated that coding regions may have asymmetric codon bias and nucleotide content that could be considered to distinguish non-coding regions from protein-coding regions. We calculated the nucleotide (A, C, G, or T) composition that is favored by the first, second and the third position of codon in a transcript.
where A 1 , A 2 , A 3 represent the number of A in the first, second and the third position of codon in a transcript. We calculated the Fickett score for C, G, and T as well. This provided four features for the development of ML models.

H. UTR REGIONS
For mRNA transcripts, untranslated regions (UTR) may contain some specific pattern compared to lncRNA transcripts [30]. To capture such characteristics, we first identified the longest ORF. Then we considered the upstream region of the start codon and the downstream region of the stop codon as 5' UTR and 3' UTR, respectively as suggested in [30]. Then we calculated the ratio of UTR length to the transcript length as coverage of UTR. In this way, we generated two features. We also computed the CG content of the 5' UTR and 3' UTR, which provided two more features for the development of ML models.

I. CONSERVATION SCORE
As lncRNAs genes are less conserved than the protein-coding genes, conservation profile would be a good distinguishing feature for separating mRNA transcripts from lncRNA transcripts. We aligned each transcript against Pfam [45] version 34 using HMMER [46]. Based on the alignment we extracted eleven features for the development of ML models. We considered bit score of overall sequence alignment and the matched domain, e-value for overall sequence alignment and the matched domain, length of query and target sequence, mean posterior probability reflecting how reliable the alignment was, etc. Additionally, we used the HMM alignment ration (ratio of the length of the aligned region to the input sequence) for the development of ML models.

J. DEVELOPMENT OF CLASSIFICATION MODELS
After the Data collection step, we used the selected features to build different classifiers to distinguish lncRANs from protein coding ones. We used Decision Tree (DT), Support Vector Machine (SVM), Artificial Neural Network (ANN), Random Forest (RF), eXtreme Gradient Boosting (XGB), CatBoost algorithms to classify these two types of transcripts. The mRNA and lncRNA transcripts were considered as the positive and the negative set, respectively for the development of ML models. We used Python Scikitlearn GridSearchCV for hyperparameter tuning with fivefold cross validation. The details of parameter optimization are provided in Supplementary File 02. We applied a fivefold cross validation technique to evaluate the performance of the model. We used 80% data as the training set and the remaining 20% data as the test set. We used the following performance evaluation metrics for the models: Sensitivity(S n ) = tp tp + fn where tp, fp, tn and fn represent number of true positive, false positive, true negative and false negative samples, respectively predicted by the model.

A. FICKETT SCORE AND HEXAMER SCORE PATTERNS AT THE LNCRNA AND MRNA TRANSCRIPTS
Fickett score represents the combined effect of nucleotide (nt) composition and their codon usage bias [43]. It also reflects the degree at which a nt is favored in codon positions. Figure 1 shows the distribution of Fickett scores for nts in human. The Fickett score was relatively high for C in lncRNA transcripts compared to mRNA transcripts (lncRNA:mRNA= 0.054 ± 0.017: 0.023 ± 0.016). On the other hand, Fickett score for T in was relatively high in mRNA compared to lncRNA (lncRNA:mRNA= 0.016 ± 0.011: 0.039 ±0.032). This summarily represents higher codon usage bias of C and T in lncRNA and mRNA, respectively. Figure 2 shows that the distribution of hexamer scores was relatively higher in mRNA transcripts compared to lncRNA transcripts representing the specific hexamer patterns that  are prevalent in mRNA transcripts compared to lncRNA transcripts. It perfectly aligns with previous studies showing that protein-coding genes contain specific hexamer patterns, which are rare in the rest of the genome [47] [48].

B. SEQUENCE PATTERNS AT THE UTR REGIONS OF LNCRNA AND MRNA TRANSCRIPTS
3' UTR regions generally have longer length than the 5' UTR region [49] and the same pattern is observed for both lncRNA transcripts and mRNA transcripts (Figure 3a and 3b). For both 5' and 3' UTR, the length (UTR ratio) was relatively high for mRNA transcripts compared to the lncRNA transcripts. The GC content in the genic region of lncRNA [50] and in the promoter region of lncRNA models [33] [48] are not as enriched as protein coding genes. But the UTR regions of lncRNA transcripts are more GC enriched than the mRNA transcripts (Figure 3a and 3b). UTR with high GC content tend to enhance the gene regulation ability [49], which is one of the known functions of lncRNAs [51].

C. ORF AND HOMOLOGY OF LNCRNA AND MRNA TRANSCRIPTS
Long putative ORF is highly unlikely to be present in any random sequence including noncoding sequences and ORF over 100 length codons is usually considered to be a highly likely protein-coding sequence. Therefore, we observed higher ORF length and coverage in the protein coding transcripts Figures 4a and 4b, though some lncRNAs may have long ORF length. And this perfectly aligns with the  Figure 3a shows the ratios and GC contents at 3' UTR, while Figure 3b shows ratios and GC contents at 5' UTR.
known ORF length distribution in literature [50].
As protein coding genes are more conserved than lncRNA genes, searching homology using HMMER against Pfam database would provide higher similarity for mRNA genes than lncRNA genes and this pattern is clearly observed in Figure 5a and 5b. Supplementary File 03 provides the distribution of hexamer usage bias, Fickett score, open reading frame, UTR regions, HMMER score in mouse. Table 1 highlights the performance of ML models on different types of features that were used in the model. Based on ablation study, we can observe that the CatBoost based model performed the best among all the models we evaluated. The performance of the XGBoost and CatBoost models were very close but CatBoost based model slightly outperformed the XGBoost model (Table 1).

D. PERFORMANCE OF LNCRI IN HUMAN AND MOUSE DATASETS
Among the type of features, weighted k-mer, PseKNC and UTR-based features had the distinguishing capability level of~85%,~85%,~80% Acc, respectively in both human and mouse ( Figure 6). Fickett Score (Acc of~79%:~73% in human:mouse) and Hexamer Score (Acc of~76%:~73% in human:mouse) based features showed better performance in human compared to mouse ( Figure 6). Interestingly we observed more distinguishing capability of ORF (Acc of 77%:~84% in human:mouse) and HMMER based feature (Acc of~88%:~97% in human:mouse) in mouse compared to the human ( Figure 6). Combining all the features the CatBoost based model achieved the best performance with 93% Sn and 95% Sp for human and 97% Sn and 99% Sp for mouse ( Figure 6, Table 1).

E. PERFORMANCE OF LNCRI AND OTHER EXISTING TOOLS ON PERMISSIVE DATASET
We compared the performance of LNCRI against state-ofthe-art model for the classification of lncRNA transcripts from mRNA transcripts. We used the benchmark dataset for human and mouse to compare the performance of LNCRI against six other tools: CPC2, CNCI, PLEK, CNIT, CPAT and LncADeep. LNCRI outperformed all the tools for mRNA transcript prediction (Table 2) for human and very close to lncADeep for lncRNA transcript prediction in human. For mouse, LNCRI outperformed all the tools we compared for mRNA and lncRNA transcripts.

F. PERFORMANCE OF LNCRI IN CROSS-SPECIES PREDICTION TASK
We also compared the performance of LNCRI against multiple species: Danio rerio (Zebrafish), Xenopus tropicalis (Frog), Bos taurus (Cow), Pan troglodytes (Chimpanzee), Sus scrofa (Pig), Macaca mulatta (Monkey), Gorilla gorilla (Gorilla), Pongo abelii (Orangutan). The benchmark datasets for multiple species was collected from [23]. For this purpose, we trained the model using Human dataset and fed the other species dataset into the trained model for inference purpose only. For evaluating cross-species prediction performance of LNCRI, we compared it against two other tools: CNCI and PLEK. For almost all species LNCRI outperformed the tools that we tested (Table 3).

G. PERFORMANCE OF LNCRI AND OTHER EXISTING TOOLS ON STRINGENT DATASET
To avoid any bias and to check if the CD-HIT cut-off has any significant effect on the performance of LNCRI, we generated stringent datasets for both human and mouse based on 60% CD-HIT cut-off. The performance of LNCRI against other existing tools based on the stringent datasets is highlighted in Table 4. We can observe that LNCRI performed almost at the similar level in both permissive and stringent human datasets (Table 2 and Table 4). As like the human permissive dataset, LNCRI and CNIT performed the best for mRNA and lncRNA transcript prediction, respectively for the stringent dataset (Table 2 and Table 4). For mouse stringent  Figure 5a shows domain score distribution, while Figure 5b shows align ratio distribution. dataset, the performance of LNCRI and the majority of other existing tools dropped slightly (Table 4) compared to their performance on permissive dataset (Table 2). But LNCRI performed the best for mouse mRNA transcript prediction task with 94.43% Acc. For lncRNA transcript prediction in mouse, performance of LNCRI (96.05% Acc) was very close to the highest performing tool CNCI (97.04% Acc) ( Table  4).

IV. DISCUSSIONS
In this article we proposed LNCRI, a ML based model to identify lncRNAs in human, mouse and eight other species. The proposed model considered seven different types of features, namely: Weighted k-mer, PseKNC, ORF, Fickett Score, UTR information, Hexamer Score, and HMMER features to distinguish lncRNAs from mRNAs. LNCRI demonstrated better performance in the assigned task of classifying lncRNA transcripts from mRNA transcripts in both human and mouse compared to other existing tools ( Table 2, Table  4). As LNCRI performed almost at the similar level for both the permissive and the stringent dataset, it is highly unlikely to have any bias in the proposed model. Moreover, LNCRI trained on human dataset achieved high Acc in the cross-species prediction task for almost all species we tested, VOLUME 4, 2016    indicating the effectiveness of LNCRI in poorly annotated species (Table 3).
To explain the proposed LNCRI model, we investigated the features that contributed in the CatBoost-based model most in distinguishing lncRNA transcripts from mRNA transcripts in human. We leveraged the SHapley Additive exPlanations (SHAP) algorithm [52] to identify the features that contributed most in this task. Figure 7 highlighted the topranked nine features based on these SHAP values as identified from the boosting model. The positive SHAP values for the influential features drive the model towards the mRNA class, whereas the negative SHAP values influence the model towards the lncRNA class. Among these dominant features, two features were from the weighted k-mer group: TAA and CAC. The higher values of TAA and CAC drive the model towards lncRNA prediction (Figure 7). The obs/exp ration of TAA and CAC were also higher in lncRNA transcripts com-  Figure  7 also showed that the Fickett score for T-, G-and A-base positions were identified as influential features and the impact of T was more dominant than G and A. Hexamer score and ORF length were also influential features and they hold relatively lower values in lncRNA. Query sequence length (qlen) and conserved region alignment ration (HMM_align_ratio) were also suggested by the SHAP algorithm as dominating features indicating the importance of incorporating sequence length and alignment length information into the prediction model.

V. CONCLUSION
This article proposes LNCRI, a novel ML-based model to distinguish lncRNA transcripts from mRNA transcripts in human, mouse, and other species. LNCRI outperformed many of the existing state-of-the-art tools for lncRNA transcript identification in the considered species. Considering the low expression level and evolving annotations of lncRNA, its identification is a challenging task. To overcome the challenges, we have used the most extensive collection of lncRNA and mRNA transcripts to build a highly accurate ML-based model for the task of lncRNA identification. We believe LNCRI will provide more insights into lncRNAome by enabling the discovery of lncRNA transcripts with increased accuracy.

None.
SALEH MUSLEH is a PhD student at the College of Science and Engineering of Hamad Bin Khalifa University. Musleh has a considerable experience Data Science Applied research discovering and communicating the value data science brings to organizational decision making and understanding and optimizing opportunities. Research Interests includes developing and applying machine learning algorithms to support bioinformatics research community to answer complex questions and discoveries in bioinformatics big data. Musleh's current research interests focusing on Long-non-coding (LncRNA) Identification using novel machine learning-based pipeline purely based on sequence information.
DR. MOHAMMAD TARIQUL ISLAM is an Assistant Professor in the Computer Science Department at Southern Connecticut State University. Dr. Islam's primary area of research is computer vision, deep learning, and applied bioinformatics. He has published at notable peer-reviewed conferences and journals, such as Computer Vision and Pattern Recognition, International Conference on Image Processing, International Conference on Bioinformatics and Biomedicine, International Journal on Image and Video Processing, etc. He is the recipient of several grants on the application of deep learning in computer vision and has supervised multiple graduate students in their pursuit of a master's degree.
DR. TANVIR ALAM is an Assistant Professor at the College of Science and Engineering of Hamad Bin Khalifa University. Among his notable research works are on the transcription regulation of non-coding RNAs and their roles in different diseases. His research work also centered around the application of artificial intelligence (AI) on the diagnosis and prognosis of communicable and non-communicable diseases. He is a member of FANTOM Consortium. He also served as a reviewer in a number of international conferences and reputed journals. VOLUME