DL-CRISPR: A Deep Learning Method for Off-Target Activity Prediction in CRISPR/Cas9 With Data Augmentation

Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR)/CRISPR- associated (Cas) system is a popular and easy to use gene-editing technique, but it has off-target risk. Cutting the off-target sites will harm the cells severely, hence in silico methods are needed to help to avoid this. Most existing in silico approaches mainly relied on a relatively small positive dataset and the data imbalance issue still exists. Besides, some samples used to be considered as negative are later proved to be positive. Hence, it is essential to refresh the dataset and develop more accurate off-target activity prediction programs. In this work, firstly, we extended the current positive dataset and explored the potential differences between positive and negative data based on the new dataset. Then we adopted a new data augmentation method to solve the data imbalance issue, and used the ensemble idea to take more negative data into consideration to make the model close to the real scenario, but at the same time keeping the model balance. Finally, we developed DL-CRISPR, a deep learning framework to predict off-target activity in CRISPR/Cas9. DL-CRISPR is evaluated and compared with other state-of-the-art methods on three kinds of datasets: 5-fold cross validation test datasets, putative off-targets datasets related to specific single guide RNAs (sgRNAs), and putative off-targets datasets related to unseen sgRNAs. DL-CRISPR realizes the best average accuracy, i.e. 98.57%, on 5-fold cross validation datasets and correctly detects more off-targets on datasets related to both seen and unseen sgRNAs.


I. INTRODUCTION
Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR)/CRISPR-associated (Cas) systems [1], [2], has been widely used in gene editing [3], [4] due to its specificity and ease of use. It uses single guide RNA (sgRNA) to guide the Cas9 nuclease when doing DNA cleavage on a specific site, where the sgRNA is composed of a 20-nt protospacer sequence and a 3-nt protospacer adjacent motif (PAM), usually a sequence of NGG. However, the CRISPR/Cas9 system has a potential off-target risk [5], it may cut the unintended sites with several nucleotide mismatches. The occurrence of off-targets will lead to harmful mutations that can impair or even kill the cells. Therefore, it is essential to identify the possible off-target sites for researchers to check for mutations after the genomic cut.
The associate editor coordinating the review of this manuscript and approving it for publication was Minho Jo .
Varieties of biology assays are available to find off-targets. Polymerase Chain Reaction (PCR) is the most reliable way. Wang et al. used integrase-defective lentiviral vectors (IDLVs) to measure off-targets activity and detected offtargets whose frequencies are as low as 1% [6], and Ran et al. proposed the BLESS which uses small Cas9 enzymes to improve gene editing efficiency [7]. However, one disadvantage of the PCR methods is that they are not practical to be used on a large number of sites only to find the small number of off-targets. Therefore, all sorts of in vitro and cell-based techniques have been developed to detect unbiased and genome-wide off-target sites. In vitro methods can detect off-target sites with low mutation frequency, for example, Kim et al. introduced Digenome-seq to profile off-target sites on whole genome sequencing [8], Tsai et al. developed CIRCLE-seq by reducing random reads in Digenome-seq [9], and Cameron et al. presented SITE-seq by enriching and tagging Cas9 cleavage sites in Digenome-seq [10]. Additionally, cell-based methods can identify the mutation in certain cell types and conditions, for instances, GUIDE-seq is proposed to find off-target sites by tagging DNA double-strand breaks (DSBs) with small double-stranded oligonucleotides [11], BLISS is developed to label the DSBs directly [12], and TEGseq is presented to enrich the target in GUIDE-seq [13].
Besides, economical and effective in silico approaches also have been presented. In the early stage, in silico methods mainly relied on mathematical statistics by considering mismatches between sgRNA and DNA, for example, MIT score [14] assigned weights according to mismatch position, identity and density between on-targets and offtargets, and CFD score [15] examined the 1-nt mutations of substantial sgRNAs. However, in recent years, the boom of the machine learning breeds a lot of methods in this area, which improves the off-targets prediction accuracy effectively. Abadi et al. developed CRISTA to detect the offtargets probability, which uses the Random Forest to learn a regression model [16], Listgarten et al. proposed Elevation to predict off-targets activities, which relies on scoring and aggregating scores machine learning models [17], Peng et al. presented an SVM ensemble learning method to determine the off-targets propensity of a sgRNA [18], and Chuai et al.
implemented DeepCRISPR to design optimal sgRNA as well as predicting the off-target profile with deep learning [19].
However, most machine learning based off-target activity prediction programs mainly rely on a previously integrated positive dataset [20], and the data imbalance issue in CRISPR/Cas9 still exists, which may introduce bias into the model and the scenario build on it would be inconsistence with the practical [21].
To solve the above problems, in this work, we first collected data from a series of in vitro and cell-based assays to increase positive data quantity as well as the model competency. Then we applied a novel data augmentation method on positive data to increase its training data amount. Ensemble idea is also employed to make the model closing to the real scenario but at the same time keeping the model balance. Finally, combined with Convolutional Neural Network (CNN), we introduced DL-CRISPR, a deep learning method to evaluate off-target activity in CRISPR/Cas9 system. We evaluated Dl-CRISPR and compared it with other state-of-the-art methods from three aspects: our newly constructed 5-fold cross validation test datasets, putative offtarget sites related to two specific sgRNAs, and off-target sites for three unseen sgRNAs on mouse gene. DL-CRISPR outperforms other methods on all kinds of datasets, it achieves accuracy (Acc) from 98.40% to 98.72% on 5-fold cross validation datasets, sorts more off-targets into top position according to their probability scores in the dataset related to two specific sgRNAs, and detects 4, 39, and 30 more offtargets on three unseen mouse gene than other methods.

A. DATASETS
There are three kinds of off-target types: nucleic acid mismatch with on-target sequence, nucleic acid deletion from on-target sequence, and nucleic acid insertion from on-target sequence. Here we only focus on the mutation type offtargets, which occupies a large proportion in all off-targets with sequence length 23.
For emerging of new data experimentally verified, firstly, we collected off-target sequences as well as their sgRNAs from in vitro [8]- [10], [22], [23] and cell-based genomewide assays [11]- [13], [24], among these assays, 3 of them contain PCR validation experiments [8], [22], [23]. As we hypothesized that the in vitro and cell-based assays may not be the reliable off-target data sources due to the differences in setting the environments between them and PCR assays, we further split the off-targets to two types. We defined the reliable off-targets as the off-target sequences which are either detected by at least two different in vitro and cell-based studies, or validated by PCR, and defined the ''detected'' off-targets as the off-targets which are not validated by PCR and only detected by one of in vitro or cell-based studies. With the definition, 957 unique reliable off-targets and 5,435 ''detected'' off-targets were obtained from in vitro and cellbased assays, from which we deleted sgRNA 'TCATCCTC-CTGACAATCGATAGG' on gene CCR5_9 with its off-target site, as there is only 1 sample for this sgRNA. Then we downloaded off-targets obtained by PCR assays [18] and got 215 more samples. We integrated the reliable off-targets got above without repetition to construct the final positive dataset. The complete positive dataset contains 1128 unique reliable off-target sequences related to 28 sgRNAs. The process to split reliable and ''detected'' off-targets and the corresponding data amount is illustrated in Fig. 1, and the overview of the assays where we collected data is implemented in Supplementary Table S1.
For negative data collection, we downloaded the negative dataset from Peng's work [18], which were found by Cas-OFFinder [25] in human gene hg38 related to 29 sgRNAs with no more than 6 mismatches and excluded those already in his positive dataset. It should be noticed that two sgRNAs in the 29 sgRNAs have the same 20-nt protospacer sequences (GGGTGGGGGGAGTTTGCTCC) but only different PAM sequences (AGG and TGG), these two sgRNAs are considered as one with PAM sequence NGG in our positive data, thus the negative data we downloaded are in accordance with our positive data. However, 379 samples in this negative dataset were later proved to be reliable off-targets and were put into our new positive dataset. Besides, 3,485 samples in the original negative dataset belong to ''detected'' offtargets as we defined. To avoid confusion, we deleted these newly identified reliable and ''detected'' off-targets from the original negative data. Our new negative dataset contains 403,953 samples, from which the DNA sites are called noediting sites.

B. FEATURES
In this paper, the features are all extracted directly from the sequence, as the previous study reported that the sequencederived feature is quite dependable for off-target prediction FIGURE 1. The process to split reliable and ''detected'' off-targets. 9 in vitro and cell-based assays are considered here, their names are shown in the orange box. After processing, 1128 reliable off-targets and 5435 ''detected'' off-target are obtained. [20]. And also, according to work [14], only PAM NGG and NAG have editing efficiencies, almost all experiments used NGG as PAM because it is much more efficient than NAG, hence the 3-length PAM sequence does not contain too much useful sequence composition information and we would not consider it in feature extraction.
Suppose S = (s 1 s 2 . . . s j . . . s n ) represents the putative off-target DNA sequences from 5' to 3', and D = ( T , G, C}, n denotes the sequence length and equals to 20 here. To represent the sequences, firstly, we used the one-hot vector to encode two raw sequences: sgRNA and its putative off-target sites. Each nucleotide acid (s j ord j ) will be represented as one of (1,0,0,0), (0,1,0,0), (0,0,1,0) and (0,0,0,1) corresponding to A, C, G and T, respectively. Next, we introduced the mismatch position and type feature to extract the mutation information from the raw sequences. In the on-target and putative off-target sequence pairs, if d j = s j , position j is regarded as mismatch. For each mismatch position j, the mismatch type must belong to one of the 12 mutation types: {AT , AG, AC, TA, TG, TC, GA, GT , GC, CA, CT , CG}, where the first nucleic acid is from on-target sites and the second is from putative off-target sites. In this way, the mutation position and type information can be incorporated into a 12 * 20 matrix M , where M (i,j) = 1 means at position j, mismatch happens between sgRNA and its putative off-target with type i, and other values in M will be assigned to 0.
Considering the above feature extraction schemes, each sequence can be represented in a 20 * 20 matrix by concatenating the one-hot features for on-target sequence, one-hot features for putative off-target sequence, and the mismatch position and pair matrix M , along the row.

C. DATA AUGMENTATION
Data imbalance is a severe issue in CRISPR/Cas9 off-target prediction. As mentioned above, the negative data can be as large as several hundred thousand while the positive data is only 1,128. This issue had been addressed by Gao [21] and he called for more efforts to be done in this area. Inspired by the image augmentation methods, in this work, we performed rotation on our built feature matrix to extend the positive dataset. We rotated each original feature matrix to 90 degrees, 180 degrees and 270 degrees, respectively. In this way, the data size can be augmented to four times as before.

D. MODEL CONSTRUCTION
Because our raw sequences are organized in 2D matrices with binary values, and CNN network can learn the hierarchical spatial representations, we believe that the application of multiple layers CNN structure would be a suitable way to learn useful information from the feature matrices we built, and we adopt 4 CNN layers here. To train the model, we first extended the positive training dataset size from 902 to 3608 with our new data augmentation method, then randomly selected the same amount of negative data as positive to ensure the balance of the model. Although some authors stated that the artificial approach to balance dataset may under-estimate the real scenario where CRISPR offtargets can have lots of negative samples in practice, given its whole genomic search range [21], we believe that using unbalance data to train the model will introduce bias and getting undesired results for positive data. Rather, we adopted the ensemble idea to make the model to be close to the real scenario to solve the problem addressed above. We trained the model 10 times, each time we randomly reselected 3608 negative samples to ensure different data can be learned by the network. The final result was got by averaging scores of ten models. The techniques we described above finally composed of DL-CRISPR, a deep learning model for offtarget prediction in CRISPR/Cas9. Its working mechanism is illustrated in Fig. 2.

III. RESULTS AND DISCUSSIONS
A. DATA RELIABILITY As mentioned above, most off-target samples were obtained from in vitro and cell-based assays, and we hypothesized that the potential differences of experimental settings in these assays from that of PCR assays will induce off-targets that would not happen in practice. To illustrate this, we designed three experiments whose positive and negative training data are 1) reliable off-targets and ''detected'' off-targets, 2) ''detected'' off-targets and negative samples, and 3) reliable off-targets and negative samples, from which, all 3 kinds of datasets are randomly selected with sizes of 902 (80% data of reliable off-targets dataset). The models constructed from these three kinds of training data were tested on the remaining 226 reliable off-targets and 226 randomly selected negative samples which are not overlapped with training. We tested the model 5 times for each group of training data, the results were recorded in Supplementary Table S2. The average Acc of using ''detected'' off-targets as negative data, positive data, and not use are 77.16%, 82.44%, and 93.72%, respectively. Obviously, differences exist between reliable off-targets and ''detected'' off-targets due to the over 10% difference in classification accuracies when using ''detected'' off-targets as positive and not using it, which in turn prove that the off-target samples found by one of in vitro or cellbased experiments are not so reliable to be the true offtargets in practice. The results also demonstrate that there are distinctions between ''detected'' off-targets and negative samples, but the relatively smaller gap between Acc got by the second and the third group of training data, and the first and the third group of training data illustrates that the properties of ''detected'' off-targets are more consistent with the reliable off-targets than the negative data. Even though, the ''detected'' off-targets will not be considered as either positive or negative in our training datasets.

B. DATA ANALYSIS AND FEATURE INVESTIGATION
To show the differences between positive data and negative data as well as the effectiveness of our features, mutation position and type for two kinds of data were visualized with line charts and heatmaps in Fig. 3. From the line chart, the distribution of mutation position and type in negative data are much more uniform than that of positive, whose mutation frequency for each position almost remain at around 5%, whereas in positive data, the mutation position and type have some obviously preferences: some positions, like position 1-4 and 8, show higher mutation frequencies, especially in position 1 and 2, whose mutation frequencies are about as twice as those in negative, whereas positions like 10-16, 19 and 20 seem to less likely to mutate. Besides, the mutation types in the same positions also can be different in two kinds of data. For instance, at position 1, although the heatmaps show that nucleic acid 'G' is more likely to mutate than other nucleic acids in both positive and negative, 'G' is more likely to mutate to 'A' in positive while to 'T' in negative, in addition, 'A' and 'T' in positive nearly show no mutation trends, whereas in negative, 'A', 'C' and 'T' almost have the same probabilities to mutate; at position 2, the mutation from 'G' to 'A' shows much higher frequency in positive, but in negative, it is not so noticeable as mutation 'G-C' and 'G-T' also share very high frequencies; at position 8, 'T' is much more likely to mutate to C in positive, however, this circumstance rare happens in negative samples; at position 19, mutation type 'T-C' happens most often whereas other types rarely happen, however in negative, all mutation types can happen but 'C-G' has the largest probability.
In addition, even for the positions share nearly the same mutation frequencies in positive and negative, such as position 5-7, 9, 17 and 18, the mutation types can be different. For example, at position 8, mutation type 'T-C' is most likely to happen in positive, but in negative, this rarely happens, in contrast, nucleic acids 'C' and 'T' are more likely to mutate VOLUME 8, 2020 to other nucleic acids. And at position 18, mutation type 'G-A' is quite noticeable in positive, whereas in negative, the mutations for all nucleic acids share nearly the same possibilities.
The mutation position trends we found above, although with limited positive data, proved that in 20-bp PAMproximal seed sequence, the sites which have mismatch nucleic with sgRNA in PAM-distal region and match with sgRNA in PAM-proximal region are more likely to be offtarget sites, this finding is consistent with the work reported before [14]. At the same time, from the mutation type plotting, mutation type 'G-A' in positive happens quite often. The findings here give us some insights to distinguish the different types of data and indicate that our feature matrices can contain useful information in predicting off-targets.

C. 5-FOLD CROSS VALIDATION
To better demonstrate the performance of DL-CRISPR and show its robust, we evaluate it with 5-fold cross validation. The original data were randomly divided into 5 subsets, in each validation, one subset is used as the test and the others are used as training. As described above, we use ensemble learning to incorporate more negative data at the same time to maintain the model balance, hence, in each fold, 10 models are constructed and the results are taken as their average. The off-target prediction for given sgRNAs is similar to the topic of optimal sgRNA design, some sgRNA design tools also can output possible off-targets and therefore can be used for comparison here. However, most optimal sgRNA selection tools only output optimal sgRNA in a gene, for tools which also can output off-targets, other input information may be required, like longer input sequence, e.g. 30-nt before PAM [26], or epigenetic features of the sequence [19]. Considering these, DL-CRISPR is compared with three recently published machine learning methods: ensemble SVM [18], CRISTA [16], and Elevation [17]. Elevation is designed as an optimal sgRNA selection tool, it outputs the optimal sgRNAs for input gene as well as their corresponding off-target sites, as some sgRNAs in our dataset are not outputted as optimal by Elevation, we cannot get the predicted off-targets for these sgRNAs. Therefore, the performance of Elevation is only evaluated on the 13 sgRNAs it found, where the 13 sgRNAs are recorded in Supplementary Table S3. In addition, all these three methods only can output scores of predicted positive data, so it is impossible for us to compare the results via ROC and PRC curves.
The 5-fold cross validation results for all four methods are demonstrated with box and whisker plots in Fig. 4. Box and whisker plots incorporate the information of center, spread and overall range for a group of data, where the skewed distribution and potential unusual samples can be indicated clearly, hence can give us insights about the model robust.
We observe that the overall performance of DL-CRISPR on 5-fold cross validation is the best even under the truth that test data may already exist in training for other methods, whose average values of Acc, sensitivity (Sn), speci-ficity (Sp), f-score, and Harmonic mean (Hm) are 98.57%, 95.57%, 98.58%, 0.9928, and 0.9705, respectively. Besides, DL-CRISPR outperforms the other three methods with better performance matrices in all 5 validations. Although the Acc of CRISTA are also very large and within a narrower spread than DL-CRISPR, they are mainly contributed by the negative prediction accuracies due to the heavily imbalanced data. However, in this classification problem, we should pay more attention to the positive prediction accuracy, as it can provide valuable references for the potential off-target sites. The smallest sensitivity in DL-CRISPR is 95.13% whereas the largest sensitivity in CRISTA is only 73.89%. Additionally, the average sensitivities for CRISTA, Ensemble SVM and Elevation are 69.32%, 93.53%, and 87.40%, respectively, which are all worse than DL-CRISPR's 95.66%.

D. PERFORMANCE ON SPECIFIC sgRNAs
In this part, we want to evaluate the ability of DL-CRISPR in identifying the most possible off-targets from a large number of putative sites. Each input sequence will be assigned a probability score of being off-target by the model, whether the experimentally verified samples can be assigned high probability scores and whether they can be ranked at top positions among all potential sites by the model can give us significant insights about the model effectiveness.
We test DL-CRISPR on putative off-target datasets for specific sgRNAs. We selected two sgRNAs to study, one from gene VEGFA (on-target sequence: GGGTGGGGGGAGTTTGCTCCNGG) and the other from gene EMX1 (on-target sequence: GAGTCCGAGCAGAA-GAAGAANGG), we annotate these two sgRNAs as VEGFA and EMX1 in the following description. These two sgRNAs are chosen because they are investigated most times in nine in vitro and cell-based techniques (9 times for VEGFA and 6 times for EMX1). The model we used is the ensemble model constructed in the first cross validation from the above part, only Ensemble SVM and CRISTA are compared here, as the Elevation does not output the two sgRNAs as optimal on gene VEGFA and EMX1.
To prepare the data, firstly, we download the putative offtarget data in Frock's work [24] for these two sgRNAs from [20], 41,746 and 76,251 samples are obtained for sgRNA VEGFA and EMX1 separately. After deleting those already in training, 99 for VEGFA and 64 for EMX1, we get 29 and 12 reliable off-targets, and 437 and 92 ''detected'' off-targets for VEGFA and EMX1 separately. To evaluate the methods on new datasets for two sgRNAs, we compare the number of reliable off-targets and the number of ''detected'' off-targets in the first m highest scores predicted by different methods, where m = 29, 40, 70 for VEGFA and m = 12, 20, 30 for EMX1. These reliable and ''detected'' off-target sequences are expected to be ranked in top positions with larger probability scores, especially so for the reliable off-targets. The results were demonstrated in Fig. 5.
From the comparisons, DL-CRISPR sort one more reliable off-target into the first 29 and 70 most possible off-target  positions for sgRNA VEGFA and the first 30 most possible off-target position for sgRNA EMX1. And DL-CRISPR also ranks more ''detected'' off-targets, i.e. no less than 6 for sgRNA VEGFA and no less than 2 for sgRNA EMX1, into the top m positions compared to CRISTA and Ensemble SVM.

E. PERFORMANCE ON UNSEEN sgRNAs
We implement the studies in this part on three new sgRNAs: gM (GGCTGATGAGGCCGCACATGTGG), gMH (CAGGTTCCATGGGATGCTCTGGG) and gp (AGCAGCAGCGGCGGCAACAGCGG), targeted to the mouse Pcsk9 gene, with off-target sequences identified by CIRCLE-seq on WT and KI mouse genomic DNA [27]. Three new test datasets contain 166 off-targets for sgRNA gM, 439 for gMH and 3,381 for gp, respectively. Since all the off-targets are already detected by CIRCLE-seq, these data should have relatively higher scores. As these three sgRNAs are new, we reconstruct the model under the DL-CRISPR working mechanism using all positive data in our dataset.
DL-CRISPR identifies 75, 174 and 3,323 sequences from gM, gMH and gp datasets as off-targets respectively. The prediction result of gp is in high agreement with the in vitro experiment while the results of the other two sgRNAs are not so consistent. This phenomenon can be explained from their heatmaps in Fig. 6. The mutations in gp are much more identical, they mainly happen at position 10, 13 and 17 with mutation type 'GC' or 'AG', therefore, the samples in this dataset are very likely to be ascribed into the same class, and obviously, these samples are identified correctly by DL-CRISPR, which illustrates its ability to predict off-targets of unseen sgRNA. However, as to the other two sgRNAs, the mutation positions and types of their off-targets sequences are much variety, hence the scores may vary in scopes and the sequences may be allocated to different classes. Furthermore, as we defined above, the test samples in this part are ''detected'' off-targets rather than the reliable off-targets, whereas the DL-CRISPR is trained with reliable off-targets, hence it may not favor all ''detected'' off-targets and assign high probability scores to them.
We also explore the prediction results of Ensemble SVM and CRISTA on the same datasets. Their results together with the results of DL-CRISPR are demonstrated in TABLE 1. CRISTA outputs 888, 1,410 and 15,953 off-targets in total for gM, gMH and gp sgRNAs, from which 26, 80 and 2,913 samples are in the off-target datasets identified by CIRCLE-seq. The trends for the prediction results using Ensemble SVM and CRISTA on three sgRNAs are consistent with what we found in DL-CRISPR above, where the offtarget sequences related to gp are the easiest to be predicted and the gMH's mutation sites are the hardest to be defined. CRISTA gives the least number of predicted off-targets, and VOLUME 8, 2020 FIGURE 6. Heatmaps for the off-targets found by CIRCLE-seq of three mouse sgRNA gM, gMH, and gp. Ensemble SVM is slightly worse than that of DL-CRISPR on these unseen sgRNAs, which predicts 4, 39, and 30 fewer sequences in datasets related to gene gM, gMH, and gp, respectively. Through the comparison, DL-CRISPR is proved to be an effective off-targets activity prediction program for unseen sgRNAs.

IV. CONCLUSION
Data imbalance is a severe issue in CRISPR/Cas9 system when applying machine learning. To solve the data imbalance problem, we extended the positive dataset size and adopted data augmentation to increase positive training data amount, and we employed ensemble idea to take more negative data into consideration to make the model closing to the real scenario, but at the same time keeping the model balance. Based on the above strategies, we proposed DL-CRISPR, a deep learning model for off-target activity prediction in CRISPR/Cas9. We first explored the off-target data reliability and the differences between positive and negative data on our newly extended dataset. Experiments show that off-targets detected by only one of in vitro or cell-based assays have some differences with the reliable off-targets, and positive data have obvious preferences for the mutation positions and types compared to negative data. Then we tested and compared DL-CRISPR with three state-of-the-art methods on different types of datasets. DL-CRISPR achieved the best performance on 5-fold cross validation test datasets with over 98.40% of Acc and over 95.13% of Sn, the general high values and narrow spread of all evaluation matrices illustrate the robust of DL-CRISPR. In addition, DL-CRISPR ranked more reliable and ''detected'' off-targets in top positions according to the probability scores in datasets related to two specific sgRNAs than other methods. Furthermore, DL-CRISPR also worked well on off-targets prediction for unseen sgRNAs by identifying more off-targets detected by CIRCLE-seq than other methods. In a nutshell, the experimental results in this work fully demonstrated that DL-CRISPR is an effective and robust off-target activity prediction method in CRISPR/Cas9.

APPENDIX
Supplementary Materials, the data used in this work, and the code for DL-CRISPR are available at https://github.com/ yuuuuzhang/DL-CRISPR_offtarget_prediction.
RUI YIN received the B.S. degree in automation from Shandong University, China, in 2013, and the M.Sc. degree in control engineering from Central South University, China, in 2016. He is currently pursuing the Ph.D. degree with the School of Computer Science and Engineering, Nanyang Technological University, Singapore. His research interests include data mining and pattern recognition to make sense of big heterogeneous data for real applications in engineering and biomedical science.
CHEE KEONG KWOH (Senior Member, IEEE) received the bachelor's degree (Hons.) in electrical engineering and the master's degree in industrial system engineering from the National University of Singapore, Singapore, in 1987 and 1991, respectively, and the Ph.D. degree from the Imperial College of Science, Technology and Medicine, University of London, in 1995.
He has been with the School of Computer Engineering, Nanyang Technological University (NTU), since 1993. His research interests include data mining, soft computing and graph-based inference; applications areas include bioinformatics and biomedical engineering. He has done significant research work in his research areas and has published many quality international conferences and journal articles. He has often been invited as an organizing member or referee and a reviewer for a number of premier conferences and journals, including GIW, IEEE, BIBM, RECOMB, PRIB, BIBE, ICDM, and iCBBE. He is also a member of the Association for Medical and Bioinformatics, Imperial College Alumni Association of Singapore. He has provided many services to professional bodies in Singapore and was conferred the Public Service Medal by the president of Singapore, in 2008. His research interests include data mining, soft computing and graph-based inference; applications areas include bioinformatics and biomedical engineering. He is an editorial board member of the International Journal of Data Mining and Bioinformatics, the Scientific World Journal, Network Modeling and Analysis in Health Informatics and Bioinformatics, Theoretical Biology Insights, and Bioinformation. He has been a Guest Editor of many journals, such as the Journal of Mechanics in Medicine and Biology, the International Journal on Biomedical and Pharmaceutical Engineering, and others. VOLUME 8, 2020