BP-GAN: Interpretable Human Branchpoint Prediction Using Attentive Generative Adversarial Networks

Branchpoints (BPs) are essential sequence elements of ribonucleic acids (RNAs) in splicing, which is the process of creating a messenger RNA (mRNA) that is translated into proteins. This study proposes to develop deep neural networks for BP prediction. Extensive previous studies have shown that the existence of BP sites depends on sequence patterns called motifs; hence, the prediction model must accurately explain its decisions in terms of motifs. Existing approaches utilized either handcrafted features for interpretable, though less accurate, predictions or deep neural networks that were accurate but difficult to explain. To address the aforementioned difficulties, the proposed method incorporates 1) generative adversarial networks (GANs) to learn the latent structure of RNA sequences, and 2) an attention mechanism to learn sequence-positional long-term dependency for accurate prediction and interpretation. Our method achieves highly satisfying results in various performance metrics with adequate interpretability. We demonstrated that, without any prior biological knowledge, BP prediction by the proposed method is closely related to three motifs, the consensus sequence surrounding BPs, polypyrimidine tract, and 3’ splice site, that are well-established in molecular biology.


I. INTRODUCTION
The human body has numerous types of cells, such as blood cells, neurons, and liver cells. Even though their functions are different, the cells are created from the same set of deoxyribonucleic acid (DNA) by combining different genes to synthesize a functional gene product, e.g., protein. 1 The process of making pre-messenger ribonucleic acid (pre-mRNA) from DNA, messenger RNA (mRNA) from pre-mRNA, and protein from mRNA is referred to as gene expression. 2 The associate editor coordinating the review of this manuscript and approving it for publication was Zijian Zhang . 1 DNA is a long string of paired chemical molecules called nucleotides that are of four types denoted by A, C, G, and T [1]. Genetic information in DNA is organized into nucleotide sequences called genes. The set of genetic material stored in DNA is called a genome. 2 RNA is a molecular transcript of DNA that conveys the genetic information to create proteins. The types of nucleotides in RNA are A, C, G, and U instead of T.
Large intervening sequences called introns are spliced out and only the flanking sequences called exons are spliced in together to comprise the mRNA, as shown in Figure 1(a). This is the process of alternative splicing to generate various types of proteins from the same mRNA. As an essential part of gene expression, splicing incorporates three key nucleotides: 5' and 3' splice sites (sss) and a BP site. The 5' and 3' sss denote the first (upstream) and last (downstream) nucleotides within an intron, respectively. A BP is a nucleotide where the 5' ss is joined to a lariat-forming to make an intron separated from a pre-mRNA. After splicing, each mRNA molecule encodes the instructions to build proteins. Thus, identifying the 5' ss, 3' ss, and BP is crucial for understanding the mechanism of splicing. Mis-spliced mRNA can result in altered proteins and often damage their functions, such as causing diseases [2]. The 5' and 3' sss are detected by well-established RNA sequencing techniques [3]; however, it is difficult to identify BPs in sequencing experiments, VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ as the lariats are degraded rapidly and appear rarely. Recent efforts have enriched the lariats in RNA sequencing, building large-scale datasets of genome-wide BP annotations [4], [5]; however, only 40% of the entire human introns were covered by annotation of almost 130,000 human BPs. In view of these difficulties, we propose a deep neural network model to predict the BP for a given RNA sequence taking into consideration the following: First, BP sites are known to co-exist with motifs or sequence patterns of up to tens of nucleotides that are typically not readable by humans and difficult to identify; a few ultra-conservative sequence motifs were observed experimentally [6]. Therefore, it is necessary to develop a model that explains its predictions in terms of motifs with high accuracy. Second, the distribution of BPs is highly biased within 20 to 50 nucleotides upstream from the 3' sss, as shown in Figure 1(a) [7], causing a class-imbalance in the datasets. The existing approaches were constrained as they relied on either interpretable but inaccurate handcrafted sequential features or deep neural networks of poor interpretability [3], [8]- [12].
We address these issues by using generative adversarial networks (GANs) [13], [14] to learn the latent structure of RNA sequences for BP prediction, which we call BP-GAN hereafter. The BP-GAN decomposes the underlying structure of input RNA sequences into two types of features: one directly affecting the BP prediction and the other irrelevant to the prediction. We further improve the BP-GAN by adopting a novel integration of attention mechanism and triplet loss with hard negative mining. The attention mechanism enables to learn long-term dependencies for a given input sequence for compelling sequence modeling. In addition, it provides a principled way of interpreting the regional clues regarding the part of a sequence that the BP-GAN attended to. As shown, the BP-GAN recovers three biologically meaningful motifs. To resolve the problem of class imbalance, we use triplet loss to regularize the BP-GAN during training. This improves the prediction accuracy effectively. In summary, the contributions of this study are as follows: 1) We present the BP-GAN as a novel integration of attention into GAN, regularized by triplet loss. Deep neural networks in the BP-GAN are carefully designed for BP prediction, achieving excellent results in two large-scale datasets. 2) Our method uses a GAN to learn the structure of latent variables for RNA sequences relevant to BP prediction.
To the best of our knowledge, a GAN is suitable for generative tasks [15]- [19] in gene expression, however, this is the first GAN-based attempt to address a discriminative task. We believe that our approach is also applicable to other related downstream tasks. 3) As a key benefit, the BP-GAN provides a means to interpret and visualize the rationale behind its decisions by using attention, which was not addressed in previous studies on BP prediction. In particular, we demonstrated that it can recover three biologically meaningful motifs that are strong indicators of BPs.

II. RELATED WORK
First, we present an overview of the deep neural network-based approaches related to RNA splicing. Next, previous efforts on BP prediction are summarized to highlight our contributions.

A. GENERATIVE MODELS
A generative model aims to determine an underlying structure of variables relevant to real use-cases. For example, only small structured subsets of pixel values are plausible exemplar images in the real world. A GAN is a deep generative model with an exceptional capability to learn the structure of parameters and generate samples similar to true data [13]. It has initiated several recent studies in genomics [15]- [19].
In [15], [19], the approaches generated DNA sequences using a GAN. A GAN model in [17] aimed to generate RNA sequences for protein function analysis whereas other works focused on generating protein structures [16] and sequences [18]. Gene expression was investigated to simulate an RNA-seq dataset focusing on the diversity of skin cells [20]. This work interpreted the learned parameters in a biologically meaningful manner that was similar to ours. An approach in [21] used a GAN to learn the probabilistic distribution of gene expression in single cells.

B. ATTENTION MECHANISM
Attention learns inter-/intra-sequence dependencies in sequential modeling [22]- [24]. Encouraged by its success, a number of recent studies in genomics incorporated attention into various prediction models on RNA-protein binding sites [25], gene expression analysis [1], and precursor microRNAs [26]. These approaches primarily focused on attention at the level of hundreds of nucleotides [1], or used recurrent neural networks (RNNs) [26] to learn long-term dependencies between sequence elements [24]. However, we considered an attention mechanism without using RNNs, called self-attention or intra-attention, to relate the elements of an input sequence with each other more effectively.
The former included features such as sequence conservation and positional bias for a support vector machine [8], motifs of an intron sequence for an ensemble of multiple algorithms [9], and a score function calculated from position-specific scoring matrix (PSSM) and binding energy of spliceosome [10]. An approach similar to [10] solved the mixture model of motif interference and polypyrimidine tract 3 A gradient boosting algorithm was used in [12]. On the other hand, [27] pioneered the use of convolutional neural networks (CNNs) with extra annotations of positional information of nucleotides. An approach proposed in [30] was similar to ours in that they visualized motifs for predicting splice sites in DNA sequences; however, it was based on a simple CNN that lagged behind state-of-the-art approaches in terms of prediction accuracy [27]. Recently, an RNN-based approach using bi-directional long short-term memory (LSTM) learned sequential features [3]. This work was extended in [28] by processing input sequences using CNN before applying them to LSTM and achieved excellent results. The RNN-based approaches used additional information on the binding energy of nucleotides whereas our approach requires only RNA sequences. To summarize, the existing deep neural network-based approaches are merely straightforward applications of CNN or RNN to BP prediction and few of the previous studies addressed the issue of interpretability of the predictions. The work in [31] presented the comprehensive evaluations on the recent approaches for various performance measures on in-house RNA datasets as well as public ones.

A. PROBLEM FORMULATION
The BP-GAN takes as its input an intronic region in an RNA sequence of N nucleotides of the four types {A, C, G, U} N , as shown in Figure 1(b). We use a one-hot vector s i = {0, . . . , 0, 1, 0, . . . , 0} ∈ R 4 to represent each type of nucleotide and denote the input sequence by s = {s i } ∈ R 4×N . Following this notation, s N ∈ s is the last nucleotide at the 3' ss. For training the BP-GAN, a ground-truth of BP sites for a given input s is given by b = {b i } ∈ {0, 1} N . A sequence can have multiple BPs [5]. The BP-GAN predicts a vector b = {b i } ∈ {0, 1} N to represent the probability to be a BP for each nucleotide in s.

B. LEARNING FEATURE REPRESENTATION USING GAN
A GAN learns generative models by utilizing the discriminative capability of neural networks. A generator G takes a latent variable z sampled from known prior P z to generate a sample s z = G(z). The learning process of a GAN is a minimax game of G parameterized as θ G and a discriminator D with θ D , which is formulated as The discriminator D tries to decrease the cost function J (G, D) by maximizing the probability of a real data s, which is sampled from the real distribution P s , to be classified as real, D(s), and classifying a fake data s z = G(z) as fake VOLUME 8, 2020 in a term denoted by adversarial loss L adv . On the contrary, the generator G tries to mislead D, i.e., decrease 1 − D(s z ), and thus, increase the cost function. Figure 1(c) illustrates the structure of the BP-GAN. An input s is encoded into two latent variables x c = E c (s) and where d x is a dimension of a latent variable corresponding to a sequence element. x c represents a context feature that is similar in all sequences with the same BP sites. We consider two BP sites to be identical based on their positions, regardless of the type of nucleotide.
x o describes other aspects of the sequence irrelevant to the prediction of BPs. Thus, it is possible to reconstruct s from x c and x o . To this end, we use a network R with parameters θ R : whereŝ is a reconstruction of s with loss L recon given by The BP sites are predicted by a network BP, the key component of the BP-GAN, that uses the context feature x c corresponding to real sequences s to generateb = BP(x c ). We denote a loss of BP by L BP as The generator G generates a sequence s z = G(z) from z ∼ N (0, 1). The BiGAN approach [14] demonstrated that a GAN can be improved by augmenting a feature encoder into the training process. As a result, the discriminator D of the BP-GAN additionally ingests the features x c and x o along with b to discriminate better between real and fake data. Thus, we rewrite L adv in (1) as where,

C. SEQUENCE-BASED POSITIONAL SELF-ATTENTION
Stacked 1-D residual networks (ResNets) [32] and self-attention blocks feature largely in the network architecture of the BP-GAN. As discussed in Section II, self-attention aims to effectively learn long-term dependency of an input sequence. We used Transformer [24], a self-attention mechanism without using RNN by combining two Transformer blocks and three 1-D ResNet blocks, referring to a branchpoint-attention (BP-ATT) block as shown in Figure 2(a). The Transformer depicted in Figure 2(b) takes the input feature x ∈ R N ×d k corresponding to s, where N is the sequence length of s, and d k is a dimension of x corresponding to an element of sequence s and set to 64. Given a query, Q, and a set of key-values, K -V with Q, K , V ∈ R N ×d k created from x, we evaluate the similarity between the key and query. Next, attention Att ∈ R N ×N is given as a scaled dot product by As positional information is lost in self-attention, we apply a positional encoding operation to the input feature to translate it to a positional matrix pe ∈ R N ×d k as pe pos,2i = sin(pos/100000 2i/d k ), where, pos ∈ {1, . . . , N } is a sequence-positional index and i ∈ {1, . . . , d k 2 } is the dimensional index of x. We combine h results of scaled dot products to attend to different representation spaces jointly. The results are projected to a linear layer given by where W * i and W o are trainable parameters, and pe = x +pe. We then apply layer normalization to a feature M and add it to the input feature via a residual connection. The output of the Transformer block, In addition to an increase in the accuracy, the BP-GAN provides interpretability on prediction decisions, using the self-attention mechanism as its key strength. The BP-GAN is specifically designed to identify motifs within an input sequence to predict their locations and attention intensity.
Given an intronic input sequence s, the motif interpretation from the attention begins by summing h attention maps obtained from the last BP-ATT of E c into a single where Q j and K j are the key and value corresponding to the Attention map i as shown in Figure 2(b). It is likely that the last attention block delivers the most informative attention map. We denote the i-th row of MA s by MA s,i ∈ R 1×N . In the context of the Transformer model, MA s,i is an attention vector that represents the extent to which a sequence element s i ∈ s attends to other nucleotides. Inversely, the vertical rectangles show the extent to which the associated nucleotides are being attended by others. Figure 3(a) shows an example of the three BP-ATT blocks in E c for input s with a BP being s 49 . The two horizontal rectangles in the figure depict attention vectors showing how s 1 (5' ss) and s 49 (BP) attend to other nucleotides of the input. The attention intensity in the vertical rectangles for s 49 is more distinguishable than other sites in the later attention maps. This observation explains why we chose the last BP-ATT for motif analysis, which is explained below. The example also suggests that s 49 is an important clue for branch prediction as nucleotides in the input sequence attend to s 49 the most.
Next, we define a shift function sh(x, k) that shifts all the elements in sequence x so that the k-th element is relocated to the center of x thereafter. Nucleotides outside the input sequence can be involved as a result of the shift operation. Given the set of test sequences S = {s} under consideration, let A c ∈ R 1×N be an expected value of the shifted attention vectors, which is given by where c s is a positional index to be the center of s and |S| is the cardinality of S. However, each sequence may have different c s when, for example, it refers to their BP sites. Given the predicted BPsb s of sequence s, we obtain an aggregated attention map A c by considering c s =b s . This enables us to identify common motifs to which predicted BPs attended. Furthermore, if we seek motifs with respect to a specific position in the RNA sequence, e.g., the 3' sss, it is required to merely set all c s identically to N . We also define SL S ∈ R 4×N as a sequence logo [33] corresponding to S. Given a collection of aligned sequences, a sequence logo is a graphical representation of the sequence conservation of nucleotides in a strand of DNA/RNA. A sequence logo shows how frequently each type of nucleotide appears at each position along the horizontal axis as shown in Figure 3(b). The frequency of a type corresponds to the larger letter, depicting the consensus sequence and diversity of sequences. A sequence logo SL S is then created from the set of shifted sequences S = {sh(s, c s )}. Now, we define SL (W ) S as a weighted sequence logo using the mean attentions obtained from (8) corresponding to S . We now have where diag(A c ) is a diagonal matrix with A c being the diagonal entries.
To summarize, SL visualizes the extent to which each of the nucleotides affects the BP sites using input sequences aligned to c s . In Section IV-C, we showed that the motifs discovered by the BP-GAN included to the consensus sequence 4 4 The consensus sequence is yUnAy in nucleotide codes, where y is C or U and n is any nucleotide in addition to U and A. Here, A represents a BP site. VOLUME 8, 2020 and polypyrimidine tract that are well established in previous studies [6].

E. TRAINING THE BP-GAN END-TO-END
Owing to the biased distribution of BPs, there is class-imbalance that may cause overfitting. Therefore, we use a novel regularization based on triplet loss to learn a distance function from three samples namely, an anchor, a positive, and a negative [34]. This loss ensures that, in feature space, an anchor is always closer to a positive than a negative by a margin at least. Let us define s a,i as an anchor having a BP at i, and s p,i and s n,i as a positive and negative, respectively. The triplet loss L tri is given by where, [·] + := max(0, ·) takes the positive component of its input, α is the margin, and x * ,i = E c (s * ,i ).
A positive is a sequence with a different nucleotide at the same BP position. As the training proceeds, most of the negatives will be further apart from an anchor than the positives, making L tri ineffective. Hard negative mining relieves this problem by favoring hard negatives over easy ones to evaluate L tri . The selection of a hard negative to the anchor is based on the following assumptions: The features of sequences with a single BP will be similar as their BP sites are closer, and it will be harder to classify these sequences. Likewise, sequences with multiple BPs will be similar as they have more number of identical BPs. We considered sequences with a single BP in (10) for simplicity. (See Appendix APPENDIX A for more details on identifying hard negatives.) During the training, not all the anchors may have their counterparts, positives, and hard negatives in a mini-batch. We augment auxiliary mini-batches into the training process to calculate L tri in addition to normal mini-batches. In particular, we collect the triplets found in normal mini-batches and put them to an auxiliary mini-batch. When the auxiliary mini-batch has as many samples as a normal mini-batch, we perform a stochastic gradient descent on the auxiliary mini-batch to evaluate L tri .
The entire BP-GAN is differentiable. Thus, the model can be trained end-to-end based on standard back propagation [35]. Combining to L tri in an auxiliary mini-batch. We used the GAN training procedure following [36], except for the auxiliary mini-batches. (See Appendix Algorithm 1.)

A. EXPERIMENTAL SETUP 1) IMPLEMENTATION DETAILS
We used the ADAM optimizer [39] with its parameters β 1 and β 2 set to 0.5 and 0.9, respectively, for the training. The size of a mini-batch was set to 512. The learning rate was set to 0.001. Margin α in (10) was set to 10. We set λ in (11) to 0.5. Auxiliary mini-batches were found to be augmented almost every five normal mini-batches during the training.

2) DATASET PREPROCESSING
We used two public datasets of RNA sequences extracted from chromosomes 1-22 and X in the reference human genome hg19 with a high confidence set of BP annotations, referred to as DS P [5] and DS M [4]. We set N = 70, taking the 3' ss and 69 of its upstream precedents to construct an input sequence as a 4 × 70 one-hot matrix. We split the datasets into three sets identical to the settings used in previous studies [3], [8], [12], [28], i.e., chromosome 1 as a test set, chromosomes 2, 3, 4, and 5 as a validation set, and the remaining as a training set. Table 1 provides the summary of the datasets regarding data splits when training BP-GAN and data imbalance ratio between BPs and other nucleotides. Figure 6 in Appendix shows a set of plots for the training and validation data with each of the datasets.

B. QUANTITATIVE EVALUATION OF PREDICTION ACCURACY 1) OVERALL PERFORMANCE USING LARGE SCALE DATASETS
We used six performance metrics: the area under the receiver operating characteristic curve (auROC), area under precision-recall curve (auPRC), F-score, sensitivity (SE), specificity (SP), and recall at 1 (R@1). We set the decision threshold to 0.5 to measure the sensitivity, specificity, and F-score. A high sensitivity signifies that a model is likely to predict BPs with a higher confidence than the threshold. Similarly, a high specificity indicates that a model is effective to filter non-BPs. The auROC and auPRC are more informative with multiple thresholds than the basic metrics. We laid emphasis on auPRC over auROC with regard to class imbalance [40]. The ratio of the number of BPs to that of other nucleotides was approximately 1:45 and 1:39 in DS M and DS P , respectively. Another important measure namely, a large value in recall at K indicates that the K BP sites of highest probabilities as predicted by the model are highly likely to be the true BPs. This can improve the efficiency of biological experiments by reducing the number of BP candidates to be validated.  Table 2 compares the BP-GAN with four approaches namely, SVM-BP [8], branchpointer [12], LaBranchoR [3], and Nazari et al. [28], including two state-of-the-art studies, on the test sets from the two datasets. Not all performance measures are available for the methods, for example, no working implementations of SVM-BP [8] and Nazari et al. [28] exist so that we can use only the performance numbers reported in the literature; branchpointer [12] could be applied only to DS M as it required extra information, that was not available, to use the new dataset.
The BP-GAN in the last row of the table outperforms SVM-BP and branchpointer except for sensitivity. In comparison with the best performing approaches, LaBranchoR [3] and Nazari et al. [28], the BP-GAN boots the performance over all metrics, except auROC on DS P that is already saturated in all the approaches. The results indicate that the BP-GAN can be used effectively to identify real BP sites as well as filter sites that are non-BP. Notably, the performance gain BP-GAN over Nazari et al. [28] (0.009 on DS P and 0.014 on DS M ) is identical at least or larger than those achieved by Nazari et al. [28] over LaBranchoR [3] (0.009 on DS P and 0.009 on DS M ), in terms of auPRC.
In addition, we present additional evaluations on auPRC for a region [-45,-18] where most BP are located as shown in Table 3. The auPRC increased compared to the whole region evaluation in all the approaches. BP-GAN has the performance improvement similar to the case of whole region in Table 2 over LaBranchoR [3] and Nazari et al. [28]. We also present results with other chromosomes as test sets in Table 4 to show that the BP-GAN consistently outperforms other approaches.

2) ABLATION STUDY
The pipeline of the BP-GAN combines several components: GAN, attention, and regularization with triple loss. We conducted an ablative study on these components and the results are shown in the bottom half of Table 2. A combination of all the components represents the BP-GAN. For comparison, we built CNN models by using the encoder with its network structure identical to E c and E o and the BP predictor BP; however, the encoder in the CNN models does not use the disentangled feature representation, unlike the BP-GAN. Moreover, only the cross-entropy loss is used for training the model without the adversarial term. The GAN models exhibited an improved performance in most of the metrics as compared to the CNN models with the same configurations of attention and regularization. In particular, the largest VOLUME 8, 2020  performance gain was achieved in auPRC. It was not clear whether the attention and triplet loss improved the sensitivity on top of GAN consistently. The attention improved auPRC significantly in the CNN as well as GAN models. The performance of the CNN models with attention was comparable with those of LaBranchoR [3] and Nazari et al. [28]. The regularization using triplet loss resulted in an additional, consistent improvement in the performance in most of the metrics on all the datasets. The gain with triplet loss is not as significant as that achieved from GAN or attention; however, the model is effectively generalized, as expected. Figure 4(c) and (d) depict sequence logos showing the distribution of the nucleotide types for generated sequences by the generator in BP-GAN. We sampled 1000 RNA sequences from the generator G(·) of BP-GAN after training for 80 epochs until convergence. Then their BPs were predicted by BP(·). We repeated this sampling procedure 10 times and created the sequence logo corresponding to the 10,000 generated sequences. The sequence logos reveal the consensus sequence yUnAy clearly. This result shows that BP-GAN learnt the features for BP from input sequences successfully.

3) EVALUATION OF PREDICTION OF MULTIPLE BPs
As a significant portion of human introns contains multiple BPs [5], we conducted further investigations concerning their prediction. DS P was the latest dataset with emphasis on multiple BPs hence, we compared our approach with LaBranchoR in recall for three splits of the test sets of DS P corresponding to the number of BPs in a sequence K . We considered those cases with not more than three BPs: 4517 sequences of a single BP (K = 1), 1618 of dual BPs (K = 2), and 593 of triple BPs (K = 3). These occupied more than 95% of the dataset. Table 5 shows that the BP-GAN outperforms LaBran-choR in terms of prediction of multiple BPs. On the whole, the performances of the BP-GAN and LaBranchoR increased but converged at larger K . The BP-GAN, however, is much superior to LaBranchoR in smaller top-K recommendations. The results imply that the BP-GAN can be used to identify BPs from fewer proposals in a cost-effective manner.

C. INTERPRETATION AND VISUALIZATION OF ATTENTION FOR SEQUENCE MOTIF ANALYSIS
The last set of experiments dealt with the interpretation of the predictions for identification of motifs and its visualization as described in Section III-D. We selected two nucleotide sites namely, the BP and 3' ss. We considered the union of the test splits of DS P and DS M to be S in (8). We also set c s to the BPs and 3' sss of the sequences in S and calculated their corresponding weighted sequence logos, which are depicted in Figure 5(a) and Figure 5(b), respectively. In Figure 5(a), we considered the BPs in [−45, −15] to be investigated in those visualizations that accounted for almost 97% of the predicted sites.
In terms of relative distance from a BP, the interval [−3, 2] of the weighted sequence logo in Figure 5(a) is highlighted by a dashed red rectangle. The motif observed in the interval is identical to the aforementioned consensus sequence in nucleotide codes, established based on previous studies [6]. Figure 5(b) highlights the interval [40,70] as the distance from the 3' ss in the weighted sequence logo. Similar to the previous case, another well-known sequence motif is marked by the wide red rectangle. This is the polypyrimidine tract consisting of rich C and U. In addition, the BP-GAN affirms that the motif 'AG' at the 3' ss is crucial for BP prediction, which is depicted by the narrow rectangle on the right side in Figure 5(b). Note that such the existing approaches are not able to identify such sequence motifs at distal and variable 97858 VOLUME 8, 2020  [3] in terms of multi-BP prediction in recall at K with three sequence sets of identical BPs. The best scores are highlighted in red.

TABLE 6.
Comparison of BP-GAN and Nazari et al. [28] in detecting sequence variants on two datasets.
locations from BP sites. The results of the interpretation from attention demonstrate that, without any prior biological knowledge, the BP-GAN gains meaningful insights in terms of prediction of sequence motifs. The sequence logo analysis is an added advantage of the BP-GAN to identify novel sequence motifs for providing human-interpretable reasoning on its decisions, leading to a better understanding of RNA splicing.
The proposed method of interpretation using the BP-GAN is general as the effect of any region of RNA sequences on branch prediction can be identified directly. For example, if we enlarge an input sequence, distal motifs from a BP can be identified, if they exist. To the best of our knowledge, none of the previous literature addressed this aspect of BP using deep neural networks. An investigation in this area may be pursued in future.

D. EVALUATION ON VARIANTS
We investigated the effect of sequence variants on the model prediction. We used two public datasets of sequence variants, ClinVar [37] and EpilepsyGene [38], While ClinVar contains sequence variants and corresponding phenotype annotations for a wide variety of diseases, EpilepsyGene focuses on epilepsy disease. For comparing with the state-of-the-art work, we performed the evaluation similarly to what was done in Nazari et al. [28]. In particular, we selected 18 variants from ClinVar that are known to cause non-functional protein and aligned the variants to sequences in DS P to constitute a candidate subset for the evaluation. Then, we modified each of the nucleotides within [-9,9] in distance from the predicted BP site and repeatedly applied the variants to BP-GAN to see if the prediction of BP location changes. Table 6 show the number of variants that changed BP predictions of the two methods and match the associated variants in ClinVar. Overall, BP-GAN has better agreement to the variants in ClinVar compared to that of Nazari et al. [28]. Especially, BP-GAN achieved 0.38 in sensitivity and 0.88 in precision while 0.22 and 0.57 with Nazari et al. [28] respectively. This reveals that BP-GAN detected more true variants and less false positive ones favorable against the existing approaches.
In addition, we performed the similar evaluation on the EpilepsyGene dataset. Even though the number of variants are smaller than that the case of ClinVar, the similar performance gain is observed.

V. CONCLUSIONS
We introduced the BP-GAN, a novel integration of attention and triplet loss on top of a GAN-based framework for human BP prediction from RNA sequences. Our approach used a disentangled representation of RNA sequences with a GAN to learn BP prediction-aware features. Attention learned intra-sequence dependencies, delivering improved prediction accuracy and interpretability on decisions. Triplet loss addressed the issue of class-imbalance using hard negative mining. All these techniques contributed to exhibiting a state-of-the-art-performance on the two large-scale datasets. The sequence motifs identified from the attention layers of the BP-GAN agreed with the consensus sequence, polypyrimidine tract, and 3' ss that are known to essentially affect BP sites during splicing. This enabled us to gain biologically meaningful insights to explain the predictions. To the best of our knowledge, this is the first attempt in applying a GAN-based model to a discriminative task in gene expression. Moreover, the BP-GAN can easily be extended to several other similar tasks.

APPENDIX A HARD NEGATIVE MINING FOR TRIPLET LOSS
Consider a sequence s with b s ∈ {0, 1} N being its multiple BPs. If we define s a,i as an anchor with a BP at i, a hard VOLUME 8, 2020 where, ||.|| 1 is the L 1 norm. The upper case in (12) corresponds to sequences of a single BP whereas the lower case  accounts for those of multiple BPs. We set K to 64 in the experiments. Table 7-11 describes the entire architecture of the BP-GAN.

APPENDIX C PSEUDO CODE FOR TRAINING BP-GAN END-TO-END
Algorithm 1 shows the training process of the BP-GAN. 97860 VOLUME 8, 2020 Sample a random noise vector z ∼ N (0, 1) 4: s z ← G(z) 5: x z,c ← E c (s z ) 6: x z,o ← E o (s z ) 7: Predict branchpoints BP(x z,c ).

8:
Add a quadruple (s z , x z,c , x z,o , BP(x z,c )) to the current normal mini-batch with the label 0 as fake 9: end for 10: 11:  Sample a sequence s ∼ P s 15: x c ← E c (S) 16: x o ← E o (S) 17: Add a quadruple (s, x c , x o , b) to the current normal mini-batch, assigning the label 1 as real 18  Sample a hard negative s n ∼ P s 30: x a,c ← E c (s a ) 31: x p,c ← E c (s p ) 32: x n,c ← E c (s n ) 33: end for 34: 35: θ E c ← θ E c − ∇ θ Ec L (aux) tri 36: end for Figure 6 shows two sets of plots for training BP-GAN with the training and validation data splits corresponding to the terms in (11) for DS P and DS M .