Safety in Multi-Assembly via Paths Appearing in All Path Covers of a DAG

A multi-assembly problem asks to reconstruct multiple genomic sequences from mixed reads sequenced from all of them. Standard formulations of such problems model a solution as a path cover in a directed acyclic graph, namely a set of paths that together cover all vertices of the graph. Since multi-assembly problems admit multiple solutions in practice, we consider an approach commonly used in standard genome assembly: output only partial solutions (contigs, or safe paths), that appear in all path cover solutions. We study constrained path covers, a restriction on the path cover solution that incorporate practical constraints arising in multi-assembly problems. We give efficient algorithms finding all maximal safe paths for constrained path covers. We compute the safe paths of splicing graphs constructed from transcript annotations of different species. Our algorithms run in less than 15 seconds per species and report RNA contigs that are over 99% precise and are up to 8 times longer than unitigs. Moreover, RNA contigs cover over 70% of the transcripts and their coding sequences in most cases. With their increased length to unitigs, high precision, and fast construction time, maximal safe paths can provide a better base set of sequences for transcript assembly programs.


INTRODUCTION
M ANY real-world problems require to reconstruct an unknown object from partial data observed from it. Genome assembly is a typical instance of such problem in Bioinformatics: given a set of high-throughput sequencing reads obtained from some genomic sequence, we need to reconstruct the sequence from which the reads originate. A major issue in such problems is that multiple solutions (reconstructions) can explain the observed data, making it difficult to distinguish the correct solution. As such, reporting one arbitrary solution may easily lead to an incorrect answer to the problem. An established way of coping with this issue is to report only partial solutions about which we are "confident" that they are correct. For example, state-of-the-art genome assemblers do not output entire chromosomes, but only contigs, namely genomic fragments that are promised to occur in the original genome.
An algorithmic way of formalizing such "reliable" partial solutions is through the notion of safety, introduced in [1] to model contig assembly. Given a problem P, we say that a partial solution to P is safe if it is common to all solutions to P. Assuming that the real solution is among the solutions to our computational problem P, then safe partial solutions are common also to the real solution, and therefore correct. We would like to report all such safe partial solutions, and in fact, it suffices to focus on all maximal safe partial solutions, namely those such that any other safe partial solution is contained in a maximal one. Safety has precursors in Bioinformatics (reliable regions in sequence alignments [2]), but also in combinatorial optimization (persistency in bipartite matchings [3]). Safety has also been applied to other genome assembly sub-problems such as gap filling [4].
The algorithmic toolkit for safety in genome assembly is quite developed by now. Indeed, a typical computational solution to a genome assembly problem is some type of walk in an assembly graph. For example, if we are sequencing a single bacteria and building an edge-centric de Bruijn graph from the reads, then the solution is a circular edge-covering walk [1]. A safe partial solution (i.e., contig) is thus a walk appearing as a subwalk of all such circular edge-covering walks. This graph-theoretic problem has been shown to admit efficient algorithms computing all their maximal safe walks [5], [6]. These run in time OðjV jjEjÞ, where V and E are sets of vertices and edges, respectively, of the assembly graph. Recently, these results were generalized for a plethora of different genome assembly models in a common framework known as the hydrostructure [7].
Over the last decade, sequencing technologies have been applied to mixed settings, where a sample contains multiple distinct genomic sequences, that may differ in varying degrees. Genome assembly has thus been extended to multiassembly problems [8] asking to reconstruct all such individual genomic sequences from mixed reads sequenced from all of them. Popular instances of such multi-assembly problems ask to reconstruct e.g., the RNA transcripts present in a cell population [8], or the quasi-species of a virus present in infected cells [9]. Although the of concept safety is implicitly used in current multi-assemblers, there exists a lack of formal treatment of this notion, despite its relevance and widespread adoption in the standard genome assembly problem.

Path Covers and Multi-Assembly
Given a directed acyclic graph G (or DAG, for short), we consider a solution to the multi-assembly problem to be a set of paths in G such that every vertex of G appears in at least one path. Such a set of paths is called a path cover of G, and it is present at the core of practical tools for both RNA transcript assembly and viral quasi-species assembly, as we review next.
A common graph used in the multi-assembly of RNA transcripts is a splicing graph, obtained by first identifying exons from the RNA read alignments to the reference. Every exon is then added as a vertex, and every read overlapping two exons indicates a possible splicing junction and is added as an edge between the two exons. The edges are directed "from left to right" (in the reference genome), making the graph a DAG.
An RNA transcript naturally corresponds to a path in such a DAG, thus the set of all RNA transcripts (solution to the RNA transcript assembly) corresponds to a path cover of the DAG. Various criteria have been proposed to define what path covers are actually a solution, e.g., those optimal with respect to some combinatorial optimization problem, or some statistical model. While state-of-the-art methods also include many practical steps and heuristics, they have one of two main approaches at their core. The first approach, used up to some variations and constraints by [10], [11], [12], [13], [14], define a solution as a path cover with a minimum number of paths (minimum path cover, or MPC), or with a number of paths smaller than an upper bound. The second approach, used up to some variations by [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], is based on finding a flow in the splicing graph best explaining the observed read coverage, and then on decomposing the flow into a small number of paths. Such paths also form a path cover, which is now optimal with respect to some flow criteria, not just cardinality.
Path cover models are also used in the assembly of reads sequenced from all viral quasi-species in a sample. Minimum path cover-like methods include [27], [28], and methods based on path covers optimal to some flow criteria include [29], [30].

Our Contributions
We give the first algorithms outputting all maximal safe paths for a natural notion of path cover, which we call constrained path cover. A constrained path cover can have at most a given number ' ! k ¼ widthðGÞ 1 of paths, and its paths are now required to start and end in given vertex sets, S and T , respectively. This notion generalizes that of an MPC, 2 which is a classical combinatorial object, dating back to Dilworth's and Fulkerson's results in the 1950s [31], [32], and appearing in standard textbooks such as [33]. More formally, given a DAG G, sets S; T V , and an integer ' ! k, we say that a path P is safe for G; S; T; ', if for every constrained path cover P of G, P is a subpath of some path of P. When G; S; T and ' are clear from the context, we just say that P is safe. Fig. 1 illustrates these concepts.
Our safe algorithms are obtained using a general "avoidand-test" approach. Intuitively, given a structure to be checked for safety, we transform the graph so that no solution can use the structure fully, while not changing the other properties of the graph. If a solution still exists, then it must avoid the structure, proving it is unsafe. This general approach dates back to finding edges present in all maximum matchings of a bipartite graph [3]. However, safe paths can contain more than one edge, which requires a more complex approach to "avoid-and-test".
We start with an arbitrary minimum-sized constrained path cover P, and we test the safety of every subpath P of a path in P. By definition, all safe paths are subpaths of some path of P. We introduce the reduction of G with respect to P , G P , and we show that P is a safe path if and only if widthðG P Þ > ' and the subpath formed by excluding the last vertex of P is safe. A naive implementation of these ideas leads to a running time of OðkjV j 2 mpcðGÞÞ for computing maximal safe paths, where mpcðGÞ denotes the running time The path cover P on the top is the unique constrained path cover for ' ¼ 3, therefore safe paths are exactly the paths of P. The path cover on the bottom does not have the path P ¼ s; u; v; t, therefore P is not safe for ' ¼ 4.
1. Given a DAG G, its width, denoted widthðGÞ, equals the minimum size of a path cover of G.
2. An MPC is a constrained path cover with ' ¼ k and of computing an MPC of G. However, we improve these running times by using two additional ingredients. First, since we need to output only maximal safe paths, it suffices to do a two-finger scan over each path of P. The scan advances the right finger if the path P between them is safe, or the left finger otherwise. Second, to avoid recomputing an MPC of G P for each different P , we build a flow data structure on top of G, which allows determining whether widthðG P Þ > ' in time Oðmaxð1; k þ m À 'ÞjEjÞ, where m is the number of paths of P containing the last edge of P . Our data structure is based on the concept of "shrinking", previously used to obtain efficient parameterized solutions for the problem of computing an MPC [34], [35]. The application of these two ideas improves the running time of our solution to Oðk 2 jV jjEjÞ.
To test our solution and provide a proof-of-concept study highlighting the potential usefulness of safe paths in RNA transcript assembly, we consider splicing graphs constructed from transcript annotation of different species, including human. As such, the splicing graphs used in our experiments correspond to work in perfect conditions, removing the biases introduced by errors prior to the splicing graph construction [36]. We denote the maximal safe paths in this application as RNA contigs. As a comparison baseline, we also consider a basic notion of safe paths, namely those maximal paths whose internal vertices have in-degree and out-degree equal to one. These are called unitigs and are commonly used in genome assembly [37], [38], [39]. On these datasets, RNA contigs for constrained path covers are up to 8 times longer than unitigs, while being over 99% precise, and take less than 15 seconds to be found. Moreover, if we define the maximum relative coverage of a transcript as the length of the longest RNA contig segment inside it, divided by the length of the transcript, then transcripts have maximum relative coverage up to 80%. Moreover, RNA contigs cover over 70% of the coding sequences of transcripts in most cases.
We hope that our findings introduce a new toolkit and perspective on the notoriously hard transcript assembly problem. We envision that, after enhancements dealing with real data issues, safe paths could be applied into RNA assembly (or multi-assembly) pipelines as follows: By outputting only RNA contigs, in the same way as state-of-the-art genome assemblers output contigs. Our initial results support this, since RNA contigs, albeit not being full transcripts, still have significant length, and at the same time are correct (partial) transcript assembly results. As a preprocessing step to methods that extend given strings into full transcripts. For example, [12], [18] first conservatively assemble the RNA-Seq reads into some longer sequences, and use these to guide the (more heuristic) assembly of full RNA transcripts. By taking the RNA transcripts output by any existing RNA transcript assembler, and marking the maximal transcript substrings that match a safe path. In this way, our method could indicate some parts of an RNA assembly solution that are likely to be correct. The rest of this paper is structured as follows. In Section 2 we develop the theory behind our algorithms reporting safe paths for constrained path covers. In Section 3 we discuss the RNA transcript assembly problem in more detail. We also present the experimental setup and results for RNA contigs, as an application of safe paths. We conclude the paper in Section 4.

SAFE PATHS FOR CONSTRAINED PATH COVERS
As previously discussed, a solution to a multi-assembly problem can be seen as a path cover of a DAG G. Therefore, safe paths are paths in G that are common to all path covers of G. However, this definition of safety is too permissive in formal terms, since it allows for P ¼ fðvÞ : v 2 V g (one path for each vertex v, consisting only of vertex v) to be a solution (path cover), thus yielding maximal safe paths to be isolated vertices. 3 To overcome this problem, we restrict the solution space by including further practical information. We formalize this in the concept of a constrained path cover.
Definition 1 (Constrained path cover). Let G ¼ ðV; EÞ be a DAG, S; T V be two sets of vertices such that sourcesðGÞ S; sinksðGÞ T , 4 and let ' ! widthðGÞ. We say that a path cover P of G is a constrained path cover (for G; S; T; ') if: (a) Every path P 2 P starts at some vertex in S and ends at some vertex in T . (b) The number of paths of P is at most ', jPj '.
Constraint (a) restricts the paths of P to start and end with some of the vertices of given sets of vertices S and T , respectively. We say that the paths of P go from S to T . This restriction incorporates the fact that in some contexts, such as RNA transcript assembly, some candidates of the starting / ending vertices may be identified, e.g., as those vertices whose read coverage present an initial upward / final downward slope [21], [22]. On the other hand, constraint (b) restricts the solution to contain at most ' paths, which is used to relax the minimality condition present in minimum path cover-like multi-assemblers as previously reviewed.
Note that, if S ¼ T ¼ V and ' ¼ 1, then we remove both restrictions and recover the classical definition of path cover. Besides, if instead ' ¼ k, then the constrained path covers correspond to the MPCs of G.
The conditions sourcesðGÞ S; sinksðGÞ T are necessary to ensure that at least one constrained path cover exists (otherwise there exists a vertex that is not reached from S or does not reaches T , which cannot be covered). On the other hand, the condition ' ! widthðGÞ is also necessary, since a constrained path cover is, in particular, a (classical) path cover of G, thus of size at least widthðGÞ.
Moreover, these three conditions (on S; T and ') combined suffice to ensure that at least one constrained path cover exists, which follows from the following clemma.
Lemma 1. Let G be a DAG, S; T V be two sets of vertices such that sourcesðGÞ S; sinksðGÞ T . Then, the minimum number of paths of a path cover with paths from S to T is widthðGÞ.
The proof of Lemma 1 (see Supplemental material 1, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/ TCBB.2021.3131203, for the proofs of our lemmas and theorems) also give us a way to compute our initial solution: we first compute an MPC of G and then extend their paths to S and T .
Next, we observe that it is sufficient to consider jSj ¼ jT j ¼ 1 by using the following reduction. We build the graph That is, we add a new unique source vertex pointing to every vertex of S, and a new unique sink vertex pointed from every vertex of T . Note that, there is a one-to-one correspondence between constrained path covers for G; S; T; ' and constrained path covers for G 0 ; fsg; ftg; ' (or just G 0 ; s; t; '), since we can obtain one from the other by adding/removing s and t to/from the paths. As such, we can reduce the computation of safe paths for constrained path covers for G; S; T; ' (or just safe paths for G; S; T; ') by computing safe paths for G; s; t; ' (we abuse of the notation and use G instead of G 0 for simplicity).
To decide whether a path P is safe (for G; s; t; ') we introduce the construction G P , which we call the reduction of G with respect to P . Definition 2. Given a DAG G ¼ ðV; EÞ and a path P ¼ ðu; x p Þ j u 2 N À ðx i Þ n fx iÀ1 g È É : Fig. 2 illustrates G P . The main idea behind this construction is that no path of G P can contain P entirely, since ðx pÀ1 ; x p Þ is removed. However, the usage of every proper suffix of P , except if this is used as the beginning of a path, is emulated by the transitive edges drawn from the in-neighbors of the vertices of the paths (except from the previous vertex on the path, and not for the first vertex) to x p . Proper suffixes of P (except x p ) used as the beginning of a path are not emulated in G P , and adding transitive edges from arbitrary vertices of P to represent these suffixes will ultimately emulate P . To solve this issue we incorporate to our hypothesis the safety of subpaths of P . Specifically, we obtain the following theorem. Theorem 2. Let G ¼ ðV; EÞ be a DAG, s 2 V the unique source of G, t 2 V the unique sink of G, and let P ¼ x 1 ; . . . ; x p be a proper path of G, such that x 1 ; . . . ; x pÀ1 is a safe path for G; s; t; '. Then we have one of two cases: 1) x pÀ1 2 sinksðG P Þ _ x p 2 sourcesðG P Þ, in which case P is a safe path for G; s; t; '.
2) otherwise, P is a safe path for G; s; t; ' if and only if widthðG P Þ > '.
Our first algorithm for computing maximal safe paths (for G; s; t; ') uses Theorem 2 to test the safety of each of the OðkjV j 2 Þ subpaths of the paths of an initial minimum-sized constrained path cover P of G (OðjV j 2 Þ subpaths for each of the k (by Lemma 1) paths of length OðjV jÞ each). We process these subpaths in increasing order of their lengths. When processing a subpath P ¼ x 1 ; . . . ; x p we inductively know whether x 1 ; . . . ; x pÀ1 is safe, and we can proceed accordingly. If x 1 ; . . . ; x pÀ1 is not safe then P is not safe either (by definition of safety). Otherwise, if x 1 ; . . . ; x pÀ1 is safe we use Theorem 2 to test whether P is safe. We build G P , if x pÀ1 2 sinksðG P Þ _ x p 2 sourcesðG P Þ we report P as safe, otherwise, we compute an MPC P 0 of G P and report P as safe if and only if jP 0 j ¼ widthðG P Þ > '. If P is safe we add it to a set of maximal safe paths and remove x 1 ; . . . ; x pÀ1 and x 2 ; . . . ; x p from this set since P invalidates their maximality. Note that with this rule every maximal safe path is captured and every non-maximal safe path is removed at some point. We call this approach unoptimized, and it runs in time OðkjV j 2 mpcðGÞÞ, where mpcðGÞ denotes the running time of computing an MPC.
By running a two-finger algorithm on every path P of P, we can compute the maximal safe paths inside P in time OðjP jÞ ¼ OðjV jÞ. We maintain pointers x and y on vertices of P , and the invariant that the subpath of P between x and y is a safe path. First, we try to extend this subpath by moving y to the next vertex of P . Since we know that our current subpath is safe we use Theorem 2 (as previously explained) to test if the extension is safe. If it is not, we move x (and also y, if x ¼ y) to the next vertex of P and try again. Maximal safe paths reported by this approach are only guaranteed to be maximal within P (the path from which they were reported), but not necessarily maximal in the whole G.
To efficiently report all maximal safe paths of G, we collect the maximal safe paths inside each P 2 P, and then filter those that are subpaths of already reported paths. For this we apply the generalized suffix tree approach used in Step 3 of [40]. This lets us filter the paths contained in another path in total time OðNÞ where N is the sum of the lengths of all paths. Here, N ¼ OðkjV j 2 Þ, since the two-finger algorithm reports OðjV jÞ safe paths each of length OðjV jÞ, for each path the k paths of P. As such, this approach runs in time OðkjV j 2 þ kjV jmpcðGÞÞ ¼ OðkjV jmpcðGÞÞ.
Our last optimization consists in speeding up the second case of the safety test of Theorem 2. Each such test involves computing an MPC of G P , taking mpcðGÞ time. Instead, we use our initial solution P to compute a flow data structure at the beginning of the algorithm. This data structure takes additional OðkjV j þ jEjÞ time to construct, but allows us to test whether widthðG P Þ > ' in time Oðmaxð1; k þ m À 'ÞjEjÞ, Fig. 2. The construction G P from Definition 2, for P ¼ x 1 ; . . . ; x p : the edge ðx pÀ1 ; x p Þ is removed, and (transitive) edges are added from the in-neighbors (except from the path predecessor) of the path vertices (except x 1 ) to x p , shown as dashed.
where m is the number of paths of P containing the last edge of P . In broad terms, the data structure corresponds to a reduction of path cover to a flow network, which has been used before for finding an MPC [34], [35], [41], [42], [43], and the answer to the queries "widthðG P Þ > '?" are based on the concept of shrinking of a flow in this flow reduction [34], [35]. Lemma 3. Let G ¼ ðV; EÞ be a DAG of width k, and P an MPC of G. We can build a data structure in time OðkjV j þ jEjÞ, answering whether widthðG P Þ > ' in time Oðmaxð1; k þ m À 'ÞjEjÞ, for every ' ! k; P ¼ x 1 ; . . . ; x p proper path of G, with m ¼ jfP 2 P j ðx pÀ1 ; x p Þ 2 Pgj.
The proof of Lemma 3 (see Supplemental material 1, available online) shows us that if we are in the second case of Theorem 2, then widthðG P Þ k þ m 2k. As such, if ' ! 2k, then widthðG P Þ > ' will always be false. Using the two-finger algorithm and the flow data structure, we obtain the main theoretical result of this paper.

APPLICATION TO RNA TRANSCRIPT ASSEMBLY
The functioning of the cell is based on the transcription of genes into transcripts, followed by the translation of the transcripts into proteins. This makes the set of transcripts present in a cell (the transcriptome) an important link between DNA and phenotype, and can give information of the current and future state of a cell. High-throughput sequencing of transcripts (RNA-seq) started in 2008 [44], [45], and later proved essential in characterizing gene regulation and function, development and diseases, including cancer [46], [47], [48], [49]. In complex organisms, one gene can produce different transcripts, each in a different number of copies. For example, about 95% of multi-exon genes in humans produce multiple transcripts through alternative splicing [50]. Alternative splicing alters the set of exons transcribed by the gene, but different transcripts can still share exons.
The transcript assembly problem aims to reconstruct the set of transcripts present in a set of RNA-seq reads. While long reads hold the potential to sequence through the full transcripts, thus resolving the full transcript structure, there are inherent biases in the protocols which makes them unable to sequence longer transcripts [51], [52]. In addition, long-read alignment software have been shown to produce inaccurate alignments [53], [54] on which the assembly methods rely on. Moreover, current state-of-the-art long-read transcript assembly methods are still in active development, with precision below 50% on biological data [55]. Due to these limitations, short-read RNA-seq assembly remains a viable option.
The importance of having reliable transcript assembly results is further underlined by two recent related works [64], [65]. However, they tackle the reliability issue with a significantly different approach, as they provide methods to calculate a "confidence" range for the abundance in the sample of a given candidate transcript. In Fig. 3 we show an example of our safety approach applied to an RNA splicing graph of the human transcriptome built from genome annotation. We call RNA contigs the maximal safe path in this context.

Implementation and Experiments
We aim to motivate the use of safe paths as an alternative construct to unitigs for transcript assembly. Therefore, the main purpose of our experiments is to evaluate the construction time, correctness, and length of sequences produced by maximal safe paths compared to the sequences produced by unitigs. We compare the two approaches for RNA of real organisms, but avoid external artifacts introduced by aligning the RNA-seq reads, unsequenced regions, inferring exons from alignments, and constructing the splicing graph, which introduces many biases [36]. We instead build the splicing graph directly from gene annotation As such, the splicing graph used in our experiments corresponds to work in perfect conditions. Removing the biases introduced by errors prior to the splicing graph construction provides us a preliminary evaluation of RNA contigs in the transcript assembly problem. It is also worth noting that our datasets (annotated transcripts) as well as our algorithm are read coverage agnostics. Since the main objective of safe paths is to output information about the transcripts but not about their abundances, we do not consider this characteristic in our evaluation.
Datasets and Input. We started from gene annotation, by considering all transcript annotation from the Ensembl database [66] for different species as detailed in Table 1.
In case two exons E ¼ E 1 E 2 and E 0 ¼ E 2 E 3 from different transcripts have a suffix-prefix overlap E 2 , we replace each occurrence of E in a transcript with E 1 ; E 2 , and similarly for E 0 . This is common in splicing graph construction (see e.g., [17], [67]); the resulting exons are usually called pseudo-exons, but for simplicity we refer to them as exons. Graph vertices correspond to annotated (pseudo-)exons, and edges correspond to exons consecutive in some annotated transcript. Each weakly connected component of this graph is a splicing graph. For each dataset we considered all transcripts on the forward strand and build the corresponding splicing graphs. We also store the coding sequences on each transcript annotated by the corresponding Ensembl dataset. We say that a splicing graph is a trivial instance if it is formed by less than 3 exons or less than 2 transcripts. We filter out trivial graph instances from the dataset for our experiments.
To correlate the results with the complexity of the data, we partitioned all the annotated transcripts based on their and large ( > 5000 bases). We also partitioned coding sequences (1-1000, 1001-2500, > 2500 bases) and splicing graphs (3-15, 16-50, > 50 vertices) in these three categories. 5 Implementations. For each splicing graph, we fixed the sets S and T as the set of exons appearing as first, or as last, respectively, in some annotated transcript. We assume that such S and T can be detected at the splicing graph construction phase. 6 In practice, this could be done as stated in Section 2. The algorithm from Theorem 5 (optimized) was implemented in C++, and uses an implementation of the Edmons-Karp algorithm [68], [69] from the LEMON graph library [70] for theoretical guarantees. To implement the unoptimized version (not two-fingers, nor the flow data structure) we used the LEMON's implementation of the Network simplex algorithm [71], [72] (for better performance). The input data manipulation and evaluation code was written in Python, including the computation of the ST -unitigs discussed later. Our entire code and input datasets are publicly available at https:// github.com/algbio/SafePathsRNAPC. Our C++ code was compiled with optimization flag -O9 using compiler gcc version 8.3.0. The experiments ran on a single thread in a Linux machine (Debian GNU/Linux 10, kernel 4.19.0-10-amd64), with processor Intel (R) Core (TM) i7-8750H @ 2.2 GHz and 16 GB of RAM.
Choice of Parameter ' and its Effect on Correctness. For a splicing graph, let us denote by t the number of true transcripts. By construction and definition, the RNA transcripts of a splicing graph correspond to a constrained path cover with at most ' ¼ t paths. Thus, safe paths for such constrained path covers are also correct, in the sense that they appear in the true RNA transcripts (because they appear in any constrained path cover with at most ' ¼ t paths). However, safe paths for constrained path covers with ' < t may The table also shows the number of transcripts, coding sequences and splicing graphs after filtering trivial graph instances of the problem (see Section 3.1). 5. The grouping was made so that approximately matches the 60% smallest part of the data (small) the next 30% (medium) and the remaining 10% largest part of the data.
6. This assumption is also part of the perfect conditions in which we ran our experiments. not be correct (i.e., may not appear in some true RNA transcript). As such, for each splicing graph, we first compute the smallest size (i.e., number of paths) of a constrained path cover, for the fixed S and T , and we denote this size by k. Since t is unknown in practice, we perform experiments for ' 2 fk; k þ 1; . . . ; 2kg, to evaluate which is a good choice for the parameter '. Note that maximal safe paths for ' ¼ 2k are common to all path covers (of any size) as stated in Observation 4, thus if t > 2k we can interpret the results for ' ¼ t using the results for ' ¼ 2k (even when we did not run the algorithm for ' ¼ t).
Baseline Comparison With ST -Unitigs. We also implemented a standard strategy used by genome assemblers to compute contigs. This involves reporting those paths whose internal vertices have in-degree and out-degree equal to 1 (and with at least one internal vertex), also called unitigs [37]. If no vertex of S or of T appears in a unitig, then the unitig is safe for constrained path covers, for any '. Intuitively, to cover an internal vertex v of a unitig P (which exists by definition), one must arrive from some vertex of S, then entirely traverse the prefix of P until v, and then arrive to some vertex in T , thus entirely traverse the suffix P from v. However, if e.g., some vertex of T appears in P after v, the path in the constrained path cover covering v may stop before reaching the end of P . As such, we say that an ST -unitig is a unitig containing no vertex of S and T as internal. It holds that ST -unitigs are safe and correct for constrained path covers, for any '.
Evaluation Metrics. To evaluate the performance of RNA contig we report several metrics computed in vertex length, that is, number of exons, and in base length, that is, the total number of bases in all the exons considered.
Our first two metrics show the improvement of RNA contigs with respect to ST -unitigs. Since ST -unitigs must be covered by one path, they are subpaths of some maximal safe path. As such, for every ST -unitig we compute its improvement as the longest RNA contig containing it, and report the difference and ratio of their lengths.
From this point onward, by contig we denote both an RNA contig, and an ST -unitig. Our second set of metrics measure the sensitivity of both approaches from different perspectives. First, for every transcript (and every approach) we compute the longest contig segment inside the transcript, that is, the longest path that is a common subpath of the transcript and of some contig. We say that the length of this subpath is the maximum coverage of the transcript. To specifically measure the coverage of coding sequences, we also compute their maximum coverage, and we call it maximum coding coverage. Besides, for every transcript we compute a standard metric used in genome assembly, namely, the e-size [73] of the transcript. In this case, the e-size is the average length of all contigs inside the transcript overlapping a random location of it. More precisely, for every position in a transcript T , we take the average length over all contig segments inside T overlapping that position. The e-size of the transcript is obtained as the average of such averages, over all positions of the transcript. Finally, to normalize our results, we report our sensitivity metrics divided by the length of the corresponding transcript.
Our final metric computes the precision of contigs per splicing graph. We classify a contig as correct if it is a subpath of at least one annotated transcript. The precision over a splicing graph is the total length of the correct contigs, divided by the total contig length. Concrete examples of some of these metrics can be found in Fig. 3.
Safe Paths of a Variation Graph. To demonstrate the time scalability of our algorithms, we computed safe paths of a variation graph common to all path covers (' ¼ 1). Note that this is not related to multi-assembly, the main application envisioned by this paper. However, safe paths in this context can be interpreted as genomic regions common to all individuals represented by the variation graph. We used a variation graph built from the Leukocyte Receptor Complex (LRC) [74], which constitutes one of the most diverse variant spots. We extracted subgraphs of different sizes of the LRC variation graph and run the optimized and unoptimized versions of our algorithm. Sugraphs were taken as subgraphs induced by a consecutive segment of vertices in a topological order, which guarantees the width of the resulting subgraph to be at most the width of the original graph [75].

Results
In this section we show a summary of the results of the metrics described in Section 3.1 across the different datasets. A detailed breakdown of the metrics and running times for each dataset can be found in the Supplemental material 2, available online. For the metrics maximum coverage, e-size and precision, we only show them in base length. We observed that average precision for ' ¼ k is over 75% for every dataset except barley (over 60%) and rice blast (over 44%). On the other hand, the average precision for ' ¼ k þ 1 is over 99% for all species and 100% for ' ¼ t and ' ¼ 2k as expected. Moreover, in general, all metrics are very close (less than 1% of difference at most) for ' ¼ k þ 1, ' ¼ t and ' ¼ 2k, suggesting that safe paths derived with prior knowledge of the number of transcripts (' ¼ t) can be reasonably approximated as safe paths for ' ¼ k þ 1 or safe paths common to all path covers (' ¼ 2k), with a small tradeoff of precision versus coverage. In the next results we fix ' ¼ k þ 1 for RNA contigs. Table 2 shows the improvement of RNA contigs with respect to ST -unitigs. Large graphs have longer RNA contigs, suggesting that maximal safe paths manage to connect paths in different sectors that ST -unitigs are not able to capture with their simple rule. Overall, RNA contigs are from 1.4 (small graphs, human) to 8.1 (large graphs, wheat) times longer in terms of bases, and from 1.17 to 5.6 times longer in terms of vertices (i.e., exons). The most improvement with respect to ST -unitigs occurs in wheat and fruit fly, whereas the least improvement in human, mouse and rice blast, the latter with just a few graphs (see Table 1) with no more than 15 vertices each.
In Table 3 we show e-size and maximum coverage (normalized by transcript length) at transcript level, and in Table 4 we show them at splicing graph level. In terms of precision (Table 4), as discussed before, RNA contigs reach over 99% on average for ' ¼ k þ 1 for all datasets and size of splicing graphs, while ST -unitigs are 100% precise as expected (as well as RNA contigs for ' ¼ 2k, see Supplemental material 2, available online). At transcript level (Table 3) e-size of RNA contigs is over 40% for human, meaning that at a random position on the transcript, the average length of   all the safe "stretches" inside that transcript, and overlapping that location, is about 40% of the length of the transcript length, whereas for wheat and rice blast the e-size is over 60% and 70% respectively and only 22% for barley. At graph level, on small graphs, the longest RNA contig of a transcript is on average over 80% of the transcript length in all species. Table 5 shows maximum coding coverage (normalized by coding sequence length) at coding sequence level. RNA contigs cover over 70% of the coding sequences of transcripts in most cases. and over 48% in general. In terms of these global metrics, RNA contigs are typically between 10-20 percentage points over ST -unitigs. These results suggest that RNA contigs can provide significantly long and correct information about the RNA transcripts of a splicing graph, without any heuristic or complex assembly model. Finding ST -unitigs takes less than 2 seconds, whereas RNA contigs are reported in 32 seconds for the unoptimized variant, and less than 12 seconds for the optimized in the slowest dataset (human, see Table 6). As such, the unoptimized algorithm, being the simplest, is a good initial candidate to be adapted and extended by future practical RNA contig assemblers. The difference of running times between both algorithms becomes clear when working in the subgraph of the variation graph LRC (see Table 7). When considering a subgraph of 10000 vertices the optimized version takes less than two minutes while the unoptimized more than 20 minutes. Finally, for the subgraph of 50000 vertices the optimized version takes less than 1 hour whereas the unoptimized runs in more than 17 hours.

CONCLUSION
The results of this paper are two-fold. On the theory side, we considered a natural generalization of the classical problem of minimum path cover, including more practical constraints, which we called constrained path covers. We devised the first algorithms finding all maximal safe paths for them, through a general "avoid-and-test" framework, that could have applications in other safety problems. We showed how the direct versions of these algorithms can be improved by reusing computation, with the help of a flow data structure and a two-finger computation of maximal safe paths.
On the practical side, we implemented our algorithms and offer these implementations in publicly available repositories for practitioners and further development of our ideas. As an application of our algorithmic ideas, we proposed safe paths for constrained paths covers as a contig model in RNA transcript assembly.
We evaluated the benefits of RNA contigs on transcript annotation (perfect conditions) of various species and observed for the first time that RNA contigs contain significantly long parts of the transcripts. This fact was not obvious from the outset, because in practice the problems may admit many different path covers, which have very little subpaths in common, The Unoptimized column corresponds to the algorithm not using the two-finger approach nor the flow data structure. The Optimized column corresponds to our algorithm using these optimizations. The number in the graph name indicates the number of vertices in the subgraph. and thus very short safe paths overall. However, our results show that RNA contig assembly is indeed a valid approach in transcriptome assembly. We also provided several key computational techniques that can lay at the core of future practical tools. As such, once RNA contigs are possibly enhanced with heuristics (for example from [12], [16], [18], [20], [57]), RNA contig assembly could enjoy the same success as contig assembly does in genome assembly. Finally, our results are not limited to RNA transcript assembly but to any multi-assembly problem whose solution can be modeled with a path cover. For example, in the assembly of reads sequenced from all viral quasi-species in a sample there exist minimum path cover-like methods [27], [28], and methods based on path covers optimal to some flow criteria [29], [30].
Romeo Rizzi received the laurea degree in electronic engineering from the Politecnico di Milano, Milan, Italy, in 1991, and the PhD degree in computational mathematics and informatics from the University of Padova, Padua, Italy, in 1997. Afterwards, he held postdoc and other temporary positions at research centers like CWI (Amsterdam, Holland), BRICS (Aarhus, Denmark) and FBK (Trento, Italy). In 2001, he became assistant professor by the University of Trento. In 2005, he became associate professor by the University of Udine. Currently, he is full professor in operations research with the University of Verona, Italy. His main interests include combinatorial optimization and algorithms. He is an area editor of 4OR and acts as a reviewer for the American Mathematical Society. His papers cover the areas of discrete mathematics, combinatorics, algorithms, bioinformatics, and artificial intelligence. Since 2004, he has intensively acted as a trainer of the Italian team for the iOi.