Flow Decomposition With Subpath Constraints

Flow network decomposition is a natural model for problems where we are given a flow network arising from superimposing a set of weighted paths and would like to recover the underlying data, i.e., decompose the flow into the original paths and their weights. Thus, variations on flow decomposition are often used as subroutines in multiassembly problems such as RNA transcript assembly. In practice, we frequently have access to information beyond flow values in the form of subpaths, and many tools incorporate these heuristically. But despite acknowledging their utility in practice, previous work has not formally addressed the effect of subpath constraints on the accuracy of flow network decomposition approaches. We formalize the flow decomposition with subpath constraints problem, give the first algorithms for it, and study its usefulness for recovering ground truth decompositions. For finding a minimum decomposition, we propose both a heuristic and an FPT algorithm. Experiments on RNA transcript datasets show that for instances with larger solution path sets, the addition of subpath constraints finds 13% more ground truth solutions when minimal decompositions are found exactly, and 30% more ground truth solutions when minimal decompositions are found heuristically.


INTRODUCTION
F LOW networks are useful models in many domains, from transportation planning to computational biology. In some cases, the flow on a graph arises as the superposition of some set of weighted paths, such as trips through a road network, routing of information through a communication network, or paths in a graph encoding mixed reads sequenced from several biological sequences, as in the case of RNA transcripts through a splice graph.
In many such applications, we are actually presented with the inverse problem: given a flow in a graph, we need to recover the initial paths that made up the flow. This problem is also referred to as the flow decomposition (FD) problem. In computational biology, this is a common subroutine in multiassembly problems, such as RNA transcript assembly or viral quasispecies assembly. Prioritizing parsimonious solutions proved to be an accurate assembly method, but it can suffer when there are multiple parsimonious solution to choose from. As such, in this paper we consider a natural generalization of the flow decomposition problem, by assuming that extra information about the initial paths is available in the form of subpath constraints. These are subpaths in the graph that must be followed by at least one path in the flow decomposition; thus, we are looking for flow decompositions with the property that every constraint is a subpath of some decomposition path. We call the resulting problem flow decomposition with subpath constraints (FDSC).
In a version of this work presented at WABI 2021, we left open the question of whether FDSC (not necessarily minimal) can be solved in polynomial time. In this updated work, we give a polynomial time algorithm for FDSC, and present additional discussion on the hardness of two FDSC variants.

Biological Setting
Algorithms that solve variations of the flow decomposition problem are at the heart of most RNA transcript assembly software, including IsoLasso [1], Traph [2], FlipFlop [3], Scallop [4] and StringTie [5]. More recently, flow decomposition methods were used for another multi-assembly problem, namely strain-aware genome assembly, with applications to viral quasispecies assembly [6], [7]. Briefly, flow decomposition methods for sequence assembly work by using reads and their abundances to first construct a flow network whose vertices may represent exons (in the case of an RNA splice graph) or k-mers (in the case of a de Bruijn graph). Edges in the network are present if there is read evidence that some sequence followed the edge (e.g., two exons are consecutive in some transcript). Furthermore, each edge is weighted by the number of reads that support it. With perfect data, we might expect the weights to directly provide a flow in the network; however in practice some adjustment to the weights may be needed to achieve a flow. One such method uses a minimum-cost flow approach for this adjustment [2]. Another approach [8] models the input as an inexact flow network in which edge flows belong to intervals, that are estimated from the data. In all cases, we seek a path decomposition for the flow network that minimizes the number of paths.
In Kloster et al. [9] it was shown that in the case of RNA transcripts, most of the time the "true" transcripts also provide a minimum flow decomposition of the splice graph. However, there can often be more than one solution to the minimum flow decomposition problem; indeed, Kloster et al. found that, when the number of true transcripts is seven, the minimum flow decomposition found corresponds to the true paths in only 80% of the instances of that size, with lower accuracies as the number of true paths increases. In fact, practical methods for RNA assembly methods also have a precision of 50%-60% on some human datasets [5], [10]. Adding subpath constraints to the flow decomposition problem may further restrict the solution space, thus improving RNA assembly accuracy.
In practice, the subpath constraints can be derived from reads overlapping three or more nodes of the flow graph. Long RNA-Seq reads naturally have this property in many cases; however, also short reads can exhibit this behavior in the case of short exons. As we review below, other possible sources of such constraints exist in practice as well, such as from partial assemblies, or super-reads [5] constructed from short reads that can be uniquely extended.
Finally, most of the RNA assembly tools cited above work in a so-called genome-guided setting in which also a reference genome of the studied species is available. This makes the splice graph acyclic (i.e. a DAG). While both the original flow decomposition problem and our variant with subpath constraints can be defined in flow networks with cycles (which would correspond to a de novo assembly setting), in this paper we focus on DAGs only.

Related Work
Finding a flow decomposition with the minimum number of weighted paths is a well-studied problem in computer science. Even when restricted to DAGs, the minimum FD problem is NP-hard [11], and thus various practical approaches to it exist: approximation algorithms [12], [13], [14], [15], [16], FPT algorithms [9], greedy algorithms [11], [17]. By taking the set of subpaths constraints to be empty (or to correspond to all edges of the graph with non-zero flow), it follows that also finding a solution to the FDSC problem with a minimum number of paths is NP-hard.
The idea of improving RNA assembly by multi-edge subpath information is in fact used by several flow-based tools, such as Scallop [4] and StringTie [5]. However, both approaches integrate subpaths in a heuristic manner, with no overall formulation of the computational problem they are solving. The same holds also for the viral quasispecies assembler [6]. Recently, the method TransBorrow [18] uses partial assemblies from different RNA assembly tools, and works by heuristically extending the subpaths they correspond to in a splice graph.
Moreover, our FDSC problem generalizes a related problem on DAGs. Recall that in the minimum path cover (MPC) problem, we are looking for a minimum-cardinality set of path that together cover all nodes of a DAG (e.g., "explain" all exons of a splice graph). The problem is behind early RNA assembly methods such as Cufflinks [19], and early virus quasispecies assembly methods such as ShoRAH [20]. The MPC problem has been extended to include subpath constraints as well [21], [22], [23], [24], by analogously requiring that each constraint is a subpath of some solution path. While these generalizations are polynomially-time solvable, they (together with the initial MPC formulations) are usually unsatisfactory since they ignore the weights of the graph (i.e. the abundances of the reads)-recall that most state-of-the-art RNA assembly methods cited above are flow-based. Moreover, MPCs and MPCs with subpath constraints correspond to restricted classes of flows in some DAG [21], [25], and thus the minimum FDSC problem is a strict generalization of the MPC problem with subpath constraints.

Contributions
In this work, we initiate the formal study of the FDSC problem. This is a natural model for multiassembly problems, as seen by the abundance of methods and tools that incorporate subpath information for improving RNA and viral quasispecies assemblies. However, because finding a minimum solution to the FDSC problem is NP-hard, these methods and tools have focused on either heuristic approaches or a polynomial-time solvable particular version of the problem (MPC) that ignores valuable edge weight information. Here, we make two advances that bring us closer to being able to use the complete version of the problem in practical tools. On the theoretical side, we formalize the problem and give the first algorithm to determine whether an instance is feasible (Theorem 18), and produce a solution if it is. The algorithm works via a reduction to the standard flow decomposition problem where any solution must translate to a solution in the original graph that satisfies all of the subpath constraints. Based on this reduction, in Section 3.2 we also propose a "bridge-reweighting" heuristic algorithm to solve the minimum FDSC problem. Additionally, we give an FPT algorithm for the minimum FDSC problem (Theorem 20), extending the one of Kloster et al. [9]. Finally, in Section 4, to add to the complexity picture around the FDSC problem, we show that two application-oriented FD problems related to FDSC are NP-hard in the strong sense, even without requiring a solution with a minimum number of paths.
We implement both algorithms for FDSC, and perform a proof-of-concept study of their usefulness in RNA assembly. We experiment on a dataset developed by Shao et al. [17] to study their heuristic for the minimum FD problem. The same dataset was then used by Kloster et al. [9], who focused on studying the usefulness of standard minimum flow decompositions in RNA assembly, as explained above. We find that our FDSC algorithms increase our ability to uncover the ground truth RNA transcripts, as more and longer subpath constraints are included in the input. This holds both when minimality is enforced, through our FPT algorithm, and when it is only heuristically sought, through the flow decomposition reduction and an associated path reduction heuristic. For example, when there are seven ground truth transcripts, we increase the accuracy by 13% when an optimal solution is found (via FPT) and 30% when a heuristic solution is found.
Our FDSC algorithm runs in polynomial time (i.e. always finds a solution to the FDSC problem, not necessarily minimum). Though we desire a minimum such path decomposition, algorithms that guarantee such solutions in general may be too slow to be used in practice. Despite a lack of minimality guarantees in our heuristic FDSC algorithm, our experiments show that the addition of subpath constraints yields solutions that approach the accuracy of a minimum decomposition without subpath constraints; thus, our results show that heuristic FDSC is a practical substitute for minimum FD without subpath constraints.

PRELIMINARIES
Since flow weights represent read counts, we restrict attention to integral flow networks and flow decompositions.
Finally, for each v 2 V , there is an s-t path in G that includes v.
Definition 2 (Flow decomposition). A flow decomposition for a flow network G ¼ ðV; E; fÞ consists of a set of s-t paths P ¼ ðP 1 ; . . . ; P k Þ and associated integral flow weights w ¼ ðw 1 ; . . . ; w k Þ with w i 2 N such that for each edge e 2 E, X i:e2P i w i ¼ fðeÞ: We define several problems concerning finding decompositions of flow networks into paths.
Problem 3 (MFD). Given a flow network G ¼ ðV; E; fÞ, the minimum flow decomposition problem is to find a decomposition ðP; wÞ such that jPj is minimized.
Definition 4 (Flow decomposition with subpath constraints). Let G ¼ ðV; E; fÞ be a flow network. Subpath constraints are simple paths R ¼ fR 1 ; . . .; R ' g in G such that no R i R j . A flow decomposition ðP; wÞ satisfies the subpath constraints if and only if 8R i 2 R 9P j 2 Psuch that R i is a subpath of P j ðin short; R i 2 P j Þ:

FDSC Reduces to FD
We now describe a reduction from the FDSC problem to the FD problem. The idea is to convert subpath constraints into edges in an FD instance so that any path decomposition of the FD instance translates into a path decomposition for the FDSC instance that covers the subpath constraints.
Given a flow network G ¼ ðV; E; fÞ with subpath constraints R, we define the overdemand of an edge as d o ðeÞ ¼ maxð0; jfi : e 2 R i gj À fðeÞÞ; (4) and say that e is overdemanded if d o ðeÞ > 0. The FDSC problem ðG; RÞ may be feasible if multiple subpaths covering e are satisfied by a single path in a path decomposition. If no edges are overdemanded, we can give a simple reduction from FDSC to FD by transforming all subpath constraints in the FDSC instance into edges in the FD instance. We address this case in Section 3.1.1 and the case with overdemanded edges in Section 3.1.2.

Instances Without Overdemanded Edges
Lemma 7. Let G ¼ ðV; E; fÞ be a flow network with subpath constraints R such that no edge is overdemanded. Let G 0 ¼ ðV; E 0 ; f 0 Þ be the flow network that results from converting every subpath constraint R i ¼ ½v 1 ; v 2 ; . . . ; v jR i j to a bridge edge e i ¼ ðv 1 ; v jR i j Þ with f 0 ðe i Þ ¼ 1 and subtracting one from the flow values on the edges it covers. That is, for all e 2 E, f 0 ðeÞ ¼ fðeÞ À jfi : e 2 R i gj. G 0 is a flow network.
Proof. Consider building G 0 from G iteratively by converting each subpath constraint into a new edge and subtracting its flow from the edges it covers. At each step, conservation of flow holds. Thus, after the final step, f 0 is a flow on G 0 . Additionally, because no edge is overdemanded, all flow values in f 0 are nonnegative. Thus, G 0 is a flow network. t u then the instance is infeasible. If x ¼ 2, then the instance is feasible and requires three paths to decompose (whereas the associated FD instance without subpath constraints can be decomposed with two paths in both cases). Fig. 2. Illustration of the greedy choice for extending paths used in Algorithm 1. When processing node R 3 in the constraint graph in Fig. 2b, the path from R 2 is extended because the overlap between R 2 's subpath constraint and R 3 's subpath constraint is greater than R 1 's and R 3 's. Proof. We show how to construct a size k solution to FDSC on G from ðP 0 ; wÞ. For each path P 0 2 P 0 , create a new path P by replacing all bridge edges e 0 i with the original sequence of nodes R i . By construction, R i must form a path from the start node of e i to the end node of e i in P , and so P is a valid path from s to t in G. We take P to be the set of all k such paths P . We now must show that ðP; wÞ forms a flow decomposition with subpath constraints for G.
Let e be any edge in G and let R 0 R be the set of subpath constraints containing e. We can divide the paths in P that cover e into two disjoint sets: P B , those that covered bridge edges e i : R i 2 R 0 , and P O , those those that covered the original edge e in G 0 . Because ðP 0 ; wÞ is a flow decomposition for G 0 , each path in P B must have unit weight. Thus, paths in P B contribute jfi : R i 2 R 0 gj to e's flow. On the other hand, paths in P O must cover e's flow in G 0 , which is fðeÞ À jfi : R i 2 R 0 gj. Thus, paths from P B and P O together cover e with exactly fðeÞ units of flow. Additionally, P B must satisfy all of the subpath constraints R 0 , so together P B and P O do as well.
t u Because any FDSC instance without overdemanded edges can be solved by reduction to FD, it follows that all FDSC instances without overdemanded edges are feasible.
Corollary 8.1. Let G ¼ ðV; E; fÞ be a flow network with subpath constraints R. A sufficient condition for a flow decomposition to exist is that no edge is overdemanded.

Resolving Overdemanded Edges
When an FDSC instance has an overdemanded edge, the reduction given above fails, because any overdemanded edge would have a negative flow value after subtracting all of its demands from its original flow. However, if the FDSC instance ðG; RÞ is feasible, it is possible to first transform ðG; RÞ to an FDSC instance ðG; R Ã Þ, where no edge is overdemanded and any path decomposition for ðG; R Ã Þ also provides a feasible path decomposition for ðG; RÞ. By Lemma 7, ðG; R Ã Þ can be solved via reduction to an FD instance. We now show how to obtain ðG; R Ã Þ, if it exists. Lemma 9. Let ðG; RÞ be a feasible FDSC instance with overdemanded edge e and ðP; wÞ be a path decomposition for ðG; RÞ. Let R 0 R be the set of subpath constraints that contain e. There is some P 2 P such that jfR i : Proof. Suppose not. That is, suppose ðP; wÞ is a path decomposition for ðG; RÞ but no path in P covers two or more subpath constraints in R 0 completely. This means that every subpath constraint in R 0 must be satisfied by a different path; call this set of paths P 0 and let the total weight assigned to these paths be w 0 ! jP 0 j ¼ jR 0 j ¼ jfi : e 2 R i gj. As e is overdemanded, we have jfi : e 2 R i gj > fðeÞ. But then w 0 > fðeÞ, contradicting the fact that ðP; wÞ is a path decomposition for ðG; RÞ. t u Inspired by the above lemma, we consider pairs of subpath constraints that may be satisfied by the same path in a decomposition.

Definition 10 (Compatible subpaths). Two subpaths
Definition 11 (Directly-compatible subpaths). Two subpaths R i ; R j 2 R are directly compatible if and only if R i and R j are compatible and there does not exist a subpath R k such that R i and R k are compatible and R k and R j are compatible.
We remark that the directly-compatible relation is just the transitive reduction of the compatible relation.
Lemma 12. Let ðG; RÞ be an FDSC instance with some overdemanded edge e. Then ðP; wÞ is a solution for ðG; RÞ if and only if there exist directly-compatible subpaths R i and R j , each containing e, such that ðP; wÞ is a solution for ðG; R [ fR i [R j g n fR i ; R j gÞ.
Proof. ð!Þ Let ðG; RÞ be a feasible FDSC instance with overdemanded edge e. Let ðP; wÞ be a path decomposition for ðG; RÞ. Let R 0 R be the set of subpath constraints that contain e. By Lemma 9, there exists a P 2 P and R i ; R j 2 R 0 such that R i 6 ¼ R j and R i ; R j 2 P . Since R i and R j both belong to P and overlap (since they each contain e), it follows that they are compatible. If R i and R j are not directly compatible, there must exist some R k such that R i and R k both contain e and are directly compatible. In this case, take R j to be R k . Furthermore, the path P satisfies the subpath constraint R i [ R j , so ðP; wÞ is a feasible solution for ðG; R [ fR i [ R j g n fR i ; R j gÞ.
ð Þ Let R i and R j be directly-compatible subpaths that both contain e. Let ðP; wÞ be a feasible solution to ðG; R [ fR i [ R j g n fR i ; R j gÞ. It follows that there exists a path P 2 P that covers R i [ R j . Clearly, P also covers R i and R j , so ðP; wÞ is also a feasible solution for ðG; RÞ. t u  Lemma 12 suggests that we can determine feasibility by finding combinations of subpath constraints that are satisfied by the same paths. We can then think of merging these subpath constraints together to form an equivalent instance without overdemanded edges. One way to determine feasibility, then, would be to consider every possible way of merging subpaths; however, there are an exponential number of such possibilities.
To do this it polynomial time, we define a new graph built from the subpath constraints, and show that certain path coverings in this graph correspond to valid ways to merge the subpath constraints. If no such path cover exists, then the instance is infeasible.
We first define the new graph.
Definition 13 (Constraint graph). Let ðG; RÞ be an FDSC instance. We define its constraint graphG c as the graph where the vertices are constraints in R and an edge ðR i ; R j Þ indicates that R i and R j are directly compatible.
An example constraint graph for an FDSC instance is shown in Fig. 2.
Let L be the total length of all subpath constraints and recall that ' denotes the number of subpath constraints. We can construct the constraint graph G c in OðL þ ' 3 Þ time by first using Gusfield's algorithm for all pairs suffix-prefix overlaps in OðL þ ' 2 Þ time [26] and then finding the transitive reduction of this relation in Oð' 3 Þ time using Aho's algorithm with standard matrix multiplication [27]. (Note that the original all pairs suffix-prefix overlap relation is acyclic since no subpath constraint is properly contained inside another, and G is a DAG.) Definition 15 (Edge-induced subgraph). Let e be an edge in G. The edge-induced subgraph G c ðeÞ is the subgraph of G c consisting of all vertices R i in G c (and induced edges) such that e 2 R i .
We now show that a certain path cover of the constraint graph corresponds to a valid way to merge subpath constraints.
Lemma 16. Let ðG; RÞ be an FDSC instance and let G c be its constraint graph. ðG; RÞ is feasible if and only if there is a vertex-disjoint path cover P c of G c such that, for every edge e in G, at most fðeÞ paths in P c visit G c ðeÞ.
Proof. ð!Þ Assume ðG; RÞ is feasible. By Lemma 12 and Corollary 12.1, it is possible to merge subpath constraints until no edge is overdemanded. The resulting constraint subpaths are each formed from the union of original constraint subpaths in G that were directly compatible, so each resulting subpath corresponds to a path in G c . Since each original subpath belongs to exactly one of the final subpaths, all such paths provide a vertex-disjoint path cover P c of G c . Let e 2 G. Since e is not overdemanded, the number of paths from P c that visit G c ðeÞ is at most fðeÞ.
ð Þ Let P c be a vertex-disjoint path cover for G c such that, for every edge e in G, at most fðeÞ paths in P c visit G c ðeÞ. Create a new FDSC instance ðG; R 0 Þ as follows: For each path in P c , add a single subpath constraint to R 0 that corresponds to union of the subpath constraints in the path. In ðG; R 0 Þ no edge is overdemanded, since no edge in G had more associated paths in P c than its flow. Thus ðG; R 0 Þ is feasible and has a solution ðP; wÞ. Since any path covering a merged subpath constraint must cover the original un-merged subpath constraints it contains, ðP; wÞ also provides a solution to ðG; RÞ. t u There is a simple greedy strategy (Algorithm 1 and Fig. 2) to find a vertex-disjoint path cover P c of G c that minimizes the number of paths intersecting G c ðeÞ for all edges e in G. if jUj > 0 then 7: u i the u 2 U with greatest number of edges in suffix-prefix overlap with R i 8: Extend the path ending at u i to end at R i 9: else 10: Add new path starting and ending at R i to P c 11: end if 12: end for 13: return P c 14: end Function Lemma 17. In Oð' 2 Þ time, Algorithm 1 finds a vertex-disjoint path cover for G c such that for every edge e in G, the minimum possible number of paths in the cover visit G c ðeÞ.
Proof. Consider any vertex R in G c , with incoming edges from R 1 ; . . . ; R a . Clearly, at most one such edge ðR i ; RÞ can belong to any vertex-disjoint path cover. Observe that for any edge e in G, whether ðR i ; RÞ belongs to a path in the cover only affects the number of cover paths visiting G c ðeÞ provided both R i and R belong to G c ðeÞ; in other words, e belongs to both R i and R. For each incoming edge ðR i ; RÞ in G c , consider the set of e 2 G for which e belongs to both R i and R; these edges form a path in G that is a prefix of R. Thus, choosing ðR i ; RÞ will benefiti.e., not increase-the visitation counts to exactly those edges in this prefix of R. It follows that simply choosing the ðR i ; RÞ that has the longest suffix-prefix overlap will minimize all visitation counts. By considering the vertices in topological order, we can extend paths in Oð1Þ time.
Since G c has ' vertices, we can perform the topological sort in Oð' 2 Þ time. Each edge is examined once during the algorithm, and we assume that suffix-prefix overlaps have been pre-computed during the construction of G c , so the total running time of this step is Oð' 2 Þ. t u Because we can find an optimal path cover in polynomial time, we can check the feasibility of an FDSC instance (and, if it is feasible, find a solution) in polynomial time.
Theorem 18. Let ðG; RÞ be an FDSC instance with jRj ¼ ', total length of all subpath constraints L, and at least one overdemanded edge. In OðL þ ' 3 ) time, we can determine whether ðG; RÞ is feasible, and if so, reduce ðG; RÞ to an equivalent FD instance with at most ' additional edges.
Proof. As discussed above, G c and the suffix-prefix overlaps can be found in OðL þ ' 3 Þ time. We can then use Algorithm 1 to find an optimal vertex-disjoint path cover P c of G c in Oð' 2 Þ time. Because the path cover is optimal, Lemma 16 tells us that ðG; RÞ is feasible if and only if for every edge e of G, at most fðeÞ paths in P c visit G c ðeÞ; this can be checked in OðLÞ time. If the instance is feasible, by Lemma 16, we can merge the subpath constraints in each path found by Algorithm 1, yielding an equivalent FDSC instance with no overdemanded edges. Then, we can use the method of Lemma 7 to transform that FDSC instance into an equivalent FD instance with at most ' additional edges.
t u

A Heuristic Algorithm for MFDSC
In practice, we can run an MFD heuristic algorithm to determine a solution to the FD instance found via the reduction in the previous section. We use greedy-width, first proposed in [11], which greedily chooses the heaviest ("widest") paths in order to decompose the flow. 1 As G 0 is a DAG, a greedy-width path can found in OðjV j þ jEj þ 'Þ time, by standard dynamic programming. In [11] it is shown that at most jEj À jV j þ 2 greedy-width paths can be found, so the total time to find an FD solution is OðjEjðjV j þ jEj þ 'ÞÞ. Translating the FD solution back to the original graph (following Lemma 8) yields a path decomposition for the FDSC problem. However, in applications, we are often interested in finding solution to the MFDSC problem, i.e., finding a solution with the minimum number of paths. The introduction of bridge edges in the reduction described above may lead to more paths being required to decompose the reduced FD instance than the original FDSC instance. This is because we now must find paths through bridge edges, as well as in the original flow network. For this reason, we apply a bridge reweighting heuristic before decomposing the network in order to reduce the number of paths. For some arbitrary ordering of the bridge edges, we do the following: 1) For each bridge edge, find the minimum flow f min over the flow values on the edges of its corresponding subpath constraint in the original network. Since the FDSC is feasible, f min ! 0. 2) Subtract f min from each of the subpath constraint edges, and add f min to the bridge edge. Since the bridge edge starts at the first node of the subpath constraint and ends at the last, flow conservation holds and the mapping of the bridge paths back to the original network again provides a solution to the FDSC instance. Fig. 3 demonstrates the reduction to an FD instance and the bridge reweighting step on an example FDSC instance.

An FPT Scheme for MFDSC
In this section, we describe an extension of Toboggan [9], an FPT algorithm for decomposing DAG flows, to also handle subpath constraints. Toboggan is able to find a k-path decomposition for a flow network G ¼ ðV; E; fÞ, if one exists, in 2 Oðk 2 Þ Á ðjV j þ ÞÞ time, where is the logarithm of the largest flow value present. To solve MFD, Toboggan tests increasing k values until a solution is found. We briefly describe Toboggan's approach and then discuss how to modify the algorithm so that it can also check if an FD solution satisfies subpath constraints.
Toboggan considers the vertices of G in topological order and computes a table T i for each vertex v i using dynamic programming. Table entries are of the form ðg; LÞ, where g indicates how paths from the previous table T iÀ1 are extended, and L is a linear system indicating how the weights of these paths are constrained to satisfy the flow requirements on all edges encountered so far. This linear system can be written as Aw ¼ b, where A is a binary matrix of k columns representing whether each row's edge is covered by each column's path, w is the length k solution vector, and b is the flow on the row's edge. Because there are k weights and all coefficients are integers, each linear system can be reduced to k linearly independent rows. As noted in [9], testing an integer linear system L for feasibility and finding a solution can be done in Oðk 2:5kþoðkÞ jLjÞ time, where jLj is shown to be k Oð1Þ .
When the final vertex in the order is reached, these linear systems indicate the path flow constraints on all edges in G, and so, if a particular system is feasible, the corresponding paths and weights provide an FD solution.
To modify Toboggan to also consider the subpath constraints, for the final table T jV j , we add a second linear system to simultaneously satisfy of the form Aw ! b, where A is an ' Â k binary matrix and b T ¼ ðd 1 ; . . . ; d ' Þ. Here Aði; jÞ 2 f0; 1g indicates whether path P j contains R i . We give an updated version of a lemma [9, Lemma 5] that bounds the number of distinct linear systems in the final table.
Lemma 19. The final table has at most 4 k 2 þ k' 2 k!k k distinct linear systems.
Proof. We follow the proof of [9, Lemma 5]. Since A is an ' Â k binary matrix, there are 2 k' possible systems of the second form. We must multiply this by the number of flow matching systems which was bounded ([9, Lemma 5]) by 4 k 2 k!k k . So, the total number of possible combined linear systems is Theorem 20. Let ðG; RÞ be an FDSC instance with jRj ¼ ' and is the logarithm of the largest flow value in the input. Modifying the Toboggan algorithm as described provides an FPT algorithm for MFDSC with running time 2 Oðk 2 Þ jV jþ 2 Oðk 2 þk'Þ ðk þ 'Þ Oð1Þ .

Proof. Kloster et al. prove ([9, Lemma 4]), that in any table
T i , the number of distinct g values present is at most ffiffi ffi k p ð0:649kÞ k . This implies (following [9,Theorem 7]) that there are at most 1. Greedy-width is a common heuristic algorithm for MFD. Other heuristics such as Catfish [17] could also be substituted. final linear systems L to check for integer solutions. The encoding size of a linear system L is now bounded by ðk þ 'Þ Oð1Þ , where is the logarithm of the largest flow value in the input. Checking feasibility and finding a solution for L can now be done in Oðk 2:5kþoðkÞ ðk þ 'Þ Oð1Þ Þ time, so the total time needed to check all such linear systems is at most ffiffi ffi k p 4 k 2 þ k' 2 0:649 k k! Á Oðk 2:5kþoðkÞ ðk þ 'Þ Oð1Þ Þ Oð4 kðkþ ' 2 Þ k 1:5k 1:765 k k oðkÞ ðk þ 'Þ Oð1Þ Þ; using the fact that k k k! e k . The total running time of the algorithm becomes 2 Oðk 2 Þ jV j þ 2 Oðk 2 þk'Þ ðk þ 'Þ Oð1Þ . t u

HARDNESS OF RELATED FLOW DECOMPOSITION PROBLEMS
In this section we add to the computational complexity picture around the FDSC problem by studying two natural application-oriented variants of it. In contrast to the FDSC problem, where deciding if the instance is feasible can be done in polynomial time (recall Theorem 18), for both of these problems feasibility checking is NP-complete in the strong sense (i.e. even if the input flow values are bounded by a polynomial in the size of the input; in fact, the former one is NP-complete even when all flow values equal 1). Since their application setting is slightly different from the one of the FDSC problem (see our FDSC experiments in Section 5), in this paper we do not give algorithms for them, and leave that for future work. First, we consider the version of the problem where the subpath constraints have associated demands that must be met by the flow assigned to a path that covers them. This problem could arise if we would like a subpath constraint to be covered by at least one flow path of higher weight, since a more heavily-weighted path may be less likely to be result of noisy data.
Definition 21 (Flow decomposition with subpath constraints and demands). Let G ¼ ðV; E; fÞ be a flow network. Let R ¼ fR 1 ; . . .; R ' g be a set of subpath constraints in G, and let D ¼ fd 1 ; . . .; d ' g be a set of demands, where each d i is associated to subpath constraint R i . We say that a flow decomposition ðP; wÞ of G satisfies subpath constraints R and demands D if and only if 8R i 2 R 9P j 2 P such that R i 2 P j and d i w j : We note that FDSC is a special case of FDSCD, where all demands are equal to one. The following proof is very similar to the NP-completeness proof of [11] for MFD. Theorem 23. FDSCD is NP-complete in the strong sense.
Proof. Clearly, FDSCD belongs to NP. For NP-hardness, we reduce from the 3-PARTITION problem. The input of this problem consists of a set of positive integers A ¼ fa 1 ; a 2 ; . . .; a 3q g, where P a2A a ¼ qB and B=4 < a < B=2 for all a 2 A. The question is whether there exists a partition of A into q disjoint subsets, each with exactly three elements summing up to B.
Given an input for 3-PARTITION, we construct the flow network G ¼ ðV; E; fÞ with subpath constraints R with demands D as in Fig. 4. We claim that this is a YES instance for 3-PARTITION if and only if the reduction creates a YES-instance for FDSCD.
ð!Þ Assume È fa j 1 ; a j 2 ; a j 3 j j 2 f1; . . .; qgg É is a proper 3-partition of A. For each j 2 f1; . . .; qg, we build the three flow paths ðe j 1 ; b j Þ, ðe j 2 ; b j Þ, ðe j 3 ; b j Þ, with weights a j 1 ; a j 2 ; a j 3 , respectively. This is possible since fðb j Þ ¼ B ¼ P 3 k¼1 a j k . As such, each subpath constraint ½e i of demand a i is satisfied. ð Þ Let ðP; wÞ be a flow decomposition with subpath constraints R and demands D of G as indicated. Since the demand of each constraint ½e i is a i , and fðe i Þ ¼ a i it follows that each edge e i is used by exactly one flow path of weight a i , and thus that P consists of exactly 3q paths P 1 ; . . .; P 3q , with weights a 1 ; a 2 ; . . .; a 3q , respectively. For each j 2 f1; . . .; qg, consider the flow paths passing thorough b j . Since B=4 < a < B=2 holds for all a 2 A, there are exactly 3 such paths, say P j 1 ; P j 2 ; P j 3 , each of weight a j 1 ; a j 2 ; a j 3 respectively, passing through b j . As such, fa j 1 ; a j 2 ; a j 3 j j 2 f1; . . .; qgg È É is a proper 3-partition of A. Finally, since 3-PARTITION is NP-complete in the strong sense [28], it follows that FDSCD is as well.
t u Our second problem variant is motivated by paired-end reads, which naturally induce pairs of subpath constraints that must be covered by the same flow path (since e.g., they are sequenced from the same RNA transcript).
Paired subpaths constraints are defined to be a set of pairs of simple paths R ¼ fðR 1 ; R 0 1 Þ; . . .; ðR ' ; R 0 ' Þg in G. A flow decomposition ðP; wÞ satisfies the paired subpaths constraints if and only if 8ðR i ; R 0 i Þ 2 R 9P j 2 P such that R i 2 P j and R 0 i 2 P j : (7) Problem 25 (FDPSC). Given a flow network G ¼ ðV; E; fÞ and paired subpaths constraints R, the flow decomposition with paired subpaths constraints problem is to determine if there exists, and if so, find a flow decomposition ðP; wÞ satisfying (7).
Our NP-completeness proof below is similar to the NPcompleteness proof of the minimum path cover problem  1 ; a 2 ; . . .; a 3q g with P a2A a ¼ qB , we construct the flow network with edges e 1 ; . . .; e 3q , b 1 ; . . .; b q . For all i 2 f1; . . .; 3qg, we set fðe i Þ ¼ a i , and for all j 2 f1; . . .; qg we set fðb j Þ ¼ B. For each e i we add the single-edge subpath constraint ½e i with demand d i ¼ a i .
with paired subpath constraints from [29]. Note that hardness holds even if all flow values equal 1. Proof. Clearly, FDPSC belongs to NP. For NP-hardness, we reduce from the problem of deciding whether the chromatic number of an undirected graph G ¼ ðV; EÞ is 3. Given G, we construct the flow network G Ã ¼ ðV Ã ; E Ã ; fÞ, with paired subpaths constraints R, as in Fig. 5. We claim that the chromatic number of G is 3 if and only if ðG Ã ; RÞ is a feasible FDPSC instance.
ð!Þ Let V 1 ; V 2 ; V 3 be a partition of V such that no edge of G has endpoints in the same V k . Then, we can construct a flow decomposition ðP 1 ; P 2 ; P 3 Þ of G Ã , where each path has weight 1, as follows. In the vertex part of G Ã , suppose v i 2 V k ; P k follows the edge labeled v i , and the other two paths follow the other two edges parallel to it, respectively. In the edge part of G Ã , for each edge e ¼ ðv i ; v j Þ, where v i 2 V k i and v j 2 V k j , path P k i follows the edge labeled v 0 i and path P k j follows the edge labeled v 0 j . The remaining path follows the middle unlabeled edge. Clearly, all subpath constraint pairs are satisfied.
ð Þ Any flow decomposition of G Ã has exactly 3 paths, each of weight 1. Let P 1 ; P 2 ; P 3 be a flow decomposition with paired subpaths constraints of ðG Ã ; RÞ. We construct a partition V 1 ; V 2 ; V 3 of V : in the vertex part of G Ã , an edge labeled v i belongs exactly to one P k , and we assign v i to V k . Assume for a contradiction that some edge e ¼ ðv i ; v j Þ of G has endpoints in the same V k . Recall that for e, R contains a constraint pairing the edge labeled v i with the one labeled v 0 i , and one constraint pairing the edge labeled v j with the one labeled v 0 j . Because of these constraints, since P k passes through both v i and v j (in the vertex part of G Ã ), in the edge part of G Ã corresponding to e we have that P k passed through both the edge labeled v 0 i and the one labeled v 0 j , which is a contradiction. t u

EXPERIMENTS
The algorithms described in Section 3 were implemented in Python in a package called Coaster. 2 We refer to the algorithm of Section 3.2 as heuristic MFDSC and Section 3.3 as FPT MFDSC. Experiments were performed on a high performance research cluster, where each run was executed on a single Intel Xeon Ivy Bridge E (3.4 Ghz) or similar CPU. We denote the number of groundtruth paths for an instance by k, and set a CPU time limit of 30 seconds for smaller instances (2 k 8) and 1 hour for larger instances (k ¼ 9; 10). For fairness of comparison, we report only on graph instances that ran to completion for all algorithm and parameter combinations, unless otherwise mentioned.
Overall, we find that heuristic MFDSC completes for all test instances in under one second, and FPT MFDSC completes in under 30 seconds for all instances with k 5, which includes the majority of instances. We give additional details and results in the following sections.

Datasets
As in previous studies on flow decomposition methods for RNA-Seq assembly [8], [9], [17], we use a simulated RNA-Seq dataset from [17] where each instance is a flow network generated by simulating RNA transcripts and their abundances with Flux-Simulator [31]. The original dataset includes human, mouse, and zebrafish genes, but we restrict our attention to instances in the human dataset, which contains 100 independently generated transcriptomes. As in [8], [9], we use only instances with at least two ground truth paths (since a single ground truth path is trivial to decompose). We also restrict the dataset to instances with 10 ground truth paths or fewer, yielding a total of 528,544 instances. Because the transcripts and abundances are known, we have ground truth paths and weights for each splice graph instance. We measure accuracy as the proportion of instances for which the algorithm returns the ground truth set of paths and weights exactly.

Simulating Subpath Constraints
In order to simulate subpath constraints, we take subpaths of the ground truth paths according to two parameters: the number of subpaths ', and a fixed length for all subpaths jRj. As noted in [9, Lemma 8], we can simplify the graph by bypassing any vertex with out-degree or in-degree equal one. We set jRj as the length of subpaths in this contracted graph. To generate subpath constraints that are consistent across experiments, we fix an arbitrary ordering for the ground truth paths for each instance, and take the first jRj edges of the first ' (contracted) paths as the subpaths. We note that the method of generating subpath constraints described here does not yield any overdemanded edges. 2. Coaster is based on the codebase for Toboggan [30] and is available at github.com/msu-alglab/coaster

Accuracy Results
To study the effect of the subpath constraints on the accuracy of the RNA-Seq assembly, we vary ' and jRj independently, letting ' 2 ½0; 4 and jRj 2 f3; 4g. Because instances become more difficult to solve correctly as the number of ground truth paths increases, we separate results by k. Accuracy results for both algorithms are reported in Table 1. For each k value, we also report the percentage of instances that completed for all parameter combinations tested. As already shown by Kloster et al. [9], the MFD solutions found by Coaster for ' ¼ 0 (for them, Toboggan) do correspond to the the ground truth paths and weights most of the time. However, for larger k values, we can see that FPT MFD solutions (without subpath constraints) do not necessarily recover the correct set of paths and weights. For k ¼ 7, for example, only 81% of the optimal decompositions produced by Coaster are the ground truth decomposition that we are seeking. Similarly, we see that FPT MFDSC solutions tend to be correct, with accuracy decreasing as k increased. However, FPT MFDSC has higher accuracy for all parameter combinations than FPT MFD at the same k value. For k ¼ 7, when we add four subpath constraints of length four each, the ground truth decomposition is found 91% of the time, a 13% increase over FPT MFD.
When ' ¼ 0, our heuristic MFDSC algorithm is equivalent to the often-used greedy-width heuristic for MFD; our results show that adding subpath constraints to greedywidth increases its accuracy considerably for larger k values, for example, by 30% when k ¼ 7. The increased accuracy of heuristic MFDSC is also good news for the use of MFDSC in practical methods, since heuristic MFD methods are already commonly used in RNA-Seq tools. In fact, the inclusion of many long subpath constraints makes heuristic MFDSC more accurate than FPT MFD for k values up to 5, which account for 95.6% of the full dataset studied (all k ! 2).
Part of the success of the heuristic MFDSC can be attributed to the fact that it finds optimal solutions in most cases. Without subpath constraints, heuristic MFDSC (i.e. greedywidth MFD) finds an optimal solution in 98.0% and the ground truth solution in 95.3% of instances in our dataset. With two subpath constraints of length 4, that increases to 99.0% and 98.1%, respectively. (For ' > 2, small k values are excluded, so results are not comparable with ' ¼ 0 experiments.) With and without subpath constraints, the vast majority of incorrectly predicted path decompositions are due to the algorithm returning an optimal decomposition of the same size as the ground truth one, but different from it, rather than a too-small optimal decomposition. As found in [9], in nearly all instances, the ground truth path decomposition is also an optimal decomposition. (They find that 0.043% of instances of all ground truth k that ran to completion in 50 seconds had non-optimal ground truth decompositions; we find that 0.100% of instances that completed in 30 seconds for all parameter combinations and had ground truth k less than 9 had non-optimal ground truth decompositions.) However, most instances are solved correctly, so it could be the case that the few instances that are not solved correctly are those that had non-optimal ground truths. This tends not to be the case. Overall, only 0.027% of instances for which the FPT for MFD yields incorrect solutions have nonoptimal ground truth path decompositions. This is dominated by the k ¼ 2 instances, however, for which no instance had a non-optimal ground truth; for k ¼ 3 through k ¼ 8, between 0.1% and 0.3% of instances that were predicted incorrectly had non-optimal ground truths. With many and longer subpath constraints (jRj ¼ 4 and ' ¼ 4), it  For k ¼ 2 through k ¼ 8 we use a CPU time limit of 30 seconds; for k ¼ 9; 10 we use 1 hour. We only report instances that finished in the time limit for all '; jRj, and for both algorithms for each k; the "pc" column reports the percentage of instances that completed for all runs for each k value. The italicized values agree with the ones reported in [9, Fig. 3], with some slight differences due to the fact that we restrict to the human dataset (they studied two additional datasets) and timeout differences.
is still only a very small number -0.052% -of incorrect solutions that have non-optimal ground truth path decompositions. Thus, this implies that the addition of subpath constraints restricts the solution space, allowing the algorithm to return the correct one more frequently and explaining the increase in accuracy when they are included.

Effect of the Bridge Reweighting
To confirm the effectiveness of the bridge reweighting heuristic for MFDSC, as opposed to simply using a path decomposition found by the method of Lemma 8, we measured the accuracy of the FDSC algorithm without bridge reweighthing on the same dataset studied above. In that case, the addition of subpath constraints in our experiments reduces the accuracy of the path decompositions returned, as shown in Table 2. Bridge reweighting allows the maximum flow that can cover a subpath constraint to do so, without introducing extra weight-one paths.

Performance Results
We measured CPU runtime of the implementation for both algorithms using all instances for the given k range, even those that timed out for some experimental conditions. For heuristic MFDSC on 2 k 8 instances, the average, minimum, and maximum runtimes were 0.0059s, 0.00096s, and 0.977s. For FPT instances, they were 0.076s, 0.001s, and 30s (the maximum time allowed). On k ¼ 9; 10 instances, the average, minimum, and maximum runtimes were 0.018s, 0.004s, 0.155s for heuristic MFDSC and 289.3s, 0.932s, and 3600s (the maximum time allowed) for FPT FDSC.
Because of the optimizations made in Toboggan [9] on which our FPT MFDSC implementation is based, memory use is generally very limited even as k increases. We measured the peak memory use for large instances (k ¼ 9; 10) with a timeout limit of 1 hour for all experimental combinations (' and jRj values) for the FPT MFDSC algorithm. All but four instances used under 100MB of memory in all experimental combinations; the average memory use over all instances and all experimental combinations, even those that timed out after an hour, was 47 MB. For most instances, the memory use is dominated by loading the required Python packages (about 40 MB).

DISCUSSION
In this work, we initiate the formal study of the MFDSC problem, which is used as a model in applications such as RNA sequencing and viral quasispecies assembly. We give both a heuristic algorithm, based on a novel reduction to flow decomposition, and an FPT algorithm, which extends the FPT MFD algorithm of Kloster et al. [9]. Through experiments on a previously-studied simulated transcriptomics dataset, we verify the base assumption underlying the use of MFDSC in practical RNA-Seq tools: that the minimumsize path decomposition should correspond to the ground truth set of paths and weights. Additionally, we show that the use of subpath constraints increases accuracy when compared to MFD without subpath constraints. We also find that our heuristic algorithm is practical, completing in less than 1 second for all instances studied, and achieves accuracy levels near those of FPT MFDSC. This is an encouraging result, because while RNA sequencing data tends toward very small ground truth path sets, other multiassembly problems such as viral quasispecies assembly may not -for example, some benchmarking datasets of [32] contain 10 and 15 strains, meaning that MFDSC (or even MFD without subpath constraints) would be intractable without a heuristic.
The research presented here suggests a number of future directions. One is to develop MFDSC algorithms for graphs containing cycles. Though splice graphs for RNA assembly are usually DAGs, graphs for de novo assembly of viral or other genomes would likely contain cycles due to repeated sequences. Another is to explore additional methods to increase the accuracy of the decompositions found. Our experiments show that as the size of ground truth gets large, accuracy decreases because there are multiple optimal solutions to choose from, even with the maximum length and number of subpath constraints that we tested. To increase accuracy, either more subpath constraints are needed (which may be possible, depending on the domain), or additional optimality criteria could be used. For example, the two FDSC variants considered in Section 4 (together with yet to be discovered algorithms for them and possible problem generalizations) could guide the search for such optimality criteria.  If all bridges are kept at weight one, subpath constraints reduce the accuracy of the path decomposition, though less if they are longer.