Ranking top-k trees in tree-based phylogenetic networks

'Tree-based' phylogenetic networks proposed by Francis and Steel have attracted much attention of theoretical biologists in the last few years. At the heart of the definitions of tree-based phylogenetic networks is the notion of 'support trees', about which there are numerous algorithmic problems that are important for evolutionary data analysis. Recently, Hayamizu (arXiv:1811.05849 [math.CO]) proved a structure theorem for tree-based phylogenetic networks and obtained linear-time and linear-delay algorithms for many basic problems on support trees, such as counting, optimisation, and enumeration. In the present paper, we consider the following fundamental problem in statistical data analysis: given a tree-based phylogenetic network $N$ whose arcs are associated with probability, create the top-$k$ support tree ranking for $N$ by their likelihood values. We provide a linear-delay (and hence optimal) algorithm for the problem and thus reveal the interesting property of tree-based phylogenetic networks that ranking top-$k$ support trees is as computationally easy as picking $k$ arbitrary support trees.


INTRODUCTION
Although phylogenetic trees have been used as the standard model of evolution, phylogenetic networks have become popular amongst biologists as a tool to describe conflicting signals in data or uncertainty in evolutionary histories [4,6,9]. Therefore, when we wish to reconstruct the phylogenetic tree T on a set X of species from non-tree-like data, a natural idea would be to describe the data using a phylogenetic network N = (V, A) on X and then remove extra arcs to discover an embedding τ = (V, S) of T inside N , where τ is called a 'support tree' of N [6].
However, the above strategy only makes sense when N is 'tree-based', namely, N is merely a tree with additional arc [6], which is not always the case [12]. In [6], Francis and Steel provided a linear-time algorithm for finding a support tree of N if N is tree-based and reporting that it does not exist otherwise. Another linear-time algorithm for this decision problem was obtained by Zhang in [13].
While Francis and Steel's work was followed by many studies (e.g., [1,3,4,5,7,11,13]), Hayamizu's recent work [8] significantly advanced our understanding of how tree-based networks could be useful in contemporary phylogenetic analysis. In fact, Hayamizu's structure theorem has derived a series of linear-time and linear-delay algorithms for many basic problems (e.g., counting, enumeration and optimisation) on support trees, and has thus enabled various data analysis using tree-based phylogenetic networks (see [8] for details).
In the present paper, we consider a so-called 'top-k ranking problem', with the aim to further facilitate the application of tree-based phylogenetic networks. The problem is as follows: given a tree-based phylogenetic network N where each arc a exists in the true evolutionary lineage with probability w(a) > 0, list top-k support trees of N in non-increasing order by their likelihood values. We note that this problem is an important generalisation of the top-1 ranking problem, which asks for a maximum likelihood support tree of N and can be solved in linear time [8], since nearly optimal support trees can provide more biological insights than the maximum likelihood one.
At first glance, ranking top-k support trees may seem more difficult than picking k arbitrary support trees, the latter of which is possible with linear delay [8]; however, in this paper, we provide a linear-delay (i.e., optimal) algorithm for the top-k ranking problem and thus reveal that the above two problems have the same time complexity, which is an interesting property of tree-based phylogenetic networks.

PRELIMINARIES
Throughout this paper, X represents a non-empty finite set of present-day species. All graphs considered here are finite, simple, directed acyclic graphs. For a graph G, V (G) and A(G) denote the sets of vertices and arcs of G, respectively. A graph G is called a subgraph of a graph H if both V (G) ⊆ V (H ) and A(G) ⊆ A(H ) hold, in which case we write G ⊆ H . When G ⊆ H but G = H , then G is called a proper subgraph of H . When G ⊆ H and V (G) = V (H ), G is a spanning subgraph of H . Given a graph G and a non-empty subset A of A(G), A is said to induce the subgraph G[A ] of G, that is, the one whose arc-set is A and whose vertex-set consists of all ends of arcs in A .
is called a leaf of G. Definition 2.1. A rooted binary phylogenetic X -network is defined to be a finite simple directed acyclic graph N with the following properties: When N has no reticulation vertex, N is called a rooted binary phylogenetic X -tree.

Definition 2.2 ([6]
). If a rooted binary phylogenetic X -network N that has a spanning tree τ that can be obtained by inserting zero or more vertices into each arc of a rooted binary phylogenetic X -tree T , then N is said to be tree-based and τ is called a support tree of N .

Theorem 2.3 ([6]). Let N be a rooted binary phylogenetic X -network and let S be a subset of A(N ). Then, the subgraph N [S] of N is a support tree of N if and only if S satisfies the following three conditions, in which case S is called an 'admissible' arc-set of N . Moreover, there exists a one-to-one correspondence between support trees of N and admissible arc-sets of N .
In this paper, as the conditions in Theorem 2.3 still make sense for any subgraph of N , we consider admissible arc-sets of subgraphs of N .

KNOWN RESULTS: THE STRUCTURE OF SUPPORT TREES
Here, we summarise without proofs the relevant material in Then, any zig-zag trail Z in N is specified by an alternating sequence of (not necessarily distinct) vertices and distinct arcs of N , such as From now on, we represent a maximal zig-zag trail Z by a sequence 〈a 1 , . . . , a |A(Z )| 〉 of the elements of A(Z ) that form the zig-zag trail in this order, assuming that no confusion arises. Then, we can encode an arbitrary arc-induced subgraph of Z by an |A(Z )|-dimensional vector. For example, for an N-fence Z = 〈a 1 , a 2 , a 3 , a 4 , a 5 〉, the subgraph of Z induced by the subset {a 1 , a 3 , a 5 } ⊆ A(Z ) is specified by the vector (1 0 1 0 1) = (1(01) 2 ). With this notation, we can state Hayamizu's structure theorem for tree-based phylogenetic networks, which gives an explicit characterisation of the family Ω of all admissible arc-sets of N as follows.

TOP-k SUPPORT TREE RANKING PROBLEM
Given a tree-based phylogenetic X -network N where each arc a is chosen with probability w(a) ∈ (0, 1], we can assign a ranking number to each support tree τ ∈ Ω of N by the likelihood value f (τ) := a∈A(τ) w(a). In principle, the top-k support tree ranking problem for N asks for an ordered set 〈τ (1) , . . . , τ (k) 〉 of k support trees of N such that f (τ (1) ) ≥ · · · ≥ f (τ (k) ) ≥ f (τ) holds for any support tree τ of N other than τ (i ) (i = 1, . . . , k). However, such a ranking is not unique in general, since there can be 'ties' in the collection Ω of support trees of N as well as in the family Ω i of admissible arc-sets of each maximal zig-zag trail Z i in N . For convenience, we ensure the uniqueness of the ranking by using the lexicographical order ≤ lex on vectors as follows.
Assume that N is a tree-based phylogenetic X -network with Ω = d i =1 Ω i as in Theorem 3.1 and that Z i is any maximal zig-zag trail in N . We define the local ranking for Z i to be a totally ordered set (Ω i , ≤ * ) such that for any Note that the elements of Ω i are |A(Z i )|-dimensional vectors and any two of them are comparable lexicographically. From now, we identify the j -th element of (Ω i , ≤ * ) with its local ranking number j ∈ {1, . . . , |Ω i |} in order to write Ω = d i =1 {1, . . . , |Ω i |}. Then, the elements of Ω are vectors having the same dimension again and so we can break ties by using ≤ lex as before. Abusing the notation ≤ * slightly, we call the totally ordered set (Ω, ≤ * ) the support tree ranking (for N ). For any k ∈ N with k ≤ |Ω|, the top-k support tree ranking (for N ) is defined to be a unique subsequence of the first k elements of (Ω, ≤ * ). Note that for any k ∈ N, one can determine in O(|A(N )|) time whether or not k ≤ |Ω| holds [8].

Problem 4.1. Top-k support tree ranking problem Input:
A tree-based phylogenetic X -network N with associated probability w : A(N ) → (0, 1] and k ∈ N not exceeding the number |Ω| of support trees of N . Output: The top-k support tree ranking 〈τ (1) , . . . , τ (k) 〉 for N .

Proposition 5.2. Let 〈τ
Let e i be the unit vector such that i -th component is one and the others are all zeros. Also, for each τ ∈ Ω \ {τ (1) }, let id(τ) be the first index such that the i -th component of v is strictly greater than one and let e(τ) := e id(τ) . For example, τ = (1 1 1 5 8) gives e(τ) = (0 0 0 1 0). Then, we have the next lemma, which is illustrated in Figure 1.

Lemma 5.4. Let Γ be the graph as in Lemma 5.3 and let Q j be a subset of V (Γ) that is recursively defined by
Then, for each j ∈ [1, k], we have τ ( j ) ∈ Q j and τ ( ) ∈ Q j for all < j .
In order to analyse the running time of the above algorithm, let us review some basics of a priority queue, which is a data structure for maintaining objects that are prioritised by their associated values. In its most basic form, a priority queue supports the operations called INSERT and DELETE-MIN, where the former refers to adding a new object, and the latter to detecting and deleting the one with the highestpriority [2]. Implemented with a binary heap, each of these operations can be performed in O(log n) time, where n denotes the number of the elements in the priority queue [2]. Proof. As Equation (1) implies that |Q j +1 − Q j | ≤ 1 holds for any j ∈ [1, k − 1], |Q j | ≤ k holds for any j ∈ [1, k]. Then, if we keep the elements of each Q j in a priority queue, O(log k) time suffices to return τ ( j ) and to delete τ ( j ) from Q j . Also, once child * (τ (j) ) and sibling * (τ (j) ) have been obtained, inserting the two elements requires O(log k) time. We note that O(log k) ≤ O(|A(N )|) follows from k ≤ 2 |A(N )| . By Proposition 5.1, for each j ∈ [1, k − 1], one can compute {child * (τ (j) ), sibling * (τ (j) )} in d i =1 O(|A(Z i )|) time, which equals O(|A(N )|) time as {Z 1 , . . . , Z d } is a decomposition of N . Hence, our algorithm can return τ (1) , . . . , τ (k) one after the other in such a way that the delay between two consecutive outputs is O(|A(N )|) time. This completes the proof.
Finally, we make two remarks. First, Ω(k|A(N )|) time is required to output k distinct support trees of N as each support tree has size Ω(|A(N )|). Therefore, the running time of our algorithm (as well as that of the enumeration algorithm in [8]) is Θ(k|A(N )|), which guarantees the optimality of those algorithms. Second, as commonly in the literature (e.g., [10]), it would be natural to wonder about the time complexity of an analogue of Problem 4.1 that only asks for outputting a sequence of the differences between τ ( j −1) and τ ( j ) ; however, we note that this problem still requires Ω(k|A(N )|) time because the size of each difference is Ω(|A(N )|). To illustrate this, consider a tree-based phylogenetic X -network N that is decomposed into maximal fences, each of which has only one admissible arc-set, and c crowns, each of which has size Ω(|A(N )|/c). The difference between any two support trees has size Ω(|A(N )|/c), which equals Ω(|A(N )|) if c is a constant.