Coded-MPMC: One-to-Many Transfer Using Multipath Multicast With Sender Coding

One-to-many transfers in a fast and efficient manner are essential to meet the growing need for duplicating, migrating, or sharing bulk data among servers in a datacenter and across geographically distributed datacenters. Some existing works utilize multiple multicast trees for a one-to-many transfer request to increase network link utilization and its transfer throughput. However, since those schemes do not fully utilize the max-flow value of transmission from a single sender to each recipient, there is room for each recipient to retrieve data more quickly. Therefore, assuming fully-controlled networks with full-duplex links, we pose a problem to find a set of multicast flows with an allocation of block-wise transmissions by which each of multiple recipients with diverse max-flow values from the sender can utilize its own max-flow value. Based on that, assuming a sender-side coding capability on file blocks, we design a schedule of block transmissions over multiple phases by which each recipient can achieve a lower-bound of its file retrieval completion time, i.e., the file size divided by its own max-flow value. This paper presents the coded Multipath Multicast (Coded-MPMC) for one-to-many transfers with heuristic procedures to find a desired set of multicast flows on which block transmissions are scheduled. Through extensive simulations on large-scale real-world network topologies and different types of randomly-generated synthetic topologies, the proposed method is shown to design a desired schedule efficiently. A preliminary implementation on OpenFlow is also reported to show the fundamental feasibility of Coded-MPMC.


I. INTRODUCTION
With the popularization of cloud, distributed computing, and contents delivery networking technologies, there is a growing need for duplicating, migrating, or sharing bulk data among distributed servers or sites. The edge-cloud computing for emerging IoT technologies will further accelerate the need for delivery of a large file to a larger number of servers distributed over geographically-wider locations. Therefore, fast and efficient one-to-many transfers are demanded, in which a single sender delivers a file to multiple recipients over network paths not only in a datacenter but across geographically distributed datacenters [1].
For one-to-many transfers, transferring a file by ''reliable'' multicast is essentially adopted to prevent from wasting link capacities on the links shared by many paths to recipients and also from overloading the sender. However, in the case of The associate editor coordinating the review of this manuscript and approving it for publication was Christian Esposito . using a single multicast tree (e.g., [2]), the completion time for each recipient to retrieve a file is prolonged by a recipient at the worst location. To address this limitation, the use of multiple multicast trees with appropriately grouped recipients was considered [3]. For one-to-many transfers in a datacenter, a P2P-based flexible data dissemination scheme was proposed focusing on typical datacenter network topologies such as the FatTree and Multi-Routed Tree [4].
The strong need for fast transfers of bulk data among globally-distributed datacenters has led to the deployment of dedicated high-speed backbone networks inter datacenters (e.g., [5]- [7]), which are centrally managed based on Software-defined network (SDN) technology including OpenFlow. Recent studies often utilize multiple multicast trees realized by a flexible and adaptive path routing based on SDN technology [8]- [10].
On the other hand, data transfers from a single sender to a single recipient leveraging multiple network paths have emerged and attracted attention for diverse applications [11].
For example, a multipath transmission scheme was proposed using a fountain code to encode transmission data and achieve higher reliability without retransmission [12]. The effective use of Multipath TCP (MPTCP) inter datacenters has also been studied [13]. File transfers on multiple network paths inter datacenters with fair resource allocation have been studied based on ''multi-commodity max-flow problem'' [14].
Block/packet-level coding techniques have been extensively studied in one-to-one, one-to-many, and many-to-many information dissemination to improve throughput, delay, and/or resilience in both multicast (e.g., [15], [16]) and multipath (e.g., [12]) settings. Coding not only at a sender node but also at network-internal relay nodes is often referred to as Network Coding, which can be used to maximize the information dissemination rate at the sender as remarked in the seminal work [17]. However, in practice, it is difficult to implement a network coding system on each network-internal relay node (i.e., switch) as well as to find an optimal coding scheme on a given network topology.
Motivated by these observations, we proposed the Multipath Multicast (MPMC) to transfer a large file from a single sender to multiple recipients on a bandwidth-guaranteed and fully-controlled OpenFlow network with full-duplex wired links [18]. A typical scenario to which MPMC will be applied is as follows. One-to-many transfers are performed one by one in a scheduled manner on a centrally-managed network; when one one-to-many transfer is in progress, no other oneto-many transfer is performed on the network unless some residual network resources are available. In such a scenario, a one-to-many file transfer task from a sender should be performed as quickly as possible, followed by a next transfer task from a different sender. Our essential interest is how to use multiple multicast flows from the sender to all or a part of recipients in transferring a single file, to maximally exploit the transmission capacities reserved for a specific one-to-many transfer in full-duplex links, i.e., to realize the max-flow value from the sender to each recipient. To do so, different parts of the file should be concurrently transmitted to the same recipient on multiple paths to fully utilize the link capacities; and the same data should be concurrently transmitted to different recipients on a multicast tree to efficiently utilize the link capacities.
We define the retrieval completion time (RCT) of a recipient as the duration from the time when the sender starts the file transfer until the recipient completes the file retrieval. Given a sender and a set of recipients in one-to-many transfer, RCTs generally differ among recipients because the location of recipients in relation to the sender's location is heterogeneous in terms of network topologies and link capacities. Note that the flow completion time (FCT) [19] is often adopted as a performance metric in research on TCP flows and a similar concept to RCT. However, we adopt the RCT in this paper as a performance metric to emphasize the time of each recipient's file retrieval completion; this is because our scheme deals with multiple multicast transmissions over multiple phases to deliver a file to multiple recipients and thus, the RCT is not directly bound with a single flow. Figure 1 illustrates the advantage of MPMC transfer compared to multicast transfer. A file of size L is transferred from a single sender s to two recipients r 1 and r 2 over a network with relay nodes (i.e., OpenFlow switches) and bidirectional (i.e., full-duplex) links. In the network, the link capacity between relay nodes is one in each direction, and that between a relay node and a sender/recipient is sufficiently large.
In the multicast transfer as shown in Fig. 1(a), a single tree from s to r 1 and r 2 is constructed to transfer efficiently, and each recipient retrieves in RCT L using a single multicast flow whose flow value is one. On the other hand, MPMC transfer minimizes the RCT of each recipient by the multipath transfer for increasing the link capacity available and by the multicast transfer for decreasing the link capacity consumed by multiple recipients as shown in Fig. 1(b). In MPMC transfer, the file is divided into two original blocks ''a'' and ''b'' each of which is transmitted on different multicast trees, and each recipient retrieves in RCT L/2 at the same time, which is twice as fast as that in the multicast transfer and is equal to the theoretical RCT lower-bound. Note that the full-duplex transmission is essential, as shown in the link between nodes 1 and 2 in Fig. 1 Our goal is to design a schedule of block transmissions on an appropriate set of multicast flows by which each of multiple recipients completes the file retrieval in its own lower-bound time, i.e., the file size divided by its own max-flow value from the sender. However, as described in Sec. III-A, such a schedule does not always exist in case that only original blocks are transmitted in MPMC (referred to as Basic-MPMC). Therefore, we have proposed an extension of Basic-MPMC with a sender-side coding scheme, referred to as Coded-MPMC [20]. In Coded-MPMC, a necessary number of coded blocks are generated from the original blocks of a file by using a maximum distance separable (MDS) coding by the sender. Both the original and coded blocks can be transmitted over one or multiple phases. Note that, since assuming the standard OpenFlow networks to apply our scheme, we could not consider the network coding but focus only on the sender coding. For the sender coding, throughout in this paper, we adopt a systematic Reed-Solomon (RS) coding as an example of MDS coding to explain Coded-MPMC scheme. In this paper, our contributions are summarized as follows: 1) We formulate the Coded-MPMC problem in which multicast, multipath, and coded data transfers are integrated, and investigate a desired set of unit multicast flows with the allocation of blocks (called globally bandwidth-efficient block allocation; GBE-BA) essentially required to design a schedule over multiple phases of Coded-MPMC that minimizes the RCT of every recipient; such a schedule is optimal in terms of the RCTs of all recipients. 2) To find a GBE-BA for the 1 st phase, we develop a method to construct a desired set of unit multicast flows by combining heuristic procedures. 3) Through simulation-based evaluations on large-scale realistic network topologies in [21] and random network topologies generated by [22], we demonstrate that the proposed method can design an optimal Coded-MPMC schedule in a wide range of network topologies. 4) We successfully implemented and tested our Coded-MPMC system on a wide-area OpenFlow testbed network, which suggests its fundamental feasibility. To the best of our knowledge, for a single one-to-many file transfer to multiple recipients on an arbitrary network topology with full-duplex wired links, no other work has fully exploited the max-flow value of each recipient and achieved the theoretical RCT lower-bound of every recipient.
The rest of the paper is organized as follows. In Sec. II, we introduce related work on transmission scheduling using single or multiple multicast trees in SDN. In Sec. III, we show the necessity of Coded-MPMC through a critical example of Basic-MPMC. In Sec. IV, we formulate the problem and present the Coded-MPMC scheduling. In Sec. V and VI, we propose a method with heuristic procedures to design an appropriate schedule in Coded-MPMC and check if the schedule is optimal through extensive simulations on a variety of topologies. In Sec. VII, an implementation of Coded-MPMC is introduced. Finally, we provide discussion on Coded-MPMC in Sec. VIII, and conclude in Sec. IX.

II. RELATED WORK
Avalanche [23] utilizes a minimal-sized multicast tree for a multicast request in typical datacenter topologies, which enables one-to-many transfers to all recipients with fewer resources. DCCAST [24] constructs a minimal-weighted multicast tree over a network graph, where each link is assigned a weight according to the total load of the already scheduled requests and a given new request. Such an assignment of weights enables to construct a minimal-sized multicast tree to avoid highly loaded links and allows multiple multicast requests to coexist on the network at the same time. However, since these approaches use the only single multicast tree for a multicast request, the transfer throughput is limited by a bottleneck link with the lowest capacity in the multicast tree or with congestion caused by competing flows.
Using multiple multicast trees for a multicast request increases network link utilization to improve the transfer throughput. In [25], each transfer request with a demanded deadline is completed by maximizing the transfer throughput of all requests using multiple multicast trees. For each request, transmission rates and the number of utilized trees are calculated based on the proposed linear programming formulation. In QuickCast [26], the recipients of a single one-to-many file transfer are partitioned into several groups using multiple multicast trees according to transmission rate allocation, which can efficiently and quickly accommodate coexisting multiple file transfers over the network with a series of file transfer requests dynamically arriving.
In recent works, multicast transfer schemes are also studied over reconfigurable datacenter networks with circuit switches that can reconfigure links and fifth-generation (5G) networks. DaRTree [27] dynamically changes specified links over a reconfigurable network, and allocates various rates to each multicast request per discrete time unit, i.e., timeslot, to complete the data transfer by the deadline. Timeslots are fully and efficiently utilized, and transfer throughput to each recipient is maximized. The aim of [28] is to reduce provisioning costs of Network Functions (NFs) with the capability of multicast replication, by minimizing the number of deployed NFs. When there are multicast requests that cannot meet the required transfer throughput between NFs in the designed single multicast tree, the scheduler allows multipath transfers between those NFs, which reduces the installation of extra NFs, and meets the demand.
Contrary to these multicast-based schemes and settings, which focus on scheduling multiple one-to-many transfer requests [24]- [28], MPMC deals with a single one-to-many transfer request and aims to design a block-wise transmission schedule using appropriate multiple multicast trees so that each recipient can retrieve the file using its own maxflow value on a static full-duplex network. Based on Basic-MPMC [18], we developed some variants in two directions; coding-based one [29] and gossiping-based one [30]. In [29], coded blocks are generated by combining two or more original blocks using exclusive OR (XOR) coding based on a reactive and opportunistic algorithm and are transmitted to recipients to recover (i.e., decode) their unreceived blocks. In [30], one or more blocks are transmitted from not only the sender but the recipients that have already retrieved all blocks to a recipient that has not yet received all blocks, to make use of unused residual links efficiently. Following them, a preliminary version of Coded-MPMC [20] was developed that can outperform the previous two variants, but it was also recognized that the preliminary Coded-MPMC is still unable to design an optimal schedule in some real topologies; therefore we focus on Coded-MPMC.

III. LIMITATION OF BASIC-MPMC A. CRITICAL EXAMPLE
In both Basic-MPMC and Coded-MPMC, equally-sized blocks are transmitted to multiple recipients using multiple multicast trees over one or more phases. In Fig. 2, there are sender s, recipient r 3 with the max-flow value of three, and three recipients r 1 , r 2 , and r 4 with the max-flow value of two, respectively. A file to transfer is divided into six blocks {a, b, c, d, e, f} because of LCM (2, 3) = 6. In any desired setting, in the 1 st phase, r 3 should complete the file retrieval by receiving all six blocks, while r 1 , r 2 , and r 4 should receive four blocks using two multicast flows. In the 2 nd phase, r 1 , r 2 , and r 4 should receive two unreceived blocks to complete the file retrieval. Figure 2(a) shows the set of multicast flows from s to some/all recipients and the set of received blocks at the end of the 1 st phase. Since r 1 , r 2 , and r 4 have not completed the file retrievals, Basic-MPMC proceeds to the 2 nd phase. Figure 2(b) shows the set of multicast flows from s to some/all uncompleted recipients and the set of received blocks at the end of the 2 nd phase. The number of blocks transmitted on each multicast flow is one; r 1 and r 2 receive blocks {e, f} using two multicast flows, and complete file retrievals, but r 4 cannot receive block {d} using the residual link (3 → 4). This is because the blocks received by those recipients are different in the 1 st phase; r 1 and r 2 have {a, b, c, d}, whereas r 4 has {a, b, e, f}. Note that, if there is a desired set of multicast flows by which r 1 , r 2 , and r 4 receive the same blocks in the 1 st phase, those recipients can complete at the same time in the 2 nd phase. However, it is impossible to construct such multicast flows for the 1 st phase. This is because those three multicast flows should deliver mutually different blocks so that recipient r 3 receives all six different blocks. Therefore, different pairs of two multicast flows should deliver different sets of four blocks, resulting in that at least one of three recipients (r 1 , r 2 , and r 4 ) should receive a set of blocks different from the other two recipients. Figure 2(c) shows the time chart of Basic-MPMC transfer in each link. By letting L be the file size, r 3 completes file retrieval in its lower-bound RCT L/3 and r 1 and r 2 complete in their lower-bound RCT L/2. However, r 4 cannot receive block {d} in the 2 nd phase, then completes in RCT 2L/3 that is not its lower-bound.

B. CODED-MPMC SOLUTION TO CRITICAL EXAMPLE
In Coded-MPMC, by generating a set of coded blocks from the original blocks using RS coding, both original and coded blocks can be transmitted. Each recipient can retrieve the file when it receives a necessary number of different blocks that are either original or coded. Figure 3 shows an example of Coded-MPMC transfer in the same network as Fig. 2. The 1 st phase is the same as that for Basic-MPMC in Fig. 2(a). Figure 3(a) shows the set of multicast flows from s to some/all recipients and the set of received blocks at the end of the 2 nd phase in Coded-MPMC. In the 2 nd phase, transmitting one coded block {g} enables all uncompleted recipients in the 1 st phase to retrieve different six blocks using three multicast flows. As shown in Fig. 3(b), in Coded-MPMC, r 4 also completes file retrieval in its lowerbound RCT L/2; this is an optimal schedule.

A. NETWORK MODEL
We target one-to-many transfers on a fully-controlled network consisting of network nodes (i.e., routers or switches) with bandwidth-guaranteed full-duplex wired links connecting them. There are a sender host and multiple recipient hosts VOLUME 9, 2021 which are connected to the network. Each of those hosts is assumed to be attached to each different node through a link with a sufficiently large transmission capacity. In other words, any bottleneck of file transfers is not in those links attaching the sender/recipient hosts.
Our targeted communication network with full-duplex links is modeled as a connected and symmetric directed graph (digraph) D(V, A); V represents a set of nodes; and A ⊂ V × V represents physical directed connections of adjacent nodes. Let c(u, v) be a positive integer value representing the capacity of link (u, v) ∈ A; that is, the capacity of each link is a multiple of the unit capacity 1. Digraph D(V, A, C) is defined in association with static link capacities (mapping ). All links are assumed to be static, error-free, and mutually independent to simplify the file transfer process proposed in this paper.
Let N = |V| be the number of nodes. All nodes are numbered (as index) from 0 to (N − 1). Let s be the index to which the sender host is attached; R be the set of indexes of nodes to which the recipient hosts are attached. Let K = |R| be the number of recipients; K ≤ N − 1. In reality, a file is transferred by a sender host attached to node s, and is received by each recipient host attached to a node in R. However, since each host is attached to a corresponding node with a sufficiently large capacity, we can consider s as a sender and R as a recipient set hereinafter.
On given D(V, A, C), a path is defined as a cascaded sequence (u, from node u to node v without loop; its length is 1 if w 1 = v, and is (k + 1) otherwise. Since only non-loop (non-cycle) paths are considered, all nodes are mutually different; that is, u, w j (j = 1, 2, . . . , k), and v are different nodes if the path length is k +1. A unit flow f u,v is a data flow from u to v along a path from u to v consuming the unit capacity at each link on the path. If a node is passed by a flow, it is to belong to the flow. Note that hosts attached to any node belonging to the flow (except for u) can receive data from u on the flow. A set of unit flows f (1) if the number of the unit flows traversing link (p, q) does not exceed its capacity c(p, q) for any link (p, q) traversed by at least one of those flows. Compatible flows can transfer data simultaneously and independently. In the special case that all links have the unit capacity (i.e., c(p, q) = 1), it is equivalent to an arc-disjoint set of paths. Furthermore, for a data transfer from u to v, multiple paths can be utilized simultaneously. The maximum number of compatible unit flows (along multiple paths in general) from u to v is called the max-flow value m(u, v), which is the maximally possible aggregate capacity of a concurrent transfer on a set of unit flows. Finding the max-flow value m(u, v) and the max-flow set (a set of m(u, v) unit flows from u to v) is known as the max-flow problem. Note that, a given source-sink pair (u, v), the max-flow set is not unique in general.
is a multicasting data flow from u to W along a multicast tree T u,W (V , A ) consuming the unit capacity at each link on the multicast tree. Note that any host attached to a node (either ∈ W or ∈ V \ W) on tree T u,W can receive data on the flow. A set of unit multicast flows F (1) if the number of the unit flows flowing on link (p, q) does not exceed its capacity c(p, q) for any link (p, q) flowed by at least one of those flows. Compatible multicast flows can transfer data simultaneously and independently. In the special case that all links have the unit capacity (i.e., c(p, q) = 1), it is equivalent to an arc-disjoint set of directed trees.

B. MODEL OF BASIC-/CODED-MPMC
On given connected and symmetric digraph D(V, A, C), sender s ∈ V, and recipient set R ⊂ V, the MPMC (D(V, A, C), s, R) problem is to design a schedule of oneto-many transfer on an appropriate set of unit multicast flows rooted at s for distributing the file to each of recipients (∈ R) as quickly as possible.
The retrieval completion time (RCT or RCT[r]) of recipient r is defined as the duration from the time when sender s starts sending the file to the time when recipient r has received the entire contents of the file. Note that the propagation delay at the link and the processing delay at the node are ignored.
The RCT inherently varies among recipients due to their heterogeneity of the location in the network. The lower-bound unit time, that is the file size L divided by the max-flow value m(s, r) (by defining the unit capacity as the unit data size per the unit time). A schedule is called ''optimal'' if and only if the RCT[r] is equal to its lower-bound time for every recipient r. Let us classify all recipients by their own max-flow values. Assuming that those max-flow values m(s, r) range into h different values as m 1 > m 2 > . . . > m h , the recipient set M(j) is defined as the set of all recipients whose maxflow value from the targeted sender is m j . We call it ''the j th group'' which depends on the sender location. Therefore, in any optimal schedule, for j = 1, 2, . . . To design an optimal schedule, the 1 st phase is defined as the period from the start time of transfer to the completion time of retrieval by all the recipients in the 1 st group. For j = 2, 3, . . . , h, the j th phase is the period from the completion time of retrieval by recipients in M(j − 1) to the completion time of retrieval by recipients in M(j). Thus, in any optimal schedule, the duration of the 1 st phase is L m 1 and that of the j th phase is The file transfer proceeds in such a phase-by-phase manner until all recipients have retrieved the entire file if there are multiple recipient groups. Let S r (j) be the set of all blocks that are scheduled to be received by recipient r before the j th phase. At the beginning of the 1 st phase in any schedule, all recipients have no block, i.e., ∀r ∈ R (S r (1) = φ). On the other hand, at the beginning of the j th phase in any ''optimal'' schedule, the number |S r (j)| of blocks to be received in the previous phases by recipient r should be d · m i m j−1 for any recipient r ∈ M(i) (i ≥ j). Let R(j) be the set of all uncompleted recipients at the beginning of the j th phase, i.e., recipients who need to receive additional blocks in the j th phase; For the j th phase (j = 1, 2, . . . , h), . . , t} of unit multicast flows that are compatible and rooted at s with some leaf node set consisting of all or a part of uncompleted recipients (i.e., W k ⊂ R(j)), and • for each k = 1, 2, . . . , t, we determine a set B (k) of blocks to be transmitted on the k th unit multicast flow F In Basic-MPMC, B (k) consists of original blocks only; while in Coded-MPMC, B (k) can include coded blocks.
Block allocation σ for a phase is called ''bandwidth-efficient (BE) for recipient r'' if r receives its maximally possible number of blocks on the max-flow value m(s, r) of unit flows in that phase, that is, r can fully utilize a max-flow set from s to r. Block allocation σ for a phase is called ''globally bandwidth-efficient'' (GBE) if it is BE for every uncompleted recipient in that phase. A GBE block allocation should satisfy the following condition(s).
In other words, for any recipient in the i th group, the number of unit multicast flows passing r should be m i , i.e., |{k|r ∈ V k }| = m i if r ∈ M(i).
For any subsequent phase, the block allocation should consider the set S r (j) of blocks that are scheduled to be received by r in previous phases.
(C 2 ) In the j th phase (j > 1), any recipient r in the i th group (i = j, j + 1, . . . , h) should receive different ''unreceived'' blocks that are not in S r (j), using the maxflow value m i of unit multicast flows that are all or a part of {F That is, for any uncompleted recipient r, if r is a recipient in the i th group, the number of unit multicast flows passing r should be m i , i.e., |{k|r ∈ Note that the number of blocks to be transmitted on each unit multicast flow should be The number t of unit multicast flows in the j th phase should be equal to or greater than m j . VOLUME 9, 2021 if r belongs to the k th unit multicast flow F (k) , F (k) should not transmit any block that is scheduled to be received by r in previous phases, i.e., A schedule of MPMC consists of a series (σ 1 , σ 2 , . . . , σ h ) of block allocations from the 1 st phase to the last (h th ) phase. If all block allocations are GBE, the schedule is optimal.
In Basic-MPMC, only original blocks are transmitted, and each recipient can retrieve the file when it receives all d original blocks. Therefore, in a simple but critical example shown in Sec. III-A, any block allocation cannot satisfy the required condition for the 2 nd phase. In other words, even if we find a GBE block allocation for the 1 st phase, it is not guaranteed to design an optimal schedule.
In Coded-MPMC, on the other hand, as shown in Sec. III-B, both original and coded blocks are transmitted and each recipient can retrieve the file when it receives any d different blocks that are either original or coded. As explained in the next subsection, if we find a GBE block allocation for the 1 st phase, we can always design an optimal schedule. is derived from the block allocations for past phases (σ 1 , σ 2 , . . . , σ j−1 ) as shown in Alg. 1 to distribute the coded and/or original blocks to all the uncompleted recipients R(j) so that all the recipients in the j th group M(j) complete the file retrieval.
Note that, if block allocation σ is GBE for the j th phase (i.e., BE for ∀r ∈ R(j)), we can simply derive σ that is GBE for the (j+1) th phase (i.e., BE for ∀r ∈ R(j+1)). All or a part of the multiple unit multicast flows F (1) s,W 1 , F (2) s,W 2 , . . . , F (t) s,W t that cover the recipient set R(j) in σ can be used in σ to cover the recipient set R(j + 1) ⊂ R(j), and a set of coded blocks Algorithm 1 Block Allocation Algorithm in the j th Phase to Reuse Already Allocated Blocks A k ), B (k) )|k = 1, 2, . . . , t}, R, R(j), {S r (j)|r ∈ R}, d, Set C of all possible coded blocks, Set C * of already allocated coded blocks Output: Set of blocks allocated in the previous phases 3: σ j ← φ; 4: n ← 0; 5: for k = 1, 2, . . . , t do 6: Set of uncompleted recipients in F (k) 7: if U = φ then 8: n ← n + 1; 9: W n , V n , and A n are reduced so that W n ⊂ U on F (n) s,W n (V n , A n ); 11: Set of blocks already allocated in this phase 13: d ← |X \X * |; 14: if d ≥ d * then 15: B (n) ← d * blocks selected from X \X * ; 16: else if d > 0 then 17: B (n) ← All d blocks in X \X * and (d * − d ) coded blocks selected from C; 18: Remove the coded blocks from C and add them to C * ; 19: else 20: ← d * coded blocks selected from C\X * ; 21: Remove the coded blocks from C and add them to C * ; 22: end if 23: s,W n (V n , A n ), B (n) )}; 24: end if 25: end for that is not used in the previous phases can be allocated to the unit multicast flows.
A trivial approach to guarantee for every recipient to receive blocks that were not received before is to use completely new (coded) blocks in each phase. However, there is a chance to reuse some blocks that were received by a set of recipients before but not received by another set of uncompleted recipients. Algorithm 1 is used to select blocks in determining which unit multicast flow transmits which block set. This process aims to reduce the number of coded blocks to generate by reusing an already delivered block on  for all (f x , F (y) ) in Z do Block allocation for r do 20: 22: Update W y ; 23: end for 24: if |Z| < m i then BA for r is non-BE 25: if Repeate count of deprioritizied reallocation for r does not exceed threshold value then Sec. V-B

V. GLOBALLY BANDWIDTH-EFFICIENT BLOCK ALLOCATION (GBE-BA) A. PROCEDURE TO FIND GBE-BA
To design an optimal schedule, we should find GBE block A k ), B (k) )|k = 1, 2, . . . , m 1 } for the 1 st phase that satisfies the condition C 1 defined in the previous section. However, since finding such σ is not trivial with a combinatorial explosion as increasing the size of the network, we propose a heuristic approach (Alg. 2) to construct appropriate σ that is GBE.
The basic idea is to determine a block allocation (BA) first to unit flows from sender s to a recipient who has the largest max-flow value m 1 , i.e., a recipient who should complete the file retrieval first. For subsequent recipients, a per-recipient construction of a max-flow set with block allocation is repeated in descending order of the recipient's max-flow value until the block allocation is performed for all recipients in a one-by-one manner. In this process, the max-flow set for the target recipient should be constructed to be compatible with the previous recipients' max-flow sets. To do so, each unit flow for the target recipient is constructed by extending or branching some existing unit flow in an already constructed max-flow set for some previous recipient, which eventually results in incrementally constructing m 1 unit multicast flows rooted at sender s. Note that this extension is not unique for some branch nodes. Since all recipients belonging to the same unit multicast flow receive the same set of blocks, this extension directly determines the blocks allocated to the target recipient. Therefore, to make a BE-BA for the target recipient, each flow in the max-flow set should correspond to (i.e., be extended from) each different unit multicast flow by solving ''bipartite matching problem'', to allocate mutually different sets of blocks to all unit flows in the max-flow set as described in Sec. V-C.
Since the above approach strongly depends on the order of recipients for block allocation, we need to appropriately control the order. In particular, the order among multiple (possibly a large number of) recipients in the j th group M(j) is of essential importance to find a GBE-BA. Therefore, we introduced dynamic changes of the recipient order by allowing retrials in selecting a target recipient as described in Sec. V-B.
Furthermore, since we found that such heuristic controls of the order of recipients alone are insufficient to find a GBE-BA in more than half of examples of the realistic network topologies in Table 1 in Sec. VI-A, we also introduce dynamic changes of the max-flow set by allowing retrials in selecting unit flows in the max-flow set for a target recipient   [22].
as described in Sec. V-D. Both the change of the order of recipients and the change of the max-flow set for a recipient are necessary to find a GBE-BA as shown in Table 3 in Sec. VI-B.
Lastly, note that, although it rarely happens, we also found that a BE-BA for a recipient with a larger max-flow value can prevent a posterior BA from being BE for another recipient with a smaller max-flow value. This cannot be detected by checking if a current BA is BE for every recipient with the same max-flow value. Therefore, in response to such a special case, we optionally check a necessary condition to finally realize a GBE-BA from the current BA configuration. This checking (Alg. 3) is introduced in the Appendix because it is not necessary in almost all cases.

B. CONTROL OF RECIPIENTS ORDER
In case that multiple recipients have the same max-flow value, a simple approach is to randomly change the order among such tied recipients and select the best order by trial and error. However, in general, it is challenging to examine all possible orderings of tied recipients to find a GBE-BA in complex large-scale network topologies with a large number of recipients.
To efficiently find a GBE-BA, we adopt a heuristic control of the recipient order of block allocation among recipients with the same max-flow value proposed in [20]. Concretely, we take priority to the recipients using a max-flow set that if a past σ 1 is checked as Fail for r before then 4:

exit;
Stop finding a GBE-BA 5: else 6: σ 1 ← old σ 1 ; 7: r is pushed to the tail of BAorder[i]; In Alg. 2 next time, a new max-flow set P different from the failed P will be selected; 8: end if 9: end if consists of flows along shorter-length paths. The idea comes from the fact that a BA for a recipient should have a small effect on BAs for the subsequent recipients. As explained in Sec. V-C, by allocating blocks to the max-flow set for a recipient, some links are newly allocated blocks. Allocating blocks to a link will reduce a possible chance to newly allocate different necessary blocks to that link or the residual capacity of that link for a subsequent recipient. Therefore, block allocation for a recipient who involves a small number of links is preferentially performed as follows. Path-Length Ascending Order Allocation: For recipients with the same max-flow value, the block allocation is performed in ascending order of the total number of links traversed by unit flows in the max-flow set for each recipient. If there are multiple recipients whose max-flow set traverses the same number of links, the recipients are compared in terms of the maximum length of paths flowed along by unit flows in its max-flow set. Then, the block allocation is performed in ascending order of the maximum path length. If there remains a tie-break in some recipients, we use a pre-defined order ''recipient order parameter,'' which is randomly given in simulation.
However, since the use of this heuristic control alone does not always provide a good result, we allow a retrial for a recipient by dynamically postponing the block allocation for the recipient as follows. Deprioritized Reallocation: When the block allocation for recipient r is performed, but a resulting BA is non-BE for r, its block allocation is postponed after the block allocation for all other recipients with the same max-flow value will have been performed. Such postponed retrials for r can be repeated until either a resulting BA becomes BE for r or the number of retrials for r exceeds a pre-defined threshold value.

C. BLOCK ALLOCATION FOR EACH RECIPIENT
To each (target) unit flow in a max-flow set for recipient r, a set of blocks should be allocated so as to be compatible with the max-flow sets for the previous recipients. This is equivalent to selecting an appropriate unit multicast flow from the already-constructed ones so that the target flow is branched or extended from the selected multicast flow. Note that if recipient r already belongs to an already-constructed unit multicast flow, a unit flow along this unit multicast flow should be included in the max-flow set for r. Let F be the set of all unit multicast flows to each of which a set of blocks are already allocated; P be the max-flow set for r with the max-flow value of m i to each of which a set of blocks will be allocated. The block allocation procedure should find m i different unit multicast flows in F so that each of m i flows in the max-flow set can be extended from (or already included in) them.
To find m i appropriate unit multicast flows for recipient r, at first, we check which flow f x in the max-flow set P can be extended from which unit multicast flow in F to deliver the blocks to r. This is done by tracing back f x from r to s to list possible branching nodes. Based on the above checking, a bipartite graph B(P, F, Q) is created consisting of the max-flow set P = {f 1 , f 2 , . . . , f m i } and the set F = {F (1) , F (2) , . . . , F (m 1 ) } of unit multicast flows as two sets of vertices, and set Q of directed edges from P to F. A directed edge from f x to F (y) represents that r can belong to unit multicast flow F (y) via unit flow f x to receive block set B (y) . There are two cases; either in case that F (y) can be newly extended for r by using residual links traversed by f x or in case that f x has already been included in F (y) (i.e., r already belongs to F (y) ) due to an extension of F (y) for previous recipients.
In the former case, let w be a branching node (i.e., w is the intersection of unit multicast flow F (y) and unit flow f x ). The links (or one capacity on the links, more exactly) in the segment from w to r on flow f x are newly used to extend a multicast flow with an allocation of set B (y) of blocks. Based on the same idea of the path-length ascending order allocation, we reduce the number of such newly allocated links to remain more number of links with unused capacity for the use by the subsequent recipients. To do so, the number of links in the segment from w to r on f x is assigned to the edge in Q as the cost, and a desired matching from P to F is determined by using a standard algorithm for ''Minimum Cost Bipartite Matching.'' Note that the above matching procedure may change the actual route of flow f x . For example, suppose the original f x (before matching) from s to r flows along multicast flow F (i) and branches at node v toward r; it meets another multicast flow F (j) ) at node w in-between from v to r. If f x is matched to F (j) in the matching, the modified f x (after matching) from s to r flows along F (j) and branches at node w toward r. As a special case, nodes v and w can be the same node if the node (v = w) is an intersection of multicast flows F (i) and F (j) . In our simulation, this special case often happened.
Let Z be the resulting set of ''flow''-''multicast'' matching that minimizes the cost, from which a BA for r is directly derived. Note that the number of flow-multicast matching |Z| is equal to m i if and only if the BA is BE for r. In other words, if |Z| < m i , the BA is non-BE.

D. SELECTION OF MAX-FLOW SET
At different steps in the block allocation for recipient r, maxflow set P from s to r should be given. At first, a max-flow set is used to decide the order of recipients in Path-length ascending order allocation. This initial max-flow set is found using a standard algorithm with a breadth-first search (BFS).
Then, in allocating blocks for r, a max-flow set from s to r used here may be different from the initial one in general. The pseudo-senders are defined as nodes included in the intersection of all the unit multicast flows in F, i.e., the nodes belonging to all unit multicast flows. From the original D(V, A, C), we define D(V * , A * , C * ) by virtually connecting super-source s * to each of all pseudo-senders with a sufficient capacity. A max-flow set from super-source s * to r is computed on D(V * , A * , C * ) also using a standard algorithm with BFS. From the resulting virtual max-flow set from s * to r, we can derive a max-flow set from s to r. The derived max-flow set is expected to be better than the above initial max-flow set in terms of reduction of the number of links newly used in allocating blocks for r, which will increase the number of potential links available in allocating blocks for other subsequent recipients. Furthermore, the max-flow set from s to r may also be modified in the matching process as described in Sec. V-C. VOLUME 9, 2021 However, if a resulting BA on a given max-flow set cannot be BE for r, we should select another different max-flow set for r either by changing one unit flow in the max-flow set or by newly generating a different max-flow set as follows. Suppose a non-BE BA for r with the max-flow value of m i is derived from F (the set of unit multicast flows), P (the current max-flow set), and Z (the flow-multicast matching from P to F).

Max-Flow Set Reselection:
In case that only a single flow f in all m i flows in P is unmatched in Z, i.e., unallocated by the block allocation, the flow f is changed to a new flow. Let P Z be P \ {f }, the set of all (m i − 1) flows matched in Z; F Z be the set of all unit multicast flows with only one directed edge from some vertex in P on bipartite graph B(P, F, Q), i.e., the set of multicast flows that are uniquely determined to be extended; and F c Z be F \ F Z , the set of remaining unit multicast flows not included in F Z . To make a BE-BA, a new flow replacing f should extend one of unit multicast flows in F c Z . Therefore, we define D(V * , A * , C * ) from D(V, A, C) as follows. C * is defined by reducing the link capacities used by each multicast flow in F and also reducing the link capacities newly used to extend each of (m i − 1) unit flows in P Z from each corresponding matched multicast flow in F. To define (V * , A * ), virtual super-source s * is connected to each node belonging to at least one of unit multicast flows in F c Z with sufficient capacity. Then, the shortest virtual path to r from s * on D(V * , A * , C * ) is computed by a standard algorithm; if it exists, it is guaranteed to be compatible with all (m i − 1) unit flows in P Z . Finally, a new unit flow from s to r should be determined from the obtained virtual path, which is combined with (m i − 1) flows in P Z in order to construct a new maxflow set P on the original D(V, A, C). Note that this new unit flow is not always uniquely determined. Let w be the node next to s * in the virtual path. If w belongs to multiple unit multicast flows, the new unit flow is not determined in the route from s to w because it can flow one of multiple possible unit multicast flows from s. However, this is not a problem. After the max-flow set reselection, we always do the block allocation based on bipartite graph matching from P to F, which finally solves such non-uniqueness.
On the other hand, in case that more than one flows are unmatched in flow-multicast matching Z, a new max-flow set different from P is reselected. Let P Z be the set of all flows matched in Z; and P c Z be P \ P Z , the set of all unmatched flows. To make a BE-BA, a new max-flow set should exclude the previously unmatched flows in P c Z as much as possible. Therefore, from D(V, A, C), we define D(V * , A * , C * ) by virtually connecting super-source s * to each node belonging to all (i.e., the intersection of) unit multicast flows in F with a sufficient capacity. We also consider D(V * , A * , C * ) with ''link cost'' by assigning a sufficiently large cost to each link that is included in some flow in P c Z but not included any flow in P Z . This process prevents the previously unmatched unit flows from being reselected. Then, a new virtual max-flow set to r from s * on D(V * , A * , C) with link cost is computed by a standard algorithm for the minimum-cost max-flow problem. Finally, a new max-flow set from s to r consisting of m i compatible unit flows on the original D(V, A, C) is determined depending on which multicast flows in F are selected to accommodate the virtual max-flow set.
Note that the average and/or the maximum path length of a new max-flow set may be larger than those of the original max-flow set. However, a new max-flow set can introduce a chance to find a BA that is BE for r.

VI. SIMULATION EVALUATION A. SIMULATION SETUP
Through simulation evaluation, we verify that our proposed block allocation algorithm (Alg. 2) can realize a GBE-BA for the 1 st phase for various network topologies listed in Tables 1 and 2, and investigate more details of processes in the subsequent phases to show the effectiveness of the Coded-MPMC. These tables show realistic network topologies consisting of forty or more nodes in ''The Internet Topology Zoo'' [21] and three types of random network topologies generated by ''NetworkX Python package'' [22], respectively.
Each instance of network topologies in Table 1 or 2 is given as input for the simulation. We regard each link in the topology as a full-duplex link with both uniform and nonuniform settings for the link capacity. In the uniform case, the link capacity of each link is 1. In the non-uniform case, the link capacity is 1, 2, 3, or 1, 4, 10, corresponding to the number of duplicate edges defined in the GML file in [21].
In addition to the topology information, locations of a single sender (s) and recipients should be given as inputs. In this evaluation, we consider only the case that all nodes except for the sender are recipients, i.e., R = V\{s}; that is, in reality, every node is directly connected to either sender s or a recipient host with sufficiently large link capacity. This is because an optimal schedule in the MPMC (D(V, A, C), s, V \ {s}) problem can easily lead to an optimal schedule in the MPMC (D(V, A, C), s, R) problem for any R ⊂ V \ {s}, as noted in Sec. IV-B.
The recipient order parameter for tie-break in the recipients with the same max-flow value is randomly given and sorted by path-length ascending order allocation. The threshold value for the maximal repeat count of deprioritized reallocation is set to 1 in this simulation. The obtained results suggest a single chance for such postponement is enough.

B. SIMULATION RESULTS
The essential result is that our scheme successfully found a GBE-BA in the 1 st phase and generated an optimal schedule for every instance of realistic network topologies in Table 1 and random network topologies in Table 2 generated by ten random-seed values, regardless of the position of the sender and the uniformness of link capacities as explained in Sec. VI-A. This fact suggests that the Coded-MPMC actually benefits a sufficiently wide range of heterogeneous scenarios.
We illustrate examples of our simulation results in a topology with many high-degree nodes ''UUNET'' (Fig. 4),   a large-scale topology with more than one hundred nodes ''Interroute'' (Fig. 5), and a random network ''Barabasi-Albert model'' (Fig. 6); for those network topologies, GBE-BAs were not found by our previously proposed method [20]. Table 3 shows a part of simulation results. ''Sender ID'' means the ID of the node defined in [21] and [22] to which the sender is directly connected.
First, we compare the average RCT over all recipients (the average value of all recipients' RCTs) in Coded-MPMC to that in the ST scheme, where ST means a simple multicast transfer using a single multicast tree in which the flow value to each recipient is one. In all cases, we see that the ratio of the average RCT in Coded-MPMC to that in the ST is about half or less, which shows the strong advantage of Coded-MPMC that can fully utilize the max-flow value of each recipient. Since the max-flow values vary, the average RCT differs depending on the sender location in each topology.
Next, we focus on the number of original blocks and coded blocks. Since the number of original blocks is the least common multiple of all max-flow values, the number of original blocks also depends on the sender location in each topology. For example, when the sender ID is 9 in UUNET, the file to transfer is divided into 1260 blocks since there are recipients with the max-flow value of 10, 9, 7, 6, 5, 4, 3, 2, and 1. The number of coded blocks used to generate the optimal schedule in Coded-MPMC is computed using Alg. 1, which reuses already distributed blocks as much as possible. Table 3 also shows that the use of Alg. 1 reduces the necessary number of coded blocks compared to using only new coded blocks without reusing already distributed blocks in all phases. This reduction in the necessary number of coded blocks benefits the coding process in real implementation to reduce the computational cost. The number of blocks is also discussed later in Sec. VIII-A.
We show examples of a GBE-BA in UUNET and Interroute when the sender ID is 9 and 54, respectively. In Fig. 7, ten unit multicast flows are utilized where each flow distributes 126 original blocks in the 1 st phase. In Fig. 8, six unit multicast flows are utilized where each flow distributes 2 original blocks in the 1 st phase. Then, these multicast flows are used in all phases until uncompleted recipients retrieve the entire file.
Finally, we demonstrate the necessity of our heuristics in the block allocation algorithms. Table 3 also shows how many times the deprioritized reallocation and the max-flow set resection (described in Sec. V-B and Sec. V-D) were performed to find a GBE-BA in the 1 st phase.
Note that we did not provide any specific performance comparison of Coded-MPMC to the existing schemes such as QuickCast [26] or DaRTree [27] described in Sec. II. This is because those schemes focus on allocating network resources for multiple requests or demands for different one-to-many file transfers simultaneously happening, while Coded-MPMC focuses on a single dedicated request for a single one-to-many file transfer. Thus, it is obvious that Coded-MPMC outperforms those schemes in terms of  minimizing each recipient's RCT if there is only a single oneto-many file transfer with our assumption.

VII. PROTOTYPE IMPLEMENTATION
The prototype of Coded-MPMC is implemented on a sender, recipients, OpenFlow controller (OFC) system with Ryu framework [35], and OpenFlow switch (OFS) with Open vSwitch switch OS [36]. MDS coding at the sender and recipients is implemented by using the library [37] of a systematic Reed-Solomon coding. x-~in Fig. 9 show the procedural order for the Coded-MPMC on an OpenFlow network.
Before starting a file transfer, the OFC floods Link Layer Discovery Protocol (LLDP) packets to OFSs to get information on the network topology. Then, the OFC queries the port information from OFSs and gets the bandwidth speed at the link-up of the port (x). Each recipient uses Membership Reports of Internet Group Management Protocol (IGMP) to join as a recipient, by which the OFC can know the locations of the recipients (y).
Next, a sender sends a request with the file size to the OFC (z), which triggers the OFC generates a schedule of the Coded-MPMC transfer from the file size and the topology information with the sender and the recipients ({). Flow entries for OFSs are created based on the generated schedule including the UDP transmission port number, the phase number, and the block number which identify each of multiple multicast flows. The OFC installs those flow entries into OFSs. Besides, the OFC notifies the sender and the recipients of the necessary information on schedule (|). Based on the schedule information, the sender generates a necessary number of the coded blocks to be transmitted (}). If the file size is not a multiple of the division number of the file, a zeropadding is applied to create equally-sized blocks.
After starting block transmissions, when a recipient has received a necessary number of blocks, the recipient sends the retrieval completion notification to the sender, and the sender knows which recipients have completed. Then, the recipient decodes their blocks and restores the entire file.
In this implementation, a simple packet retransmission mechanism is adopted against packet-loss during a file transfer (~). Each recipient detects that some packets in a block are lost by using the sequence number of received packets and a timer. When a recipient detects packet-loss, unicast retransmission is performed on the shortest path from the sender to the recipient in parallel to multicast transmissions of packets of the same or a different block.
In [38], we verified our prototype of Coded-MPMC on a global OpenFlow testbed network, consisting of our inlab OpenFlow network and a wide-area OpenFlow testbed network ''RISE'' [39], which is extended from Fukuoka and Tokyo in Japan to Seattle in the U.S. As a result of the verifications, each recipient at globally distributed locations could receive all blocks to be retrieved in an RCT very close (less than 102%) to its lower-bound when packet-loss does not occur. Besides, we could check that the retransmission scheme worked correctly when some recipients detected packet-loss. These facts imply the basic correctness of the implementation. Table 3, our proposed block allocation algorithm (Alg. 2) can realize a GBE-BA for the 1 st phase, but the increase of encoding and decoding times with the increase of a necessary number of coded blocks is not considered. Furthermore, if we use Reed-Solomon (RS) coding on GF(2 8 ), the total number of the original and coded blocks is limited to 255. Using RS coding on GF(2 16 ) or GF(2 32 ) may relax the limit, but requires longer encoding and decoding times.

As shown in
To reduce the total number of the original and coded blocks, apart from an effort of block reuse in Alg. 1, it is essential to reduce the flow values of some recipients so that the least common multiple of all max-flow values keeps smaller. Therefore, we consider a semi-optimal schedule in which the max-flow values to some recipients are intentionally reduced at the expense of giving up the exact lower-bound RCTs of those recipients and accepting a semioptimality. Table 4 shows the ratio of the average RCT over all recipients in the Coded-MPMC to that in the ST by an optimal schedule (''before'') and by a semi-optimal schedule (''after'') in which the max-flow values to some recipients are reduced. As a result, the total number of original blocks and coded blocks is limited to less than 255 in UUNET and Barabasi-Albert model with the sender ID of 9 and 3, respectively. At the same time, the results also show that the performance degradations by the semi-optimal schedule of the Coded-MPMC are negligible in terms of the average RCT.

B. STEINER TREE PACKING AND GBE-BA
The Steiner Tree Packing problem [40] is a classical work that requires to find the maximum number of edge-disjoint trees from a single sender s to a recipient set R over directed or undirected networks. There are many variations, but in general, we are thinking of a set of trees rooted at s, each containing (and reaching) all the recipients. When such multiple trees are found, the sender can maximize the same flow rate for all recipients. In contrast, although it can be considered as a variant of the Steiner Tree Packing, finding a GBE-BA in this paper assumes a special rich network condition that all links are full-duplex but requires a hard combinatorial condition that each recipient r ∈ R is included in a necessary number of different multicast trees rooted at s to maximize the different flow rate for each different recipient simultaneously over all multiple recipient groups. Besides, since the Steiner Tree Packing problem is known to be NP-hard and the requirement of finding a GBE-BA is more complex, finding a GBE-BA is not a trivial task. In this paper, although the existence of a GBE-BA for an arbitrary network topology is not theoretically proven, we show experimentally that such a GBE-BA exists in a wide range of network topologies.

IX. CONCLUSION
We focus on a single one-to-many file transfer with a single sender and multiple recipients for fully-controlled networks with full-duplex wired links. In particular, the coded Multipath Multicast (Coded-MPMC) is introduced in which a file is divided into appropriate number d of equally-sized original blocks and both the original blocks and the coded blocks generated from the original blocks are transmitted to the recipients using multiple multicast trees. In Coded-MPMC, each recipient can retrieve the entire file by receiving any of d different original or coded blocks.
What we have shown in this paper is how to design an optimal schedule of block-wise transmissions over multiple phases in Coded-MPMC so that every recipient can retrieve the file in its lower-bound time. For any optimal schedule, we need to find a set of multicast flows on which appropriate blocks are transmitted so as to realize the max-flow value from the sender to each recipient simultaneously in each phase. We call it a globally bandwidth-efficient block allocation (GBE-BA). We have developed a heuristic method to find a GBE-BA and verified it through extensive simulations with many large-scale real-world network topologies as well as different types of synthetic topologies randomly generated. Furthermore, we briefly reported an OpenFlow-based prototype implementation of Coded-MPMC that was verified on a wide-area OpenFlow testbed network.
Note that, the present one-to-many transfer scenario has a limitation of concurrent processing, i.e., only one sender is allowed in the network during the period of transmitting bulk data. However, we believe that Coded-MPMC can also contribute to many-to-many file transfers in a centrally-managed network where multiple tasks from different senders are scheduled concurrently. Therefore, in future work, we will try to extend the Coded-MPMC approach to deal with multiple one-to-many file transfers that happen at the same time or at different overlapped times, at the sacrifice of giving up the theoretical lower-bounds of file retrieval completion times.

NECESSARY CONDITION CHECK FOR GBE-BA
In this Appendix, the explanation assumes all links (edges) in both directions have the unit capacity of 1 for simplicity. A block allocation (BA) in the 1 st phase is BE for recipient r when r belongs to m(s, r) different multicast flows. Therefore, any set of cut edges from s to r should be finally allocated to at least m(s, r) different multicast flows. This fact suggests that before the block allocation for recipient r is performed, we can know that the current BA configuration will be non-BE at least for r by checking a necessary condition. For example, let E be a set of cut edges from s to r. Suppose a FIGURE 10. BA σ 1 after the block allocations for 6 (a) and 9 (b)(c) when block allocation order of recipients in the 1 st group BAorder [1] is [10,6,9].
number n 1 of different multicast flows have been allocated on some links in E by previous block allocations and the number of (free) links in E on which no multicast flows are allocated before is n 2 . Then, if (n 1 + n 2 ) is less than m(s, r), the current BA cannot be BE for r.
We describe an example in which the block allocation for a recipient in the 1 st group constructs a BA that will be non-BE for a subsequent recipient in the 2 nd group in Fig. 10, where the sender ID is 0 and the block allocation order of recipients in the 1 st group is [10,6,9]. All nodes except for 0 are recipients, and the max-flow values m(s, r) are 2, 3, or 4; there are LCM (2, 3, 4) = 12 blocks {a, b, . . ., l} to transmit. Figures 10(a) and (b) show the BA σ 1 just after the block allocations for 6 and just after the block allocation for 9, respectively. Each color represents a multicast flow with a set of allocated blocks. In the BA in Fig. 10(b), blocks {a, b, c} are allocated to a link (10 → 9), and blocks {g, h, i} are allocated to links (10 → 14 → 13 → 9). Then, we see that 14 cannot retrieve the unreceived blocks using the residual links (11 → 15 → 14) because only two different block sets {j, k, l} and {g, h, i} are already allocated to the set {(13 → 14), (10 → 14), (10 → 11), (7 → 11)} of cut edges to node 14. Thus, the BA σ 1 will be non-BE for 14.
In this example, if the block allocation for 9 is performed by reversing the blocks allocated to (10 → 9) and to (10 → 14 → 13 → 9) within the same max-flow set, the BA can become BE for 14. However, in most cases, the BA is non-BE for a subsequent recipient unless another max-flow set is reselected. As shown in Fig. 10(c), by using another max-flow set for 9, the above condition can be avoided for 14. However, this condition is not practically usable because it is computationally too costly to investigate all sets of cut edges to each of all subsequent recipients after the block allocation for the current recipient. Therefore, we adopt a weaker check on the following simpler condition. CheckBlockAllocation(): This procedure is performed every time after the block allocation for r is completed to check if the current BA prevents the future BA from becoming BE for any subsequent recipient. Let r be a subsequent recipient, m be the max-flow value from s to r , and F * be a set of all multicast flows to which r already belongs. First, we define D(V * , A * , C * ) by reducing the link capacities used in F * .
Then, max-flow value m * from s to r on the reduced network D(V * , A * , C * ) is computed to find possible flows other than F * on D(V, A, C). If m * + |F * | < m, there is no room to extend (m−|F * |) different multicast flows to r and the future BA cannot be BE for r , thus Fail is returned. If m * + |F * | = m for every subsequent recipient, Pass is returned.
After Fail is returned, as shown in Alg. 3, the current block allocation for r is canceled and postponed. In Alg.2 next time, a new max-flow set P * for r different from the failed P will be selected by finding a max-flow set on the graph on which a sufficiently large cost is assigned to the links used by the failed block allocation for r.
In the case of Fig. 10(b), after the block allocation for r = 9, we will check r = 14 and see that the future block allocation is non-BE for 14 because m = 3, |F * | = 2, and m * = 0; therefore, this block allocation for 9 is canceled. Later, for the next trial, a sufficiently large cost is assigned to the links (10 → 9) and (10 → 14 → 13 → 9) to select a new max-flow set of 9 that is not harmful to 14 such as Fig. 10(c).
We recognize that it is still inefficient to examine the condition to every subsequent recipient after the block allocation for the current recipient; the total number of this check is proportional to K 2 , where K is the number of recipients. Therefore, we optionally use this check as Alg. 3 by inserting it to Alg. 2.