Graph-based Approach for Buffer-aware Timing Analysis of Heterogeneous Wormhole NoCs under Bursty Traffic

This paper addresses the problem of worst-case timing analysis of heterogeneous wormhole NoCs, i.e., routers with different buffer sizes and transmission speeds, when consecutive packet queuing (CPQ) occurs. The latter means that there are several consecutive packets of one flow queuing in the network. This scenario happens in the case of bursty traffic but also for non-schedulable traffic. Conducting such an analysis is known to be a challenging issue due to the sophisticated congestion patterns when enabling backpressure mechanisms. We tackle this problem through extending the applicability domain of our previous work for computing maximum delay bounds using Network Calculus, called Buffer-aware worst-case Timing Analysis (BATA). We propose a new Graph-based approach to improve the analysis of indirect blocking due to backpressure, while capturing the CPQ effect and keeping the information about dependencies between flows. Furthermore, the introduced approach improves the computation of indirect-blocking delay bounds in terms of complexity and ensures the safety of these bounds even for non-schedulable traffic. We provide further insights into the tightness and complexity issues of worst-case delay bounds yielded by the extended BATA with the Graph-based approach, denoted G-BATA. Our assessments show that the complexity has decreased by up to 100 times while offering an average tightness ratio of 71%, with reference to the basic BATA. Finally, we evaluate the yielded improvements with G-BATA for a realistic use case against a recent state-of-the-art approach. This evaluation shows the applicability of G-BATA under more general assumptions and the impact of such a feature on the tightness and computation time.


I. INTRODUCTION
Networks-on-chip (NoC) have become the standard interconnect for manycore architectures because of their high throughput and low latency capabilities. Most NoCs use wormhole routing [1], [2] to transmit packets over the network: the packet is split in constant length words called flits. Each flit is then forwarded from router to router, without having to wait for the remaining flits. Compared to store and forward (S&F) mechanisms, the wormhole routing allows to drastically reduce the storage buffers at each router [3], as well as the contentionfree end-to-end delay of a packet, i.e. almost insensitive to the packet path length. On the other hand, the wormhole routing complicates the possible congestion patterns, since a packet waiting for a resource to be freed can occupy several input buffers of routers along its path; thus introducing indirect blocking delays due to the buffer backpressure 1 [4].
Hence, an appropriate timing analysis, taking into account these phenomena, has to be considered to provide safe delay bounds in wormhole NoCs.
Various timing analysis approaches of such NoCs have been proposed in the literature and a detailed qualitative benchmarking can be found in [5]. The most relevant ones can broadly be categorized under three main classes: Scheduling Theorybased ( [4], [6]- [9]), Compositional Performance Analysis (CPA)-based ( [10], [11]) and Network Calculus-based ( [12]- [15]). However, these existing approaches suffer from some limitations, which are mainly due to: • considering specific assumptions, such as: (i) distinct priorities and unique virtual channel assignment for each traffic flow in a router [6] [4] [9]; (ii) a priority-share policy, but with a number of Virtual Channels (VC) at least equal to the number of traffic priority levels like in [7] [8] [12] [10] or the maximum number of contentions along the NoC [16]; • ignoring the buffer backpressure phenomena, such as in [7], [11], [13], [14], [17]; • ignoring the flows serialization phenomena 2 along the flow path by conducting an iterative response time computation, commonly used in Scheduling Theory and CPA, which generally leads to pessimistic delay bounds. Hence, to cope with these identified limitations, we have proposed in [5] a timing analysis using Network Calculus [18] and referred as Buffer-Aware Worst-case Timing Analysis (BATA) from this point on. The main idea of BATA consists in enhancing the delay bounds accuracy in wormhole NoCs through considering: (i) the flows serialization phenomena along the path of a flow of interest (foi), through considering the impact of interfering flows only at the first convergence point; (ii) refined interference patterns for the foi accounting for the limited buffer size, through quantifying the way a packet can spread on a NoC with small buffers. Moreover, BATA is applicable for a large panel of wormhole NoCs: (i) routers implement a fixed priority arbitration of VCs; (ii) a VC can be assigned to an arbitrary number of traffic classes with different priority levels (VC sharing); (iii) each traffic class may contain an arbitrary number of flows (priority sharing). 1 A logical mechanism to control the flow on a communication channel and avoid buffer overflow. 2 The pipelined behavior of networks infers that the interference between flows along their shared subpaths should be counted only once, i.e., at their first convergence point.

arXiv:1911.02430v1 [cs.PF] 6 Nov 2019
Nevertheless, this approach, along with many other stateof-the-art approaches in timing analysis of NoCs taking backpressure into account, considers only Constant Bit Rate (CBR) traffic, i.e. one fixed-length packet within a minimum interarrival time. However, there are some traffic types, such as real-time audio, video and bursty data streams, which do not fulfill the CBR model. With such traffic, there can be more than one packet of the same flow consecutively queueing in the network. This scenario is referred to hereafter as consecutive-packet-queueing (CPQ). Assuming CPQ allows to consider bursty traffic flows, i.e. flows that can inject several consecutive packets in the NoC, but also to cover scenarios where the network load is high or the traffic is non-schedulable so that a packet of one flow is delayed enough to impact the next injected packet of the same flow. The impact of CPQ assumption on the interference patterns has been revealed in [19]. Moreover, further insights into the computation issues of the worst-case delay bounds yielded by BATA have been provided in [20]. The results reveal that BATA provides good delay bounds for medium-scale configurations within less than one hour, but its complexity increases dramatically for largescale configurations.
In this paper, we extend the applicability domain of BATA to ensure that the computed delay bounds remain safe without any assumption on CPQ, in addition to considering heterogeneous NoCs, i.e. buffer sizes, link capacities and processing delays may differ from one router to another. Furthermore, we cope with the complexity issue of BATA to enable the timing analysis of large-scale configurations within a more reasonable time.
Contributions: we introduce a new Graph-based approach to improve the analysis of indirect blocking due to backpressure, while capturing the CPQ effect, for heterogeneous NoCs. This introduced approach, denoted G-BATA for Graphbased Buffer-Aware Timing Analysis, decreases in addition the complexity of the timing analysis process. Furthermore, we provide deeper insights into the tightness and complexity issues of worst-case delay bounds yielded by G-BATA, when varying different system parameters. Our assessments show that the computation times with G-BATA are 10 to 100 times lower than with BATA. Moreover, the average measured tightness ratio (achieved worst-case delay using simulation over analytical worst-case delay bound) of G-BATA is 71%. Finally, we evaluate the yielded improvements with G-BATA for an automotive use-case against a recent state-of-the-art approach. This evaluation shows the applicability of G-BATA under more general assumptions than the state-of-the-art approach and the impact of such a feature on the tightness and computation time.
The remainder of this paper is organized as follows. We first present the problem statement in Section II. Then, we detail the system model and some preliminaries in Section III. Section IV details our new approach, G-BATA, to handle heterogeneous NoCs and the impact of backpressure under CPQ assumption. Finally, we evaluate the complexity and tightness of our approach in Section V, and the yielded improvements with G-BATA for a realistic use case against a recent state-of-the-art approach in Section VI.

A. Illustrative Example of CPQ Effect
The key element to take into account the backpressure phenomenon induced by limited buffer size is based on how packets can spread in the network when stalled. We consider an illlustrative example to better understand the impact of the buffer size on the packet spreading, and consequently the indirect blocking set ( Figure 1). We make the following assumptions: (i) each buffer can store only one flit; (ii) all flows have 3-flit-long packets; (iii) all flows are mapped to the same VC; (iv) the foi is flow 1. We assume there is a packet A of flow 3 that has just been injected into the NoC and granted the use of the North output port of R6. Simultaneously, a packet B of flow 2 is reauesting the same output, but as A is already using it, B has to wait. B is stored in input buffers of R6, R5 and R4. Finally, a packet C of flow 1 has reached R3 and now requests output port East of R3. However, the West input buffer of R4 is occupied by the tail flit of B. Hence, C has to wait. In that case, A indirectly blocks C, which means flow 3 can impact the transmission of flow 1 even though they do not share resources.  Figure 2). Consider that flow 3 has a packet A that has just been injected in the network and is using output port North of R7. As before, flow 2 has a packet B in the network that competes with A and has to wait. This time, however, the output requested by B is one hop further on the path of flow 2. As a result, B is stored in input buffers of R7, R6 and R5. Finally, flow 1 has injected a packet C into the NoC. Since B is stalling one hop further than before on its path, C can request ouput port East of R3 and use input buffer West of R4, and reach its destination without contention.
An approach that does not consider buffer sizes would predict that flow 3 could impact flow 1 regardless of the configuration. However, on the second example, we just showed such an assumption was pessimistic and could be avoided by taking buffer size into account. This illustrates the impact of the buffer size on packet spreading, and how buffer size reduces the section of the path on which a blocked packet can in its turn block another one. Still considering the same example, recalled on Figure 3, we notice our analysis assumes there can only be one packet of flow 2 stalling in the network. Should there be an additional packet of flow 2 queueing right after the first one, the analysis would be different. We call such a scenario "consecutive packet queueing" (CPQ).
To see how this limits the applicability of BATA approach, consider the packet configuration shown on Figure 3. As before, a packet A of flow 3 is being transmitted. It requested and was granted output port North of R7. Flow 2 has a packet B also requesting output port North of R7 but as A is using it, it has to wait. B is stored in buffers in R5, R6 and R7. Moreover, there is an additional packet of flow 2, C, right behind B. It was granted the use of output port East of R3 and is waiting at R4 for the next input buffer of its path to be available. Finally, flow 1 has a packet D requesting the output port East of R3. Since C is already using this output, D has to wait.
Thus, packet D has to wait that packet A releases output port North of R7 to be able to move. This means flow 3 can indirectly block flow 1, even though BATA approach did not cover such a scenario. CPQ can happen when considering bursty traffic, i.e. flows that can generate and inject a burst of several packets one after the other. Example of such flows include real-time audio and video streams. It can also occur when a packet of a periodic CBR flow experiences enough congestion for the next packet to "catch up" on it.

B. Identified Extensions of BATA
To cover the CPQ assumption, we introduce the new Interference Graph approach, G-BATA (Graph-based BATA), which extends BATA with the following features: • Generic system model to cover more general traffic pattern and heterogeneous NoC architectures; • Improved analysis of the backpressure phenomenon through refining the indirect blocking set computation; • Indirect blocking latency analysis taking into account the refined indirect blocking set. Each identified extension will be detailed in the following sections and there will be illustrated through an example.

III. PRELIMINARIES AND SYSTEM MODEL
In this section, we detail the considered system model based on Network Calculus. First, we present the main concepts of Network Calculus that are used in this paper. Afterwards, we describe the network and flow models. Finally, we introduce the main definitions to cover the characteristics of heterogeneous NoCs with wormhole routing. The notations will be introduced as they are needed and are also gathered in Table  I. As a general rule, upper indexes of a notation X refer to a node or a subset of nodes, while lower indexes refer to a flow. X r f means "X at node r for flow f ".

A. Network Calculus Background
Network Calculus describes data flows by means of cumulative functions, defined as the number of transmitted bits during the time interval [0, t]. Consider a system S receiving input data flow with a Cumulative Arrival Function (CAF), A(t), and putting out the same data flow with a Cumulative Departure Function (CDF), D(t). To compute upper bounds on the worstcase delay and backlog, we need to introduce the maximum arrival curve, which provides an upper bound on the number of events, e.g., bits or packets, observed during any interval of time.
Definition 3.1: (Arrival Curve) [18] A function α is an arrival curve for a data flow with the CAF A, iff: A widely used curve is the leaky bucket curve, which guarantees a maximum burst σ and a maximum rate ρ, i.e., the traffic flow is (σ, ρ)-constrained. In this case, the arrival curve is defined as γ σ,ρ (t) = σ + ρ · t for t > 0. Furthermore, we need to guarantee a minimum offered service within crossed nodes through the concept of minimum service curve. Definition 3.2: (Simple Minimum Service Curve) [18] The function β is the simple service curve for a data flow with the CAF A and the CDF D, iff: To define the leftover service curve for a flow crossing a node implementing aggregate scheduling, we need strict service curve property: Definition 3.3: (Strict service curve) [18] The function β is a strict service curve for a data flow with the CDF D(t), if for any backlogged period 3 ]s, t], D(t) − D(s) ≥ β(t − s).
Knowing the arrival and service curves, one can compute the upper bounds on performance metrics for a data flow, according to the following theorem.
The k + 1 th node of f path subpath(P k , P l ) The subpath of flow k relatively to flow l after dv(P k , P l ) The convergence node of P k and P l dv(P k , P l ) The divergence node of P k and P l f r Flow f crosses node r F ⊃ r There is a flow f ∈ F such that f r  Theorem 3.1: (Performance Bounds) Consider a flow constrained by an arrival curve α crossing a system S that offers a service curve β, then: In the case of a leaky bucket arrival curve and a ratelatency service curve, the calculus of these bounds is greatly simplified. The delay and backlog are bounded by σ R + T and σ + ρ · T , respectively; and the output arrival curve is σ + ρ · (T + t).
Finally, we need the following results concerning the end-4 h(f, g): the maximum horizontal distance between f and g 5 v(f, g): the maximum vertical distance between f and g to-end service curve of a flow of interest (foi) accounting for flows serialization effects in feed-forward networks, based on the Pay Multiplexing Only Once (PMOO) principle [21], under non-preemptive Fixed Priority (FP) multiplexing. Theorem 3.2: The service curve offered to a flow of interest f along its path P f , in a network under non-preemptive FP multiplexing with strict service curve nodes of the rate-latency type β R,T and leaky bucket constrained arrival curves α σ,ρ , is a rate-latency curve, with a rate R P f and a latency T P f , as follows : where the required notations are defined in Table I.

B. Network Model
Our model can apply to an arbitrary NoC topology as long as the flows are routed in a deterministic, deadlock-free way (see [1]), and in such a way that flows that interfere on their path do not interfere again after they diverge. Nonetheless, we consider the commonly used 2D-mesh topology with inputbuffered routers and XY-routing, known for their simplicity and high scalability. Besides, XY-routing is widely used in COTS architectures (e.g. [22]).
We consider typical input-buffered 2D-mesh routers with 5 pairs of input-output, namely North (N0, South (S), West (W), East (E) and Local (L), as shown on Figure 4. Output-buffered routers have buffers located at the output ports instead of the input port but remain similar otherwise.  Fig. 4: Typical 2D-mesh router It is worth noticing that NoCs using output-buffered routers can be modeled similarly to input-buffered routers NoCs. The idea is that from a flow point of view, whether the buffer is located at the input or at the output does not change the number of buffers and links crossed by the flow on its path, as introduced in [  We also allow to model heterogeneous NoC architectures. For instance, we can specify distinct buffer sizes, link and router capacities and processing delays values on a single NoC.
The considered wormhole NoC routers are similar to the architecture presented in [24], illustrated in Figure 5 (top). They implement a priority-based arbitration of VCs and enable flit-level preemption through VCs. The latter can happen if a flow from a higher priority VC asks for an output that is being used by the flow of interest (foi). Hence, when the flit being transmitted finishes its transmission, the higher priority flow is granted the use of the output while the foi waits. Moreover, each VC has a specific input buffer and supports many traffic classes, i.e., VCs sharing, and many traffic flows may be mapped on the same priority-level, i.e., priority sharing. Finally, the implemented VCs enable the bypass mechanism, illustrated in Figure 6. If the foi gets blocked at some point (for instance, flow 1 in Figure 6), flows from lower priority VCs sharing upstream outputs with the foi (for instance, flow 2 in Figure 6) can bypass it, but they will be preempted again when the downstream blocking of the foi disappears.
We consider an arbitrary service policy to serve flows belonging to the same VC within the router, i.e., these flows can be from the same traffic class or from different traffic classes mapped on the same VC. This assumption allows us higher priority flit can move buffer is now full, packet 1 is blocked packet 2 can move while 1 is blocked 1 2 Fig. 6: Bypass mechanism to cover the worst-case behaviors of different service policies, such as FIFO and Round Robin (RR) policies. Hence, we model such a wormhole NoC router as a set of independent hierarchical multiplexers, where each one represents an output port as shown in Figure 5 (bottom). The first arbitration level is based on a blind (arbitrary) service policy to serve all the flows mapped on the same VC level and coming from different geographical inputs. The second level implements a preemptive Fixed Priority (FP) policy to serve the flows mapped on different VCs levels and going out from the same output port. It is worth noticing that the independency of the different output ports is guaranteed in our model, due to the integration of the flows serialization phenomena. The latter induces ignoring the interference between the flows entering a router through the same input and exiting through different outputs, since these flows have necessarily arrived through the same output of the previous router, where we have already taken into account their interference.
Each router-output pair r (that we will refer to as a node from now on) has a processing capacity that we model using a rate-latency service curve. β r (t) = R r (t − T r ) + R r represents the minimal processing rate of the router for this output (which is typically expressed in flits per cycle, fpc) and T r the maximal experienced delay by any flit crossing the router before being processed (which is commonly called routing delay and takes one or few cycles).

C. Flow Model
The characteristics of each traffic flow f ∈ F are modeled with the following leaky bucket arrival curve, which covers a lot of different traffic arrival events, such as CBR or bursty traffic with or without jitter : This arrival curve integrates the maximal packet length L f (payload and header in flits), the period or minimal inter-arrival time P f (in cycles), the burst (number of packets the flow may release consecutively) b f and the release jitter J f (in cycles) in the following way : For each flow f , its path P f is the list of nodes (routeroutputs) crossed by f from source to destination. Moreover, for any k in appropriate range, P f [k] denotes the k + 1 th node of flow f path (starting at index 0). Therefore, for any r ∈ P f , the propagated arrival curve of flow f from its initial source until the node r, computed based on Th. 3.1, will be denoted: The end-to-end service curve granted to flow f on its whole path will be denoted:

D. Preliminaries
Consider k and l two flows that are directly interfering with one another, P k , P l their paths, and let dv(P k , P l ) be the last node they share : Suppose the path of l continues after this node. Even if the head flit of l is not stored in a router of P k ∩ P l , the limited buffer size available in each router can lead to storing the tail flit of l in a router of P k ∩ P l under contention. In that case, l blocks k.
Therefore, we need to quantify the way a packet of flow f spreads into the network when it is blocked and stored in buffers. Here, we assume node r has a buffer size of B r to model heterogeneous architectures.
Definition 3.4: Consider a flow f of maximum packet length L f flits. The spread index of f at node i, denoted N i f , is defined as follows: where B r the buffer size at node r in flits. N i f is the number of buffers needed to store one packet of flow f from node P f [i] onwards on the path of f .
Using this notion and the last intuitive example, we call the section of the path of flow k from dv(P k , P l ) through N dv(P k ,P l ) k nodes (at most) "subpath of k relatively to l": Definition 3.5: The subpath of a flow k relatively to a flow l is: where Last(P k , P l ) = max{n, P k [n] ∈ P l } is the index of the last node shared by k and l along P k , i.e P k [Last(P k , P l )] = dv(P k , P l ). We can extend this notion and define, in a similar fashion, the subpath of any flow k relatively to a subpath S l ⊂ P l of any flow l (with l = k or l = k). The previous notation still holds: Definition 3.6: The subpath of a flow k relatively to any subpath S l of any flow l is: where Last(P k , S l ) = max{n, P k [n] ∈ S l } is the index along P k of the last node shared by k and l within S l . By abuse of notation, we denote subpath(k, l) to refer to subpath(P k , P l ), and similarly subpath(k, S l ) to refer to subpath(P k , S l ).
It is worth noticing that if P l ends before reaching the N Last(P k ,P l )+1 l -th node after dv(P k , P l ), then we ignore the out-of-range indexes. The notion of subpath is illustrated in Figure 7 for the foi k and a spread index for the interfering flow l right after node dv(P k , P l ) equal to 3, i.e., N

Fig. 7: Subpath illustration for the foi k
We will also need the following definitions.
, that is all flows with a priority lower or equal (resp. higher or equal) than f , f excluded. Definition 3.9: The indirect blocking set of a flow f is the set of flows that do not physically share any resource with the foi, but cause a delay to the foi because they impact (directly or indirectly) at least one flow of DB f . It is denoted IB f and contains pairs of the form {flow id, subpath} to specify, for each flow, where a packet of that flow can cause blocking that may propagate to the foi through backpressure.
It is worth noticing that Definition 3.9 is slightly different from the one used in the Scheduling Theory approaches [7] [8], where there is a distinction between the indirect blocking, due to same-priority flows, and indirect interference, due to higher priority flows. In our approach, we only consider flows belonging to the same VC as the foi to compute the indirect blocking set, since the impact of higher priority flows is already integrated in our model as follows: • if a higher priority flow blocking our foi gets blocked, the foi can bypass it. In this case, we take into account the extra processing delay needed to allocate the shared resource to the foi. On the other hand, the buffer backpressure will only propagate among flows from the same class, as illustrated in Figure 6; • the influence of higher priority flows on the same priority flows than the foi, which are inducing the indirect blocking, is modeled through the granted end-to-end service curve of each one of these flows at the rate and latency levels, as explained in Section IV-B

IV. GRAPH-BASED APPROACH FOR BUFFER-AWARE TIMING ANALYSIS
Hereafter, we propose a graph-based approach to compute the IB set of flows to cover CPQ scenarios in Section IV-C. Then, we detail the new method to compute T IB in Section IV-D. We illustrate each step on an example.
We first present an overview of G-BATA with the needed steps to compute the end-to-end delay bound (IV-A). In the following sections (IV-B to IV-D), we detail each step and illustrate them with an example.

A. Overview
To get a bound on the end-to-end latency for a flow f , we first need to compute its end-to-end service curve. The endto-end service curve of f is denoted: where T f is the sum of: • T P f , the "base latency", that any flit of f experiences along its path due only to the technological latencies of the crossed routers; • T DB , the maximum direct blocking latency, due flows in DB f ; • T IB , the maximum indirect blocking latency, due to flows in IB f .
To compute the bound on the end-to-end latency for the foi f , we proceed according to Algorithm 1 and following these main steps: 1) We compute T P f (Line 2), and the direct blocking latency T DB (Lines 4 to 10); 2) We compute the indirect blocking set IB f (Line 11); 3) We compute T IB (Lines 12 to 17). 4) From there on, knowing the initial arrival curve of f , α f , and its end-to-end service curve β f (Line 18), we compute the end-to-end delay bound using Theorem 3.1 as follows: where σ P f [0] is the burst of the initial arrival curve of f (the arrival curve of f at node P f [0], i.e. the first node of its path).
The main steps that are impacted by CPQ scenarios are steps 2 and 3. The remaining steps are the same as with BATA approach, introduced in [5]. Therefore, we recall herein only the main idea of step 1 for self-containment purpose and more details can be found in [5], and we rather focus on the details of steps 2 and 3 illustrating the introduced graph-based approach to cope with the CPQ assumption.
Algorithm 1 Computing the end-to-end service curve for a flow f endToEndServiceCurve(f, P f ) α k ← initial arrival curve of k // Now add the latency over the subpath to T IB : 16:

B. Step 1: Direct Blocking Latency Computation
The direct blocking latency takes into account the impact of flows sharing resources with the foi. We used PMOO [21] to account for flow serialization phenomena when computing the maximum direct blocking latency. As introduced in [5], it is defined in the following Theorem.
Theorem 4.1: (Maximum Direct Blocking Latency) The maximum direct blocking latency for a foi f along its path P f , in a NoC under flit-level preemptive FP multiplexing with strict service curve nodes of the rate-latency type β R,T and leaky bucket constrained arrival curves α σ,ρ is equal to: where: The proof can be found in [5].

Application
We now detail the computations on the example of Figure 3, for the foi 1. We assume all routers have a service curve β = R(t − T ) + and flow i has a packet length L i = L and the initial arrival curve α i = σ + ρt. We also consider all flows have a burst b = 2 and no jitter, therefore σ = 2L. All flows are mapped to the same VC, thus T hp = T lp = 0. We then have : Hence : It is important to notice that when we compute the direct blocking latency T DB of the foi, we need to know the burst of interfering flows at their convergence point with the foi. Thus we need to compute, for each interfering flow, its service curve from its source to the aforementioned convergence point.
The end-to-end service curve computation is thus a recursive process (cf. Algorithm 1). The recursion terminates because each call to endToEndServiceCurve() is done upstream the current convergence point.

C. Step 2: Indirect Blocking Set Computation
To handle CPQ assumption,we start from two modifications. First, we allow to compute the subpath of any flow f relatively to a subpath S f ⊂ P f of f to model several packets of the same flow queuing in the network. Second, we use a graph structure to maintain the dependency information between the subpaths. By doing so, we are able to know how each subpath was computed, and we also can explore all possible interference patterns more easily.
Each vertex corresponds to a subpath of a flow and holds the following information : • fkey : the flow identifier • path : the subpath • dependencies : the list of all edges (v, u) where v is the current vertex and u is such that v.path is the subpath of flow v.f key relatively to subpath u.path. • dependents : the list of all edges (w, v) where v is the current vertex and w is such that w.path is the subpath of flow w.f key relatively to subpath v.path. The two functions to construct the graph are detailed in Algorithm 2 and 3. The main steps are as follows: 1) We create a graph with one vertex corresponding to the foi (Line 1); 2) We compute all subpaths relatively to the foi and create a vertex depending on the foi's vertex for each non-empty subpath (Lines 2 and 7); 3) We add these vertices to the graph, making sure there are no dupplicates and merging the dependencies of the new vertex with the existing one if needed (Line 5); 4) We iterate these steps on each new vertex, in a breathfirst manner, until no new vertex is created (loop on Line 3).
Once the graph is created, the indirect blocking set of f consists of the pairs (k, subk) from all vertices such that k / ∈ DB f ∪{f }. In other words, these vertices correspond to flows that do not directly interfere with the foi f . for v ∈ L 0 do 5: addVertex(G f , v) merge v with w 3: else 4: add v to G 5: end if The computational complexity of Algorithm 2, when considering a flow set F on the NoC, is denoted as C(|F|) and is defined in the following property.
and can be roughly bounded as follows : Proof: We first notice that vertices of the graphs are defined only by their flow index and subpath. For a flow f , there are |P f | possible subpaths (each of them starting at a different node of the path of f ). Therefore, there are at most f ∈F |P f | distinct subpaths for the flowset F. We can thus bound the number of vertices of the computed graph. For each of these vertices, the algorithm computes all possible subpaths relatively to the current vertex' subpath (in getNextVertices() main loop).
Assume this subpath is S and that we have a preprocessed dictionary listing, for every node, the indexes of flows using this node, 7 . Although we wrote the secondary loop of get-NextVertices() as a loop over all flows in F for clarity reasons, all we have to do to get all possible subpaths relatively to S is run through the nodes of S and check for intersection with another flow's path. Comparing the indexes of the current node with those of the previous node, we can find divergence nodes of contending flows relatively to S. We assume that, knowing the divergence point of a contending flow relatively to S, it takes a constant time to find its subpath (we only need to compute the spread index).
Thus, the complexity of finding all subpaths relatively to any subpath is O(max f ∈F |P f |), hence the final result. The last bound is found bounding each path length of the sum by the maximal path length in the whole flow set.
The reason we can account for more than one packet of the same flow stalling in the network is because we allow to compute the subpath of a flow relatively to a subpath of that very same flow.

Application
We now apply the algorithm to the configuration of Figure 3: 1) starting from flow 1, we create vertex v 1 with index 1 and path P 1 and we call getNextVertices() on

5) we call getNextVertices() on [v 3 ]. It returns the empty list
[] and the algorithm terminates. The subpaths corresponding to the computed vertices are represented on Figure 8. The final graph is the following: 7 We do run such a preprocessing on the configuration.  When using G-BATA approach, we take into account the possible queueing of several packets of each flow through the consideration of multiple consecutive subpaths for one flow. Therefore, when computing T IB , the main difference compared to the BATA approach is that, for each {flow index, subpath} pair of the derived IB set, we do not need to compute the arrival curve at the beginning of the subpath and instead use the initial arrival curve of one packet of the corresponding flow. Having several consecutive subpaths for the same indirectly interfering flow allows to take into account a burst of more than one packet.
Computation of the indirect blocking latency T IB is done using the following Theorem : The maximum indirect blocking latency for a foi f along its path P f , in a NoC under flit-level preemptive FP multiplexing with strict service curve nodes of the rate-latency type β R,T and leaky bucket constrained arrival curves α σ,ρ , is as follows: where: Proof: For any pair {j, subP j } ∈ IB f , a packet of flow j will impact the foi f during the maximum time it occupies the associated subpath subP j , ∆t max j . Hence, a safe upper bound on the indirect blocking latency is as follows: On the other hand, for any pair {j, subP j } ∈ IB f , ∆t max j is upper bounded by the end-to-end delay bound of one packet of flow j along its associated subpath subP j , D subPj j , which infers the following: Based on Theorem 3.1, the delay bound of flow j, D subPj j , is computed as the maximum horizontal distance between: • the maximum arrival curve for a single packet of flow j at the input of the subpath subP j , α . We consider one packet per subpath. This is due to the fact that each subpath holds one packet (from the definition of the spread index). The multiple number of packets is taken into account through the multiple consecutive subpaths of the the same flow. Thus, the considered arrival curve is the initial arrival curve of flow j with b j equal to one, that is with a burst equal to L j + J j ρ j ; • the granted service curve to flow j by its VC along subP j , β subPj j , called VC-service curve, when ignoring the samepriority flows (which are already included in IB f ). The latter condition is due to the pipelined behavior of the network, where the same-priority flows sharing subP j are served one after another if they need shared resources. Hence, the impact of the same-priority flows than flow j is already integrated within the sum expressed in Eq. (9). To compute the granted service curve β subPj j for each flow j ∈ IB f along subP j , we apply the existing Theorem 3.2, when: • ignoring the same-priority flows in sp(j), thus all shp(j) will become hp(j) and slp(j) will become lp(j) in Eqs. Hence, we obtain R subPj j and T subPj j described in Eqs. (8a) and (8b), respectively. Consequently, the maximum indirect blocking latency in Eq. (9) can be re-written as follows: It is worth noticing that compared to the BATA approach, we do not need to propagate the arrival curves of flows in IB f at the beginning of the subpaths when computing T IB . Consequently, our new approach G-BATA does not need to compute service curves upstream the subpaths, which decreases the number of recursive calls to endToEndServiceCurve() in Algorithm 1. We will evaluate the associated complexity gain in our computational analysis in Section V-A.
V. PERFORMANCE EVALUATION In this section, we first analyse the computational effort of G-BATA and particularly on heavy configurations, with reference to BATA. Afterwards, we conduct a sensitivity analysis of the proposed approach when varying the system parameters and analyse their effect on the end-to-end delay bound. Finally, we assess the tightness of the derived bounds, using the insights we got from the sensitivity analysis.

A. Computational Analysis
In this section, we study the computational aspect of G-BATA. We will first run G-BATA on configurations with 4, 8, 16, 32, 48, 64, 80, 96 and 128 flows on a 8 × 8 NoC and compare it with BATA. We randomly generated 20 such configurations for each number of flows N . To do so, we randomly pick 2N (x-coordinate, y-coordinate)-couples, where each coordinate is uniformly chosen in the specified range (here, from 0 to 7). We use N of these couples for source cores and the other N for destination cores. There are 20 configurations for each flow number, and we set a time limit of two hours for the analysis.
For each configuration, we will focus on the following complexity metrics, that give an idea of the cost of analyzing a configuration: • ∆t, the total analysis runtime; • ∆t IB , the duration of the IB analysis (for BATA, determining IB set; for G-BATA, constructing the interference graph); • ∆t e2e , the duration of all end-to-end delay bounds computation; • N e2e , the number of calls to the function endToEndServiceCurve(); • N iter , the number of calls to a representative IB analysis function: for BATA, the number of iterations needed to compute all subpaths in the IB set (denoted while iterations on Figure 10); for G-BATA, the number of calls to the function getNextVertices(); We begin the comparative study by plotting the total analysis runtime ∆t as well as the duration of IB analysis ∆t IB as  Figure  9). The first thing we can notice, on the left graph, is that BATA takes more time than G-BATA, especially for flow sets of more than 32 flows. For instance, the total analysis of 48flow configurations is on average 766 times faster with G-BATA than with BATA. There were no timeouts for G-BATA, whereas BATA timed out for most configurations with 64 flows or more.
However, we expect the IB analysis part of BATA approach to be computationally less expensive than G-BATA. Since the IB analysis is independent from the end-to-end service curve and delay bound computation, we were able to do it with no time-outs. We have plotted the runtimes of IB analysis part vs flow number for the two approaches to check this intuition (right graph of Figure 9). The result is very explicit: IB analysis of BATA is faster than G-BATA. For instance, on 48-flow configurations, BATA is on average 5.7 times faster than G-BATA.
In an attempt to be more platform-independent, we have used other metrics than runtimes to estimate the complexity of analyses. To do so, we counted the number of calls of relevant functions. For the end-to-end delay bounds computations, we counted the total number of calls to the function endToEndServiceCurve() in Algorithm 1, which is used in both approaches. For the IB analysis part, the two approaches are significantly different; thus, we counted the number of iterations of the while loop for BATA and the number of calls to the function addVertex() for G-BATA when this function creates a new vertex. The number of calls to addVertex() in G-BATA is roughly the equivalent of the number of while iterations of BATA.
We gathered the results in Figure 10. We plotted two graphs: one for the service curve computation (left), the other one for the IB analysis (right). The results match what the runtime graphs showed: G-BATA is way faster on the endto-end service curve computation, while BATA is faster on IB analysis. More precisely, for the total analysis of 48-flow configurations, BATA performs on average 1883 times as many calls to endToEndServiceCurve() as G-BATA does. For the IB analysis, G-BATA performs on average 1.5 times as many IB analysis iterations as BATA does.
We then performed additional experiments on randomly generated configurations for G-BATA approach, on a 8 × 8 NoC, with a number of flows from 20 to 800, to study how well the new method scales on large flow sets. As before, we perform the analysis and measure total runtime, runtime of the IB analysis and runtime of the service curve computation. We plot the results on Figure 11. What comes out of this additional study is that G-BATA analysis scales well: without parallelization, on a laptop powered by an Intel core i7 processor, computing end-to-end delay bounds for each of the 800 flows takes around 7200 seconds in the worst case (2 hours), i.e. around 9 seconds per flow, as shown on the left graph of Figure 11. Moreover, the IB analysis runtime is the more computationally expensive phase: for the 800flow configurations, it represents on average 97.1% of the total runtime, as illustrated on the right graph of Figure 11.
Key points: G-BATA approach scales way better than BATA approach. The difference is especially visible for flow sets of 32 and 48 flows, where the average runtime of the total analysis for BATA is 10 to 100 times higher than G-BATA. For bigger configurations, we have not been able to get much comparative information as running one analysis with BATA takes more than two hours. Moreover, G-BATA approach performs well on heavy configurations (600 and 800 flows) with an average total runtime of 2647 and 6935 seconds, respectively. Finally, we notice that depending on the approach, the more computationally expensive step is either the indirect blocking analysis (G-BATA) or the service curve computation (BATA). For the latter case, it is what limits BATA approach scalability for large flow sets.
From these illustrated results, we can notice that for a given number of flows, runtimes can vary significantly from one configuration to another. For instance, on the left graph of Figure 9, for 48-flow configurations, runtimes differ by up to 57% and up to 99% for G-BATA and BATA, respectively. Hence, the configuration complexity seems to not only depend on the number of flows, but also on at least another hidden parameter.
In an attempt to better understand what are the configuration parameters impacting the approach complexity, we define two congestion indexes.
The value of one such index is specific to one flow. Hence, to quantify how complex a configuration is, we introduce the following average indexes: • |F|, the number of flows of the configuration; It is worth noticing that these communication patterns favor direct and indirect blocking, which impact the introduced direct and indirect blocking indexes. We perform the same analysis as before and compare the results we get for both approaches, on these constrained configurations (referred to as "constrained") and the previous 4-, 8-, 16-and 32-flow configurations.
We first plot total runtime as a function of flow number, and the average curve ( Figure 13). We notice that for both G-BATA and BATA approaches, there is a noticeable difference between the constrained and the uniformly distributed configurations. For a given number of flows, constrained sets generally require greater runtimes than the previous sets. We did not include the plots of other runtimes (IB analysis and service curve computation) vs flow number for the two configuration types, but they exhibit the same trend as total runtimes.
Hence, to better understand the correlation between the runtime and the congestion pattern, we focus on the 32-flow configurations and we plot, for both approaches, all points (x, y) where: • x is the average DB index (resp. IB index) of the configuration, I DB (resp. I IB ); • y is the total analysis runtime.
The results are gathered in Figure 14. For both approaches, we notice that the runtime tend to increase with the average congestion index (direct or indirect). We conclude that a higher average congestion index (direct or indirect) tends to characterize configurations that require a higher computation time.
Moreover, the average IB index does not bring more insights than the average DB index on how computationally expensive the analysis of a configuration may be. So, given that it is computationally more expensive to compute the average IB index than the average DB index, especially for G-BATA approach, we conclude that average DB index is a good configuration indicator to quantify the complexity of a configuration in addition to the number of flows.
Key points: Although there is a correlation between the number of flows of one set and the runtime needed to perform its analysis, we find that it is not sufficient to characterise how long the timing analysis may take. In that respect, we propose two configuration indicators to refine the quantitative aspect of the complexity of a flow set: the average DB and IB indexes. We show that both are adequate complementary configuration parameters. Nonetheless, the average IB index is computationally more expensive while not bringing much more information. Hence, the DB index and the size of the flow set are considered as sufficient to characterize a configuration complexity.

B. Sensitivity Analysis
In this section, we study the impact of different parameters on the end-to-end delay bounds yielded by G-BATA. For the sensitivity analysis, we will analyze the end-to-end delay bounds when varying the following parameters: • buffer size for values 1, 2, 3, 4, 6, 8, 12, 16, 32, 48, 64 flits; • total packet length (including header) for values 2, 4, 8, 16, 64, 96, 128 flits; • flow rate for values between 1% and 40% of the total link capacity (so that the total utilization rate on any link remains below 100%). To achieve this aim, we consider the configuration described on Figure 15. This configuration remains quite simple but exhibits sophisticated indirect blocking patterns. We assume periodic flows with no jitter having the same period and packet length, and consider the following parameters: • each router can handle one flit per cycle and it takes one cycle for one flit to be forwarded from the input of a router to the input of the next router, i.e., for any node r, T r = 1 cycle and R r = 1 flit/cycle; • all the flows are mapped on the same VC; • our flow of interest is flow 1.
To better highlight the impact of the various parameters on G-BATA in reference to BATA, we display the results of G-BATA along with the existing results obtained with BATA. Figure 16 illustrates the end-to-end delay bounds of the foi when varying buffer size. For the left graph, we keep each flow rate constant at 4% of the total bandwidth; whereas for the right graph, we keep each flow packet length at 16 flits.
First, on both graphs, we notice an opposite trend between G-BATA and BATA approaches. The former predicts that delay bounds increase when buffer size increases, whereas the latter predicts that delay bounds decrease. This is mainly due to the variation of the spread index of flows and its impact on each approach.
For BATA, this generally makes the IB set smaller: as this approach does not consider CPQ, reducing the length of a subpath reduces the possibility that this subpath intersects with  the paths of other flows. Consequently, the derived IB latency tends to decrease, as well as the end-to-end delay bound.
For G-BATA approach, however, the interference graph takes CPQ into account, and in that respect, the number of consecutive packets is not bounded. Therefore, reducing the size of the subpaths increases their number. The extracted IB set thus contains more subpaths of smaller size. Consequently, there are more terms in the indirect blocking delay sum (Equation 7), which may increase the end-to-end delay bound.
Second, we notice that with both approaches, the end-toend delay bounds increase with the packet length and rate. Moreover, we observe that past a certain value of buffer size, the end-to-end delay bounds remain constant. This corresponds to the IB set remaining constant once buffers are large enough to hold one packet (spread index of 1 for all flows).
Finally, on the right graph, we notice that BATA is more sensitive to rate than G-BATA: for buffer sizes below 6 flits, BATA predicts delay bounds between 327 and 1178 cycles, while G-BATA gives delay bounds between 357 and 486 cycles.
Key points: Although increasing buffer size may improve end-to-end delay bounds when no CPQ happens (under BATA), we find that it does not impact favorably the endto-end delay bound when CPQ can occur and the number Next, we focus on the packet length impact on the endto-end delay bound for G-BATA and BATA, as illustrated on Figure 17 and 18, respectively. For clarity reasons, we plotted separate graphs for the two approaches. On each figure, the left graphs present results when the buffer size is constant (4 flits) and the right ones when the rate of each flow is constant (4% of the link capacity).
The first observation we can make from all graphs is that the delay bounds evolve in an almost linear manner with the packet length. For instance, on the right G-BATA graph, with 8 flits of buffer size and packet length equal to 64, 96 and 128 flits, the ratios of packet length and end-to-end delay bound are 20.9, 20.7 and 20.6, respectively.
Still on the same right graph, we observe further interesting aspects: • At a given packet length, the buffer size has a limited impact on the end-to-end delay bounds. For instance, for a packet length of 64 flits, the delay bounds increase with less than 30% when the buffer size increases with 480%; • For packet lengths that are significantly larger than buffer size, the delay bound remains constant regardless of the buffer size, e.g., it is the case for a packet length of 128 flits.
Similar observations can be made for BATA approach.
However, looking at the left graphs for BATA and G-BATA, we notice that BATA is more sensitive to rate variations than G-BATA: for a packet of 64 flits, when the rate increases from 2% to 40%, the end-to-end delay bound yielded by BATA increases from 1226 cycles to 4630 cycles (+278%) while the delay bound predicted by G-BATA increases only from 1326 cycles to 1698 cycles (+28%).
Key points: at a given rate and packet length, we observe that buffer size has a limited impact on the end-to-end delay bound, and this observation is valid for both G-BATA and BATA approaches. We also notice that the evolution of the delay bound with the packet length follows an almost linear trend, for both approaches as well. Finally, we further confirm that BATA is more sensitive to rate variations than G-BATA, especially for large packet lengths.
We now focus on the impact of the flow rate on end-toend delay bounds ( Figure 19). The left graph represents the evolution of delay bounds when packet length is fixed (16 flits) for different values of buffer size, and the right graph shows the evolution of delay bounds with a fixed buffer size (4 flits) and values of packet length from 2 to 64 flits. As expected, with both approaches, the end-to-end delay bound increases with the rate. What is more interesting is that delay bounds with G-BATA approach increase much less rapidly than with BATA approach for buffers of 1 and 8 flits: at a 40% flow rate, BATA gives bounds that are 26% to 162% greater than bounds given by G-BATA approach (left graph on Figure 19). Therefore, we can confirm one more time that BATA is more sensitive to rate variations than G-BATA.
Although there is generally no strict order between the bounds given by the two approaches, for instance for B = 8 flits, we can notice a trend regarding the relative position of the bounds: BATA predicts smaller bounds than G-BATA for large buffer sizes and small rates, and the trend is opposite for small buffer sizes, especially as the rate increases. When the rate of flow ρ increases with all other parameters constant, the propagated burst of an arrival curve increases by ρ · T per node with a service curve latency of T . Results obtained with BATA are especially impacted by this burst propagation since the burst is propagated at the beginning of the subpaths when computing T IB . This explains why BATA-predicted bounds increase faster than graph-predicted bounds when increasing the rate.
Key points: Both approaches predict an increase of the end-to-end delay bound with the rate, however this increase is significantly different depending on the approach. Burst propagation at the beginning of subpaths in BATA approach leads to important bound increase when the flow rate is high. For instance, the computed bounds are up to 275% higher with BATA than with G-BATA at 40% flow rate.

C. Tightness Analysis
To assess the tightness of the delay bounds yielded by G-BATA, we consider herein simulation results using Noxim simulator engine [25]. We have configured Noxim to control the traffic pattern using the provided traffic pattern file option. For each flow, we have specified: • the source and destination cores; • pir, packet injection rate, i.e. the rate at which packets are sent when the flow is active; • por, probability of retransmission, i.e. the probability one packet will be retransmitted (in our context, this parameter is always 0); • t on , the time the flow wakes up, i.e. starts transmitting packets with the packet injection rate; • t off , the time the flow goes to sleep, i.e. stops transmitting; • P , the period of the flow.
Moreover, since we want to simulate a deterministic flow behavior to approach the worst-case scenario, we use the following parameters for each flow: • Maximal packet injection rate : 1.0; • Minimal probability of retransmission : 0.0; We also have to pick t on , that is determine at what time within its period the flow is going to wake up from its inactivity and start sending packets. Since we want the flow to be periodic, we set its active period to be as short as possible so that we ensure it wakes up, sends exactly one packet, and goes to sleep until the next period. To create different contention scenarios and try approaching the worst-case of end-to-end delays, we randomly chose a value of t on for each flow and perform simulations with uniformly distributed values of offsets for each flow. We generate 40000 different traffic configurations for each set of parameters and simulate We simulate the configuration of Figure 15, when varying buffer sizes in 4, 8 and 16 flits, and flow rates in 8% and 32% of the total available bandwidth. We run each flow configuration many times while varying the flows offsets. We extract the worst-case end-to-end delay found by the simulator over all the simulations, and for each flow f , we compute the corresponding "tightness ratio" τ f , that is the ratio of the achievable worst-case delay D WC and the worst-case delay bound D f : We simulate the configuration of Figure 15, when varying buffer sizes in 4, 8 and 16 flits, and flow rates in 8% and 32% of the total available bandwidth. We extract the worstcase end-to-end delay found by the simulator and compute the tightness ratio for each flow. The obtained results are gathered in Table II. We also recall the computed tightness ratios obtained with BATA, detailed in [20]. Additionally, we computed and included the congestion indexes associated with G-BATA approach. We only displayed results for buffer sizes 4 and 16.
We notice that the lower the congestion indexes are, the greater the tightness is. Low congestion indexes mean that the contending possibilities are reduced. Hence the worst case is simpler to find and thus more likely to be achieved or approached with randomly chosen offsets. We stress out the fact that there are many possibilities for the wake up time of each flow, and that our series of simulations may not have been able to approach or achieve the worst-case for every flow.
For a buffer size of 4 flits and a flow rate of 8%, G-BATA and BATA give similar results (with a slightly better average tightness for BATA). However, for a 32% rate, G-BATA gives tighter bounds. For 16 flits of buffer size, BATA gives tighter results for both rates. However, we want to stress out that in this case, for 32% rate, we might not be able to verify that no CPQ can occur. Thus the results yielded by BATA should be taken with caution.
Key points: On the tested configuration, with 4-flit-large buffers and at 8% flow rate, both models give similar results. With the same buffer size and a higher rate (32%), G-BATA gives tighter results than BATA, showing that BATA tends to be pessimistic for high flow rates. With larger buffer  sizes, BATA performs better, but when flow rates are high, BATA might not be applicable. Overall, the tightness is good. G-BATA averages at 72% when the buffer size is 4 flits and 56% for 16 flits, whereas BATA averages at 59% and 81%, respectively. For flows subject to the more complex congestion patterns, the worst-case may not have been approached as closely as for flows undergoing little to no interference, hence the derived tightness ratio is smaller. This conjecture is supported by the fact that the measured tightness is lower for flows with higher congestion indexes.

D. Discussion
In order to determine whether BATA or G-BATA should be used, we propose a decision-making graph ( Figure 20). The first choices regard the system characteristics. If the traffic is non-CBR, or if the platform is heterogeneous, BATA is not applicable, thus G-BATA should be used. With CBR traffic and homogeneous platforms, BATA may be used provided CPQ does not occur, i.e. provided one packet of a flow cannot catch up on the previously injected one.
However, due to the computational complexity of BATA for large flowsets, the analysis with BATA may take a long time. Therefore we recommend the use of G-BATA for configurations with more than 80-100 flows. The main interest of using BATA when the appropriate assumptions are verified is that it may give tighter results than G-BATA in some cases, e.g. when buffer size is large compared to packet lengths.

VI. AUTOMOTIVE CASE STUDY
We now perform our analysis on the case study proposed in [26] and used in [9]. The chosen application is the control of an autonomous vehicle. It features several tasks in charge of processing data from the sensors, managing the obstacle data base, controling the actuators, etc. Various data flows are exchanged between these tasks.
Further description of the application can be found in [26]. We took the same 33 tasks mapped on a 4 × 4 2D-mesh NoC, and the same mapping of the 38 data flows between tasks, routed in a XY fashion.
The parameters used are the following: • The duration of a cycle is 0.5 ns; • All routers have a technological latency of 3 cycles; • The link capacity is one flit per cycle; • Flows' priority assignment follows a rate monotonic policy; • Each router supports 4 Virtual Channels with no prioritysharing and no VC-sharing, i.e., one flow per VC; • To compare our results to the ones in [9], we performed the analysis for different buffer sizes (2, 100 and 1000000 flits, the latest being large enough to assume buffer size is infinite).
All flows have a different priority. As they are mapped to VCs in such a way that at each router, all VCs are nonshared, there is no indirect blocking. Thus, we expect BATA and G-BATA to give the exact same results for the worstcase delay bounds, which we checked was the case. We then plotted comparative graphs on Figure 21 and computed the average tightness of our approach (Table III), using results from simulations performed by Nikolić et al. [9]. The average tightness ratio for G-BATA approach with buffer size 2, 100 and infinite are 64%, 67% and 71% respectively.
We first notice that our approach gives similar results to [9]. To further quantify the similarity of the results, we subtracted the tightness ratio obtained by the two approaches on each bound to obtain what we call "tightness difference", denoted ∆τ . For a given flow: where τ G−BAT A is the tightness ratio of the bound yielded by G-BATA, and τ ST is the tightness ratio of the bound yielded by the method of [9]. The tightness difference ∆τ is positive when G-BATA gives the tighter bound and negative otherwise. We synthesized the differences in Table III. We computed the minimum, maximum and average tightness difference.  Even though they are based on fundamentally different theories, we can notice both approaches yield very close results, giving credit to both models.
Authors in [9] have shown that 4 VCs are needed to find a mapping of flows to VCs that ensures each flow has exclusive use of the VC within each router, which greatly simplifies the  computation. However, having only one flow per VC at each node can raise scalability problems: with larger and/or less favorable configurations, ensuring each flow has the exclusive use of a VC within each router would require a number of different VCs that is not reasonable any more.
In that respect, we want to stress out that our model allows priority sharing and VC sharing (several flows sharing priority levels and VCs). Therefore, we have performed another analysis on the same configuration using only 2 VCs, with the following priority mapping: • Flows 1 to 19 have the higher priority and are mapped to VC0; • Flows 20 to 38 have the lower priority and are mapped to VC1.
We also analyze a configuration with only 1 shared VC.
We have plotted the results with the different VC configurations on Figure 22. We only displayed the results for a buffer size of 2 flits, but the trend is similar with other sizes. To get an insight into the impact of reducing the number of VCs on delay bounds, we also computed, for each flow and for each n VC configuration, the relative increase of the worst-case delay bounds compared to the delay bound with 4 VCs, as follows: inc n = delay with n VCs − delay with 4 VCs delay with 4VCs The results are shown on Table IV. First, as we can notice from Fig. 22, all flows have delay bounds less than their periods (the shortest period is 40 ms); thus remain schedulable. When we reduce the number of VCs, the computed delay bound either increases or remains the same for all flows. When mapping all the flows to the same priority level, two factors impact the end-to-end delay bound. First, more interference patterns will be possible, especially considering that the configuration with 4 VCs did not allow for any indirect blocking. Hence, additional delay will impact all the flows. Second, it is likely that highest priority flows will suffer from additional delay because the arbitration provides equal fairness to all flows. Conversely, the lower priority flows will suffer less or not at all from that redistributed fairness, and may even experience lower delays due to competition with other flows. In the considered case, however, the additional complexity of the indirect blocking prevails and no flow experiences a decrease of its end-to-end delay bound.
Moreover, as shown in Table IV, the average bound increase stays reasonable (up to few times more than the original one) when the number of available VCs is divided by up to 4. Hence, G-BATA yields noticeable improvements to decrease the platform complexity (less Virtual Channels) while   IV: Relative increase of the worst-case end-to-end delay bounds for B = 2, for 2 VC and one VC configurations vs the 4 VC configuration, for G-BATA appproach guaranteeing schedulability, in comparison to the state-of theart method in [9].
Finally, we provide some insights into the runtime of G-BATA under different VC-configurations. For each buffer size and number of VCs, we measured the runtime of G-BATA and summarized our results in Table V. We notice runtimes with non-shared VCs are in the order of 10 to 100 times lower than runtimes with 1 and 2 VCs. This confirms our conclusions regarding the inherent complexity of G-BATA, shown in Section V-A, when enabling the priority-sharing and VC-sharing assumptions.
When no VC is shared between several flows, the IB latency is zero, and consequently, computing the end-to-end service curve is faster. On the contrary, when VCs are shared, there are (i) additional recursive calls to end-to-end service curve function needed to compute the IB latency; and (ii) a more complex interference graph to construct. Therefore, we can expect an increase in the analysis duration. In the studied cases, G-BATA still performs in a very reasonable duration (one second or less).

VII. CONCLUSIONS AND PERSPECTIVES
Starting from the observation that bursty traffic is not covered by our previously published BATA model, we aimed at 4 [5] to model heterogeneous architectures and exhibited a CPQ scenario to explain the idea of our extension. Then, we proposed a new approach, G-BATA, improving the indirect blocking analysis based on dependency graphs to capture interference patterns involving CPQ. Following this, we adapted the indirect blocking latency computation to take into account the new way of modeling indirect blocking patterns, and consequently decreased the number of recursive calls needed to compute end-to-end service curves.
Finally, we evaluated our approach on several aspects: (i) we studied the sensitivity of the model to input parameters such as router buffer size, flow rate, and packet length. We found that increasing buffer size does not reduce the end-to-end delay bound, and that for a given flow rate, sending small packets is more worst-case-efficient than sending big packets. We also found that BATA bounds are generally more pessimistic than G-BATA when flow rates increase; (ii) we evaluated the scalability of our approaches. We showed BATA hardly scales beyond configurations of 50 to 100 flows, whereas G-BATA is able to analyze 800-flow configurations in a reasonable time (below 10 seconds per flow); (iii) we evlauated the tightness of the model given bound on a test-case and achieved an average tightness ratio of 71%, without; (iv) we confronted our model to a realistic case-study to further check for the correctness of the bounds and the efficiency of G-BATA in comparison to a state-of-the-art approach.
In a future work, we plan to focus on addressing related problems such as Software/Hardware mapping. We would like to include our dependency graph-based approach in Design Space Exploration techniques of manycore platforms. This would also allow us to tackle more complex case studies and improve the system performance.