Exploring Multi-Reader Buffers and Channel Placement During Dataflow Network Mapping to Heterogeneous Many-Core Systems

This paper presents an approach for reducing the memory requirements of periodically executed dataflow applications, while minimizing the period when deployed on a many-core target. Often, implementations of dataflow applications suffer from data duplication if identical data has to be processed by multiple actors. In fact, multi-cast (also called fork) actors can produce huge memory overheads when storing and communicating copies of the same data. As a remedy, so-called Multi-Reader Buffers (MRBs) can be utilized to forward identical data to multiple actors in a First In First Out (FIFO) manner while storing each data item only once by sharing. However, using MRBs may increase the achievable period due to contention when accessing the shared data. This paper proposes a novel multi-objective design space exploration approach that selectively replaces multi-cast actors with MRBs and explores actor and FIFO channel mappings to find trade-offs between the objectives of period, memory footprint, and core cost. In distinction to the state-of-the-art, our approach considers (i) memory-size constraints for on-chip memories, (ii) hierarchical memories to implement the buffers, e.g., tile-local memories, (iii) supports heterogeneous many-core platforms, i.e., core-type dependent actor execution times, and (iv) optimizes the buffer placement and overall scheduling to minimize the execution period by proposing a novel combined actor and communication scheduling heuristic for period minimization called Communication-Aware Periodic Scheduling on Heterogeneous Many-core Systems (CAPS-HMS). Our results show that the explored Pareto fronts improve a hypervolume indicator over a reference approach by up to 66 % for small to mid-size applications and 90 % for large applications. Moreover, selectively replacing multi-cast actors with corresponding MRBs proves to be always superior to never or always replacing them. Finally, it is shown that the quality of the explored Pareto fronts does not degrade when replacing the efficient scheduling heuristic CAPS-HMS by an Integer Linear Program (ILP) solver that requires orders of magnitude higher solver times and thus cannot be applied to large scale dataflow network problems.


I. INTRODUCTION
Modern many-core systems provide ample computational power due to a large number of available cores.To exploit the available number of cores, applications should exhibit sufficient concurrency to fully utilize all cores.Imperative programming languages are often considered poorly suited for developing concurrent applications [1].Hence, applications should be specified using a Model of Computation (MoC) that explicitly expresses concurrency, e.g., using a dataflow MoC [2], where an application is represented by a Dataflow Graph (DFG).DFG vertices represent actors, and edges represent First In First Out (FIFO) channels transmitting tokens.
Actors thereby specify an application's computations.The dynamics of a dataflow graph is given by the notion of firings.An actor is called enabled for firing (execution) if enough tokens have accumulated at its input channels.Per firing, it consumes tokens on its input channels and produces tokens at its outputs according to a set of firing rules.In so-called marked graphs [3], the firing rules are that at least one token must exist at each input of an actor to be enabled for firing.Per firing, it consumes exactly one token on each of its inputs and produces exactly one token at each of its outputs.
One application domain well suited to use dataflow modeling is image processing.Generally, an image processing VOLUME 1, 2023 h NoC q global T 3 = {q T3 , h T3 , p 13 , . . ., p 18 , q p13 , . . ., q p18 } p 13 p 14 p 15 p 16 p 17 p 18 q T3 h T3 q p13 q p14 q p15 q p16 q p17 q p18 p 19 T 4 = {q T4 , h T4 , p 19 , . . ., p 24 , q p19 , . . ., q p24 } q p19 q p20 q p21 q p22 q p23 q p24 q T4 h T4 T 2 = {q T2 , h T2 , p 7 , . . ., p 12 , q p7 , . . ., q p12 } q T2 h T2 q p7 q p8 q p9 q p10 q p11 q p12 T 1 = {q T1 , h T1 , p 1 , . . ., p 6 , q p1 , . . ., q p6 } p 1 p 2 p 3 p 4 p 5 p 6 core of type ϑ 3 core of type ϑ 2 core of type ϑ 1 q T1 h T1 q p1 q p2 q p3 q p4 q p5 q p6 interconnects memories actors multi-cast actors channels On the left, an application graph g A consisting of a set of actors a i ∈ A communicating over a set of FIFO channels c j ∈ C is shown.Channel capacities in terms of tokens γ(c j ) are illustrated by white boxes.The token size in bytes ϕ(c j ) and the number of initial tokens δ (c j ), e.g., one initial token for channel c 1 (black dot), is also illustrated.On the right, a heterogeneous four-tile many-core architecture is modeled by an architecture graph g R .Processor cores are denoted p i , and tiles are denoted T j .Each core p i ∈ P can, in principle, access any core-local memory q p j ∈ Q P , any tile-local memory q T j ∈ Q T , as well as the global memory q global .Dashed arcs represent mapping options from actors to cores and channels to memories.To exemplify, mapping edges are illustrated for the actors a 3 and a 5 as well as the channel c 4 .However, to reduce visual clutter, only the resources of tile T 1 and the global memory are shown as targets for these mappings.In the proposed approach, actors can be mapped in principle (light red arcs) to all cores of a type that supports the execution of the actor, e.g., cores of type ϑ 1 for actor a 3 and cores of type ϑ 2 or ϑ 3 for actor a 4 .In contrast, channels can generally be mapped (light green arcs) to any memory.
application consists of a graph of image processing filters, where each filter operates on its input and produces transformed image data at its outputs.Each filter of an image processing application can be naturally modeled by an actor.
In order to map a dataflow graph such as exemplified in Fig. 1 (left) with explicit modeling of actors and channels onto a many-core target such as shown in Fig. 1 (right), the actors must be properly mapped to individual cores, and the channels must be mapped to proper memories of the target architecture [4,5].Moreover, a schedule needs to be determined for the actor executions as well as the transport of data from and to the allocated channel memories so as to achieve a short period on the one hand while reducing the required memory footprint and core count on the other hand.
One problem of dataflow MoCs, however, is that it does not allow multiple actors to read data from the same channel.Instead, multiple individual channels must be created for the multiple readers, thus creating copying and data overheads by introducing so-called multi-cast actors.The only purpose of these multi-cast actors is to read the data from the producer and copy it to all consumer actors [6][7][8].An example of such a multi-cast actor is the actor a 2 highlighted in Fig. 1 (left).Apart from high memory footprint requirements, these actors also typically cause a huge amount of communication.
To avoid copying and the resulting data duplication, [9] recently introduced a concept called Multi-Reader Buffer (MRB), which is a channel that has one writer and multiple readers that behave as if each reader has a dedicated channel of the very same channel data but stores each token in the channel only once, e.g., see Fig. 2b.A minimal memory footprint for an application results when each multi-cast actor (and their connected FIFOs) is replaced by an MRB.But, as was stated in [9], this buffer replacement scheme may impact the minimum achievable period.Unfortunately, the approach presented in [9] is also limited in (i) being restricted to simple bus-based homogeneous architectures and in (ii) ignoring the capacities of the on-chip memories.Often, many-core systems, particularly Multi-Processor Systems-on-a-Chip (MP-SoCs), are comprised of different core types and have dozens of cores with constrained on-chip memories connected via a hierarchical interconnect, e.g., see Fig. 1

(right).
Coping with these deficiencies, this paper contributes a multi-objective Design Space Exploration (DSE) approach that (i) considers memory-size constraints for all on-chip memories, (ii) explicitly models memory hierarchies, (iii) supports heterogeneous many-core platforms, i.e., core-type dependent actor execution times, and (iv) optimizes the buffer placement and overall scheduling to minimize the period (i.e., maximizing application throughput) by proposing a novel modulo scheduling-based heuristic named Communication-Aware Periodic Scheduling on Heterogeneous Many-core Systems (CAPS-HMS).CAPS-HMS periodically schedules actors on cores and read/write operations on hierarchical interconnect topologies in a very efficient way.To analyze the trade-off between memory footprint and period, our exploration also selectively replaces multi-cast actors by MRBs and explores actor and FIFO channel mappings on top of finding a periodic actor and memory access schedule.In addition to the minimization of the memory footprint and the period, the (weighted) number of allocated cores is also minimized.
The paper is structured as follows: Section II presents the fundamental models of applications and architectures and memory q global 228 kB . . .
introduces the notion of multi-cast actors, all forming a socalled specification graph that serves as input to our optimization problem.Then, Section III introduces the design space of selective MRB replacements, actor and channel bindings, as well as actor and communication scheduling.In Section IV, a population-based DSE approach is presented in which a Multi-Objective Evolutionary Algorithm (MOEA) is used to explore the design space to find Pareto-optimal implementations.Section V presents two alternative approaches for the combined periodic actor and buffer access scheduling problem: (i) an exact formulation based on an integer linear program and (ii) our novel scheduling heuristic called CAPS-HMS for the combined periodic scheduling of actors and communications between actors.Even though the Integer Linear Program (ILP) modulo scheduling approach performs well in terms of solution times for small to mid-sized applications, the CAPS-HMS heuristic provides superior results when tackling large applications because the ILP runs into timeouts, or solution times would become prohibitively long.
To evaluate the overall approach, experimental results are reported for three applications in terms of the quality of the found non-dominated sets of solutions in Section VI.It is shown for both CAPS-HMS and the ILP alternative, improvements in the quality of found solutions can be achieved when selectively replacing the multi-cast actors with MRBs when exploring the mappings of a dataflow streaming application to heterogeneous many-core architectures.In particular, for small to mid-size applications, the reported improvements range from 28 % to 66 % in the hypervolume score.In contrast, for the largest benchmark application, the reported improvement is even 90 % of the same hypervolume score.Moreover, when comparing the Pareto front quality of the CAPS-HMS heuristic against the front achieved using the exact ILP approach, the observed degradation of CAPS-HMS turns out to be minor for all presented test applications.It will be shown that for the small and mid-sized applications used in the experiments, CAPS-HMS is slightly inferior by just 7 % in terms of hypervolume compared to the ILP.Particularly for large applications and with increasing complexity of the target architecture, the ILP solution times turn out to become prohibitively long.In contrast, the fast CAPS-HMS outperforms the ILP by 67 % in hypervolume for our large test application.Section VII presents related work, and Section VIII concludes the paper.

II. FUNDAMENTALS
The problem of mapping applications to many-core targets is often described by a specification graph [5,10,11] composed of (i) an application graph, (ii) an architecture graph, and (iii) a set of mapping edges that will be explained in the following.

A. APPLICATION GRAPH
An application is modeled as a bipartite graph of actors and channels, called an application graph, as defined below: According to the assumed marked graph semantics, each actor consumes exactly one token from each input channel and produces one token on each output channel upon firing.

B. MULTI-CAST ACTORS
Generally, an application graph might contain so-called multicast actors a m ∈ A M ⊂ A, e.g., actor a 2 ∈ A M in Fig. 2a.
In each actor firing of a multi-cast actor a m , one token is consumed from its input channel c in , and for each output channel c out , one token is produced containing copied data of the consumed token.To exemplify, actor a 2 copies each token consumed from input channel c 1 to actors a 3 and a 4 by producing for each output channel c out ∈ {c 2 , c 3 } a token containing identical data.Formally, each multi-cast actor a m ∈ A M has exactly one input channel and multiple output channels, as specified in Eq. (1).The size of the tokens contained in the input and output channels must be identical (see Eq. ( 2)), e.g., ϕ(c 3 ) = ϕ(c 2 ) = ϕ(c 1 ).Finally, there must not be any initial tokens in the output channels, and the channel capacity of all output channels must be identical (see Eq. ( 3)), e.g., δ Each multi-cast actor represents a memory footprint reduction opportunity by replacing it and its adjacent channels with an MRB, as shown and explained in the caption of Fig. 2.

C. MULTI-READER BUFFER REALIZATION
A concept for unifying multiple FIFOs carrying identical data was first introduced by [12] (there called broadcast-FIFO).However, the proposed broadcast-FIFO has slightly altered semantics compared to the behavior of the multiple point-topoint FIFO channels it replaces.A concept preserving FIFO semantics called Multi-Reader Buffer (MRB) has then been presented in [9], which will be explained in the following.
By definition, an MRB c m has one writer a w and multiple readers a r ∈ {a | (c m , a) ∈ E}.For the example shown in Fig. 2b, the MRB c {1,2,3} has the writer a 1 , and the actors a 3 and a 4 are its readers.An MRB has a write index ω c m ∈ {0, 1, . . ., γ(c m ) − 1} that indicates the next position in c m 's buffer to be filled with the next token produced by the writer.Moreover, for each reader a r , there is a read index ρ c m ,a r ∈ {−1, 0, 1, . . ., γ(c m ) − 1} that indicates the position in c m 's buffer from which the reader will consume the next token.The special value −1 denotes that c m is empty from a r 's perspective.
Then, the number of available tokens T(c m , a r ) from the perspective of a reader a r and the number of free places F(c m ) in c m from the perspective of the writer a w can be determined as follows: It is worth noting that the presented MRB realization presented here can support even multi-rate dataflow.To demonstrate this, assume that the writer a w produces ψ(a w ) tokens and a reader a r consumes κ(a r ) tokens upon firing.Naturally, the writer a w can only fire when F(c m ) ≥ ψ(a w ) holds.Similarly, T(c m , a r ) ≥ κ(a r ) must be satisfied for a reader a r to fire.
When firing actor a w , each read index ρ c m ,a r with value −1 (i.e., indicating that the MRB is empty from a r 's perspective) is set to the value ω c m (Eq.( 4)).Then, Eq. ( 5) is applied, which advances the writer index ω c m by the number of pro- duced tokens.
Accordingly, upon each firing of a reader a r , the corresponding read index ρ c m ,a r is updated as follows: To exemplify, the MRB's read and write indices after various firings of the connected actors a 1 , a 3 , and a 4 are depicted in Fig. 3. There, actors a 1 , a 3 , and a 4 are, respectively, associated with the write index ω c {1,2,3} and the read indices ρ c {1,2,3} ,a 3 and ρ c {1,2,3} ,a 4 .
Next, assume actor a 1 fires three times, resulting in the state shown in Fig. 3b.There, the write index ω c {1,2,3} has advanced to 3, pointing to the next free place in the MRB's buffer.The read indices ρ c {1,2,3} ,a 3 and ρ c {1,2,3} ,a 4 have been updated dur- ing the first firing of actor a 1 from −1 to 0, pointing to the first token contained in the MRB.At this point (see Fig. 3b), we can also perform read operations.Before firing a reader, we need to verify if there exist sufficient tokens to be consumed by the reader.For instance, we are able to fire actor a 3 because After firing the sequence a 3 , a 3 , a 3 , a 1 , the resulting state is shown in Fig. 3c.There, the readers track different information about the state of the MRB.The reader a 3 points to ρ c {1,2,3} ,a 3 = 3 and observes T(c {1,2,3} , a 3 ) = ((0 − 3 − 1) mod 4) + 1 = 1 token, whereas reader a 4 points to ρ c {1,2,3} ,a 4 = 0 and observes T(c {1,2,3} , a 4 ) = ((0 − 0 − 1) mod 4) + 1 = 4 tokens.From the perspective of the writer a 1 , the MRB is full.At this point (see Fig. 3c), let the firing sequence a 4 , a 3 be observed.
The resulting state of the MRB is shown in Fig. 3d.From the perspective of a 3 , the MRB is empty, i.e., ρ c {1,2,3} ,a 3 is −1.

D. ARCHITECTURE GRAPH
A heterogeneous many-core target architecture, e.g., as depicted in the right part of Fig. 1, can be modeled formally by an abstract architecture graph: Definition 2.2 (Architecture Graph): An architecture graph g R is a tuple (R, L) composed of a set of vertices R modeling hardware resources and a set of edges L ⊆ R × R denoting communication links between resources.
Here, the set of vertices R = P ∪ Q ∪ H represents the resources of the architecture where each p ∈ P denotes a core, each q ∈ Q a memory, and each h ∈ H an interconnect.The set of cores P is partitioned into sets P ϑ 1 , P ϑ 2 , . . .P ϑ |Θ| .Each set P ϑ describes the set of cores of identical core type ϑ ∈ Θ.
The set of memory resources Q = Q P ∪ Q T ∪ {q global } can be partitioned into core-local memories (q p i ∈ Q P ), tile-local memories (q T j ∈ Q T ), and the global memory (q global ).Each core p i ∈ P has a core-local memory q p i reachable via a link (p i , q p i ) ∈ L. Each memory q ∈ Q has a capacity W q , which denotes the number of bytes that can be stored in the memory.
The set of interconnects H is partitioned into the Networkon-Chip (NoC) (h NoC ∈ H ) and a set of crossbars h T ∈ H T = H \ {h NoC }.Each interconnect h ∈ H is annotated with its bandwidth B h , which is used to calculate data transfer delays.The time required to transport η bytes of data over a crossbar h T can be calculated as η/B h T .
Resources of a given architecture, excluding the NoC and the global memory (q global ), i.e., processors, local memories, tile-local memories, and crossbars, are organized as a set of tiles T. Each tile T ∈ T consists of a set of cores and their core-local memories, a tile-local memory, and a tile crossbar connecting the cores and memories of the tile.As each resource belongs to exactly one tile, tiles are (i) disjoint, i.e., ∀T i , T j ∈ T : T i ∩ T j = / 0 where i = j and (ii) covering, i.e., Intra-tile communication is provided by links connecting each core and memory of the respective tile via the tile crossbar.To exemplify, consider the tile T 1 presented in Fig. 1.It is composed of six cores {p 1 , . . ., p 6 }, six core-local memories {q p 1 , . . ., q p 6 }, the tile-local memory q T 1 , and the tile-crossbar h T 1 .Each core in tile T 1 has an exclusive communication link with its corresponding core-local memory, e.g., there exists a link (p i , q p i ) that connects core p i with its memory q p i .Moreover, each memory of the tile can be reached via the tilecrossbar h T 1 .If core p 1 sends data to p 4 , such data will traverse the tile-crossbar h T 1 via the links (p 1 , h T 1 ) and (h T 1 , q p 4 ) to be stored in the core-local memory q p 4 of core p 4 .
For inter-tile communication, links are provided that connect each tile to the NoC (h NoC ), which in turn is connected to the global memory (q global ).
The set of resources involved in a data transfer between a core p and a memory q will be denoted by a routing function R : P × Q → P(R) 1 , as explained in the following.
In the simplest case, a data transfer happens between a core p i and its local memory q p i .Then, no interconnect resources are involved, i.e., R(p i , q p i ) = {p i , q p i }.
Else, if the core p and the memory q share the same tile (∃T j ∈ T : p, q ∈ T j ), an intra-tile data transfer is performed.In this case, the data transfer only traverses the tile crossbar h T j , i.e., R(p, q) = {p, h T j , q}.
Otherwise, an inter-tile transfer is needed as the core p and the memory q are allocated in different tiles, i.e., p ∈ T j , q ∈ T k , and T j = T k .Then, the data needs to travel over the tile crossbar h T j of the tile containing the core p, the NoC interconnect h NoC , and the tile crossbar h T k of the tile containing the memory q, i.e., R(p, q) = {p, h T j , h NoC , h T k , q}.
In all other cases, the global memory is used, and the involved interconnect resources are the tile crossbar h T j and the NoC, i.e., R(p, q global ) = {p, h T j , h NoC , q global }.

E. SPECIFICATION GRAPH
To perform explorations of allocations and mappings of actors to cores as well as channels to memories, a specification finally contains a set of mapping edges M actors to cores and a set of potential mappings M C = C × Q of channels to memories.These mapping edges specify that every memory can store each channel and that each actor a can be mapped to every core p ∈ P ϑ of a type ϑ that can execute the actor a.With these definitions, a specification graph can be defined as follows: Definition 2.3 (Specification Graph): A specification graph g S is a tuple (V S , E S ) composed of a set of vertices V S and a set of edges E S .The set of vertices V S = A ∪ C ∪ R is formed from the union of vertices of the application graph g A and the architecture graph g R .Similarly, the set of edges E S = E ∪ L ∪ M is formed from the union of edges of both graphs and the set of mapping edges.Figure 1 illustrates an example of an application graph, an architecture graph, and an exemplified set of actor-to-core and channel-to-memory mappings.

III. DEFINITION OF THE DESIGN SPACE
This section introduces the design space of selective MRB replacements, formalizes the concept of actor and channel bindings, and illustrates the principles of actor and communication scheduling.

A. SELECTIVE MRB REPLACEMENT
As discussed in Section II-B, each multi-cast actor represents an opportunity for memory footprint reduction by replacing it and its adjacent channels with an MRB, as shown in Fig. 2.However, replacing a multi-cast actor with an MRB may also lead to an increase in the execution period [9], which is defined as the time interval between two successive iterations of execution of a given application graph.Hence, which multi-cast actors are replaced by MRBs needs to be explored to trade between period and memory footprint, both to be minimized.For this purpose, we define a multi-cast actor replacement function ξ : Formally, the replacement of selected multi-cast actors with MRBs for a given application graph g A (e.g., as illustrated in Fig. 2a) can be realized by a graph transformation as detailed in Algorithm 1, leading to a transformed application graph g Ã (e.g., as shown in Fig. 2b), where the selected multicast actors and the channels connected to them have been replaced by their corresponding MRBs.

B. ACTOR AND CHANNEL BINDINGS
Next, determining an implementation of a transformed application graph g Ã on an architecture requires a binding (i) of 2 Remember, τ(a, ϑ ) = ⊥ denotes that an actor a cannot be mapped to a particular core type ϑ .each actor to a processor, which is described by a set β A ⊆ M A called actor bindings, and (ii) of each channel to a memory, which is described by a set β C ⊆ M C called channel bindings.Moreover, each actor and channel must be bound to exactly one core (see Eq. ( 6)), respectively, memory (see Eq. ( 7)).Finally, the channels bound to a memory q ∈ Q must not exceed its capacity (see Eq. ( 8)).A set of feasible bindings β = β A ∪ β C must satisfy Eqs. ( 6) to (8).
The number of cores α(ϑ ) allocated of a given type ϑ can then be implicitly derived from the actor binding β A as allocation α.
While, in principle, each channel c ∈ C can be bound to any memory q ∈ Q, it makes sense to constrain the design space to be explored such that a channel will not be bound to a corelocal memory of a core that does not at all access the channel data.Similarly, tile-local memories of tiles containing no core accessing the channel data can also be excluded.As a result, only five binding alternatives exist for each channel: (PROD) the core-local memory q p prod of the core p prod producing the data, (TILE-PROD) the tile-local memory q T prod of the tile T prod containing the core producing the data, (CONS) the corelocal memory q p cons of the core p cons consuming the data, Example of a schedule with period P = 8 time steps (right) for the transformed application graph g Ã from Fig. 2b where actors a 1 and a 5 are bound to core p 3 , actor a 3 and channel c 4 are bound to core p 1 and its core-local memory q p 1 , as well as actor a 4 and channel c 5 are bound to core p 2 and its core-local memory q p 2 .For better visualization, we use c m to refer to the MRB c {1,2,3} , which is bound to the core-local memory q p 3 of core p 3 (left).The light red boxes in the Gantt chart shown to the right denote actor executions, while the light green boxes denote read operations, e.g., the light green box containing (c 4 , a 5 ) denotes a read of a token contained in channel c 4 by the actor a 5 .The data dependencies of the application graph g Ã are depicted by the solid and dotted dashed directed edges in the Gantt chart.For example, the solid directed edge from actor a 1 over the read communication (c m , a 4 ) to actor a 4 represents the data dependency between actors a 1 and a 4 communicated via the MRB c m .The Gantt chart does not depict the corresponding write (a 1 , c m ) as the MRB c m is bound to the core-local memory q p 3 of the core p 3 , where actor a 1 is bound to.Thus, the write communication is assumed to be part of the execution of actor a 1 itself.
(TILE-CONS) the tile-local memory q T cons of the tile T cons containing the core consuming the data, or (GLOBAL) the global memory q global .In the following, these five options are represented by a channel decision function C d : C → {GLOBAL, TILE-PROD, TILE-CONS, PROD, CONS}, which shall be explored rather than exploring channel bindings directly.Concrete channel bindings β C can then be determined via Al- gorithm 2 from the channel decisions, channel capacities, and actor bindings in such a way that Eqs. ( 7) and ( 8) are satisfied.Algorithm 2 determines for each channel c ∈ C a concrete binding according to the channel decision C d (c) in case memory capacities W q are not exceeded.Otherwise, a fallback solution is determined according to the case statements.It can be proven that a feasible binding is always found for each channel c ∈ C by binding c to the global memory q global that is assumed to be large enough to store all the buffer data related to the channels of a given application.
Algorithm 2 derives the channel bindings.For the running example in Fig. 4, we obtain β C = {(c 4 , q p 1 ), (c 5 , q p 2 ), (c m , q p 3 )} from the channel decisions and the actor bindings β A = {(a 3 , p 1 ), (a 4 , p 2 ), (a 1 , p 3 ), (a 5 , p 3 )}.Algorithm 2 thereby prefers to bind channels to core-local memories.If the core-local memory (q p 3 in the running example) did not have a sufficient capacity to accommodate the MRB channel c m , Algorithm 2 would bind c m to the tile-local memory q T 1 , and if even q T 1 would also have an insufficient capacity, the channel c m would finally be bound to the global memory q global .

C. PERIODIC SCHEDULING OF ACTORS AND COMMUNICATION
In the following, we consider the optimization and generation of static periodic schedules with an assumed uninterrupted execution of actors and communications.We also assume that each actor executes on the same core for each iteration of the dataflow graph.As it is assumed that the underlying DFG of a given application graph g Ã is a marked graph [3], for each actor a ∈ g Ã.A, read (c, a) ∈ g Ã.E , as well as write (a, c) ∈ g Ã.E operation, we need to determine exactly one start time s a , s (c,a) , and s (a,c) , respectively, which repeats with the period P. Thus, the actors and edges of the application graph g Ã together define the set of tasks to be scheduled, i.e., t ∈ T = g Ã.A ∪ g Ã.E .
For example, consider the schedule with a period of P = 7 depicted in Fig. 5 with actor start times as follows: s a 1 = 0, s a 2 = 1, s a 3 = 3, s a 4 = 4, and s a 5 = 13.Note that the start time of actor a 5 is greater than the period.Therefore, the firing of actor a 5 depicted in the schedule at time step 6 belongs to the previous iteration.Naturally, start times also need to be determined for the read and write operations, e.g., 16 break // q p prod too small, try q T prod next 28 break // q pcons too small, try q Tcons next 29 case TILE-CONS do write and read operations shown in the schedule.The read and write operations with assumed zero communication time (i.e., read and write operations not involving any interconnect resource), which are not depicted in the schedule, have the following start times: s (a 1 ,c 1 ) = s (c 1 ,a 2 ) = 1 (i.e., after actor a 1 has finished and before actor a 2 starts), s (c 2 ,a 3 ) = 3 (i.e., before actor a 3 starts), s (c 3 ,a 4 ) = 4 (i.e., before actor a 4 starts), s (a 3 ,c 4 ) = 10 (i.e., after actor a 3 finishes), and s (a 4 ,c 5 ) = 11 (i.e., after actor a 4 finishes).Furthermore, for each actor a, its execution time is denoted by τ a , derivable from the actor bindings β A as follows: τ a =τ(a, ϑ ) where ϑ ∈ Θ such that β A (a) ∈ P ϑ (10) For example, the actor execution times τ a 1 = τ a 2 = τ a 5 = 1 and τ a 3 = τ a 4 = 7 correspond to those depicted in the schedule shown in Fig. 5.
The time required for one token to be read from, respectively, written to channel c by actor a is denoted by τ (c,a) and τ (a,c) .In the following, these times are derived from the token size ϕ(c) and the interconnect bandwidth B h of the interconnect h with the minimal bandwidth that is traversed by the communication: As a consequence, read and write operations that do not traverse at least one interconnect resource have zero communication time, e.g., τ (a 1 ,c 1 ) = τ (c 1 ,a 2 ) = τ (c 2 ,a 3 ) = τ (a 3 ,c 4 ) = τ (c 3 ,a 4 ) = τ (a 4 ,c 5 ) = 0 for the actor and channel bindings given in Fig. 5.Such communication operations directly access a core-local memory q p i from the corresponding core p i .In this case, the communication is assumed to be part of the execution of the actor performing the read or write operation.In other cases, the traversed interconnect resource h with an assumed minimal bandwidth B h leads to a non-zero communication time.In the example above, τ (a 2 ,c 2 ) = τ (a 2 ,c 3 ) = τ (c 4 ,a 5 ) = τ (c 5 ,a 5 ) = 1, as visualized in the schedule in Fig. 5.
Finally, let A r and T r denote the set of actors, respectively, tasks mapped to a resource r.Formally, A r can be derived from the set of actor bindings β A as follows: For a fully formal definition of T r , we extend the domain of the routing function R to also contain all edges e ∈ g Ã.E of the application graph g Ã.Given the bindings β A and β C , let the set of resources involved by a write operation e = (a, c) or read operation e = (c, a) be denoted by With this extension, T r is given by: For example, the set of all actors bound to core p 3 , as shown in Fig. 5, is given by A p 3 = {a 1 , a 2 , a 5 }.Including read and write operations executed by core p 3 results in the set Note that the write (a 1 , c 1 ) and the read (c 1 , a 2 ) are not shown in the schedule depicted in Fig. 5, as these are assumed to have zero communication times.Moreover, read and write operations are, in general, bound to multiple resources, as they are bound to the core where the data is produced or consumed as well as all traversed interconnect resources, e.g., the read and write operations (a 2 , c 2 ), (a 2 , c 3 ), (c 4 , a 5 ), and (c 5 , a 5 ) are not only executed by core p 3 but are also traversing the interconnect h T 1 , i.e., T h T 1 = {(a 2 , c 2 ), (a 2 , c 3 ), (c 4 , a 5 ), (c 5 , a 5 )}.

D. TRADE-OFFS BETWEEN THE MINIMIZATION OF MEMORY FOOTPRINT AND THE ACHIEVABLE PERIOD
Replacing a multi-cast actor and its adjacent channels with an MRB has as its primary purpose the reduction of the memory footprint (see Fig. 2).Moreover, this transformation Schedule with a period of P = 7 (shown to the right) for the application graph g A from Fig. 2a.Actor a 3 is bound to core p 1 , actor a 4 is bound to core p 2 , and actors a 1 , a 2 , and a 5 are bound to core p 3 .Channels c 2 and c 4 are bound to the core-local memory q p 1 , channels c 3 and c 5 are bound to core-local memory q p 2 , and channel c 1 is bound to core-local memory q p 3 (shown to the left).The light red boxes and the light violet box (for the multi-cast actor a 2 ) in the Gantt chart shown to the right denote actor executions, the green boxes represent write operations, and the light green boxes indicate read operations.To exemplify, the green box containing (a 2 , c 2 ) represents a write of a token to channel c 2 by the actor a 2 , and the light green box containing (c 4 , a 5 ) indicates a read of a token contained in channel c 4 by the actor a 5 .Similarly to Fig. 4, the data dependencies of the application graph g A are depicted by the solid and dotted dashed directed edges in the Gantt chart.The read and write from and to channel c 1 are not shown in the Gantt chart as both the write (a 1 , c 1 ) of actor a 1 to channel c 1 and the read (c 1 , a 2 ) of actor a 2 from channel c 1 access the core-local memory q p 3 of the core p 3 that executes both actors a 1 and a 2 .Thus, the corresponding write and read communication times are zero, i.e., τ (a 1 ,c 1 ) = τ (c 1 ,a 2 ) = 0, as the communication is assumed to be part of the execution of the actors themselves.The same situation holds for the read from channel c 2 and write to channel c 4 by actor a 3 as well as the read from channel c 3 and write to channel c 5 by actor a 4 , i.e., τ (c 2 ,a 3 ) = τ (a 3 ,c 4 ) = τ (c 3 ,a 4 ) = τ (a 4 ,c 5 ) = 0.
removes both the need to execute the multi-cast actor and its communication.Nonetheless, there are cases where an MRB replacement is detrimental to (i.e., it increases) the execution period P. To illustrate this, Figs. 4 and 5 present two periodic schedules obtained from the specification shown in Fig. 1.One can see that the schedule shown in Fig. 4 utilizing an MRB has a longer period, i.e., P = 8, than the schedule with period P = 7 depicted in Fig. 5, where the multi-cast actor a 2 has been retained.The timings in Figs. 4 and 5 are chosen for illustrative purposes to demonstrate the impact of MRBs and the existing trade-off in the specification.In both schedules, the same actor-to-core binding is assumed for actors a 1 , a 3 , a 4 , and a 5 , i.e., actors a 1 and a 5 are bound to core p 3 , actor a 3 is bound to core p 1 , and actor a 4 is bound to core p 2 .Moreover, channels c 4 and c 5 are bound to the core-local memories q p 1 and q p 2 , respectively.
As mentioned previously, the illustrated schedules are distinguished whether they employ an MRB or retain the multicast actor a 2 .To exemplify, in Fig. 4, the MRB c m mapped to the core-local memory q p 3 replaces the multi-cast actor a 2 and its connected channels c 1 , c 2 , and c 3 .Thus, both actors a 3 and a 4 have to read from memory q p 3 (i.e., the reads (c m , a 3 ) and (c m , a 4 )), resulting in an additional delay of 1 time unit, increasing the period to P = 8.Moreover, binding the MRB c m to either core-local memory q p 1 or core-local memory q p 2 does not improve the situation as, respectively, actor a 4 or actor a 3 has to perform a read, delaying its execution by 1 time unit.For the example, the only way to obtain a schedule with a period of P = 7 is to have copies of the output data of actor a 1 in both core-local memories q p 1 and q p 2 (e.g., as shown in Fig. 5), but this is the exact situation that is prevented when employing an MRB, as MRBs are used to avoid any data duplication.Hence, no schedule with a period of P = 7 exists when an MRB replaces the multi-cast actor a 2 .
In contrast, the schedule depicted in Fig. 5 retains the multicast actor a 2 (bound to core p 3 ) and its connected channels c 1 , c 2 , and c 3 .Channel c 1 is bound to core-local memory q p 3 , while channels c 2 and c 3 are bound to core-local memories q p 1 and q p 2 , respectively.Thus, the input data needed to fire actors a 3 and a 4 are already contained in the core-local memories (i.e., q p 1 and q p 2 ) of the cores the actors are bound to (i.e., p 1 and p 2 ).Moreover, their output channels (c 4 and c 5 ) are also bound to these core-local memories.Thus, the core p 1 can execute actor a 3 without any read or write overhead.The same holds for core p 2 and its bound actor a 4 .Instead, the communication overhead to move the input and output data of actors a 3 and a 4 to and from the core-local memories q p 1 and q p 2 is spent by core p 3 , which was previously under-utilized in the schedule depicted in Fig. 4. Core p 3 executes the multicast actor a 2 , which provides the input data of actors a 3 and a 4 via the writes (a 2 , c 2 ) and (a 2 , c 3 ), and the actor a 5 (also bound to core p 3 ) is fetching the output data of actors a 3 and a 4 via the reads (c 4 , a 5 ) and (c 5 , a 5 ).This enables a schedule of period P = 7, as the cores p 1 and p 2 are no longer burdened with any communication overhead.
Moreover, this also demonstrates that the channel decisions must be explored to obtain this optimal period of P = 7.Otherwise, in case of a fixed channel decision, the actors a 3 and a 4 would need to execute a communication operation, e.g., a read operation when the data stays at the producer (PROD) or a write operation when the data has to be moved to the consumer (CONS).Only with the channel decisions In summary, replacing every multi-cast actor with an MRB enables minimal memory footprint implementations, but this may create an impact on the minimal achievable period.Thus, minimal period implementations require both optimization of the actor and channel bindings as well as a selective decision for each multi-cast actor on whether or not to perform MRB replacement.In the following, we present our design space exploration approach to minimize the execution period, memory footprint, and core cost.

IV. DESIGN SPACE EXPLORATION
Allocation of resources, binding, and scheduling a DFG onto a heterogeneous many-core system is a Multi-objective Optimization Problem (MOP) [5,10], and trade-offs exist and shall be explored between different objectives, e.g., execution period, memory footprint, and core cost.In general, there is no single best solution but a set of Pareto-optimal solutions that trade the different objectives against each other.
Moreover, the introduced design space of bindings and schedules is huge even for small applications and a modest number of processors, memories, and communication resources, such as the example shown in Fig. 1.Thus, finding the actual set of Pareto-optimal solutions is an intractable problem that can only be approximated via heuristics.For this purpose, many state-of-the-art Electronic System Level (ESL) design flows employ meta-heuristic optimization techniques based on MOEAs [5,7,15].The advantage of such population-based techniques is that the search space is sampled in parallel and that not only one compromise solution but an approximation of the Pareto-front is found after several generations of offspring as a result of the DSE.However, whereas MOEAs have been shown to provide quite good results for allocation and binding problems [5,10], it is difficult to find good encodings for feasible schedules of operations.
Indeed, pure meta-heuristic optimization techniques, while applicable to a broad domain of problems, are often too generic.This general applicability can be traded for a better optimization performance, e.g., quality of found solutions or required runtime to obtain these solutions, by employing problem-specific heuristics.Hence, it is beneficial to integrate problem knowledge into meta-heuristic optimization techniques -restricting their general applicability to a particular domain but improving optimization performance.
In this paper, we propose a new hybrid DSE approach in which the exploration of the design space is split between (i) a MOEA to explore the space of multi-cast actor replacement function ξ (encoded as a binary string), channel decision function C d (integer encoding), and the set of actor bindings β A (integer encoding).To find a schedule minimizing the execution period P for a given solution candidate, (ii) a specialized scheduling algorithm is applied.This so-called hybrid decoding process is illustrated in Fig. 6.
For decoding, we first propose an exact formulation for the related scheduling problem and subsequently introduce our heuristic CAPS-HMS.The ILP-based decoding will obtain a schedule with minimal period for a given set of actor bindings and channel decisions but may suffer from long evaluation times.In contrast, our heuristic CAPS-HMS will allow for a much faster evaluation of solution candidates but does not guarantee to find the exact minimal period.
In both alternative approaches, Algorithm 1 is applied first to compute a transformed application graph g Ã containing the MRBs decided by the DSE via the multi-cast actor replacement function ξ .This function ξ , the channel decision function C d , and the set of actor bindings β A together form the genotype G .In both cases, the genotype will be decoded into the phenotype representing the period P, the set of actor and channel bindings β , and the channel capacity function γ.Based on the phenotype, the evaluators finally determine the quality of the solution candidate under evaluation with respect to the design objectives.For our mapping and scheduling problem, the objectives are the minimization of (i) the execution period P, (ii) the memory footprint M F = ∑ c∈g Ã.C γ(c) • ϕ(c), and (iii) the core cost K = ∑ ϑ ∈Θ α(ϑ ) • K ϑ . 3

V. DECODING
In the following, we present and later evaluate two decoding approaches: (i) an integer linear program and (ii) a novel periodic scheduling heuristic called CAPS-HMS for heterogeneous multi-core platforms with hierarchical memory organizations and integrating the scheduling of actors and communications between actors.

A. ILP-BASED DECODING
First, we explain our ILP-based decoding approach, as shown in Algorithm 3.This algorithm decodes the genotype G into the corresponding phenotype (P, β , γ), as shown in Fig. 6.
Note that the scheduling via ILP is performed in a loop (Lines 2 to 6).The reason is that for an ILP-derived schedule, the channel capacities might need to be increased to execute this schedule (Line 5), and the channel bindings might need to be modified in consequence to accommodate the enlarged channels (Line 3).The loop terminates when all channels fit into the memories they are bound to (Line 6).
Overview of our hybrid DSE approach using MOEAs.The instance creator generates random genotypes for an initial population that forms the starting point of the iterative optimization process.A genotype G is the genetic representation of a solution candidate.For each (new) solution candidate, update executes a user-defined decoder and then applies evaluator functions on this candidate, i.e., either decodeViaILP or decodeViaHeuristic, depending on whether the ILP-based or heuristic-based approach is used.The decoder transforms the genotype into the phenotype representing the solution candidate's characteristics of interest, e.g., the period P, the bindings β , and the channel capacities γ.Based on the phenotype, the evaluators determine the quality of the solution candidate under evaluation with respect to the design objectives, e.g., period P, memory footprint M F , and core cost K.From the resulting population, the selector chooses parents with superior solution quality.Finally, the recombinator generate offsprings by recombining and mutating the genotype of the selected parents.Our approach has been realized using the DSE framework OpenDSE [13] and its underlying MOEA-based optimization framework Opt4J [14].
The objective of the ILP itself is the minimization of the execution period P (Eq.( 14)).Moreover, for each task t ∈ T, the ILP determines a start time s t (Eq.( 15)).Equations ( 16) to (18) encode the data dependencies of the application graph g Ã.In particular, Eq. ( 16) denotes that a token cannot be read from a channel c before it has been written into it, also considering the number of initial tokens δ (c) of the channel.Equation (17) ensures that each actor can only start after all its reads from ingoing edges have been performed, and Eq. ( 18) enforces that each actor write can only start after its actor computation has finished.Equation ( 19) guarantees for each resource r that all tasks t ∈ T r mapped to this resource are executed within a time interval of duration P. Finally, to ensure a feasible schedule, the ILP must enforce that tasks mapped to the same resource have non-overlapping executions.For this purpose, sequentialization binary variables e t,t ′ are introduced for each pair of tasks that share a resource (Eq.( 20)).Here, e t,t ′ = 1 denotes that task t must finish before task t ′ is started.Thus, exactly either e t,t ′ or e t ′ ,t must be one (Eq.( 21)).These variables are then used to sequentialize the communication over the interconnects (Eq.( 22)) and the actor executions performed by the cores (Eq.( 23)).In these equations, D ≫ P is a value much greater than the execution period, that is used to disable the sequentialization constraint that task t must finish before task t ′ is started in the case that e t,t ′ = 0.The sequentialization of actors mapped to the same core (Eq.( 23)) is enforced indirectly by constraining that all write tasks t ∈ OUT (a) of actor a are finished before the read tasks t ′ ∈ IN (a ′ ) of actor a ′ are started.This ensures that all reads of an actor, then the actor itself, and finally, all writes of the actor are executed in sequence without interspersing of reads and writes of other actors into this sequence.
However, if actor a is a sink actor (i.e., has no output edges) or actor a ′ is a source actor (i.e., has no input edges), a simple definition of OUT (a) and IN (a ′ ) as the set of all output edges of actor a, respectively, the set of all input edges of actor a ′ would fail to enforce the sequentialization that actor a is completed before actor a ′ fires.To handle these cases, OUT (a) returns the set containing only the actor a itself when this actor is a sink.Conversely, IN (a ′ ) returns the set containing only the actor a ′ itself when this actor is a source.Formally, OUT (a) and IN (a ′ ) are defined as follows: Increase γ(c) to accommodate schedule ∀c ∈ g Ã .C Here, E O (a) = {(â, ĉ) ∈ g Ã.E O | â = a} denotes the set of all output edges (i.e., write operations) of actor a and, correspondingly, the set of all input edges (i.e., read operations) of actor a ′ .

B. HEURISTIC-BASED DECODING
To speed up evaluation during exploration, we propose an alternative heuristic-based decoding outlined in Algorithm 4. This algorithm decodes the input genotype G into the corresponding phenotype (P, β , γ), as shown in Fig. 6.First, we determine an initial set of channel bindings β C in Line 2. Note that channels may need to be remapped later on (Line 10) if it turns out that channel capacities need to be increased (Line 7) to accommodate the found schedule and at least one channel no longer fits into the memory it is bound to (checked in Line 8).After initial channel bindings have been determined in Line 2, a lower bound for the period P is derived in Line 3 from the resource utilization of cores and interconnects.Consider Fig. 7 as an example, where bindings and timings are chosen for illustrative purposes with a communication time of one for all reads and writes, i.e., τ t = 1 ∀t ∈ E. The bottleneck resource in this example is the crossbar h T 1 involved in five reads and five writes, leading to a lower bound of 10 for the period P.
A concrete schedule is calculated by the proposed scheduler Communication-Aware Periodic Scheduling on Heterogeneous Many-core Systems (CAPS-HMS) depicted in Algorithm 5. CAPS-HMS is called with an application g Ã, actor and channels bindings β A and β C , and a candidate period P. If a schedule with period P is found, true is returned, false otherwise.This is used by the loop in Lines 5 to 6 of Algorithm 4 to successively increase the period until a schedule is found.As discussed previously, channel capacities may need to be enlarged to accommodate the found schedule, possibly resulting in a need to remap channels no longer fitting into memory, necessitating a rescheduling with the updated channel bindings, as is done by the while loop in Lines 4 to 10. Otherwise, as soon as a schedule with a feasible period P is found and all channels fit into the memory they are bound to, Line 9 terminates the loop.Then, the resulting phenotype (P, β , γ) is returned in Line 11.
CAPS-HMS shown in Algorithm 5 follows a greedy strategy, where tasks are scheduled as soon as possible on the resources they are bound to.All tasks are assigned a start time of execution within a given interval [0, P[, i.e., from 0 (included) to P (excluded).Ultimately, this interval will contain tasks from different iterations to optimize resource utilization.To obtain a schedule within the interval [0, P[, CAPS-HMS schedules one iteration of the application graph g Ã, thereby wrapping task executions finishing later than the period P back into the schedule interval [0, P[ through modulo P computation.Assuming the task t is executed in the interval [s t , s t + τ t [, then in the schedule interval [0, P[, it will occupy the time region given by f wrap (P, s t , τ t ) = {t mod P | s t ≤ t < s t + τ t }.For example, the execution of actor a 3 in the schedule depicted in Fig. 7 (to the right) is from 8 to 11, but it is wrapped into the schedule interval [0, 10[ with During scheduling, the resource utilization of each core or interconnect resource r ∈ R \ Q is tracked by a corresponding utilization set U r ⊆ [0, P[ that contains all time intervals already occupied with scheduled tasks.Initially, all resources are free, i.e., the utilization sets are assigned the empty set (Line 2 in Algorithm 5).For example, in the state depicted by the partial schedule shown in the middle of Fig. 7, the actors a 1 , a 2 , and a 3 and all their read and write operations have already been scheduled.In this state, the heuristic is trying to schedule actor a 4 with its read and write operations, observing the utilization sets 0. The goal of the scheduling heuristic CAPS-HMS is to assign for each task t ∈ T a as early as possible start time s t that conforms with the given bindings and satisfies the data dependencies.Channel capacities are not considered during scheduling but are adjusted in Algorithm 4 to accommodate the found schedule.The start times are initialized with zero at algorithm start (Line 3 in Algorithm 5) as, later on, the heuristic only delays start times to conform to data dependencies and resource constraints.CAPS-HMS considers for each actor a priority given by the topological sorting of g Ã (see Line 4).During scheduling, the heuristic keeps track of actors to be scheduled with the list L of ready actors, which is initialized in Line 5 as all actors that are initially ready to be fired, e.g., because they are source actors or there is at least one initial token contained in all input channels of the actor.Before any actor is selected, the ready list L must be sorted in descending order using the previously assigned priority.
Actor scheduling is performed by the loop in Lines 6 to 24, which either succeeds (Line 25) when there are no longer any actors to be scheduled, i.e., L = / 0, or fails (Line 24) when an actor can not be scheduled within the schedule interval [0, P[ due to insufficient free time remaining on at least one resource to schedule the actor and its read and write operations.This failure is indicated by the error flag ϖ checked in (Line 23).Within the scheduling loop, an actor a to be scheduled is selected from the ready list L, and its core p onto which it is bound is derived from the bindings β A (Line 8).Then, the time τ ′ a that an actor a, including its communication tasks, requires to be scheduled on core p is computed.For this purpose, we a ′ that have been enabled by firing actor a are added to the ready list (Line 21).Finally, the foreach loop is terminated in Line 22 to continue scheduling the next actor until all the actors have been scheduled (Line 25) or there is insufficient free time remaining on at least one resource to schedule all actors and their read and write operations (Line 24).
We will see in Section VI that although our heuristic scheduler CAPS-HMS does not guarantee to determine a schedule of minimal period P for a given combination of graph, channel decision function, and actor bindings, it turns out to require much less execution time than using the ILP scheduling approach presented in Section V-A.When comparing related Pareto front qualities, we will also show that the degradation is little for many test applications.Particularly for large applications and complexity of the target architecture, the ILP solution times can become prohibitively long.

VI. RESULTS
In the following, we conduct a series of different DSE experiments as shown in Fig. 6 to assess the effectiveness of our proposed ILP and CAPS-HMS heuristic in generating highquality implementations when mapping dataflow applications onto the heterogeneous many-cores shown in Fig. 1.For each exploration, we employed the OpenDSE [13] framework using the NGSA-II elitist genetic algorithm [17] with a population size of 100 individuals, each generation generating 25 new individuals and the crossover rate set to 0.95.To measure the effects of selectively introducing MRBs, we implemented and compared three different exploration strategies: Reference, MRB Always , and MRB Explore .The genotype for the Reference strategy is G = (C d , β A ).The multi-cast actor re- placement function ξ is the all-zeros function.Thus, no multicast actor is replaced (i.e., g Ã = g A ).In contrast, MRB Always also uses the genotype G = (C d , β A ) but assumes the all- ones function for ξ .Thus, each multi-cast actor is replaced by its corresponding MRB.Finally, strategy MRB Explore selectively explores for each multi-cast actor the choice of its replacement by an MRB by using the complete genotype G = (ξ , C d , β A ). Here, the binary string ξ is determined during the optimization loop (see Fig. 6).
Orthogonal to the replacement of multi-cast actors by MRBs, we also decide on decoding the genotype of each implementation.Here, we observe the effects of decoding via an ILP (see Section V-A) or using CAPS-HMS (see Section V-B).Both return a phenotype (P, M F , K) composed of a minimum period to modulo schedule, the memory footprint, and the cost of cores of an implementation.Such a phenotype is used to evaluate the quality of each implementation.In the following, the combinations of strategy and way to decode a solution candidate result in six approaches.The approaches named Reference ILP , MRB ILP Always , and MRB ILP Explore explore the effects of introducing MRBs when each genotype is decoded using the ILP-based decoder.Conversely, the approaches named Reference CAPS-HMS , MRB CAPS-HMS

Always
, and MRB CAPS-HMS Explore use CAPS-HMS to decode the genotype.The architecture used for our experiments (shown in Fig. 1) contains 24 cores organized into four tiles: T 1 , T 2 , T 3 , and T 4 .Inter-tile communication is supported via a networkon-chip h NoC .A global memory q global provides off-chip storage.Internally, each tile comprises six cores connected to its correspondent local memory.Each core is of one of three core types: ϑ 1 , ϑ 2 , or ϑ 3 .For our experiments, the respective relative core costs have been chosen as K ϑ 1 = 1.5, K ϑ 2 = 1.0, or K ϑ 3 = 0.5.Faster cores are usually more expensive than slower ones.Thus, the slowest processors in the architecture are those of type ϑ 3 , and the fastest processors in the architecture are those of type ϑ 1 .The relative core costs thus approximately correlate to the speedup between the cores of different types, i.e., cores of type ϑ 1 are 3× faster than cores of type ϑ 3 , and cores of type ϑ 2 are 2× faster than cores of type ϑ 3 .Moreover, each tile supports intra-tile communication via a crossbar h T and a tile-local memory q T .To observe the effects of the approaches under observation in a realistic environment, we constrain the size of each memory and the bandwidth of each interconnect resource.Accordingly, the core-local and tile-local memories can store up to 2.5 MiB and 50 MiB, respectively.We assume the global memory to be large enough to store all channels of the explored applications.Last, the bandwidth of each crossbar is 8 GiB/s, and the NoC bandwidth is 4 GiB/s.
We assume that each actor in the application can be mapped to any core in the architecture, and each channel might potentially be mapped to any memory.The optimization loop of the DSE explores the actor-to-core bindings β A , whereas channel-to-memory bindings β C are then determined using Algorithm 2 (see Section III-B).
As discussed, the objectives to be minimized are the execution period P (see Section V), the memory footprint M F , and the cost K of allocated cores.We quantify the memory footprint of each application g Ã after decoding as follows: This corresponds to the addition of the product of the token size (ϕ) and the adjusted channel capacity (γ) of each channel.We calculate the core cost K of each implementation after decoding as given below: As target applications, Table 1 presents a benchmark composed of three real-world image processing applications obtained from self-developed Matlab/Simulink test cases [6].Shown in the table are also the number of actors, the number of channels, and the number of multi-cast actors contained in each application.Table 1 also shows for each application two memory footprints, M F and M Fmin , with the following semantics: M F represents the minimal memory footprint of each application when all multi-cast actors are retained, while M Fmin represents the minimal memory footprint when each multi-cast actor is replaced by a corresponding MRB.To calculate both memory footprints M F and M Fmin , we use Eq. ( 24) and assume a channel capacity of exactly one token for all channels, i.e., ∀c ∈ C : γ(c) = 1.
Finally, as our applications are all acyclic, they are transformed in such a way that there is at least one initial token per channel, i.e., ∀c ∈ C : δ (c) ≥ 1, allowing lower execution periods to be reached.

A. QUALITY OF FOUND IMPLEMENTATIONS
A Multi-objective Optimization Problem (MOP) generally does not have a single optimal solution due to the conflicting objectives.Instead, there exists a set of Pareto-optimal solutions.The set of all such solutions is known as the Paretofront.As discussed previously, finding the actual Pareto front of the MOP considered in this paper is an intractable problem that can only be approximated.To obtain a good approximation of the Pareto front for each application, the Paretofronts found by all exploration runs for a given application utilizing all six considered combinations of exploration and decoding strategy are combined into a reference Pareto-front.This reference Pareto-front S Ref can be seen as the closest approximation of the actual Pareto-front achieved.The quality of each approach for each application can then be evaluated by comparing the Pareto-front approximations found by the five DSE runs performed for this application and approach combination against the application's reference Pareto-front.To facilitate such a comparison, quality measures are required for Pareto-front approximations that condense characteristics such as proximity to the reference Pareto front (the closer, the better) and diversity into a single measure [18].For this purpose, we use the hypervolume [19] quality measure and normalize the reference Pareto-front S Ref and each Paretofront S found by an approach to only contain objective values between zero and one, i.e., S Ref , S ⊂ [0, 1] d where the number of objectives is given by d = 3.This normalization ensures that each objective is weighted equally in the hypervolume quality measure.
Then, given a (normalized) Pareto-front S ⊂ [0, 1] d , the hypervolume of S is the measure of the region weakly dominated 4 by S and bounded above by the reference point 1.
There, Λ(•) denotes the Lebesgue measure [20].The greater the hypervolume score is, the better a Pareto-front approximation S is considered to be.
For each considered application and approach under investigation, five independent DSE runs were performed.To 4 A point p ∈ R d weakly dominates a point q ∈ R d if p i ≤ q i for all 1 ≤ i ≤ d. make the comparison of the approaches feasible and fair, each DSE run was given a maximum number of 2,500 generations, which is sufficient for all approaches to reach stagnation, i.e., no or very little further progress could be observed if the exploration runs longer.In each generation of the DSE, the set of non-dominated solutions 5 found so far is recorded.Thus, for a given application, approach, and generation i, there exists a set S ≤i containing exactly five sets S ≤i of nondominated solutions found until generation i, one for each DSE run.To evaluate the quality of each approach for each application, we average over the five DSE runs as follows: Fig. 8 presents for each explored application and approach the averaged relative hypervolume score, as defined by Eq. ( 27).There, the approaches implementing Reference, MRB Always and MRB Explore correspond to dashed, dashed-dotted and solid traces, respectively.Moreover, we distinguish approaches using the ILP decoder (Reference ILP , MRB ILP Always and MRB ILP Explore ) and approaches using CAPS-HMS (Reference CAPS-HMS , MRB CAPS-HMS Always and MRB CAPS-HMS Explore ) colored in red and blue, respectively.In the following, we discuss the obtained results.
Key Observations: First, we confirm our expectation that the replacement of multi-cast actors by MRBs results in better solutions according to the design objectives.The results presented in Figs. 8 and 9 show that regardless of the chosen decoding approach, either ILP (see solid red lines) or the CAPS-HMS heuristic (see solid blue lines), the selective exploration of MRB replacements performed by the MRB Explore strategy delivers better quality solutions in terms of the hypervolume score compared to the respective Reference approach.These improvements range from 28 % for the small Sobel application to 90 % for the large multicamera application.
Next, it can be observed that the MRB Explore strategy gains superiority to the MRB Always strategy for applications with a rising number of multi-cast actors.For example, for the Sobel application containing only 1 multi-cast actor, the hypervolume score is almost identical, but for the Sobel 4 application containing 4 multi-cast actors, the MRB ILP Explore approach improves upon the MRB ILP Always approach by 6 %.For the large multicamera application with 23 multi-cast actors, the improvement of MRB CAPS-HMS Explore compared to MRB CAPS-HMS Always is even 20 %.
Finally, it can be observed that the ILP-based decoder is superior to the CAPS-HMS heuristic for small to midsized applications, i.e., Sobel and Sobel 4 , where utilizing the MRB CAPS-HMS Explore approach is only slightly inferior by 7 %, respectively, 5 % in terms of the hypervolume score compared to the MRB ILP Explore approach.In contrast, the MRB CAPS-HMS Explore approach is superior for the large multicamera application by 67 %.This observation can be explained by the fact that  (see filled triangles).In contrast, the memory footprint of the non-dominated solutions found by the Reference CAPS-HMS approach (see circle symbols) vary between 55 and 90 MiB.Third, the shortest-period solution for the Sobel 4 and multicamera applications are characterized by a filled triangle symbol.Moreover, when examining shortest-period solutions for a given memory footprint, one can observe that almost all of these are found by the strategy MRB Explore (filled triangles).This validates our assertion from Section III-D that there are cases where an MRB replacement is detrimental to (i.e., it increases) the execution period.Thus, we can conclude that for mid to large size applications containing a non-negligible number of multi-cast actors, the selective replacement of these multi-cast actors by MRBs may lead to shorter periods compared to not including any MRB or replacing all multicast actors with MRBs.

B. EXPLORATION TIME
Another essential feature for evaluating a DSE approach is the exploration time.In the context of DSE, the evaluation time is crucial because a DSE run may require thousands of design point evaluations [21].Table 2 presents the exploration time in seconds when using the ILP decoder and the CAPS-HMS decoder after 2,500 generations.We also present the speedup ratio, comparing the time of the much faster heuristic-based CAPS-HMS decoder against the ILP.The speedup for each DSE approach is calculated as follows: Key Observations: In general, we can observe that the ILPbased decoder requires significantly more time to perform the exploration of the design space for a given number of generations when compared to the CAPS-HMS decoder.Even for the small Sobel application, the ILP-based approach Reference ILP takes 16.02 hours to complete 2,500 generations, while the approach Reference CAPS-HMS only requires 7.38 minutes.The reported speedup range of CAPS-HMS is 125× to 149× for the Sobel application.However, both approaches require more time to explore middle-to largesize applications.For Sobel 4 , the CAPS-HMS decoder requires between 27.19 and 42.28 minutes, while the ILP decoder requires between 20 and 22.68 hours to perform 2,500 generations of the optimization loop.Accordingly, for the Sobel 4 , the speedup range of the CAPS-HMS decoder is between 28× and 50×.For the largest multicamera application, the exploration time varies between 1.60 and 4.54 hours for CAPS-HMS.In contrast, the ILP decoder takes between 14.62 and 17.80 hours.There, the reported speedup of CAPS-HMS ranges between 4× and 9×.Note here that the reported speedup range is lower compared to the other applications because the ILP-based decoder has a timeout of three seconds.In summary, the ILP-based decoder is best suited for small to mid-size applications, as it is then able to find a minimal period schedule for any given binding of actors to cores and channels to memories.In contrast, the proposed CAPS-HMS heuristic is the preferable solution for realistically sized applications, as solving explodes with an increasing number of variables.

VII. RELATED WORK
Approaches for optimizing parallel implementation of applications specified as dataflow networks [22] perform multiobjective optimization of conflicting design objectives, e.g., throughput and number of allocated cores.On the one hand, approaches such as [15,23] optimize dataflow applications' throughput and the number of allocated cores in a given architecture.However, the previously presented approaches do not consider any memory footprint evaluation of implementations or the generation of periodic schedules during DSE.
In the following, we categorize the related work as approaches performing memory footprint minimization and approaches generating periodic schedules.

A. MEMORY FOOTPRINT MINIMIZATION
Approaches for memory footprint minimization can be classified into two main categories: (i) approaches minimizing the size of FIFOs and (ii) approaches implementing memoryreuse strategies that allow different FIFOs to be mapped into overlapping memory spaces or track individual token lifetimes to exploit memory footprint reductions over the execution of an application.In the first category, techniques such as FIFO sizing have been widely studied to reduce the memory footprint of Synchronous Dataflow (SDF) applications [24][25][26].Such approaches determine the minimal buffer size of an SDF application under throughput constraints.However, those approaches do not consider any memoryreuse strategy because each buffer is studied as a separate unit allocated in memory, and no shared memory address space is considered.In the second category, the approach presented in [27] derives overlapping memory allocations for individual tokens communicated during the execution of an SDF graph.However, it assumes no overlap between iterations, i.e., an execution period only contains actor firings of a single iteration.Thus, the achievable minimal period is severely constrained.Apart from performing an agnostic memory footprint minimization, some approaches exploit the knowledge about the application and actor characteristics.For instance, dataflow frameworks [8,12,28] targeting image processing apply memory minimization strategies based on the behavior of a set of specialized actors performing operations like multi-cast, fork, and join of data.For instance, the employed memory minimization strategy described in [12] merges all outgoing buffers of a multi-cast actor by replacing them with a broadcast FIFO that supports a single writer but multiple readers [12].However, no other design objectives apart from memory footprint are explored.In this paper, we propose a holistic approach that considers not only the minimization of memory footprint but also the mapping and scheduling of communication channels and actors onto heterogeneous many-core architectures as well as the number of allocated CPUs as exploration objectives.

B. SCHEDULING
There exist approaches for communication-aware scheduling of Directed Acyclic Graphs (DAGs) targeting many-cores that can be classified according to the utilized scheduling method: heuristic-based -i.e., list-scheduling [29] and clusteringscheduling [30] -or meta-heuristics-based -i.e., genetic algorithms [31][32][33], simulated annealing [34], and particle swarm [35] -, to mention a few.Although able to take into account communication scheduling, the optimization goal is to minimize the schedule make-span, i.e., the latency of a single iteration.Thus, minimum periodic schedules are not achievable by the mentioned approaches.Moreover, the communications on the DAGs are often not explicitly specified, but rather using a Communication-to-Computation Ratio (CCR), i.e., no explicit communications over interconnect resources in the target architecture are modeled.
When analyzing dataflow, scheduling strategies applied at compile time are beneficial [15].E.g., Self-Timed Execution (STE) [24,36] simulates the execution of a dataflow graph by using so-called state transformations.The state of a DFG is encoded as a set of variables representing the current state of the system.Changes during the execution of a system -e.g., an actor consuming/producing tokens from/to a channel -are represented by state transformations.During the simulation of the system, the transforming states are recorded until a periodic pattern emerges, which corresponds to the periodic schedule of the DFG.However, STE does not consider any communication in the scheduling.As a remedy, [37,38] proposed an extension to STE by including communication delay in the model.However, these works can only achieve schedules targeting MPSoC architectures with a single bus and a global memory.Thus, the model assumes a single resource to schedule the communication at a fixed bandwidth.This is different to our approach, which is able to target heterogeneous many-core architectures composed of a hierarchical organization of cores, memories, and interconnects.
Last, modulo-scheduling is a well-known loop scheduling technique applied in compiler optimizations as well as to periodic scheduling of DAGs on fine [39][40][41][42][43] and coarsegrained architectures [44][45][46][47].There, applications are modeled as DAGs, and hardware units such as adders, multipliers, and accelerators are used to modulo-schedule a hardware implementation of an iterative application [43].E.g., approaches such as [39,42] used modulo scheduling in combination with loop unrolling during high-level synthesis.Approaches such as [45,47] perform loop unrolling of applications composed of tasks mapped to the processing elements of coarse-grained architectures.However, these approaches ignore the scheduling of communications, i.e., transfers of data from cores to memory and from memories to cores over communication resources such as buses or NoCs.
This paper considered an explorative approach to map and schedule dataflow specifications on heterogeneous multicore architectures by considering as well the scheduling of actors as the communications between actors.Our approach targets heterogeneous many-core architectures where cores of different kinds might exist in the same architecture, and complex communications are explicitly modeled, mapped, and scheduled on interconnect resources and memories respecting a hierarchical tile organization.As illustrated, the mapping of as well actors to cores as data buffers in channels to memories, including processor-local memory, tile-local memory, and global memory, is explored during a DSE.For each solution candidate, a periodic schedule is then optimized either using an ILP formulation or an efficient scheduling heuristic called CAPS-HMS.

VIII. CONCLUSIONS
As a first contribution, this paper introduces the concept of Multi-Reader Buffers (MRBs) as a memory-efficient implementation of multi-cast actors and their replacement as a graph transformation.Rather than replicating produced tokens for all readers, an MRB stores only one token, which is alive until the last reader has consumed it.MRB replacement provides minimal buffer implementations obtained by replacing all multi-cast actors in an application with MRBs.However, replacing multi-cast actors with MRBs may increase the execution period -i.e., reduce the throughput -due to communication contention when accessing shared data.
To properly examine these trade-offs, as our second contribution, we propose a multi-objective Design Space Exploration (DSE) approach that selectively decides the replace-ment of multi-cast actors with MRBs and explores FIFO and channel mappings to trade memory footprint, core cost, and period of schedules.It is shown that the quality of found solutions improves when selectively replacing multi-cast actors with MRBs within a range of 28 % to 90 % in solution quality measured by a hypervolume indicator.
Moreover, as our third contribution, we proposed and compared two scheduling approaches that are used to determine a periodic schedule for the actors as well as the read/write accesses to buffers for each explored design point during the DSE: First, an ILP formulation that delivers the exact minimum period given an application binding.This ILP formulation performs well in terms of solution times for small to mid-sized applications.The second is a fast CAPS-HMS heuristic approach that performs particularly well when tackling large applications.It has been shown that for the small and mid-sized applications used in the experiments, our proposed CAPS-HMS is only slightly inferior by 7 % in terms of hypervolume compared to the ILP.But for large applications and the complexity of the target architecture, the ILP solution times can become prohibitively long.In contrast, the fast CAPS-HMS outperforms the ILP by 67 % in hypervolume for our largest test application.Finally, the presented DSE approach is distinguished from the state-of-the-art by considering (i) constraints in the memory size of each on-chipmemory, (ii) memory hierarchies, (iii) support of heterogeneous many-core platforms, and (iv) optimization of buffer placement and overall scheduling to minimize the period.

1 FIGURE 1 .
FIGURE 1.On the left, an application graph g A consisting of a set of actors a i ∈ A communicating over a set of FIFO channels c j ∈ C is shown.Channel capacities in terms of tokens γ(c j ) are illustrated by white boxes.The token size in bytes ϕ(c j ) and the number of initial tokens δ (c j ), e.g., one initial token for channel c 1 (black dot), is also illustrated.On the right, a heterogeneous four-tile many-core architecture is modeled by an architecture graph g R .Processor cores are denoted p i , and tiles are denoted T j .Each core p i ∈ P can, in principle, access any core-local memory q p j ∈ Q P , any tile-local memory q T j ∈ Q T , as well as the global memory q global .Dashed arcs represent mapping options from actors to cores and channels to memories.To exemplify, mapping edges are illustrated for the actors a 3 and a 5 as well as the channel c 4 .However, to reduce visual clutter, only the resources of tile T 1 and the global memory are shown as targets for these mappings.In the proposed approach, actors can be mapped in principle (light red arcs) to all cores of a type that supports the execution of the actor, e.g., cores of type ϑ 1 for actor a 3 and cores of type ϑ 2 or ϑ 3 for actor a 4 .In contrast, channels can generally be mapped (light green arcs) to any memory.

Definition 2 . 1 (
Application Graph): An application graph g A = (A ∪ C, E) is a bipartite graph with its vertices parti-VOLUME 1, 2023tioned into a set of actors A and a set of channels C. Such an application graph can be derived from a DFG by explicitly modeling the FIFO channels as vertices.The delay function δ : C → N 0 , capacity function γ : C → N, and size function ϕ : C → N, respectively, assign each channel a number of initial tokens, a maximal number of tokens that can be stored, and the token size in bytes.The set of directed edges E = E O ∪ E I describes the flow of data between actors and channels and is partitioned into actor outgoing (E O ⊆ A × C) and actor incoming (E I ⊆ C × A) edges.Throughout this paper, we assume marked graph semantics[3] of the application graph.Finally, the function τ : A × Θ → N ∪ {⊥} represents the execution time τ(a, ϑ ) of an actor a when mapped on a core of type ϑ ∈ Θ.The ⊥ value indicates that an actor a cannot be mapped to a particular core type θ .In Figs.1 and 2a, an example of an application graph g A consisting of five actors A = {a 1 , . . ., a 5 } communicating via five channels C = {c 1 , . . ., c 5 } is given.Each communication channel c ∈ C has annotated its corresponding number of initial tokens δ (c), capacity γ(c), and size of each token ϕ(c).

2 , 1 ( 2 , 2 , 2 , 1 (d) MRB after a 4 , a 3 FIGURE 3 .
FIGURE 3. MRB with one write index (pointer) indicating the location of the next token to be written.Moreover, each reading actor requires an index pointing to the position of the next token to read.

Algorithm 3 :
ILP-based Decoding 1 Function decodeViaILP(gÃ, C d , β A ) Input : Application graph g Ã , channel decision function C d , and the set of actor bindings β A Output: Period P, set of bindings β , and the channel capacity function γ 2 do 3

FIGURE 10 .
FIGURE 10.Union of the Pareto fronts of the last generation obtained for the presented applications after 2,500 generations using the ILP-based decoder.Filled points are non-dominated solutions of the union of the three Pareto fronts.The period P is presented in a logarithmic scale for better visualization.

FIGURE 11 .
FIGURE 11.Union of the Pareto fronts of the last generation obtained for the presented applications after 2,500 generations using the heuristic-based (CAPS-HMS) decoder.Filled points are non-dominated solutions of the union of the three Pareto fronts.The period P is presented in a logarithmic scale for better visualization.
11, and s (c 5 ,a 5 ) = 12 for the VOLUME 1, 2023 Algorithm 2: Determine Channel Bindings β C 1 Function determineChannelBindings(C d , γ, β A ) Input : Channel decision function C d , channel capacity function γ, and the set of actor bindings β A Output: The set of channel bindings β C 2w q ← 0 ∀q ∈ Q // Start memory usage w q from 0 P such that (a cons , p cons ) ∈ β A // Derive p cons 5a prod ∈ A such that (a prod , c) ∈ E // Derive a prod 6 p prod ∈ P such that (a prod , p prod ) ∈ β A // Derive p prod

TABLE 1 .
Applications investigated during DSE runs.M F corresponds to the minimal memory footprint in case all multicast actors are retained, while M Fmin denotes the case when each multi-cast actor is replaced by a corresponding MRB.

TABLE 2 .
Exploration time6comparison of CAPS-HMS decoder against the ILP decoder for running 2,500 generations.