MRCN: Throughput-Oriented Multicast Routing for Customized Network-on-Chips

The relentless proliferation of Big Data and artificial intelligence has compelled computing platform architectures to evolve into heterogeneous multicores for greater energy efficiency. A customized network-on-chip (NoC) supporting interconnection diversity is pivotal for the asymmetric data-access traffic requirements of modern heterogeneous multicore system-on-chip (SoC). A significant portion of on-chip data access comprises single-source multi-destination (SSMD) traffic, which supports barrier synchronization, multi-threading, cache coherency protocols, and deep neural network (DNN) acceleration. By amortizing SSMD traffic, multicast routing is essential for effectively utilizing communication bandwidth. One of the primary concerns in supporting multicast routing in NoCs is to circumvent the additional deadlock conditions caused by branch operations among the active routers. However, it is challenging to implement the throughput-optimized multicast routing in irregular topology-based NoCs because the deadlock conditions become highly complicated, and the Hamiltonian path required to apply the labeling rule may not exist. Two important observations were identified regarding multicast routing in customized NoCs: 1) Even if the NoC lacks a Hamiltonian path, deadlock-freedom can be guaranteed by restricting branch operations to a specific destination. 2) A variable path diversity in a custom topology can be leveraged in routing path allocation and branch. Based on these properties, this study proposes a deadlock-free and throughput-enhanced multicast routing for customized NoC (MRCN). MRCN ensures deadlock freedom by utilizing extended routing and router labeling rules. Furthermore, destination router partitioning and traffic-aware adaptive branching are incorporated to reduce packet routing hops and disperse channel traffic. The effectiveness of MRCN was verified using Noxim, a well-known cycle-accurate NoC simulator, under various topologies and traffic patterns. The simulation revealed that MRCN improved the average latency by 13.98 % and the throughput by 12.16 % under the saturated traffic conditions over the previous multicast routings in customized NoCs.


INTRODUCTION
B ECAUSE the computational demands driven by Big Data and artificial intelligence are remarkably increasing, heterogeneous multicores comprised of devices via domain specialization are gaining popularity for higher energy efficiency [1], [2], [3]. The other central axis of the computing platform is the interconnect architecture; thus, a customized network-on-chip (NoC) to provide enriched communication diversity is required to cope with the massive traffic in heterogeneous multicore system-on-chip (SoC) [4], [5].
Multicast routing, which enables data transmission to a group of destinations, has the potential to amortize SSMD traffic significantly [13], [14]. The branch operation that replicates and delivers packets from the input buffer to multiple output ports inside the router is crucial for multicast routing in NoCs. Multicast packets are merged and distributed at the source router; therefore, the channel load can be diminished.
We identified two findings to resolve the multicast problem in customized NoCs: 1) Even if a Hamiltonian path does not exist, restricting branch operations to a specific destination allows the labeling rule compliant routing to operate deadlock-free. 2) A variable path diversity in an irregular topology can be leveraged in routing path allocation and branch.
Inspired by these observations, this study proposes a deadlock-free and throughput-enhanced multicast routing in customized NoC (MRCN) with minimizing additional hardware overhead. MRCN assures deadlock freedom by defining the extended routing rules and labeling routers on the path involving the maximum number of routers. Furthermore, an additional router partitioning method and traffic-aware adaptive branching can reduce packet routing hops and amortize unbalanced network load.
The remainder of this paper is organized as follows. Section 2 introduces multicast-enabled NoC studies related to MRCN and emphasizes the contribution of this study in contrast to them. In Section 3, MRCN is described in the following order: dedicated router architecture, labeled path searching, destination router partitioning, and adaptive branching. Section 4 validates the effectiveness of MRCN through Noxim, a prevalent cycle-accurate NoC simulator, under various traffic patterns. Finally, Section 5 concludes the paper.

BACKGROUND AND RELATED WORK
Branch operation for multicast routing can cause deadlock conditions not observed in unicasting-only cases. [15], [16]. Coffman et al. proved that deadlock occurs if and only if four conditions, namely mutual exclusion, hold-and-wait, no preemption, and circular wait, hold simultaneously [25]. These four conditions can be adapted in NoCs as follows: 1) Mutual exclusion: Each output port is not shareable. 2) Hold-and-wait: A packet holds one output port and waits for another. 3) No preemption: Output ports cannot be preempted. 4) Circular wait: A circular chain of packets exists. Because mutual exclusion and no preemption are common in packet-switched NoCs, deadlock freedom can be guaranteed only by preventing hold-and-wait or circular wait. Fig. 1 illustrates an example of multicast deadlock in an NoC. The dashed line indicates a blocked port request, while a solid line indicates a granted and assigned port request. Packet A requests a packet transmission to the southern port of router 2, which is currently held by packet B. Likewise, packet B requests a packet transmission to the southern port of router 4, which is currently held by packet A. Each packet holds one output port and waits for another held by the other packet; thus, the packet cannot propagate, resulting in a multicast deadlock. This type of deadlock is caused by inevitable packet branching owing to multicast routing.
Tree-based approaches provide minimal routings with the source node as the root and all destination nodes as  leaves [24]. The source node establishes a minimal spanning tree and sends a single multicast message across the tree. The packet must be forwarded in all directions simultaneously to duplicate flits at the branching-point router. The contention of multicast packets increases the likelihood of multicast deadlock occurrence, thereby inducing hold-andwait. Tree-based techniques eliminate hold-and-wait by deploying additional hardware resources.
Cut-through switched routers [18] avoided hold-and-wait by implementing a sufficiently large first-in-first-out (FIFO) buffer to store the largest packet. Hardware tree-based multicast routing algorithm (HTA) [19] implemented a deadlock queue in each router to prevent deadlocks. The recursive partitioning multicast (RPM) [20] and the improved minimal multicast routing (IMM) [21] employed two VCs to transmit up and down streams separately to avoid deadlocks. IMM further built a spanning tree with fully-utilized shared routes to offer a minimal multicast path. Adaptive partition-based multicast (APBM) [22] adopted more VCs and dedicated turn models, north-last and west-last. SmartFork [23] partitioned the output ports of the router into several groups. Intra-group ports were serviced serially, while inter-group ports were serviced in parallel. The VCs in SmartFork were embraced to target any desired point on the multicast performance spectrum. The multicast router using buffer sharing (MRBS) [24] achieved deadlock-free minimal path routing with low area overhead by exploiting the spatial diversity of the input buffer. MRBS managed all input buffers as a register file style to efficiently utilize the buffer space instead of dedicating a single buffer per input port.
Path-based routing assigns consecutive labels to all routers, allowing the packets to be routed in ascending or descending order depending on the labels. Because of the simplicity in avoiding hold-and-wait, label-based routing has been widely adopted in mesh and torus topology, where ordered labeling is possible.
Dual-path routing (DP) [13] classified destination routers according to their label significance than the source router: higher labels routed ascending and lower labels routed descending orders, respectively. Although DP achieved deadlock-free multicast routing by avoiding circular wait, it might cause the worst-case critical path visiting all routers.
The label-based routing rule applied to the grouped destinations with a proper partitioning satisfied deadlockfreedom in every partitioned group while avoiding the worst-case routing path passing through all routers [14], [15]. Accordingly, several path-based approaches concerning destination partitioning have been proposed.
The multi-path (MP) algorithm, also presented in [3], divided nodes into four disjoint rectangular subsets depending on the location of the source node. By allowing only routing paths within each subset, MP achieved a lower total hop count than DP, whereas critical paths were not reduced significantly due to unbalanced subset size. The column-path (CP) algorithm [14] partitioned nodes into up to 2Âk subsets, where k is the number of columns in the mesh. CP generated more balanced subsets and shorter routing paths than MP with a high degree of parallelism. The recursive partitioning (RP) [15] method balanced the number of nodes included in each subset to minimize hop counts. However, both CP and RP could cause congestion in some clusters of nodes because of ignoring the locality of traffic loads.
The Hamiltonian adaptive multicast unicast model (HAMUM) [16] permitted more routing paths by using adaptive MP (AMP) and adaptive CP (ACP) techniques, which resulted in less congested paths and thus enhanced throughput compared with MP and CP. The Hamiltonianbased odd-even (HOE) turn model [17] maximized the degree of adaptiveness and maintained deadlock freedom without additional VCs. The hybrid routing algorithm (HRA) [26] attempted to combine the benefits of both pathand tree-based routings by adaptively determining the branching policy according to the buffer usage.
Whereas HRA and the path-based approaches incorporated the regularity of NoC topologies to reduce the routing hop counts and alleviate congestions, they could not be directly applied in customized NoCs owing to the increased complexity in finding the Hamiltonian path required by label-based routing rules. Although previous topologyindependent tree-based methods could be easily adapted to customized NoCs, the vulnerability to contention still remained owing to the ignorance of topological characteristics. MRCN enables path-based routing in customized NoCs through extended routing rules and reduces network congestion by embracing path diversity for the given topology.

MRCN
The primary goal of MRCN is to ensure deadlock-free and throughput-enhanced multicast routing in customized NoCs. In this regard, we create the labeled path searching, destination router partitioning, and adaptive branching algorithms to guarantee deadlock freedom and boost performance. Fig. 3 presents the schematic diagram of each process of MRCN and the corresponding improvements. The definitions of the topology graph and the notations utilized in MRCN are described in Table 1 to help explain each procedure.

MRCN Router Architecture
As shown in Fig. 4, the MRCN routers operate with a fivestage pipeline: buffer write (BW), route computation (RC), switch allocation (SA), switch traversal (ST), and link traversal (LT), which are successively run on the input buffer, RC logic, switch allocator, crossbar switch, and links, respectively. The switch allocator and crossbar switch are modified to support the adaptive branching, and the extended RC logic allows for sophisticated route computations in MRCN.
The input buffer stores the incoming packet in the BW stage and delivers the destination list extracted from the head flit to the RC logic. Owing to path-based routing, MRCN routers consume smaller input buffers than treebased approaches that employ VC schemes and cut-through switching.
In the RC stage, the output ports of the packet and the destination set in each direction are determined based on the adaptive branching algorithm. The RC logic acquires the credit signal from the adjacent routers to reflect the traffic condition and decides whether to detour the labeled path depending on the credit conditions of adjacent routers.
In the SA stage, the switch allocator manages the packet requests to the desired output ports through the roundrobin policy and continues until the request for the head flit to the desired output port is accepted. The arbitration function is extended to support parallel packet propagation to multiple output ports, required for adaptive branching.
After the SA, the granted packets enter the ST stage and are routed to the corresponding output ports via the crossbar switch. The input buffer invalidates the processed flit and sends a credit signal to the upstream router to indicate the availability. Next, the packet passing through the crossbar switch is traversed to the input port of the downstream router in the LT stage.

Labeled Path Searching
Labeled path-based routing is widely adopted for regular topology-based NoCs having Hamiltonian path owing to the benefit of deadlock-freedom without demanding additional hardware resources. For instance, in the 4Â4 mesh depicted in Fig. 5a, the Hamiltonian path of seamlessly connected red arrows can be revealed, avoiding circular wait. Nonetheless, a customized NoC depicted in Fig. 5b, which has no Hamiltonian path, is incompatible with existing labeled path-based routings. find NðLðu j ÞÞ is minimal 9: end for 10: label u i as R 1 11: add R 1 to R 12: R 0 = LPSðT ðU; LÞ;RÞ 13: returnR 0 14: else 15: for j ¼ 1 to NðUÞ do 16: find u j 2 U such that u j is unlabeled && l ij 2L && L U ðu j Þ is minimal 17: add u j to U C 18: end for 19: Deadlock occurs if and only if all Coffman conditions are held concurrently; therefore, deadlock freedom is achieved if at least one of these conditions is permanently removed. Circular wait for packets passing through unlabeled routers can be prevented by supplementing the following two routing rules: 1) A packet destined for an unlabeled router is imposed as a unicast packet, and 2) A packet from the unlabeled router is transferred to the nearest labeled router with legacy path-based routing. The extended routing rule allows unlabeled routers, and thus legacy path-based routing can be utilized by restricting packet branching. Extra traffic caused by inevitable unicast packets can be relieved by maximizing the labeled routers. Fig. 5c depicts one of the labeled paths with the maximum number of labeled routers in the customized NoC of Fig. 5b. In this case, the packets destined to only unlabeled router R u1 are transmitted in a unicast manner. Accordingly, the proposed labeled path searching (LPS) algorithm aims to find a path that includes as many routers as possible to minimize unlabeled routers, as illustrated in Fig. 5c.
The objective of the LPS algorithm is to label as many unlabeled routers as possible, which can be accomplished by maximizing link complexity among unlabeled routers. Generally, as the link complexity of the topology increases, the number of path labeling cases to be explored and the probability of determining a labeled path with the highest hop count also increases. When the router with the lowest degree is labeled preferentially, the link complexity of the remaining unlabeled routers in the sub-topology is maximized. Therefore, LPS attempts to maximize the labeled routers by labeling the routers with the lowest degree preference.
Algorithm 1 describes the LPS method for the maximal labeled path. First, the router with the lowest degree is selected as the first router R 1 of the labeled path (lines 1-7). Next, the router with the lowest degree among the unlabeled routers adjacent to R 1 is chosen as the next labeled router (lines [11][12][13][14][15][16][17][18][19][20]. The router labeling process is recursive (lines 8-9), and by labeling the adjacent routers with the lowest degree first, the connection to the remaining unlabeled routers is maximized (line 12). If the labeled path encompasses all routers, it implies the existence of a Hamiltonian path, and thus no unicast packet to the unlabeled routers is generated.
Lines 5-9 describe the selection criteria for multiple routers having the same degree. In cases where several routers share the lowest degree, the one with the fewest total degree of directly connected routers is chosen. If multiple such routers exist, the one with the fewest total degree of routers at two hops is selected as the first router to be labeled. Similarly, if there is still more than one of a such the router with a minimum sum of degrees of routers at two hops, the first router selection criteria are extended by increasing the hop count by one. If multiple routers with Label of the source router where packet is injected. c Label of the current router where packet is stored. N P The number of router clusters in the destination router partitioning. F Set of unaffiliated routers in the destination router partitioning.
The degree of a router which represents the number of links connected to the router u i .
The number of links between a router R i and another router R j . L P ðR i ; P j Þ The number of links between a router R i and a router cluster P j . N port The number of output ports in the router. U C Set of first router candidates to be labeled.  the minimum degree sum remain after the searching algorithm reaches the maximum hop count, all these routers have the highest priority, and the first labeled router is arbitrarily selected among them In the example of Fig. 5c, R 1 , R 15 , and R u1 have a minimum degree of one. The sum of degrees of routers one hop away from R 1 , R 15 , and R u1 are four, four, and five, respectively; therefore, the first router to be labeled is selected between R 1 and R 15 . The total degree of routers two hops away from R 1 and R 15 are 13 and 14, respectively, and R 1 is eventually selected as the first router of the labeled path.
In the LPS of MRCN, both ascending and descending are allowed in the path labeling order. Nevertheless, labeling in ascending order is more straightforward than in descending order due to the data structure utilized by LPS. Since the number of labeled routers was not determined prior to path labeling, LPS declares labeled routers by adopting the stack-based dynamic memory allocation. The elements of the stack structure and the push operation indicate labeled router information and path labeling, respectively. By reflecting that the top index increases as elements are pushed into the stack structure, LPS labels the routers in ascending order, where the stack indices match the labels of the routers.
While LPS ensures deadlock freedom, it must also overcome the additional challenges of preventing the critical path that passes through all routers and minimizing unicast packets. Therefore, destination router partitioning and adaptive branching are introduced to resolve these two problems.

Destination Router Partitioning
Destination router partitioning (DRP) aims to remove the worst-case path visiting all routers in customized NoCs, thereby improving overall throughput. As mentioned in Section 2, appropriate router clustering can shrink the critical path of the entire NoC to the critical path of the largest cluster [14], [15]. To reduce the size of the maximal cluster that determines the longest labeled path, DRP attempts to balance the cluster sizes. The number of clusters equals the number of routers adjacent to a given source router. Labeled path list R Label of the source router s Output: Partitioned router cluster P ¼ fP 1 ; P 2 ; . . .; P Np g 1: j ¼ 0 2: add all routers except R s into the unaffiliated router set F 3: for i ¼ 1 to NðRÞ do 4: if L R ðR s ; R i Þ ¼¼ 1 then 5: add R i to P j 6: find R i 2F such that P 0 < j < N P L P ðP j ; R i Þ is maximal 14: end for 15: add R i to P j 16: exclude R i from F 17: end while Algorithm 2 describes the DRP method in detail. A router in each cluster adjacent to a given source router is uniquely assigned as an entrance router. Each cluster takes turns appending one of the unaffiliated routers to reduce the variance in the number of routers between clusters. The newly appended router in each cluster must keep up with the labeling order (either ascending or descending) concerning the entrance router. A cluster containing an entrance router with a label higher (or lower) than the source can only contain unaffiliated routers with a label higher (or lower) than the source router. An unlabeled entrance router can include only adjacent unlabeled routers in a cluster.
To ensure maximum uniformity in the number of routers between clusters, it is desirable to affiliate the remaining routers in ascending order of the number of neighboring clusters. Therefore, each cluster prioritizes unaffiliated routers with the fewest connections to other clusters.
For routing all source and destination pairs, DRP is performed deterministically by sequentially selecting all routers as source nodes. Fig. 6 illustrates an example of DRP with R 8 and R 10 as source routers, respectively. In Fig. 6a, R 2 , R 7 , and R 9 adjacent to R 8 are designated as entrance routers of clusters P 1 , P 2 , and P 3 , respectively. Using the labeling rules, we place R 1 , R 6 , and R 11 into clusters P 1 , P 2 , and P 3 , respectively, based on their connectivity to other clusters. Since P 1 is entirely devoid of mergeable routers, the remaining unconnected routers must exist in either P 2 or P 3 . Finally, the resulting clusters comprise P 1 ¼ fR 1 ; R 2 g, P 2 ¼ fR 3 ; R 4 ; R 5 ; R 6 ; R 7 g and P 3 ¼ fR 9 ; R 10 ; R 11 ; R 12 ; R 13 ; R 14 ; R 15 ; R u1 g.
In Fig. 6b, R 7 , R 9 , and R 11 are designated entrance routers for clusters P 1 , P 2 , and P 3 adjacent to R 8 . Using the labeling rules, we place R 6 , R 8 , and R 12 into clusters P 1 , P 2 , and P 3 , respectively, based on their connectivity to other clusters. Since P 1 is entirely devoid of mergeable routers, the remaining unconnected routers must exist in either P 2 or P 3 . Finally, the resulting clusters comprise P 1 ¼ fR 1 ; R 2 ; R 3 ; R 4 ; R 5 ; R 6 ; R 7 g, P 2 ¼ fR 8 ; R 9 g and P 3 ¼ fR 11 ; R 12 ; R 13 ; R 14 ; R 15 ; R u1 g.

Adaptive Branching
Adaptive branching (ADB) utilizes path diversity inside each partitioned cluster to disperse network traffic. Alternative paths can be created by excluding labeled paths in the cluster. However, since the alternative paths are not labeled regularly, the Coffman condition cannot always be avoided. We define label-detouring conditions (LDCs) that prevent both circular wait and hold-and-wait, thus allowing alternative path routing only in these cases.
Label-Detouring Condition 1 (LDC 1) The packet is destined to the downstream router in the detouring direction. The input buffer in the downstream router in the detouring direction is empty.

Label-Detouring Condition 2 (LDC 2)
There is a destination whose label is either higher or lower than the current router and the downstream router in the detouring direction. The entire packet can be stored in the available input buffer space on the downstream router in the detouring direction. Satisfying LDCs 1 and 2 avoids circular wait and hold-andwait, respectively, thus allowing detouring without deadlock. LDC 1 was intended to enable packets with destinations to be routed to downstream routers. Because branching packets are ejected directly from the destination router, they never experience circular wait. When there is enough input buffer space to hold the entire packet in the downstream router, LDC 2 was created to support branching. Hold-andwait is not activated for packets that have been completely stored in the downstream router. In addition, the resulting alternative path can reduce packet latency because it has fewer hops than the labeled path.
If there exists any destination with a label valued between the current router and the downstream router of the detouring direction, the packet should also be propagated through the labeled path to avoid circular wait. Accordingly, the packet is branched to a labeled path and a label-detouring path, thus reducing total hop counts. Fig. 7a depicts a label-detouring path in ADB-enabled NoC of Fig. 6, where R 8 is the source router. If either LDCs are satisfied, the packets can be routed via the labeldetouring path. For example, for a packet whose source is R 8 and destinations are R 10 , R 11 , R 12 , R 13 , R 14 , and R 15 as shown in Fig. 7b, if R 9 satisfies LDC 1 and LDC 2, the packet is routed to R 11 and R 12 through label-detouring paths.   . . . ; b N port g Partitioned router clusters P Packet size S P Output: Set of destinations for each output direction D Output direction of the packet DIR ¼ fdir 1 ; dir 2 ; . . . ; dir N port g 1: if R C 2 D then 2: request the destined ejection port of the packet 3: end if 11: end for 12: else if S < C then 13: end if 19: end for 20: for i ¼ 1 to N port do 21: for j ¼ 1 to NðDÞ do 22: if Rðo i Þ! ¼ R Cþ1 &&S P < b i &&d j > Rðo i Þ then 23: Algorithm 3 describes the route computation with ADB for multicast packets in the router. The packet is ejected to the corresponding output port if the current router is included in the packet destinations (lines 1-4). If the current router is the source router, packets are transmitted for each cluster partitioned by the DRP algorithm (lines 5-11). If LDCs with the detouring direction output ports are satisfied, the router forwards the packet to the corresponding ports (lines 12-48).
The detailed NoC configurations are presented in Table 2. All implemented routing approaches were simulated on an 8Â8 structured mesh with 64 nodes. The prevalent Hamiltonian path of DP [13] was employed in the mesh without unlabeled router, and the LPS algorithm in Section 3.2 was applied only in customized NoC.
The connectivity of the customized NoC was generated through a task graph for free (TGFF) [29], which is commonly used to build a heterogeneous multicore communication task graph and a fault-tolerant topology generation (FTTG) [30] which optimizes the resulting irregular topology. As a result, 20 distinct customized NoC models with 16, 25, 36, 49, 64, and 81 communication nodes were constructed. Furthermore, the generated custom topology models were classified based on whether the average router degree was greater than five or not. Then, we conducted correlation analysis for each network size with the corresponding mesh topology model (average degree of five). Because ACP, HOECP, and HRA were dedicated to meshbased NoCs, they were excluded in the customized NoC simulations.
The stimulus was generated using uniform random, CTG-based, and Rent's rule [31] patterns to reflect various traffic conditions. Uniform random traffic arbitrarily produces unicast and multicast packets depending on the injection ratio for all destinations. CTG-based traffic injects packets according to the traffic rate between each communication node in the task graphs generated by FTTG [30]. Rent's rule traffic is inspired by "communication nodes are mapped in a direction to reduce hop count [31]", creating a realistic pattern of traffic that enhances the traffic between nearby nodes.
A bimodal distribution of packets was assumed, with 70 % of the packets being minimal and the 10 % being maximal to reflect the realistic packet switching pattern of the heterogeneous multicore platform, in accordance with [32]. In addition, guided by prior research [33] that reported real-world multicast percentages, we adopted two different scenarios: 5 % multicast traffic (low multicast intensity) and 30 % multicast traffic (high multicast intensity). In DNN acceleration operation, a scenario based on the traffic model of DNN+NeuroSim [34] was utilized in the evaluation. DNN+NeuroSim is a benchmark for training and inference behavior for the CIFAR-10 and ImageNet datasets, generating massive SSMD traffic from a single memory node to multiple computing units. Simultaneous multi-threading (SMT) scenario of the Gem5 [35] was used to generate traffic benchmarks, including configuration, synchronization, and cache coherence operations. All router models under consideration were implemented using industry-standard Verilog HDL [36] and DesignCompiler [37] with the selected area electron diffraction (SAED) 32 nm standard cell library [38] to assess the area overhead.  injection rate at which the latency is twice that at the zeroload [3]. As the injection rate increased, it was saturated in the order of HTA, MRBS, CTR, SmartFork, ACP, HOECP, MRCN, and HRA. Adaptive routing in MRCN reduced contentions by allocating alternative paths according to the traffic conditions, thus resulting in a 23.24 % greater saturation point than tree-based approaches. In addition, MRCN achieved a saturation point of 12.41 % greater than the other path-based approaches, which focus on the mesh topologies by allowing to avoid contentions efficiently. As an exception, the saturation point of MRCN was lower than that of the mesh-optimized HRA but showed a similar difference. HRA gives slightly better saturation point compared to proposed MRCN in an 8Â8 mesh-based NoCs for every injection rate. MRCN is a scalable multicast routing algorithm that can be applied to both regular and customized topologies, whereas HRA is optimized for mesh topology.

Average Latency
As a result, in the 8Â8 mesh network mentioned by the reviewer, HRA is likely to provide superior throughput. The hybrid path balancing method (HPBM) in HRA optimizes the path of branched packets by calculating the total transmission hop count from the branching point to the destinations based on the geometrical properties of the mesh with coordinate-based analysis. On the other hand, coordinate-based characteristics limited to the mesh cannot be applied to ADB of MRCN; instead, we defined labeled path detouring conditions that can be applied to all topologies. As a result, the difference in determining the multicast routing path in the mesh topology can be interpreted as the background demonstrating superior saturation point of HRA.
NoC traffic models that cause a high communication load on a specific path tend to be saturated when the packet injection rate is relatively lower than other traffic models. Because these traffic models make links unavailable owing to the high communication load concentrated on a specific path, the NoC quickly reaches the acceptable limit that can cope with congestion. Uniform random traffic generates relatively even contentions across channels. In CTG-based traffic, packets are concentrated on specific paths, and channels included in these paths exhibit high loads. In CTG-based traffic, congestion frequently occurs on several links where communication paths corresponding to the task graph overlies. Rent's rule traffic generates traffic-intensive links among high-degree routers, creating wide traffic hotspots in the surrounding areas of these routers. Therefore, uniform random traffic with a distributed channel load has a lower average latency compared with other traffic models. Therefore, the packet injection rate at which each NoC is saturated in uniform random traffic is the highest among traffic models. On the other hand, CTG-based traffic, which concentrates the channel load on a specific path, shows a relatively low saturation point compared to other traffic models.
The saturation points for meshes of varying network sizes are presented in Table 3. In 9Â9 mesh, the improvement over tree-based approaches of MRCN, which was 5.61 mPKTS/Cycle/Node in 4Â4 mesh, increases to 13.23 mPKTS/Cycle/Node. In addition, the improvement over HOECP of MRCN increases from 0.61 mPKTS/Cycle/Node in 4Â4 mesh to 1.72 mPKTS/Cycle/Node in 9Â9 mesh. The results imply that ADB in MRCN significantly enhances NoC performance by preventing frequent contention as the network expands. Fig. 9 depicts the average latency according to the injection rate in the 64-node customized NoCs with a degree lower than five. As the injection rate increased, it was saturated in the order of HTA, MRBS, CTR, SmartFork, and MRCN. Adaptive routing in MRCN reduced contentions by allocating alternative paths according to the traffic conditions, thus resulting in a 19.61 % greater saturation point than tree-based approaches.
CTR and MRBS showed similar average latency and saturation points. This is because CTR-and MRBS-based NoCs adopted tree-based minimal routing in common. When contention arises, CTR and MRBS store packets in the input buffer of a single channel and multiple channels, respectively. The different buffer management methods of MRBS and CTR result in a slight variation in the number of cycles required to transmit buffered data to the destination node. For this reason, even with the same simulation benchmarks, the CTR-based NoC shows a similar but not exactly the same average latency as the MRBS-based NoC. Table 4 lists the saturation points for the customized NoCs with a degree lower than five by varying the network sizes. In 81-node networks, the improvement over treebased approaches of MRCN, which was 3.02 mPKTS/ Cycle/Node in 16-node networks, increases to 9.84 mPKTS/Cycle/Node. The results imply that ADB in MRCN significantly enhances NoC performance by preventing frequent contention as the network expands. Fig. 10 depicts the average latency according to the injection rate in the 64-node customized NoCs with a degree higher than five. As the injection rate increased, it was saturated in the order of HTA, MRBS, CTR, SmartFork, and MRCN. Adaptive routing in MRCN reduced contentions by allocating alternative paths according to the traffic conditions, thus resulting in a 27.22 % greater saturation point than tree-based approaches. Table 5 lists the saturation points for the customized NoCs with a degree higher than five by varying the network sizes. In 81-node networks, the improvement over treebased approaches of MRCN, which was 1.98 mPKTS/ Cycle/Node in 16-node networks, increases to 13.61 mPKTS/Cycle/Node. The results imply that ADB in MRCN significantly enhances NoC performance by preventing frequent contention as the network expands.
At the same injection rate, the average latency of the network and the latency improvement of MRCN increased when the average degree of the NoC was heightened. The result implies that MRCN avoids contention by fully utilizing path diversity in a topology with a high degree and link complexity.  the path diversity of customized NoC to the fullest extent possible.
The result reveals that when the injection rate increases, all multicast routings in the customized NoCs reach their saturation points earlier than MRCN, indicating that MRCN has improved throughput and average latency compared to prior arts at high injection rates.
At 0.065 PKTS/Cycle/Node, the throughput gap between MRCN and tree-based approaches widened as the traffic distribution became asymmetric. The throughput gap between SmartFork and MRCN was 21.06 % and 49.92 % more significant in CTG-based and Rent's rule, respectively than in uniform random. Even when contention is concentrated on a single spot, ADB in MRCN minimizes throughput degradation.
MRBS provides higher throughput than MRCN with a low injection rate of 5 mPKTS/Cycle/Node. Minimal tree-based routings, including MRBS, provide maximized throughput in contention-free conditions. On the other hand, MRCN, which operates adaptively to traffic conditions, is not easily saturated in a situation where contention is frequent. Owing to this difference, MRCN shows higher throughput in Figs. 11d and 11f, which causes the high multicast intensity of 30 % and intensive traffic on a specific link, and MRBS has higher throughput in traffic conditions with little contention. Table 6 lists the area comparison of 5-port routers in CTR, HTA, ACP, HOECP, HRA, SmarFork, MRBS, and MRCN synthesized through SynopsysÂ Ò Design Compiler [37] under SAED 32 nm library [38]. CTR employs an input buffer large enough to store entire packets for cut-through routing, whereas SmartFork employs multiple VCs. CTR and Smart-Fork require additional input buffers, resulting in an area of 75.48 % and 75.55 % larger than MRCN, respectively. Other routers have the same size as the input buffer; however, the RC logic varies depending on the routing approach. The complexity of the route computation algorithm increases in the order of ACP, HOECP, HRA, and MRCN, correspondingly raising the RC logic area. Although MRCN is a multicast routing solution that is not limited to a mesh and exhibits a higher complexity than the mesh-optimized routing method, the overall router revealed area differences less than 4.36 %.

Energy Overhead
Additional simulation was conducted using the energy estimation function provided by Noxim. First, the power estimation function of Synopsys Design CompilerÂ Ò and the power model of the SAED 32 nm Design Kit were applied. In addition, the switching activity information (SAIF) file was extracted from the average delay simulation results, including the toggle counts and signal retention time. Using the SAIF file for the signal transition-aware simulation further improves the power estimation accuracy. The power model of multicast routers was converted into an energy model using Orion 3.0 [39] and integrated into Noxim. The simulation was conducted at an 8Â8 mesh structure and an injection rate of 30 mPKTS/Cycle/Node for variable control.
The energy estimation results based on the simulation are presented in Table 7. MRCN showed the highest energy efficiency among multicast routings applicable to customized NoCs. The result means that the energy overhead of routing computation logic for path-based routing is lower than additional hardware resources such as VC and DRU to avoid multicast deadlock. The VC allocation and deadlock recovery schemes of tree-based approaches show high energy consumption because they must be activated in entire cycles while the chip operates. On the other hand, route computation of MRCN operates only when the packet header is stored, resulting in relatively low energy consumption compared with the tree-based approaches.