MRBS: An Area-Efficient Multicast Router for Network-on-Chip using Buffer Sharing

Network-on-chip (NoC) has become the mainstream fabric architecture for chip multiprocessor (CMP) design. Owing to the market-driven advancement of modern applications in CMP, multicast traffic is aggressively increasing to support barrier synchronization, multithreading, and cache coherence protocols. Although multicast by branching of packets in the NoC router facilitates shortest path routing, additional branching-induced deadlocks must be circumvented. Existing NoC studies on deadlock-free minimal path routing in multicast traffic have typically deployed additional virtual channels or large buffers to hold entire packets, thereby significantly increasing the router area. Focusing on the area-efficient solution while sustaining the performance, we propose a novel multicast router using buffer sharing (MRBS) to guarantee deadlock-free multicast routing by exploiting the spatial diversity of the input buffer. MRBS ensures minimal path routing without requiring additional virtual channels or large buffers to hold entire packets. Extensive experiments were conducted by varying the buffer, packet, and network sizes, as well as the number of destinations per packet, under random multicast traffic with diverse injection rates. Simulation results show that MRBS achieves a 39.3 % improvement in the area-delay product on average for various network sizes compared to the conventional tree-based router.

proaches have a shortcoming in that the input buffer size in all routers must be guaranteed to be equal to or greater than the maximum multicast packet length, which leads to significant area overhead. A typical NoC router consumes a notable portion of its area on its buffer; thus, the area overhead will be more severe as the required communication bandwidth increases [7]. Therefore, conventional deadlock-free multicast routers are inefficient in terms of buffer space utilization. We leverage this observation to propose a low-cost multicast router using buffer sharing (MRBS) that guarantees deadlock freedom without employing additional virtual channels or large buffers to hold entire packets.
This study explores opportunities to find a deadlock-free multicast router architecture that enhances area efficiency while sustaining performance. To efficiently utilize the buffer space, especially for multicast traffic patterns, the proposed router architecture manages all input buffers in a router as a register file style instead of dedicating a single buffer per input port. Sharing of buffer resources facilitates the following. (1) Designing a buffer-sharing router architecture to maximize resource utilization. (2) Resolving a multicast deadlock in a tree-based router sustaining a minimal path. (3) Compromising the trade-offs in the design space, focusing on the area-delay product of the MRBS compared to a conventional tree-based router.
The remainder of this paper is organized as follows. Section 2 presents the background and related work. Section 3 describes the proposed technique in terms of the router architecture and deadlock recovery procedure. The simulation results and analysis under various conditions are provided in Section 4. Finally, concluding remarks are drawn in Section 5.

II. BACKGROUND & RELATED WORK
Various efforts have been made to improve the multicast performance of NoCs, which can largely be categorized into tree-based methods that support branch operations inside a router and path-based ones without branch operations.

A. MULTICAST ROUTING
Path-based multicast routes packets using node-labeling. The XY coordinates of each node determine the label. The pathbased method divides all nodes into groups based on labels and sends duplicate packets to each group, which includes one or more destination nodes , as shown in Fig. 1 (a).
The dual path routing method [8] divides destinations into two groups, one with labels smaller than the source node and the other with larger labels. Depending on the node labeling, packets are sent to the higher or lower label groups. Several existing path-based algorithms use different partitioning techniques to reduce path lengths. Multipath routing improves the dual path method by segmenting four rectangular subset groups based on source node location. Nodes in the same subgroup are assigned to the same path, preventing long routing paths. The Hamiltonian adaptive multicast unicast model by Ebrahimi et al. [9] can encompass more routing paths, increasing the probability that the packet will choose a less congested path.
In tree-based routing, the source node builds a spanning tree and sends a single multicast message down the tree. Fig. 1 (b) shows a simple tree-based routing example. When routing multi-destination packets, the source node is the root and the destination nodes are leaves. Using shortest-path routing, the packets are branched into different links at the intermediate node. In order to duplicate flits at intermediate nodes, the packet must be forwarded simultaneously in all directions. A blocked path prevents the flit from being duplicated, increasing the risk of a multicast deadlock. In other words, all tree-based methods are vulnerable to multicast deadlocks.
For each router in the hardware-tree based multicast routing algorithm [10], a deadlock queue was implemented. Dual virtual channels were used in recursive partitioning multicast [11] and improved minimal multicast routing [12] to avoid deadlock. According to [13], the authors solved the cyclic dependency deadlock by adaptively selecting a routing path based on the downstream router's remaining buffer size.
Path-based routing without branch operations is advantageous for avoiding multicast deadlocks. However, the unbalanced partitioning of path-based routings caused by boundary-located source nodes suffers from non-minimal routing paths. The tree-based multicast routing has low latency but needs to solve multicast deadlocks through additional hardware resources [14]. Conventional approaches have eliminated deadlock by employing deep FIFO buffers to hold the largest packet or virtual channels to prevent the input buffer preemption. However, these methods dramatically increase the input buffer size, which occupies most of the router area. MRBS reduces the input buffer size by buffer sharing scheme while preserving the low latency that is the strength of the tree-based approach.

B. DEADLOCK IN MULTICAST ROUTING
In wormhole-switched NoCs, contention between packets causes head-of-line (HoL) blocking, leading to performance ii VOLUME 4, 2016 deterioration or communication failure; thus, deadlock-free is a vital design objective. A deadlock in unicast is mainly caused by a cyclic dependency between packets and is easily rescued by restricting the turn models at the routing stage [15]. However, in multicast routing, different types of deadlocks owing to the branching of packets must also be entirely handled. Fig. 2 shows an example of a multicast deadlock. The dashed line indicates a port request that is not yet granted, and the solid line represents a port request. Packet A requests the southern port of Router 2, which is blocked by packet B, whereas packet B demands the southern port of Router 3 held by packet A. Because of branching of packet A in Router 1, one path blocks packet B, and the other is blocked by packet B. In this situation, a request to the southern output port of packet A can be granted, whereas a request to the eastern output port cannot, resulting in a multicast deadlock.
Coffman et al. [16] showed that a deadlock occurs if and only if four situations, namely mutual exclusion, hold and wait, no preemption, and circular wait, fulfil simultaneously. These four situations can be rephrased as: • Mutual exclusion: each output port is not sharable • Hold and wait: a packet is holding one output port and waiting for another • No preemption: output ports cannot be preempted • Circular wait: a circular chain of packet exists Conventional tree-based routers dismissed hold and wait situation by enlarging buffer size, which results area overhead. In contrast, MRBS aims to release hold and wait situation without requiring large buffers to hold entire packets.
A long packet transmission distance increases the probability of deadlock. The longer the packet transmission distance, the more frequently HoL blocking occurs [17]- [18]. Because HoL blocking generates a hold-and-wait condition, one of Coffman's conditions, it can result in deadlock.

C. BUFFER SHARING
There have been a number of studies on the NoC router with a centralized shared input buffer [19]- [21] and distributed input buffer sharing scheme [22]- [23]. Centralized shared input buffer structure is a single buffer shared by all input ports. Distributed input buffer sharing is a scheme that allows other input ports to share a distributed input buffer dedicated to each input port through multiplexing. Li et al. designed a router with the centralized shared buffer and proposed a packet merging mechanism to improve the utilization of buffer space and network throughput for dataflow architecture [19]. Bhardwaj et al. introduced a continuoustime replication strategy in asynchronous NoC and improved throughput while minimizing power and area overhead using a centralized shared buffer [20]. Shared-queue router architecture was designed to maximize buffer utilization for boosting network throughput [21]. The novel flexible buffering mechanisms were proposed to improve average throughput for 3D NoCs [22]. The first approach increases the utilization of underutilized buffers based on the number of free slots available in each port, prioritizing minimum occupied buffers. The second mechanism is the flexible buffering technique employing a simple priority order for each buffer. A Flexible router provided a way to handle the requests to a busy buffer by other buffers to increase saturation rate while sustaining area [23].
MRBS targets a different buffer size spectrum compared to the routers used in [19]- [23] while using distributed input buffer sharing scheme. When each input buffer size is smaller than the maximum packet size, we guarantee area-efficient deadlock freedom. On the other hand, other studies sought to improve performance when the buffer size was larger than the maximum packet length. Since they used a cut-through routing method that does not cause deadlock, the deadlock was not considered, and it occupies a considerably larger area than MRBS. Overall, our work differs from previous studies in that it dramatically reduces buffer size using a deadlock recovery scheme and adopts distributed input buffer sharing that enables faster packet access than a centralized shared buffer.

III. MRBS
The main concept behind MRBS is leveraging the spatial diversity of the input buffer for multicast deadlock recovery while maintaining the minimal path routing. A router architecture (Section III.A) and deadlock recovery scheme (Section III.B) are presented in realizing the idea. Fig. 3 illustrates the proposed multicast router architecture. Instead of using dedicated deadlock queues [24]- [25], we utilize the input buffer as a deadlock queue only when multicast deadlock is detected. The versatile input buffer incorporates VOLUME 4, 2016 iii an input port selector and a deadlock recovery control unit (DRU) that enable multicast deadlock recovery to facilitate buffer sharing. The input port selector expedites the data to a specific input port to be forwarded to another available input buffer. The DRU controls the recovery procedure and coordinates the handshaking signals (i.e., D req , D gnt , Rdy, and Done) between the adjacent routers and the control signals for the input port selector. All handshaking signals were one-hot encoded to reduce the decoding delay. The associated notations are described in Table 1. To explicitly explain the multicast deadlock and recovery operation of the MRBS, trunk router R tr , twig router R tw , and greenleaf buffer B gl are defined according to the branching flow of multicast packet.  Fig. 2. When a multicast deadlock occurs, some of the output ports of R tr are blocked, and this becomes the hold and wait situation where branched packets cannot be sent to other output ports. DRU in R tr performs deadlock detection and blocked packet forwarding to dismiss hold and wait situation. • Twig router (R tw ): A router acts as twig router R tw where multicast packets are branched and corresponds to Router 2 in Fig. 2. When a multicast deadlock occurs, branched packets cannot be propagated due to the preemption of other packets in the output port of R tw . DRU in R tw performs buffer sharing and flit reordering by using input port selector. • Greenleaf buffer (B gl ): A Greenleaf buffer B gl is an available input buffer in R tw selected to keep the part of the multicast packet sent from R tr . By forwarding and storing the rest of the packet from R tr to B gl , hold and wait situation in R tr can be released. B gl is determined through arbitration among the input buffers, excluding the buffer that is the cause of the deadlock. As shown in Fig. 2, there are four candidates of B gl in a five-port router, excluding the west input buffer of R tw (Router 2).

B. MULTICAST DEADLOCK RECOVERY PROCEDURE
Since MRBS does not modify the pipeline operations and an additional operation is involved only when a multicast deadlock occurs, the extra cycles can be ignorable considering the rate of deadlock occurrence. In addition, we devote our efforts to minimizing the additional cycles caused by input port selector and DRU logic. A multicast deadlock recovery scheme using buffer sharing was applied simultaneously to multiple adjacent routers. This section describes the deadlock detection (step 1) and flit forwarding (step 2) on R tr and input buffer sharing (step 3) and flit reordering (step 4) on R tw for multicast deadlock recovery. Fig. 4 depicts the overall deadlock detection and recovery procedure.

1) Multicast Deadlock Detection
The multicast deadlock detection in MRBS is a heuristic timer-based, which implies that all input buffers are equipped with timers, and a deadlock is detected when the timer reaches a certain threshold. Thus, not only deadlocks but also contention by HoL blockings can be repaired immediately. iv VOLUME 4, 2016

FIGURE 4. multicast deadlock recovery procedure
We set this threshold as the capability that the R tw can resolve worst-cast contention by itself. When packets are stored in all input buffers of the R tw , and all output ports are available, the maximum cycle T thres at which the contention in R tw is removed is as follows: If HoL blocking is not eliminated during this time span, the DRU determines that there is a fault in routing the packet by the R tw and initiates the deadlock recovery procedure. Since we conducted the simulation in a mesh network with a maximum packet length of 10 flits, P L M ax , N P , D buf f , and L prop were set to 10, 5, 5, and 5, respectively. Therefore, the MRBS effectively handles all possible traffic congestion caused by multicast packets. Because MRBS detects all time-out flits that occur during branch operations, it encompasses all multicast deadlock cases. We can identify whether time-out flit causes the multicast deadlock or not by checking the case which simultaneously satisfies the following two conditions.
• Condition 1. The time-out flit must request more than one output port: • Condition 2. the time-out flit must be of a body or tail type: Condition 1 indicates the requirement for multicast packets. Because MRBS detects only multicast deadlocks, unbranched packets are treated as congestion. Condition 2 indicates that a multicast deadlock always occurs after the packet header is branched. Therefore, a time-out flit conforming both Conditions 1 and 2 causes multicast deadlock, and a router including a time-out flit becomes a R tr .

2) Flit Forwarding
MRBS minimizes additional congestion during buffer sharing by diminishing the number of R tw . This is accomplished by exploiting the buffer state of the downstream router. When a multicast deadlock is detected, the DRU sends DR req only to the output port whose credit status is full among the output ports requested by a multicast packet. A request to a port whose credit status is not full can be granted, and thus, it is not the source of a deadlock. The routers that receive DR req become R tw . Here, DR req is created by the DRU and maintained until the tail flit of the multicast packet is removed from the buffer.
R tw sends DR gnt to R tr through the setup procedure described in step 3. R tr that has received DR gnt forwards a multicast flit, which is the flit of multicast packet (previously, the credit status was full, and thus it cannot be forwarded). This process conducted by sending a forward signal to the input buffer, which holds multicast flits when DR gnt corresponding to DR req is detected. The request of the multicast flit in the input buffer that receives the forward signal is always granted without considering credit status. The forward signal is maintained until the tail flit of the multicast packet is forwarded. Algorithm 1 presents the above process of R tr operation of the MRBS. VOLUME

3) Input Buffer Sharing
R tw that has received DR req needs to select B gl and send DR gnt to R tr through the setup procedure. The setup procedure of B gl is essential to preventing the worst-case of losing packets. This setup procedure is controlled by the DRU and proceeds in the following order: (1) When DR req is detected, R tw selects B gl through arbitration (in a roundrobin manner) among the remaining buffers, except for the input buffer that causes a deadlock. In addition, an empty buffer (by detecting the flag from the input buffer) has the highest priority to minimize the deadlock recovery cycle.
(2) Here, Rdy is sent to the router or network interface (when B gl is a local input buffer) connected to B gl to avoid receiving any more packets. (3) When Rdy is detected in the router or network interface, the switch allocator is controlled such that the corresponding output port does not grant any packet and Done is sent to R tw . The delay added by the setup procedure is approximately two cycles (depending on packet length, which is pre-allocated to B gl ). The input port selector of R tw must change the input port so that multicast flits forwarded from R tr can be stored in B gl . Therefore, when DR gnt is sent to R tr , information on the initially allocated buffer and B gl is updated in the DRU in the form of a table (buffer sharing table), and the input port selector changes from an idle state (in which all flits are resident in the correct input buffer) to the recovery state. The input port selector works like a crossbar switch controlled by a buffer sharing table in DRU. When the head flit of the initially allocated buffer is removed, the input port selector returns to the idle state. Fig. 5 shows an example of an input port selector operation in recovery state. The multicast flit that was to be stored in the initially allocated buffer is migrated into B gl by input port selector operating in recovery state.

4) Flit Reordering
The branch operation is completed by forwarding multicast flits from R tr to B gl of R tw . Finally, it is necessary to reorder the flits divided into the initially allocated buffer and B gl in R tw . After the multicast flits in the initially allocated buffer are granted access to the output port, flit reordering    Fig. 6 shows an example of a multicast deadlock configuration that occurs in XY multicast routing. Router 3, where the branch operation of packet B occurs, becomes R tr , and Router 4 becomes R tw . B gl of R tw is determined as the local input buffer, and multicast flits are then forwarded from R tr . Because the branch operation of packet B is completed, the branch operation of packet A becomes possible, and thus, the multicast deadlock is broken. Finally, the multicast deadlock recovery procedure is completed by reordering of packet B in R tw .

A. SIMULATION SETUP
To validate the effectiveness of MRBS, the increase in the average latency relative to the decreased area of the MRBS is analyzed. A baseline conventional tree-based router (CTR) and MRBS were modeled using SystemVerilog [26] at register-transfer-level. We used Synopsys VCS, a commonly used tool for SoC design in industry and academics, to perform cycle-accurate simulation. CTR was modeled by exploiting stalled process, branch operation, and bit-string encoding packet format of state-of-the-art tree-based multicast routers [6], [11], [12], [13], [14].
With the widely accepted 2D mesh NoC topologies, we compare the latency performance of the CTR-and MRBSbased NoCs. The simulation was conducted by increasing the network size from 4 × 4 to 10 × 10 to analyze the latency and area tendencies according to node size. A widely known deterministic XY routing is commonly applied to exclude the effect of the routing algorithm.
The configurations of NoC and traffic are listed in Table  2. In terms of synthetic traffic patterns, we evaluated uniform random traffic. The injected traffic consists of two packets to mimic realistic system scenarios: two-flit short packets (such as request packets in a CMP) and longer 10-flit packets (such as response packets carrying a cache line). We assume a bimodal distribution of packets, with 80 % of the packets being short and the rest being long, according to [27]. In addition, guided by prior research [28] that reported realworld multicast percentages, we investigated two different scenarios: (a) 5 % multicast traffic (low multicast intensity) and (b) 30 % multicast traffic (heavy multicast intensity). For both scenarios, we assume that each multicast packet is sent to a maximum of 25 % of the network nodes, in line with what has been observed in real applications.
For the experiments with cache-coherence traffic, we injected traffic that mimics the behavior of (c) HyperTransport [29] (a directory-based protocol), and (d) Token Coherence [30] (a snooping-based), in a 64-node (8×8) CMP. Broadcast traffic makes for a large portion of the injected traffic in the HyperTransport and Token Coherence protocols, accounting for 6 % and 15 %, respectively [27].
Because the CTR must guarantee to accommodate all 10 flits, which is the maximum packet length, each input channel has a buffer size of 10. On the other hand, since the MRBS embraces a buffer sharing scheme, each input channel has a buffer size of 5 (half of the CTR) and includes a DRU and input port selector to support the proposed deadlock recovery process. As mentioned in section III.B, T thres (50cycles) is determined in consideration of the worst-case contention in which all the maximum lengths of packet simultaneously request one output port.
Each router was synthesized with Synopsys Design Compiler using the SAED 32nm standard cell to extract the timing and area information. We evaluated the average latency according to the traffic and area overhead of the router in the NoC configurations using each router. We validated VOLUME 4, 2016 vii the improvement in the MRBS by deriving the area-delay product (ADP), considering the trade-off between them. Table 3 shows the saturation point of the CTR-based NoC at different network sizes and average latencies of the CTRand MRBS-based NoCs at the corresponding injection rate. Saturation point indicates the point at which the latency is twice the zero-load latency [3], where the zero-load latency refers to the latency required to send a packet from its source node to all of its destination nodes with no other packets in the network. The saturation point of the MRBS-based NoC was approximately 3 % smaller than that of the CTR-based NoC on average. The fact that there is a negligible difference in the saturation point considering the design with half of the buffer size indicates that the actual multicast traffic was not efficiently utilized in the existing CTR-based NoC. For a fair comparison, the average latency at the same injection rate was measured, and the injection rate was set as the saturation point of the CTR-based NoC. Fig. 7 depicts the average latency in synthetic uniform random traffic. At the multicast intensities of 5 % and 30 %, the average latency of MRBS-based NoC had 5.0 % and 5.5 % overhead compared to CTR-based NoC, respectively. Under cache-coherence traffic, the average latency of MRBSbased NoC increased by 6.2 % and 7.1 % than that of CTRbased NoC, respectively, as shown in Fig. 8. Owing to the permissible difference in latency, the router area reduction of MRBS-based NoC significantly affects the ADP.

B. LATENCY ANALYSIS
Since the deadlock recovery of MRBS is activated based on the time-out of the blocked multicast packet, it is also triggered by long-time contention. Flit forwarding (Step 2 in Section III.B) makes the input buffers occupied by the multicast packet in R tr available and resolves the contention of packets. This contention resolution mostly compensates for the average latency increase due to the reduced input buffer size of MRBS.   Table 4 lists the MRBS and CTR areas. Compared to the CTR, which can hold 10 flits in each input buffer, MRBS, whose capacity is up to 5 flits, has an input buffer with a 51366 um 2 smaller area. It also uses route computation (RC) logic, a switch allocator (SA), and a crossbar switch (X-bar) with a 12368 um 2 smaller area. This is mainly due to SA for credit control of the downstream router. The added DRU and input port selector consume an area of 1374 um 2 . As a result, the total area of the MRBS is 42.1 % smaller than that of the CTR. This result shows that the area overhead owing to the DRU and input port selector is negligible because the buffer size occupies a significant portion of the total area. The power consumption of CTR and MRBS is shown in Table 5. MRBS shows 41.0 % less power consumption than CTR. The input buffer in MRBS is half the size of that in CTR, which results in a 40.6 % reduction in register power. Minor credit control logic due to SA in MRBS reduces the power dissipated by 42.3 %. On the other hand, additional power consumption by the DRU is only 0.6 %.

D. AREA-DELAY PRODUCT
ADP is a performance index for evaluating an NoC when considering the trade-off between the average latency of data transmission and area overhead [31]. Fig. 9 shows the normalized ADP of the MRBS-based and CTR-based NoCs for different multicast traffic ratios. The MRBS-based NoC shows a 4.5 % (low multicast intensity) and 6.1 % (high multicast intensity) higher average latency than CTR-based NoC. In contrast, the area of the MRBS-based NoC is 42.1 % lower than that of the CTR-based NoC, regardless of network size. As a result, MRBS-based NoC has 39.3 % (a) and 38.3 % (b) lower ADP than a CTR-based NoC.

E. ENERGY-DELAY PRODUCT
We performed additional simulations to find the energy-delay product (EDP) of networks based on CTR and MRBS at various network sizes. The switching activity information (SAIF) file was extracted from the average delay simulation result containing toggle counts and retention time about the signal. The use of SAIF file for signal transition-aware simulation enables more precise power estimation. The EDP comparison between MRBS-based and CTR-based NoC is depicted in Fig. 10. The EDPs of the MRBS-based approach are (a) 28.5 % and (b) 26.6 % lower than that of the CTR-based approach in 8x8 2D mesh with uniform-random multicast traffic, respectively.

F. SCALABILITY
Scalability measures the ability of a system to increase or decrease in performance and cost with response to changes in application. In this study, scalability can be expressed as the ratio of average latency and area change as the network size increases. We evaluated the scalability based on the average latency for each network size in Table 3 and the area overhead of the routers in Table 4. It was confirmed that as the number of nodes on each side increases, the difference in average latency between the two routers increases linearly, and the area increases quadratically. Accordingly, the latency and area ratio change show a linear relationship, as shown in Fig.  11. Since it was confirmed that there is a slight increase in the average latency of MRBS-based NoC at each network size, MRBS enables an area-scalable design without performance degradation. This paper proposed a novel multicast router with a deadlock recovery scheme using buffer sharing. Branch operations for tree-based routing have limitations in designing area-efficient multicast routers. MRBS guarantees deadlock-free minimal path routing with low area overhead by exploiting the spatial diversity of the input buffer. The simulation was conducted under widely used multicast traffic. MRBS achieves a 39.3 % lower ADP while sustaining performance than a conventional tree-based router. Simulation with various network sizes proved that MRBS-based NoC is more scalable than CTR-based NoC. These results show that MRBS makes multicastenabled NoC area-efficient and scalable.