Designing Nonblocking Networks With a General Topology

Conventional theory for designing strictly nonblocking networks, such as crossbars or Clos networks, assumes that these networks have a centralized topology. Many new applications, however, require networks to have a distributed topology, like 2-D mesh or torus. In this paper, we present a new theoretical framework for designing such nonblocking networks. The framework is based on a linear programming formulation originally proposed for solving the hose-model traffic routing problem. The main difference, however, is that the link bandwidths in our problem must be discrete. This makes the problem much more challenging. We show how to apply the developed theorems to tackle the problem of designing bufferless NoCs (networks-on-chip). The proposed bufferless NoCs are deadlock/livelock-free and consume significantly less power than their buffered counterparts. We also present a multi-slice technique to reduce node capacity variations. This can make the proposed NoC architecture more cost efficient in a VLSI (very large-scale integration) implementation. In addition, we offer a detailed delay/throughput performance evaluation of the proposed bufferless NoC in the paper.


I. INTRODUCTION
Conventional nonblocking networks, such as crossbars and Clos networks, are designed for a centralized topology. However, new applications require networks with a distributed topology. One example is networks-on-chip (NoCs) [1]- [3] which often use a mesh or torus topology [4]- [6]. Most conventional NoCs adopt a buffered architecture, but deadlocks and the inability to handle adversarial traffic patterns are well-known problems in buffered NoCs [7]. A buffered NoC has another major drawback: buffers consume a large amount of power and silicon real estate [8]- [11]. [8] showed that removing buffers in its proposed architecture could lower power consumption by 40% and reduce silicon area by 70%. A buffered architecture is also incompatible with emerging optical NoC technologies, such as silicon photonic rings [12], [13], which are bufferless by nature.
However, bufferless NoCs have their own problems. For example, one such architecture is based on deflecting rout-The associate editor coordinating the review of this manuscript and approving it for publication was Gian Domenico Licciardo .
ing [8], [9], but in this approach, if the outgoing link is busy, the router simply sends it to any available output port. This will obviously degrade routing efficiency, and the throughput of the network may drop even when the input load increases. To prevent this from happening, sophisticated congestion control and routing arbitration schemes are required. Another bufferless NoC is based on a circuit/packet hybrid architecture [14]- [16] where best-effort traffic as well as circuit setup commands are sent through the packet network and quality-guaranteed traffic is transmitted through the circuit network. One problem with this architecture is that, when an intended circuit is not available, network reconfiguration is required, and this process obviously takes time. The frequency of reconfiguration depends on the degree of blocking of the network. If reconfiguration activities are frequent, the performance can be severely degraded.
In theory, all these problems associated with a bufferless architecture would be removed if the entire bufferless NoC could perform like a nonblocking crossbar switch. Under this condition, no traffic patterns could degrade the performance of the network and non-deadlock/livelock could happen. The only problem is that there is no theory available for the design VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ of a nonblocking NoC with a distributed topology, such as mesh, ring, or torus. Filling that gap is the focus of this paper.
Here, we will present a theoretical framework for the design of nonblocking networks with a general topology. We use bufferless NoCs to illustrate the applications of this new type of network, as being nonblocking is essential for such NoCs. Our contributions are summarized as follows: 1. We develop the theory for designing nonblocking networks with a general topology. In other words, the entire network performs like a crossbar switch. 2. We show how to apply the theory to optical and electronic NoC design. Such bufferless NoCs are deadlock/livelock-free and congestion-free. The rest of the paper is organized as follows. Section II presents the theoretical foundation of the paper. Section III presents an application of the theory to bufferless NoC design. Section IV presents a multi-slice technique to balance node capacities, and Section V compares the performance of the new NoC with that of a conventional buffered NoC. We conclude our discussion in Section VI.

II. THEORETICAL FOUNDATION
The theory for designing a nonblocking network with a general topology is given below. Various topologies, such as 2-D mesh [17] and torus [7], have been proposed for NoCs. Our theory is general and applies to all of these topologies.

A. HOSE MODEL NETWORKS
A network can be considered as a graph G(V, E), where V is the set of nodes and E is the set of edges (links). We list several additional notations used in our formulation in Table 1.
The theoretical foundation of the design of a nonblocking bufferless NoC is related to work on the hose-model traffic pattern [18], [19]. Traditionally, the traffic demand specification is given in the form of a traffic matrix {t ij }, where t ij is the amount of traffic from node i to node j, where i, j ∈ V . In a nonblocking network, each edge node has an ingress and an egress traffic limit. As long as the traffic admitted to the network does not exceed these limits, the network will never be congested.
But in a hose-model network, the traffic demand constraint is given in the form of row and column sums: where (α i , β i ) represent the amount of ingress and egress traffic admissible at processor i (see Fig. 1). In practice, α i and β i are usually set to the capacity, denoted by γ , of the link connecting a processor and the NoC. Note that in the model above, traffic loads are considered continuous. This obviously is not true in a nonblocking circuit network. A network, if designed according to (1), will be congestionfree as long as the total amount of ingress and egress traffic of each node does not exceed the constraints specified by (α i , β i ), regardless of the traffic demand pattern {t ij }. In such a network, a new flow from node i to j can always be added to the network as long as the total ingress traffic at node i is no more than α i and the total egress traffic at node j is no more than β j . Thus, there is no need to check internal links' statuses. In other words, a new flow will never be blocked inside the network, provided that the ingress and egress nodes have the capacity to accommodate the flow. This property is likened to the nonblocking property in switching. Such designed networks are thus called nonblocking networks in [19]- [21].
Prior works on the hose-model assume continuous bandwidths. However, this is certainly not true in a discrete bandwidth circuit switching network, where setting up a path will consume one channel from each link along the path. In the following, we therefore show how to use the hose-model concept to design a nonblocking circuit switching network.

B. INITIAL FORMULATION
As mentioned in Sec. I, our goal is to design a generaltopology network that performs like a crossbar switch. So we set γ = 1 in our discussion below. The formulation for designing a minimum-cost nonblocking network is given below in (2). Note that C total = e∈E c e ·µ e is the total cost of the network, where c e is the capacity of link e, and µ e is a cost-scaling constant related to link e. Implicitly, we assume that the silicon area, which includes nodes and links, for implementing link capacity c e is linearly proportional to c e .

min C total
(2a) For a given V and E, (2b-d) are for the flow conservation constraint, and (2e) means that the total traffic routed through a link is smaller than the link capacity. In (2e), D is the set of all legal traffic matrices; that is, all traffic {t ij } is subject to the following constraint: M nd in (2f) is given. Although this constraint is not required, we use it to limit the maximum capacity of each node. Equivalently, this can reduce the node capacity variance. To select the value of M nd , we can set M nd to infinity in the beginning and solve the formulation. Based on the derived maximum node capacity, we can reduce M nd gradually and see its impact on the total cost and the node capacity variance. More discussion will be provided later, in Sec. IV. The formulation of (2), however, cannot be solved directly because there are too many traffic matrices T satisfying (3), and they all need to be included in (2). In the following, we show how to tackle this problem.
Theorem 1: Let t ij ∈ [0, 1] (i.e., t ij ∈ R). This relaxation will not change the result of (2). Proof: ij denote the derived routing and C min the solution of (2). (ii) Assume t ij ∈ [0, 1]. Let x e ij be the derived routing and C min the solution of (2). (iii) Assume t ij ∈ [0, 1] and that the network uses routing x e ij defined above. Let C min be the minimum total capacity required to support all traffic patterns.
We are going to prove the following three states: (a) C min ≤ C min is true.
(b) C min ≤ C min must hold.
(c) C min ≤ C min is true. With (a−c) above, it is trivial to see that C min = C min must hold.
We now show that all three statements are true.
(a) This statement is trivially true because under t ij ∈ [0, 1], the optimal routing is x e ij , not x e ij . Thus the derived C min is not optimal, which means C min ≤ C min will hold. (b) Every valid traffic matrix in (2e) under t ij ∈ {0, 1} is also valid under t ij ∈ [0, 1], which means that the constraint equations under t ij ∈ {0, 1} are only a subset of the constraint equations under t ij ∈ [0, 1]. Thus the result derived from the formulation under t ij ∈ [0, 1] is worse than the result from the former formulation under t ij ∈ {0, 1}, which means C min ≤ C min must hold. (c) Assume the network uses routing x e ij derived from (i). Note that the network is intended to be a crossbar-like nonblocking network, and is a single-path network since x e ij ∈ {0,1}. Let <p, q> denote the path from source node p to destination node q. Let be the set of source nodes and the set of destination nodes of all such paths <p, q>. Note that if p ∈ and q ∈ , it does not imply <p, q> must pass through e. For example, if <1, 3> and <2, 4> are all the paths passing through e, then = {1, 2} and = {3, 4}. Obviously <2, 3> does not pass through e.
We now construct a bipartite graph G with and . Let p ∈ and q ∈ . We add an edge l pq between p and q if <p, q> passes through link e. Let z pq be the weight assigned to l pq . A legal traffic pattern (i.e., a traffic matrix satisfying (3)) routed through e can be considered as a matching in the bipartite graph G , and the amount of traffic in the pattern equals the total weights of the matching. Assume z pq ∈ {0, 1}. The total weights can be calculated as In (iii), we assume the same routing x e ij is used, but z pq ∈ [0, 1]. Under this condition, a source node can be connected to multiple destination nodes simultaneously and vice versa. The maximum amount of traffic that can be routed through e can now be calculated through the following fractional matching formulation: Comparing (4a) and (4b), we can see that (4a) is an integral matching formulation and (4b) a fractional matching formulation. From graph theory [22], the maximum integral VOLUME 10, 2022 matching equals the maximum fractional matching. Therefore, the link capacity of e derived from (i) can support any traffic pattern under the assumption of (iii). Thus, C min ≤ C min must hold.

C. FINAL FORMULATION
With Theorem 1, we can assume t ij ∈ [0, 1] and derive the same result. Once we assume t ij ∈ [0, 1], we can use the technique described in [23] to remove the requirement of exhaustively listing T in (2). (2). Routing x e ij is feasible (i.e., satisfying (2e)) if and only if there exist non-negative π e (i) and λ e (j) for all e ∈ E and i, j ∈ V such that The proof is similar to that in [23]. Thus, it is omitted here.
Note that in the above, π e (i) and λ e (j) are the auxiliary variables introduced by the duality transformation of the following linear programming (LP) formulation: With Theorem 2 and π e (i) and λ e (j), we derive the final mixed-integer LP formulation as follows: x e ij ∈ {0, 1}, π e , λ e , c e ≥ 0, ∀e ∈ E, ∀i, j ∈ V . (5h) In (5), only x e ij is discrete and the other variables are real.

III. APPLICATION OF NONBLOCKING THEORY: BUFFERLESS NoCs
In this section, we use bufferless NoCs as an example application of the nonblocking theory presented in Section II.
Although the theory is for designing a network that performs like a crossbar switch, the bufferless NoC application, as shown below, is for a packet switch.

A. ROUTER NODE DESIGN
A packet in NoCs is usually divided into flits (flow control digits), which form the basic data-transmission and buffering units at the link layer. The term ''phit'' (physical unit), on the other hand, represents the number of bits that can be transferred in parallel in a single cycle. As with many NoC designs, we assume that the flit size is equal to the phit size.
In other words, the entire flit is transmitted in one clock cycle If an NoC uses a buffered approach [2], [3], different flits can be held in different routers, with the head flit holding the routing information. Packets from different input ports may be headed for the same output port, and the head flit of a packet may be blocked in a router. When that happens, flow control signals will be sent to downstream routers, and the subsequent flits may be buffered in different routers. However, the NoC proposed in this paper uses a bufferless approach. Each router only holds one flit. After one clock cycle, the entire packet moves one flit toward the destination. The entire NoC is nonblocking in the sense that, if a destination is free, a packet from a source station can definitely reach the destination node without being blocked inside the switch (more details are provided below).
The hardware implementation of the router node of our bufferless NoC is quite simple. It only comprises a crossbar and a controller. No virtual channels, flow control, deadlock prevention mechanisms, or adaptive routing are needed. Since more than 60% of NoCs adopt a mesh or torus topology, we focus on these two in our discussion below.
In a bufferless NoC, a router node of a 2-D mesh or torus topology has at most four bi-directional links plus one link connecting to the local processor. The link connecting to the local processor contains only one channel, but the other links may have multiple channels. Each channel consists of two sub-channels, DATA and ACK: DATA: For sending data. The width of a channel is the same as the width of a flit.
ACK: A 1-bit sub-channel for sending the ACK signal back to the source. Its direction is the opposite of the DATA sub-channel.
The first bit of each flit is the VALID bit (Fig. 2b) -1 for valid and 0 for idle flits. A change from 0 to 1 in the VALID bit indicates that the first flit of a packet arrives, and a change from 1 to 0 means the end of the current packet transmission.
There is a unique path between any two processors. The specification of the entire path is given in the head flit(s) of a packet. This style of routing is usually called source routing. It requires 2 bits to specify the routing direction of each node (north, south, east, west), and the ''hop count'' field indicates how many routers the packet needs to go through. The ''hops passed'' field indicates how many hops the packet has already passed through (this field is incremented by each router). When hop count = hops passed, the arriving packet  will be handed to the local processor. For a large NoC, the path information may be carried by more than one flit. The format for the second flit will be the same. In this case, when hop count = hops passed in each flit, the flit will be popped out by a router and the next flit will be used for routing.
A channel has two states: Idle and Busy. When a new packet arrives, the controller of the crossbar in Fig. 2 will select an Idle channel and set up the state of the crossbar accordingly. The packet is sent out after a one-cycle delay. The selected output channel is marked Busy, and the corresponding 1-bit ACK channel (in the reverse direction) will also be turned on. If no channel of the selected link is available, the packet is dropped. Because the NoC is nonblocking, packet dropping means that the destination is already occupied.
When a source processor sends out a packet, it also starts a timer. If no ACK (from the ACK channel) arrives before the timer expires, the transmission fails (blocked at the destination, not inside the network), and the source processor will stop the transmission immediately by changing the VALID bit from 1 to 0. This will set all the reserved channels along the path to the Idle state again. The penalty for a failed transmission is the time-out interval, which is quite small compared to a packet length (see discussion below). A failed transmission will be sent out a short while later. No overhead is incurred if the transmission is a success.
The time-out interval is determined by the number of hops of a path. This information is available because there is a unique path to each destination processor. Each packet is delayed by 1 flit cycle per node. If the total hop count is k, the length of the time-out interval is around k + cycles, where is the time for the ACK to travel back through the ACK channel (which is an analog channel and does not contain flip flops). may be larger than 1 flit cycle due to the capacitance loading of the ACK channel, but should still be small. To simplify our discussion, we assume it is 2. In a 2-D torus NoC with n 2 processors, the average hop count is around n clock cycles. So for a 36-processor multicore system, the time-out interval is around 8 flit cycles. In other words, a processor will waste 8 flit cycles for a  (5) where (5a) is replaced with (5a'). Although the two unidirectional links (one for sending and one for receiving) between any two nodes in (b) have the same bandwidth, this is not the case in general.
failed transmission. This overhead will be further reduced by another factor, K , in our final architecture discussed in Sec. IV, where K is the number of slices used to implement the NoC. If K = 4, then the average path setup overhead is about 2 conventional flit cycles.
From the discussion above, it is easy to see that the bufferless architecture is deadlock-free and livelock-free because a failed transmission does not hold up any network resources. Also, several lines of a flit in conventional NoCs are used for carrying flow control signals and virtual channel (VC) numbers. This type of overhead does not exist in the proposed bufferless architecture.

B. COST FUNCTION AND RESULTANT TOPOLOGY
As can be seen in Fig. 2, a bufferless router has no buffers and is much simpler than in a buffered NoC. The silicon area required for a bufferless NoC is dominated by the total link capacity cost, C total , which is expressed as where ω is the wire pitch, c e is the capacity of link e, and d e is the physical length of link e. Since ω is fixed for a given fabrication technology, it can be dropped from the formulation by setting ω to 1. We can use (5) where (5a) is replaced with (5a ) to design a bufferless NoC. Consider the conventional mesh network VOLUME 10, 2022 shown in Fig. 3a. The formulation of (5) where (5a) is replaced with (5a ) produces many solutions, and one is given in Fig. 3b, where c e is labeled on each unidirectional link. Since this resulting topology is a tree, we can easily verify that this solution is nonblocking. Note that there are many other solutions which look very different from the one shown in Fig. 3.
One issue with the result is that the node capacity variance is quite large. Every node in a bufferless NoC is just a crossbar. To simplify the VLSI implementation, it is preferable to lay out just one node (the largest node) and use it for all nodes. But this can lead to inefficient utilization of silicon area if the NoC has a large node capacity variance. This issue is tackled in the next section.

IV. MULTI-SLICE ARCHITECTURE
To tackle the node capacity variance, we will present a multislice architecture below. As in Sec. III, we focus on torus networks, which are north-south, east-west symmetric (see Fig. 4). This symmetry is important for the multi-slice architecture presented.

A. NODE CAPACITY VARIANCE
Node capacity variance can lead to inefficient VLSI implementation. Let n max × n max be the largest node size. The efficiency, in terms of fewer crosspoints wasted, of using the layout of this node for all nodes can be represented as where M is the total number of nodes and n 2 i represents the number of crosspoints required in a particular node. We call the formula of (6) the node capacity variance indicator: the larger the value, the smaller the variance.

B. MULTI-SLICE ARCHITECTURES
In a multi-slice architecture, we divide the network into K independent slices, and the width of each channel in a A router with four bi-directional links connecting to other routers and one link connecting to a node processor has a total node capacity of 10 (= (2 + 3 + 5 + 6 + 4 × 6)/4). slice is only 1/K the width of the channel of a single-slice architecture. We then combine K independent slices into the final solution. The advantage of a multi-slice architecture is three-fold: (i) it reduces node capacity variance, (ii) it reduces the path set-up overhead, by a factor of K in a K -slice architecture, and (iii) it reduces the impact of a link failure (a single link failure only removes 1/K of the total bandwidth in a K -slice network).
To minimize the node capacity variance in a multi-slice NoC, how to combine the slices becomes important. Let A, B, C, and D represent the four slices of the solution shown in Fig. 5a. Each slice is a torus, although the capacities of some links are zero. We can horizontally or vertically shift or rotate each slice and then combine them. The result is still a torus due to the symmetry possessed by a torus network.
Denote the shift and rotation operation of a slice by a twotuple (O S , R S ), S∈{A, B, C, D}, where O S represents the shift and R S the rotation (in clockwise) operation. For an n × n network, we use the two-tuple (c, r) for 0 ≤ c, r ≤ n− 1 to represent O S where the network is shifted c columns (or r rows) to the right (or downward). R S can assume four values: 0 • , 90 • , 180 • , or 270 • . An example is given in Fig. 5, where the slice of a 4×4 torus network given in Fig. 5b is obtained by changing ((2, 0), 0 • ) of the slice given in Fig. 5a. By changing the (O S , R S ) of each slice, we intend to find the locations of the four slices such that the node capacity variance is minimized (Fig. 5c). Note that the torus network given in Fig. 5c combines the four slices S∈{A, B, C, D}, each of which is obtained by changing the (O S , R S ) of the torus given in Fig. 5a, where ( For 8×8 networks, testing all possible locations of different slices can be too time consuming. Thus, we use a coarseresolution approach to tackle this problem. For example, the x or y coordinates range from 0 to 7 in an 8 × 8 network. We can divide the x and y axis into four regions with boundary points at 0, 2, 4, 6, and 8. As a result, we only need to consider 16 square regions on the plane, each square containing four points of the original plane (see Fig. 6). We randomly select a point as the location of a slice in that region.

C. COMPARISON IN TERMS OF NODE CAPACITY VARIANCE
We first consider a single-slice architecture. Table 2 presents the results of a 2-D torus network (Fig. 4), where the total link cost is computed from (5) with d e = 1, in which (5a) is   replaced with (5a ) for a link between adjacent nodes and with d e = n− 1 for a link between the first and the last node in an n×n torus. We can see that the node capacity variance derived by (5) is large. The total link cost of a nonblocking bufferless NoC in this case is even smaller than that of a conventional buffered NoC. Note that the node complexity of the former is much lower and performance much better (see Sec. V).
For multi-slice architectures, the results of four-slice 4 × 4 and 6 × 6 NoCs are given in Tables 3 and 4, respectively, where the total link cost is computed from (5) with d e = 1, in which (5a) is replaced with (5a ). A node in a multi-slice NoC contains multiple nodes of different slices. Although they are independent, we can still combine them into one crossbar switch (with some configuration logic inside, of course). An interesting result is provided by the 4 × 4 torus NoC. With M nd = 24, all nodes in the bufferless 4-slice torus NoC (Fig. 5c) and the conventional buffered torus NoC (Fig. 4) have the same capacity (Fig. 7). Furthermore, the two networks have the same total link cost.

V. DELAY/THROUGHPUT PERFORMANCES
In addition to computing, telecom applications are gaining more importance for multi-core systems [24]. For such applications, a trace-based simulation, that is, a system simulation performed by looking at traces of program execution or system component access with the purpose of performance prediction, is meaningless. Therefore, we use traffic-patternbased simulation to compare the performance of the proposed bufferless NoC with that of a conventional buffered NoC.  In the simulation, we use a 4-slice architecture. A node processor of such an NoC can transmit 4 packets simultaneously. Although the transmission time of each packet is 4 times that of the prior packet, the packet for the 1-slice architecture, transmission time, the overall latency, as shown below, will be dominated by the queuing delay in the network. The routing for the NoC has already been described in Sec. II.
Many sophisticated adaptive routing schemes and congestion avoidance and deadlock prevention algorithms have been proposed for buffered NoCs. But many are very complicated. We choose a simple XY routing for a buffered network and compare its performance with our bufferless network. This XY routing is the simplest among all deadlock-free routing schemes. The simulation details of the buffered NoC are the following: 1. It uses XY routing by first routing a packet in the x dimension and then in the y dimension. 2. Credit-based flow control is deployed between neighboring nodes. 3. Each link contains four virtual channels, and 16 flit buffers are dedicated to each virtual channel.
To avoid head of line (HOL) blocking, virtual output queues (VOQs) using a round robin are implemented in each processor. Each VOQ holds the packets for the same destination, while a processor serves its VOQs in a round-robin fashion.
Since the bufferless NoC is nonblocking, a packet from a VOQ will not be blocked provided that the destination of the packet is free. Packets arrive in a Poisson stream and the lengths follow a negative exponential distribution. The mean value of the packet lengths obviously depends on the application, and there is no consensus on this issue (for example, it is assumed to be 8 flits in [25] and between 128 and 1024 flits in [26]). For our simulation below, we assume the mean of the packet lengths is 8 flits [25]. The performance of the bufferless NoC presented below will be better than an NoC whose mean packet length is larger than 8 flits. where (x, y) represents the coordinates of a node and n is referenced to an n × n network.
The bandwidth of the link connecting a processor to a router of the buffered NoC is set to 1, and the bandwidth of the same link is set to 1/4 in the bufferless NoC as we use a 4-slice architecture in the discussion below. This keeps the total link bandwidth still equal to 1. The horizontal axis in Figs. 8 and 9 represents how much traffic we can send through that link, and the vertical axis represents the total latency, which refers to the interval between the arrival time of a packet and the time when the last bit of the packet reaches the destination. The latency values are first collected in clock cycles. We then convert them to packet transmission times so that we can see the relative magnitude of the delay. Note that the transmission time presented in the figures is the time to transmit one packet (with the average packet length) in a K -slice NoC. The same time unit is used for both the conventional buffered NoC (1-slice) and the bufferless NoC (4-slice). This explains why in the delay/throughput in Figs. 8 and 9, when the traffic load is close to 0, the total latency of the conventional buffered network (1-slice) is always lower than that of the bufferless NoC (4-slice). Finally, it should be pointed out that in a buffered NoC, several lines of each flit must be used to carry flow control signals and VC numbers. This overhead is ignored in our comparison. Fig. 8 compares the delay/throughput performances of the 4 × 4 torus NoCs (4-slice) under uniform and non-uniform traffic loads. For the latter, we use the four non-uniform traffic patterns discussed in [27], and their definitions are given in Table 5. If we use the knee points in these curves as the reference points for comparing the maximum throughput, Fig. 8a shows that the throughput gain of the proposed bufferless NoC over the conventional buffered NoC is about 21% (0.75 versus 0.62) under a uniform traffic load, and can be as high as 100% under a non-uniform load (Fig. 8b-d). These results should be considered alongside the fact that the nodes of a bufferless NoC are much simpler. Fig. 9 shows the delay/throughput performances of the 8 × 8 torus NoCs (4-slice). It shows that the throughput of the bufferless NoC is about 1.4 times that of the conventional buffered NoC when traffic is uniform (see Fig. 9a). But it should also be pointed out that the total link capacity required for a bufferless network (see Sec. III) is about 1.8 times that of a buffered NoC (this does not consider node complexity). For non-uniform loads, Fig. 9b and Fig. 9c show that the capacity gain of the new bufferless NoC can be as high as 400% that of the conventional buffered NoC. Considering the simplicity  of a bufferless node, these figures clearly demonstrate the advantage of the new approach.

VI. CONCLUSION
In this paper, we have presented the theory for the construction of nonblocking networks with a general topology. We use bufferless NoCs to demonstrate an application of the new theory. The presented bufferless NoC is nonblocking, deadlockfree, livelock-free, and congestion-free. It has a simple node design and a throughput close to 100%. Its performance is traffic pattern independent, which greatly simplifies task mapping for a multi-core system. The bufferless nature of the new architecture can achieve great savings in terms of power consumption and silicon area.
The developed theorems can also be applied to a wavelength selective switch (WSS)-based optical network, which also adopts a distributed topology. A workstation can use different wavelengths for different destinations. But such a WSS network is often a blocking network, and its routing and wavelength assignment (RWA) problem is NP-hard. Using the theorems developed in the paper, we can derive a nonblocking WSS-based network in which the RWA problem is greatly simplified.