HFBN: An Energy Efficient High Performance Hierarchical Interconnection Network for Exascale Supercomputer

Supercomputers are trying to be eco-friendly using their main components like- interconnection networks, processors, and shared memory through high power efficiency (GFlops/watts). The reason for this is the exascale systems, are on the horizon, require a 1000 times performance improvement over the petascale computers and energy efficiency has attracted as the key factor for to achieve exascale system. The main contribution of this paper is to introduce a new hierarchical interconnection network; considering simulations at the million processing cores especially for the exascale system. Moreover, our designed network claims the supremacy of high network performance and low power consumption over the conventional networks and ensuring the utmost preferability for exaFLOPS system. On the other hand, the performance per watt metric, used for the TOP500 list, doesn’t reflect the overall performance of a given system. Hence, one of the possible solutions to reach the next generation exascale performance is to redesign the “Interconnection Network”, which holds the main responsibility for the intercommunication between the CPUs and also the power consumption for the supercomputers. This paper focuses on a redesigned new energy efficient interconnection network that mitigates the problems of high power consumption, longer wiring length, and low bandwidth issues. Our designed network (HFBN) has been compared against the Tofu network and in the case of 1M cores, HFBN can obtain about 87.26% better energy efficiency with uniform traffic, about 86.32% with perfect shuffle traffic, and about 92.98% with the bit-compliment traffic at the zero load latency.


I. INTRODUCTION
Recent supercomputers require to be eco-friendly, and to ensure the eco-friendly use of their resources, requires to be redesigning, manufacturing/engineering, using and disposing of computing equipment in a way to reduce their environmental impact. Lower energy consumption allows reducing the operational cost as well as the environmental impact of powering the computer [8]. Moreover, the requirement of exascale computing power is obvious to combat future generation challenges. To fight against COVID-19, the world's fastest supercomputer Fugaku is being used to determine the effectiveness of various drugs [2]. Fugaku supercomputer used Tofu interconnect with 7,299,072 cores and The associate editor coordinating the review of this manuscript and approving it for publication was Vlad Diaconita .
can achieve about 415PFLOPS requiring about 28,335kW power usage [3]. On the other hand, 5Dtorus network used in Blue Gene/Q supercomputer requires 6.6MW of electrical power in achieving 20PF/s performance with 1.57M processor core. However, this supercomputer will require about 330MW of electrical power for building the exascale system. Therefore, the energy efficiency of the supercomputers is the most important issue with continuing the other constraints like-inadequate static network performance, low scalability, low throughput & large network latency [4], [15], [19].
Network performances and power consumption of supercomputers are heavily affected by the interconnection networks and their processing nodes. Consequently, every supercomputer requires an interconnection network as an obvious choice. Since every computer chip has limited processing power, sequential processors can't be a suitable choice. For example, an Intel Core i7-3630QM processor contained with 4 cores, 22nm fabrication process can achieve about 76.8GFlops through the requirement of 45W electrical power usage. However, the requirement for exascale computing will require about 13 million connected such processors. On the other hand, in MPC systems, the wiring complexity of the network for on-chip as well as the off-chip connections is the most considerable issue due to power usages and high network latency. The K-Computer requires a total cable length of about 1,000 kilometers [25]. Moreover, as per analysis on Infiniband QDR 40Gbps switch requires typically about 1W of electrical power for its per link [26]. Friedman shows 3D NoC requires less power usage than the 2D NoCs with shorter vertical links [7]. However, the on-chip networks consume about 50% of the total chip power and off-chip bandwidth is limited to the total number of outgoing physical links [36]. Hence, the requirement of the interconnecting network has a huge impact on building exascale systems.
Energy consumption is completely dominated by the costs of data movement. The most critical problem for 3D networks is the massive heat generation. On the other hand, 3D networks require a much higher number of off-chip connections than 2D networks (50% higher). Even the cost and the power usage for 3D networks are much higher than 2D networks. This consideration leads to 2D NoC architecture is an obvious choice for exascale supercomputing. Even the Sunway supercomputer used the 2DMesh network for considering the exascale system [6], [11]. However, hierarchical networks are preferable over the flat networks due to the hierarchical design for the modern MPC systems, and the dynamic communication performance for many hierarchical interconnection networks are not capable enough to support exascale MPC network. Hence, in this research, we are considering a 2D NoC based hierarchical network (HFBN) as the interconnect for next-generation exascale system. The rest of the paper describes the architecture of HFBN, reviews the routing algorithm, shows the static performance analysis, then evaluates the dynamic communication performance of HFBN, and finally, power estimation along with the energy usage for HFBN.

II. RELATED WORKS
Chip Multiprocessors (CMPs) usually adopt flat interconnects like-Mesh and Torus, which consume an increasing fraction of the chip power [11], [12]. Moreover, as technology advances and voltage continues to scale down, static power consumes a large fraction of the total power. Hence, reducing the total power usage is increasingly important for energy proportional computing. Energy efficiency ensures the reduced amount of energy required to provide suitable performance. The power usage effectiveness (PUE) of the Swiss Supercomputer (CSCS) datacenter was 1.8 in 2012 [9], however, the current PUE is about 1.25; a factor of1.5 improvements. With the modern advancements, the biggest FIGURE 1. Existing network topology for system shares [5].
concern for supercomputers is power dissipation. Tianhe-1A supercomputer consumes 4.04 megawatts (MW) of electrical power, costing about $ 0.10/kWh will require $ 400 an hour and even about $ 3.5 million per year [37].
System cost of MPC systems is highly correlated with an increased number of wiring interconnect [52]. High degree flat networks require a high number of wiring complexity, which increases the system performance and also increases the system cost. Modern supercomputers like-Blue Gene/Q considered high degree network (5DTorus) for its own interconnect [10]. However, the latest top-ranked supercomputer in 2021, supercomputer Fugaku achieving 442 petaFLOPS LINPACK benchmark performance considered the Tofu interconnect D for interconnecting 158,976 nodes having Fujitsu A64FX CPU (48+4 core) as per node [3]. Figure 1 shows the latest scenario of existing interconnects for massively parallel computer (MPC) systems [5]. This figure confirms that the vastly used network is the Tree network for the MPC systems, which has a big concern in the case of network performance. Figure 2 shows cost analysis for chip-chip links (level-2 links) and intra-rack links (level-3 links) (figure 2 is the link cost for the 4096 nodes). We considered the electrical links at the inter-chip level (x is the number of electric links) and optical links at the intrarack level. This analysis explains that 2DMesh will require about 90.39% less amount of cost for designing level-2 and level-3 off-chip links than 4DTorus network. On the other hand, 2DMesh network also has low performance and high network congestion issues. Hence, the motivation for this research is not only to maximize network performance but also to minimize the network power usage.
In our latest paper, we had considered a three-dimensional on-chip network with two-dimension at the off-chip level called as 3D-TTN [16]. The main difference between HFBN and 3DTTN, HFBN which is based on a 2D structure whereas our latest paper of 3DTTN focuses on 3D structure at basic module (BM) level. And, from various observation, it is evident that 2D structures are less complex and more suitable for current computer systems even the Sunway supercomputer used 2Dmesh network [6]. Plus, figure 2 ensures the link cost for 2D networks is much lower than 3D networks, which motivates us to use a new 2D-based hierarchical network for this paper.

III. ARCHITECTURE OF HIERARCHICAL FLATTENED BUTTERFLY NETWORK (HFBN)
HFBN is a hierarchical network as it maintains different topological patterns at the different levels of network structure. The lowest network level is (level-1 network) is defined as the basic module for HFBN, where each core maintains a fixed number of radix and similar to 2D flattened butterfly architecture [32]. On the other hand, the upper level of HFBN is considered with 2DTorus network. Hence, we named our network the ''Hierarchical Flattened Butterfly Network (HFBN)''. This section defines the architectural pattern for HFBN, on-chip connections as well as offchips. HFBN maintains particular higher-level link pattern along with the 2DTorus upper-level connectivity. On the other hand, the requirements for exascale system can be possible through interconnecting hundreds of millions of cores, which is certainly be possible by HFBN. However, HFBN requires pre-defined port assignments for its upperlevel connectivity. Figure 3 illustrates the interconnection philosophy of HFBN. However, we defined HFBN through the definition of network structure at various network levels, and equations to obtain a fixed structure of HFBN is given as below-Topological Definition: A HFBN(m, L, q) network, by definition, is built with constant radix (similar to 2D flattened butterfly network) at the lowest level of the network followed by the 2DTorus interconnection at the upper level (with the particular connectivity consideration); where L considered as the level of the hierarchy, q is the number of paired connectivity for each higher levels and m is any positive integer, which indicates the size of the basic module.
A HFBN(m, L, q) follows the exact definition for its certain level of connectivity-Definition of HFBN Basic Module (BM): (2 m × 2 m ) is the lowest network level, (m is any positive integer)

Definition of HFBN Upper-level Connection: L max = 2·
(2 m − 1)/q + 1 is the maximum network size. Q max = the maximum possible paired connectivity for any value of m; Q max = 2(2 m − 1), 1≤ q ≤ Q max ; depending on the value of q, HFBN(m, L, q) considers two set of configuration- − 1)mod q) ! = 0 (some exterior cores will be remained free).

A. LINK CONNECTIVITY
Basic module cores requires two digit for the formulation; the first is the Y-index, then the X-index. In general, in a Level-L HFBN, the core address can be represented by: More generally in a Level-L HFBN, the core address is represented by- Here, the Level-1 is defined by core address (a 1 , a 0 ), where a 1 defines the core address for the Y-axis and then the Xaxis with the a 0 . Higher level networks are two dimensional networks, hence we consider the first digit as the row index and then the second one is the column index. Now, if the address of a core N 1 included in BM 1 is represented as • Link for Higher Level Vertical Connections- • Link for Higher Level Horizontal Connections- In case of higher level links, BMN (BMN = (2 m × 2 m )) defines the number of cores in a basic module and L defines the level number of corresponding levels. Considering m is 2, then the BMN = 16. On the other hand, x defines the source core number which is equal to ( . The highest level of network, which can be obtained by a (2 m × 2 m ) BM is defined by L max = 2 · (2 m − 1)/q + 1. Finally, DV and DH gives the core number that is vertically or horizontally connected with the source core number x in respective. Algorithm 2 shows the port assignment for upper-level connectivity for a particular basic module (the flowchart of this algorithm (Appendix A) is given in figure 22 in the Appendix B). This algorithm considers all of the exterior cores in the on-chip network to be interconnected with the other on-chip module. Algorithm 2 requires the input value of m and q. On the other hand, L max can be calculated from m and q. Function HIGHERLevel_HFBN(m,q) allocates the particular core in the basic module for high-level port connectivity, which requires the value of m, the possible number of paired connectivity for each level. As the high-level ports are being allocated by the exterior cores of each basic module, this algorithm allocates each possible port position for the higher level connectivity. In the initialize function, L max has been calculated and BMAX is the number of possible cores in each X or Y direction.

B. BASIC MODULE (BM) OF A HFBN
HFBN(m, L, q) used only six intra-chip links for the interconnection of the basic module. Hence, the on-chip design for HFBN and the flattened butterfly has a distinctive difference. However, the basic module design for HFBN(2, L, q) follows the same pattern as the flattened butterfly network. Since HFBN maintains a constant node degree, the HFBN link pattern varies (link connectivity has already been defined beforehand for the basic module) from flattened butterfly when m is greater than 2. The lowest level of HFBN(m, 1, q) has been considered as the ''Basic Module'' (BM). HFBN considers (2 m ×2 m ) number of cores as his basic module size. Hence, m = 2 means the possible number of cores at the BM level will be sixteen. Figure 3 also shows the basic module of HFBN(2, L, q). This network is based on the two-dimensional architecture and hence, BM connectivity is considered with X and Y directions. However, we have already discussed the link connectivity for the basic module beforehand.

C. UPPER LEVEL OF A HFBN
Integrating a large number of on-chip links is useful for network performance as well as cost-effective. However, the scenario for the off-chip level is completely different, where per-link cost and power requirement increase simultaneously with the total system cost. Hence, we have considered the hierarchical design for the next generation supercomputer architecture, where particular level links are interconnected for each level. The higher level of HFBN considers 2DTorus interconnection considering recursive interconnect patterns of the immediate lowest level of subnetworks. Hence, a level-2 (node level) HFBN consists of a certain number (2 2m ) of level-1 networks. This statement constitutes that an HFBN(2, L, q) will have 16 level-1 networks or basic modules for the complete level-2 network. Figure 4 shows the formation of a single level-3 network through the combination of 16 Level-2 networks and 256 Level-1 networks of HFBN(2, L, q). On the other hand, we have considered multiple lemmas to increase the readability of the network setup.
One of the important points to achieve high performance, HFBN must use all its free ports. Hence, the large number of paired connectivity is highly effective. q is defined as the number of paired connectivity for each higher level and Q max is the max paired value for any m; Maximum paired value for any m is defined as the Q max = 2(2 m − 1) and the q is all the possible divisors of Q max , Q max = 2(2 m − 1) = 2(2 2 − 1) = 6. In addition, the highest level network of HFBN can be defined as L max = ceil(2 · (2 m − 1)/q) + 1. Hence, HFBN(2, L, 1) can be constructed as the maximum seven network-level L max = (2 · (2 2 − 1)/1) + 1 = 7. The number of paired connectivity, q is responsible VOLUME 10, 2022   for the increase of the number of outgoing and incoming connections at each off-chip level. An increased value of q, increases the number of in/out connections and decreases the maximum network level. Figure 5 shows the architectural design of HFBN(2, 3, 3) and HFBN (2,2,6). Those figures also show that the choice of off-chip connectivity of a particular core with the paired connectivity number (starting 1 to 3 for Figure 5(a) and 1 to 6 for Figure 5(b)).

Lemma 3.2:
The total number of the cores at each level of HFBN can be defined as N = 2 2mL HFBN maintains a fixed number of cores at the basic module level (2 m × 2 m ) and builds the upper level with having (2 2m ) immediate lower level of subnetworks, which finally constitutes the network size of a particular number of interconnected cores. A HFBN(2, 3, q) has a network size of 4096 cores. Table 1 generalizes the architectural parameter for HFBN(m, L, q). Table 2 compares the various levels of HFBN for m = 2 with different q values. Moreover, in order to simplify the result analysis on HFBN, we have considered the HFBN(2, L, 1) class for this paper.

D. NUMBER OF LINKS AT VARIOUS NETWORK LEVELS OF A HFBN
Interconnection network has one of the greatest concerns for its own router layout [25], [34]. Hence, the number of onchip, as well as the off-chip connection, will be a major concern in designing the exascale system. Figure 4 shows the hierarchical structure for HFBN(2, L, q), where level-1 network constructs at the chip level, the level-2 network will be used for node level, and level-3 network can be used at the rack level. The number of interconnecting links at various layers of HFBN can be defined by Equation 2.
Here, N BM is the number of basic module in current level, IL 1 links considers for number of inner level-1 links and OL i considers the number of outer i-th level links. Here, the level-1 HFBN(2, 1, 1) network requires about 48 links for its BM. We have also generalized this equation in Table 3. On the other hand, Table 4 shows the total number of links for HFBN, in which the required number of links for HFBN is above only 800 at the node layer. Table 4 also shows the link comparison on various networks like-2DMesh, 2DTorus against the HFBN. And, from this table, we can also find that 2DMesh and 2DTorus networks require a much higher number of links than the HFBN at the higher levels. HFBN(2, 3, 1) will require about 18.75% fewer interconnected links than the 2DTorus and about 17.46% fewer than 2DMesh.

IV. ROUTING ALGORITHM FOR HFBN
Modern supercomputer BlueGene/L uses deterministic routing along with the adaptive routing. Hence, in our performance analysis, we also considers a simple deterministic  routing (dimension−order routing (DOR)) for HFBN(2, L, 1) class. Dimension − order routing continues to route the packet to the same dimension until the distance of that dimension become zero. Now considering HFBN routing, routing Algorithm 1 can be subdivided into two parts; one part considers the BM _routing and another part considers higherlevel routing (Routing_HFBN ). If the packet is destined for the other BM, the source core will send the packet to the outlet_core of the next interconnected BM of the current network level. On the other hand, receiving_core is used to track down the new source core address after the BM transfer has been completed. If the packet is destined to another BM, the source core sends the packet to the outlet core. Suppose, source core address is s considering the routing at the Y,X direction for the higher levels as well as for the level-1 networks. Similarly, routing tag can be defined as t In Routing_HFBN function, outlet_x and outlet_y are the function to get x coordinate s 0 and y coordinate s 1 of the core that link (s, d, l, dα) exists, where level l(2 ≤ l ≤ L), dimension d(d ∈ {V,H}) and direction α(α ∈ {+, −}).
Deadlock-free Routing for HFBN: Routing of packets requires to be deadlock-free, otherwise the packet will not be sent to the destination core ever and will delay the delivery of other packets, which in turn, drastically reduces dynamic communication performance. In this section, we studied the deadlock-freedom for HFBN. HFBN(2, 1, 1) network maintains the core-core connections from each x or y (row/column) directional cores. Hence, there is no wraparound routing is required for HFBN(2, 1, 1), which leads to a similar routing for the Mesh network (which requires only one VC for routing). This conclusion leads to only a single VC is required for the HFBN(2, 1, 1). On other hand, off-chip HFBN is constructed with the 2DTorus network arrangement. Hence, it requires 2 VCs for its deadlock-free. In summary, HFBN requires 2 VCs for its deadlock-free routing. However, using the required number of VCs, we could like to consider the proof for the HFBN deadlock-freeness based on the routing paths, which are divided into multiple states.
• State 1: Transfer of packet from the source core to outlet core of the Intra-BM. • State 3: Transfer of packet from the receiving core to the destination core of the Intra-BM. Lemma 4.1: if a message is routed in the order y → x in a 2DMesh network, then the network is deadlock free with 1 virtual Channel (VC) [33].
Lemma 4.2: if a message is routed in the order y → x in a 2DTorus network, then the network is deadlock free with 2 virtual channels (VC) [33].
Theorem 4.1: A HFBN is deadlock-free with 2 VCs. Proof: The BM for HFBN(2, L, q) follows the flattened butterfly connection. Hence, this network level requires VOLUME 10, 2022 Algorithm 1 Routing Algorithm for HFBN(2, L, 1) BM _routing(s 1 , s 0 , outlet_core y , outlet_core x ); if (routedir = positive) then send the packet to the next BM; else move the packet to previous BM; endif; if(t 0 < 0) move packet to -x direction w.r.t. destination; t 0 = t 0 + 1; endif; endwhile; while(t 1 ! = 0) do if(t 1 > 0) move packet to +y direction w.r.t. destination; if(t 1 < 0) move packet to -y direction w.r.t. destination; t 1 = t 1 + 1; endif; endwhile; end;  only 1 VC, which is also proofed by the lemma 4.1. However, the higher level network is designed with a toroidal connection. Hence, it requires 2 VCs for the upper-level deadlock-free routing, and even if we considered the routing phase phase-1 and phase-3 require 1 VC for HFBN(2, L, q). However, phase-2 with the sub-phases considered the toroidal connectivity. Hence, the HFBN requires 2 VCs for this case. In summary, HFBN is deadlock-free with the two virtual channels.

V. STATIC NETWORK PERFORMANCE
Static network performance ensures the network capability without considering the packet movement. Hence, static network performance is useful in the initial choice of the network. A good network ensures low cost, low degree, low congestion, high connectivity, and high fault-tolerant rate than the others [45], [46]. The node degree is defined as the maximum number of physical outgoing links from a core. Since each core of the HFBN network has a maximum of eight outgoing links, the degree of HFBN is eight. Table 6 shows node degrees for the various networks. In this section, we compare the parameters, such as diameter, average distance, and cost analysis. We consider up to the level-5 network performance for HFBN(2, 5, 1). Hence, we use an SGI supercomputer with the OpenMP parallel programs run with the 6 core with 16 threads. Table 5 shows the simulation environment for static network performance.

A. DIAMETER PERFORMANCE
The diameter (Max. Hop Count) ensures the maximum number of channels is required for a packet to be sent from each source core to the destination core along its shortest path. However, static diameter doesn't consider the channel faults. Low diameter ensures low communication delay [29]. Hence, the low diameter is preferable for any interconnection network. Equation 3 shows the diameter evaluation for HFBN(m, L, q), and Table 7 shows the calculated formulation considering Equation 3. On the other hand, Figure 7 shows the diameter analysis of HFBN(2, L, 1) comparing with the various networks. This simulation ensures that the diameter performance of HFBN(2, L, 1) is much better than various hierarchical networks (Like-TTN [17], TESH [13]). In comparing the conventional networks, such as 2DMesh and 2DTorus, HFBN(2, 5, 1) shows much better results. Even HFBN(2, 5, 1) can achieve about 34.15% better than the TTN network. The diameter for HFBN can also be evaluated using Here, where D s = distance for highest level of outgoing core. D si = distance for the next level of routing and D i = distance for corresponding level of routing. D d = distance from last level-2 core to destination core i.e. routing required at destination on-chip network. Table 7 shows the calculated formulation for HFBN up to level-4 network and in case of level-4 network, we need value of D s (this value is for the routing at starting on-chip network), then the values of D si and D i (those values will depend on each inter-level routing i.e. routing distance for moving packets at lower network level to higher network level or vice versa) and finally, D d is required for routing at destination on-chip network.

B. AVERAGE DISTANCE
Diameter analysis considers the routing for a single packet with a maximum path required to traverse along its shortest distance [31], [44]. On the other hand, average distance (avg. hop count) considers the broadcasting of packets from each core to every other core. Hence, an average shorter path is preferable over the low diameter. The average distance is the mean distance between each distinct pair of cores. The small average distance allows small communication latency. The average distance of graph G can be defined by Equation 4, where n is the total number of cores in the network and d is defined as the diameter between all distinct pairs x and y. However, Figure 8 shows the average distance of various networks, which ensures that HFBN is superior over TTN, 2DMesh, 2DTorus, RTTM and TESH. On the other hand, for message transfer between nodes of a higher-level RTTM, a lot of packets need to pass through the 2DMesh basic module. The high hop count of a 2D mesh will result high hop distance for that source-destination pair. And, there will have many distinct pair nodes of the higher-level RTTM will traverse through the 2DMesh basic modules. Therefore, this high hop distances of many distinct source destination pair will incur high average distance. Apart from this, based on the other static parameters like-diameter and dynamic network performances it is evident that HFBN is more suitable than RTTM network. Table 8 shows the static parameter analysis for HFBN comparing with RTTM and other on-chip networks with 16 cores.

C. STATIC COST PERFORMANCE
Cost performance analysis is vital for interconnection networks for its considerations of the product of node degree and diameter. The cost performance is inter-related due to internode distance, message traffic density, and fault tolerance. The node degree of a network can be considered as the maximum number of physical outgoing channels from a single core. The node degree of HFBN is 8. On the other hand, network radix can be defined by the number of channels for inter-router and the number of cores is connected to a single router. Hence, the network radix for HFBN is 9 (8 links are used to connect other routers, and a single link will be used for connecting the single-core). However, our network is flexible for connecting multiple cores from a single router. This feature allows high network scalability for VOLUME 10, 2022 HFBN. Figure 9 shows the cost analysis for HFBN(2, L, 1), which shows that this network outperformed the 2DTorus and is even much better than the TTN network. On the other hand, RTTM [14] illustrates the low-cost performance and almost similar to HFBN due to its low node degree.

VI. EVALUATION ON EFFICIENT ENERGY USAGE
Efficient energy usage is an important consideration for MPC systems. As the modern MPCs are highly affected by power consumption, efficient energy usage can able to trace the system performance with respect to power usage which will be a key feature in the field of interconnection networks. In addition, in an Alpha 21364 microprocessor, the integrated routers and links consume about 20% of the total chip power (about 25W of total chip power 125W) [23]. Efficient energy usage is treated as the goal for reducing the amount of power usage which is required to maintain the suitable network performance. Regarding the definition of performance, it can be considered as the dynamic communication performance (DCP) of the corresponding network and the power usage leads to network power usage. In this section, all the DCP graphs are considered with the data flits only as of the accepted throughput. Equation 5 shows the consideration to obtain the network energy usage. Here, we have considered the single clock cycles as 1ns (as the system clock is 1GHz). Hence, network energy usage leads to average flits transfer time and the total power usage for transmitting the flits. On the other hand, efficient energy usage is the reduction of obtained network energy usage in comparing between two networks with the relative request-probability (r). In DCP analysis, packets are transmitted by the request-probability (r) during the simulation clock cycles. We have considered a wormhole simulator for the DCP analysis, which is specially designed for the hierarchical networks [35]. On the other hand, electrical power is considered up to inter-chip level (256 cores) and we have used the Orion energy model for this analysis [27]. Electrical power analysis considered with the various traffic patterns (used the Garnet 1.0 simulator for traffic generation [28] (used the default table-based routing)). Now, this analysis gives the required power usage for a single electrical module. Hence, to obtain the total electrical power, we have multiplied the single electrical module power with a total number of electrical modules. And then, to obtain the optical power, we have considered the fixed data-driven power of intra-rack link (0.0101 watts) [41], [50] and inter-rack link (0.035 watts) [41] along with the per gigabit interface converter (GBIC) module power (FG-TRAN-SFP28-SR (1.2 watts) [38]) for the optical off-chip connectivity. Total optical power usage can be obtained from the multiplication of the required number of optic links (intrarack and the inter-racks) with its power usage and the required number of gigabit interface converter (GBIC) modules with its power usage. In this section, we have considered three traffic patterns analyses with various simulation conditions with respect to the number of computing cores.
Here, NEU is defined as network energy usage, ATT as the average transfer time, and NTPU as the network total power usage. As the high degree networks require a large number of off-chip interconnect, it's not suitable for large MPC systems mainly due to the required power usage. A largescale analysis is considered with 2 possible cases of 65K cores analysis and the 1M core analysis. In both the analysis, we have considered various traffic patterns for the result evaluation. This paper also shows the energy analysis of the Tofu network and the basic module (on-chip) for the Tofu network considers 12 cores. And hence, we can consider 240 cores for the inter-chip level and the total number of cores for the Tofu is considered with 4080 cores for the single rack level (for 65K the total number of cores for Tofu network is 65,280 cores and 1M analysis is consist of 1,044,480 cores), whereas rest of the networks considers 256 cores for there inter-chip level and 4096 cores as the total number of simulating cores for each rack. In the case of the RTTM network, we have considered (a = 4) and each upper-level network is constructed with (4 × 4) 16 lower level subnetworks with twisted torus connectivity to have the same number of cores for each possible case.

A. DYNAMIC COMMUNICATION PERFORMANCE
Network performance heavily depends on the variable traffic patterns. And, even the running applications on an MPC system are heavily affected by the traffic patterns. Dynamic Communication Performance for networks considers various traffic patterns and is characterized by latency and throughput. Latency refers to the time of a single packet to reach the destination core from its source core. On the other hand, Network throughput is the rate at which packets can be delivered by the network. It refers to the maximum amount of information delivered per unit of time through the network. Latency can be defined by the below equation- Here, T h is the header latency requires time to transmit the header message to traverse the network and T s is the serialization latency, is the time for a packet of length L to cross a channel with bandwidth b.

B. DEFINITION OF VARIOUS TRAFFIC PATTERNS
Network load has a high effective influence on performance. Traffic patterns are responsible for the choice of a particular source and destination core in any network. Traffic patterns can be random or non-random selection. This paper considers the following non-uniform traffic patterns along with the uniform traffic patterns for dynamic communication performance (DCP) analysis. Uniform-Here, every core sends message to every other core with equal probability, i.e., source and destination are randomly selected for each generated message.
Bit-compliment-Fixed source-destination pair for every message. This case each core sends message to a such core with one's complement of its own address.

C. CONSIDERATION ON DYNAMIC COMMUNICATION PERFORMANCE
Dynamic communication performance (DCP) considers the latency and throughput of each network. Hence, dynamic communication performance shows the network zero load latency, saturation load, and maximum amount of packet can be delivered per unit of time through the network. Accepted throughput(Flits/Cycle/Core) is the number of flits that have been received at the destination cores with respect to the total number of cores and the total simulation cycle. Here, DCP graphs are considered with the data flits only as of the accepted throughput. On the other hand, the average transfer time (measured in clock cycles) is the average delivery period for all the delivered packets within the given simulation time. To evaluate the dynamic communication performance of HFBN, we considered a specially designed simulator [49]. This simulator is specially designed for hierarchical networks with the facility of explicitly changing the packet ID with the change of source router. Hence, the choice of a particular virtual channel is possible for different links in making the network deadlock-free. HFBN(2, 1, q) requires only 1 VC for its deadlock-free routing. However, the off-chip level of the HFBN network requires 2 VCs for its deadlock-free due to its torus connections. HFBN network considers the DOR routing for its dynamic performance. And, even we also considered simple dimension-order routing for the rest of the networks with wormhole flow control. We have considered the intel compiler with mcmodel(=medium) for the 1M cores and for 65K cores analysis, DCP simulation results are obtained from Visual C++ 2017 compilation.
Flow control is responsible for the allocation of resources in packets. In DCP analysis, a packet is a key component to ensure the network capability, who follows a certain route for reaching the destination core. The key resources in networks are the channels and buffers. Channels make sure the network connectivity and buffers are used to holding the packets temporarily at the cores. In the DCP analysis, we have considered the wormhole flow control. Wormhole flow control requires low buffering and most importantly, ensures latency independence over the message distance. In wormhole routing, each message is divided into packets, which are later divided into flits. Flits have consisted of two parts as header flits and data flits. Header flit holds the routing information and data flit follows the header flit through the network. On the other hand, DCP analysis considers only the deterministic routing for each network. Deterministic routing is also called oblivious routing. In deterministic routing, the same routing path will always be considered between the same source and destination pair even though multiple routing paths exist.

D. ESTIMATION OF POWER CONSUMPTION
Power consumption is the major concern for the exascale systems. Modern supercomputers are heavily affected by the on-chip as well as off-chip power usages. One of the powerful supercomputers Tianhe-2 requires 24MW of electrical power in achieving 33.86 petaFlops performance with more than 3 million of core [37]. Hence, the required power at the exascale level will be similar to a single nuclear power plant, which is unrealistic.

1) ASSUMPTION FOR THE POWER MODEL
The power consumption for the MPC system depends on various components such as network system, processor, memory module, and the cooling system. On the other hand, the network has a high impact on total power usage. The 16-tile MIT RAW on-chip network consumes about 36% of the total chip power [39]. Hence, on-chip power is the most important factor for estimating the total power usage. On the other hand, the required power at the rack level for each link is typically about 1W, bandwidth is over GB/s and cost is very high [26]. H. Wang and et. al. show the power comparison for high-speed electrical and optical interconnects for interchip communication [40]. Hence, off-chip power estimation is also required for the analysis of total power usage. In this paper, we have considered the fixed data-driven power of intra-rack link (0.0101 watts) [41] and inter-rack link (0.035 watts) [41] along with the per gigabit interface converter (GBIC) module power (FG-TRAN-SFP28-SR (1.2 watts) [38]) for the optical off-chip connectivity. Hence, our power model is considered with electric power at the interchip level and optic power at the intra-rack or the inter-rack level. P total = P electrical + P optical P electrical = P router + P link + P clock power P optical = P optical link + P GBIC module power 2) ELECTRICAL POWER MODEL Our power model is based on the Orion energy model [27] using 65nm fabrication process. We have used the GARNET VOLUME 10, 2022 1.0 NoC simulator [28] for analyzing dynamic power consumption. The power usage for the inter-chip network depends on the dynamic and leakage power usage. The router power is based on the router buffers, its local and global arbiter, and the crossbar traverse. The dynamic energy model of the router is being considered with Equation 8, where C is the capacitance, V is the supply voltage and finally α is the switching factor. Here, we have considered the default buffered input capacitance as 7.8e-15F for 65NM fabrication process and HVT transistor [27]. On the other hand, channels dynamic power are caused by the charging and discharging of capacitive loads, which is formulated as P dy_link = αC 1 V dd f clk , where C 1 is the load capacitance, V dd is the supply. We have considered 6 header flits along with the 6 data flits for the electrical power analysis. The header flits are merged with the data flits (8 bits for each data flits and 8 bits for each header flits). Hence, the total number of flits for the electrical power usage is considered as 6 flits per packet. And, to obtain the total electrical power, we have multiplied the obtained simulated power usage with the total number of inter-chip modules. The routing for this inter-chip analysis is considered with the default '' 1M cores are considered up-to-the Level-5 network of HFBN (2,5,1). In this case starting with the Table 9 parameters for the 1M traffic analysis. This case we evaluated the power simulation with the same traffic condition concerning Table 10 and Table 11. Power analysis is obtained from parameters shown in Table 10 with the same accepted throughput (used as the injection rate in Garnet 1.0 simulator [28]), which is considered as the parameter for dynamic communication performance. In this section, we have also considered 256 cores for their inter-chip level and 1M cores as the total number of simulating cores.

1) UNIFORM TRAFFIC
The energy usage evaluation for the 1M cores, Table 9 shows the traffic parameters for the 1M cores evaluation, and Table 10 shows the power parameter for electrical analysis (Table 11 is for the optical link power usage). And finally, Figure 11 shows the energy analysis for MMN, RTTM, 2DMesh, 2DTorus, HFBN, TTN, and TESH networks considering the uniform traffic analysis for 1M cores showed   in Figure 10. Figure 11 explains that HFBN can obtain about 23.49% efficiency over the TTN, and compare with the Tofu(40,32,68,3,2,2) network, HFBN can obtain about 87.26% at the zero load latency. This analysis also ensures that the zero load latency and the network saturation rate for HFBN are superior to any other network.
2) PERFECT SHUFFLE TRAFFIC Figure 13 shows the energy usage for perfect shuffle traffic, which also ensures the superiority of HFBN over any other network. At the zero load latency, 2DTorus network shows the worst energy usage among all the networks due to its high network latency (showed in figure 12) and the high number of off-chip interconnect compared to the 2DMesh network. In this case, HFBN can sure about 47.07% better efficiency before the network saturation in comparing with the TTN. However, the Tofu(40,32,68,3,2,2) network shows an efficiency difference of about 86.32% in comparison with the HFBN at the zero load latency.

3) BIT-COMPLIMENT TRAFFIC
As the full system simulator (Gem5) has limited system scalability [42], [43], we have also considered the NAS parallel benchmarks [48] communication characteristics with the Message Passing Interface (MPI) implementation [47].
In static communication, compiled communication technique considers the compiler knowledge on application communication requirements and the provided network structure and allows to significantly optimize the performance of communications at the compile-time [48]. The communication pattern of MPI programs can be sub-divided into three types: static communications, dynamic communications, and dynamically analyzable communications. Static communications are those communications whose source and destination core are determined at the compile time.
Dynamically analyzable selects source and destination core at runtime without incurring excessive overheads. Dynamic communication selects source and destination at only the runtime. However, the majority of communications in scientific programs maintain static communication. Hence, in this part of traffic analysis, we have considered the static communication pattern of MPI_Send, where the selection  of all source-destination pairs needs to be determined at compile time. We have considered the bit-compliment traffic pattern with fixed source and destination for this analysis. Table 9 shows the parameter consideration for the traffic analysis. Figure 14 reviled the performance analysis with various networks along with the Figure 15 shows that HFBN can obtain about 30.76% better efficiency over the TTN in considering the bit-compliment traffic pattern.

F. EFFICIENT ENERGY USAGE ANALYSIS (65K Cores)
65K cores are considered up-to-the Level-4 network of HFBN (2,4,1). This case starting with the Table 12 parameters for the 65K traffic analysis. In this case, we evaluated the power simulation with the same traffic condition with respect to Table 13 and Table 11. Power analysis is obtained from parameters shown in Table 13 with the same accepted throughput (used as the injection rate in Garnet 1.0 simulator [28]), which is considered as the parameter for dynamic communication performance. In this section, we have also considered 256 cores for their inter-chip level    and 65,536 cores as the total number of simulating cores (other than the Tofu network).

1) UNIFORM TRAFFIC
In the case of 65K analysis, we have considered electrical power analysis up to inter-chip level, and we have considered   the optical power for intra-rack level and inter-rack level. Figure 16 shows the traffic analysis for this case, and using Table 11 (optical connectivity) and Table 13 (electrical connectivity), we could obtain the power usage. Figure 17 shows the energy analysis for 65K uniform traffic, where HFBN could achieve about 22.24% better efficiency over the TTN.

2) PERFECT SHUFFLE TRAFFIC
The zero load latency for HINs always ensures the lowest latency over the conventional networks. Here, perfect shuffle traffic analysis with the 65K cores considers the same traffic 3100 VOLUME 10, 2022  parameters as Table 12 for 65K analysis. In addition to traffic analysis parameters, the parameter used (Table 13 and 11) for the power analysis is also the same as to analyze the network energy usage. Figure 19 shows the energy usage for Perfect Shuffle traffic, which also ensures the superiority of HFBN over any other network. 2DMesh and 2DTorus networks show the worst efficiency among all the 2D networks due to their high network latency showed in Figure 18. This analysis ensures that HFBN has superiority over TTN up to 18.39% with efficiency ( Figure 19). Now, comparing with the Tofu(20,16,17,3,2,2) network, even with the different network sizes (65,280 cores), HFBN can achieve about 78.13% better efficiency with zero load latency. Figure 21 shows the energy usage for Bit-compliment traffic, which also ensures the superiority of HFBN over any other network. 2DMesh and 2DTorus networks have provided the worst efficiency among all the 2D networks due to their high network latency showed in Fig. 20. This analysis ensures that HFBN has superiority over TTN up to 22.36% with energy efficiency (Figure 21). Comparing to Tofu network, even with the different network sizes, HFBN can achieve much better energy efficiency (72.09%) at zero load latency.