Rapid Topology Generation and Core Mapping of Optical Network-on-Chip for Heterogeneous Computing Platform

The explosive growth of deep learning (DL)–based artificial intelligence (AI) applications necessitates extraordinary computing capabilities that cannot be achieved using traditional CPU standalone computing. Therefore, the heavy mission-critical DL kernel computing currently relies on a heterogeneous computing (HGC) platform integrated with CPUs, GPUs, and accelerators, as well as substantial data storage elements. However, the metallic electrical interconnection in the existing manycore platform would not be sustainable for handling the massively increasing bandwidth demand of big data driven AI applications. Incorporating an optical network-on-chip (ONoC) for providing ultrahigh bandwidth, we propose a rapid topology generation and core mapping of ONoC (REGO) for energy-efficient HGC multicore architecture. The genetic algorithm (GA)-based REGO utilizes the structural characteristics of the optical router to the fitness function and thus compromises the trade-off between the required throughput, optical signal-to-noise ratio (OSNR), and total energy consumption. Furthermore, the crossover step accelerates the convergence speed by suppressing randomness in the GA, thus significantly reducing excessive running time owing to the NP-hard property. The generated ONoC through REGO demonstrates, on an average, an increase of 63.29 % and 22.80 % in throughput and a decrease of 50.24 % and 9.56 % in energy per bit, in the VGG-16 and VGG-19 compared with the conventional mesh- and torus-topology-based ONoCs, respectively.


I. INTRODUCTION
Deep learning (DL), a class of machine learning algorithms, trains a nonlinear function approximator represented by a deep neural network (DNN) architecture using input-output pairs of training data [1]. The primary goal of DL is to improve accuracy by learning the weights through backward propagation of errors (backpropagation). Repetitive operations that occur while learning errors in backpropagation require extremely high parallelism and vector-matrix operations. Therefore, a heterogeneous computing (HGC) platform that combines various types of processors and dedicated accelerators is required instead of a legacy CPU-based architecture [2]. In addition, an ultra-wideband on-chip network The associate editor coordinating the review of this manuscript and approving it for publication was Massimo Cafaro . infrastructure is essential for handling excessively heavy data traffic.
Network-on-chip (NoC) is a scalable solution for onchip communication infrastructure that can handle the ever-increasing processor cores integrated on a single chip. However, despite the continuing progress in transistor miniaturization, the challenging problems in the backend-ofthe-line (BEOL) fabrication steps that form the interconnect layer using metallic interconnects impede the expansion of the on-chip communication bandwidth. An optical NoC (ONoC) based on silicon photonics is being actively investigated as an alternative to electrical NoCs (ENoCs). Semiconductor industries such as IBM, Intel, and Mellanox have developed several optical devices and interface technologies which can be deployed in ONoCs [3]- [5]. In addition, AyarLabs has developed TeraPHY, a highly VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ integrated smart photoelectric chiplet capable of operating with a bandwidth of several tens of Tb/s in an ASIC, CPU, and FPGA package in a recent study [6]. In [7], the authors compared the electrical mesh (EMesh) with four different types of ONoCs, including their own LumiNOC, under fair simulation conditions. All NoCs under comparison were organized in 64 tiles comprising corresponding core and router pairs operating at 5GHz using a 22nm CMOS process technology library. Table 1 shows a comparison of ENoC with four different ONoC architectures in an experimental environment established by Cheng. Corona, Flexishare, and Clos have 26, 3, and 1.5 times higher throughput and 14, 5, and 24 times higher power efficiency compared with EMesh, respectively. LumiNOC, which is an application-specific ONoC architecture, achieved 34 times higher throughput per watt compared to EMesh. In summary, ONoC is still far from mass production; nonetheless, it has the potential to provide a significantly higher bandwidth and energy efficiency compared to that of the ENoC. Therefore, a rapid topology generation and core mapping of an ONoC (REGO) is proposed to effectively assess the unique DNN traffic patterns and HGC architecture. Primarily, irregular topology generation and core mapping are NP-hard problems that require a tremendous computational load. Furthermore, to minimize the laser power consumption of an ONoC, a large number of worst-case optical signal-to-noise ratio (OSNR) calculations with dynamically varying optical signal propagation models in various routing candidate paths must be involved. The resonance structure of micro-ring resonators (MRs) requires large-scale iterative calculations until the optical signal is stabilized in each optical router. The computational complexity increases exponentially with system augmentations along with the memory footprint [12]. Therefore, a high-speed algorithm that can achieve irregular topology generation and core mapping of ONoCs that satisfies a given design goal within a reasonable time is essential.
For the design space exploration of NoC, particle swarm optimization (PSO) [13] and genetic algorithm (GA) [14], a type of meta-heuristic, are commonly used. However, the PSO has three drawbacks when applied to topology generation in NoCs. Because PSO finds the optimal solution only through the direction toward the current best solution, 1) it is affected more by the initial initialization population than GA, and 2) has the drawback of increasing the chance of falling into the local minima problem [15]. 3) The network constraints for topology generation result in a huge computational load on the process of changing the location of particles in the PSO.
For these reasons, PSO is mainly used for core and application mapping [16]- [19]. The GA is a parallel and global optimization problem solving technique that mimics natural selection and genetic inheritance [20]. The GA is frequently used in environments where the best solution must be found within an acceptable time [21]. Several studies have been conducted for GA-based irregular topology generation and core mapping to optimize power and performance under highly variable data traffic environments [22]- [24].
The GA is a population-based optimization heuristic that finds a solution through the iteration of selection, crossover, and mutation steps from the initial population of a group of chromosomes. The GA heuristic has been widely adopted for NP-hard problems in architecture space exploration including network topology and core mapping techniques. Three drawbacks are usually mentioned in the discussion of GA. First, GA is prone to a local minima problem unless it randomly generates an initialization population for genetic diversity. Second, an in-depth consideration of the fitness function in the evolution phase is required for the accuracy and convergence speed of the algorithm. Finally, it is difficult to assign an optimization problem to genetic data, such as chromosomes and genes. To alleviate the first and second drawbacks, the links of routers are randomly generated to maximize the randomness of the initial population in the initialization phase of the REGO, and the fitness function is defined to reflect the ONoC design characteristics. Meanwhile, because the throughput and OSNR are significantly affected by the internal connectivity and core mapping in the ONoCs, the objectives of the fitness function and the basic elements of the GA can be properly adapted to the searching scheme for ONoC topology solutions. The third drawback can thus be resolved. Therefore, the GA was selected as a framework to determine the optimal topology and core mapping solution for the ONoC implementation of the HGC platform.
While an electrical router consists of symmetric crossbar switches, an optical router comprising waveguides and MR switches is configured asymmetrically to minimize insertion loss and crosstalk noise. It is widely known that the OSNR dominates the power consumption of ONoCs and strongly depends on insertion loss and crosstalk noise. Furthermore, the insertion loss and crosstalk noise are significantly affected by the number of ports of the optical router according to the arrangement of optical elements [25]. Therefore, REGO considers OSNR variations based on the optical element configuration in the fitness calculation to optimize the trade-off between the data throughput and power consumption in HGC platform.
Furthermore, this study focuses on accelerating the convergence speed by applying the crossover that reflects the structural change of the routers depending on the OSNR and the number of ports. In the crossover step of the REGO, a fitness variation-based genetic data exchange scheme is applied to avoid unnecessary searching for chromosomes which violate ONoC constraints. Core mapping refers to the method of allocating cores to a given NoC topology in a specific order. The topology generation arranges the links of the routers and allocates appropriate routers according to the number of ports in the arranged network. In this process, the aggregation of router connections implies the location of cores, thus allowing topology generation to incorporate core mapping of ONoC with a small computational burden. In addition, considering different types of processor cores that show different capabilities and characteristics in an irregular topology, core mapping must be jointly performed with topology generation to meet the design objectives. Consequently, the consolidated topology generation and core mapping of the REGO are natural and beneficial for jointly optimizing the design objective with given constraints.
The remainder of this paper is organized as follows. Related work and background are described in Section II. Section III presents the main algorithm flow of GA-based REGO and describes how the properties of the optical elements are applied to the technique. The simulation results and analysis under various conditions are described in Section IV. Finally, the conclusions are drawn in Section 5.

II. RELATED WORK
A. GA-BASED ENoC TOPOLOGY GENERATION Core mapping has been studied for optimizing the power and performance of the target application in a regular topology where the topology is predetermined for routing efficiency and scalability. A variety of core mapping schemes for optimizing power consumption and performance in NoCs have been conducted [26]- [28]. In [26], a core mapping technique of NoC using reinforcement learning in an HGC platform without a test set prepared in advance was proposed. In addition, Tahir et al. proposed a congestion-aware core mapping scheme using betweenness centrality that can identify highly loaded NoC links in [27]. However, these studies were derived from a lightweight computing algorithm that simply swaps the location of the core, and it is difficult to apply it to a topology generation method that requires consideration of various network conditions.
Existing studies on GA-based ENoC topology generation have mainly focused on minimizing power consumption or maximizing throughput. Leary et al. proposed a GA-based topology generation for application-specific ENoC [29]. Herein, the authors achieved 30% lower total power consumption than deterministic heuristic techniques by considering the system-level floorplan with wire-length constraints along with the power consumption due to physical links. In [30], the power consumption and router resources were minimized while meeting bandwidth constraints through GA-based floorplan-aware topology synthesis. In addition, a GA-based mapping and routing (GAMR) approach was proposed for low energy design of ENoCs under bandwidth constraints [22]. The GAMR automatically mapped the cores of a given application onto the ENoC and generated deterministic deadlock-free minimal routing paths.
In contrast, GA-based topology generation schemes have been suggested for application specific ENoCs that have been pursued to improve throughput [23], [31]. These studies considered the required throughput of the given applications in the fitness function and evolution phase of the GA. Although the ENoC topology generation techniques mentioned so far can be partially adapted to ONoCs, additional considerations are essential because of the fundamentally different signal characteristics and interconnection medium. Moreover, the worst-case OSNR calculation is mandatory to determine the minimum required laser source power, which dominates the overall power consumption and performance.

B. ONOC ARCHITECTURE FOR HGC PLATFORM
Studies on irregular-topology-based ENoCs have been extended to accommodate optical interconnection thus achieving ultra-high bandwidth required for handling ever-increasing big data and/or DNN acceleration. Ahmed et al. proposed PHENIC 3D-ONoC, a silicon photonic 3D-NoC architecture for heterogeneous many-core system-on-chips (MCSoCs) [32]. The PHENIC 3D-ONoC is composed of an electronic control network (ECN) for path reservation, which can configure optical routers, and a number of photonic communication networks (PCNs), thereby providing approximately 10 % improvement in throughput compared to conventional 2D-mesh based ENoC. In [33], SHARP (shared heterogeneous architecture with reconfigurable photonic network-on-chip) showed 34 % more throughput and 25 % less energy consumption per bit compared to the mesh-based ENoC. SHARP clusters CPU and GPU cores around the same router and dynamically allocated bandwidth between CPU and GPU cores through single-writer multiple-reader (SWMR) crossbars according to the application requirement. Although PHENIC 3D-ONoC and SHARP deployed full optical data paths for exploiting the low power and broadband advantages of the optical interconnect, the potential capabilities of irregular topologies were not addressed.
With the explosive growth of big data-based DL applications, diverse approaches in terms of architectural aspects to satisfy the enormous throughput and energy efficiency are actively progressing. For optical signal detection, the ONoC laser source sets the power margin considering the OSNR, photodetector sensitivity, and laser wall-plug efficiency (L e ). MR heater for resonance wavelength tuning is reported to account for approximately 20 % of the total power consumption of the various ONoC topologies [34]. While the sensitivity and L e of the photodetector are hardware constraints that are not controllable, the OSNR and the number of MR heaters that significantly affect the total power VOLUME 9, 2021 consumption of ONoCs have strong correlation with the implementation style. Consequently, both OSNR and the number of MR heaters must be assessed in the process of GA-based topology generation and core mapping. Fig. 1 depicts the overall procedure of GA-based REGO. First, three types of inputs are defined: ONoC parameters, GA parameters, application task graph. The ONoC parameters include the system-level ONoC specification such as configurable router types and loss coefficient of optical elements. The GA parameters indicate the constants required for initialization and evolution such as population size, selection probability, and the fractional ratio of crossover. The REGO receives as inputs an application task graph including the number of cores and ONoC parameters, which further includes the available router structure and loss and noise factors of the optical elements. Thus, the REGO can accommodate various router structures and optical elements because it calculates the worst-case OSNR through loss and noise parameters obtained in advance through the parameters of optical routers and elements. A gene, which is a primitive element of the GA, is mapped with a router, including structure and link information connected to the adjacent routers and the corresponding core. Next, the genes are gathered to form a chromosome corresponding to the entire ONoC topology. All chromosomes go through the initialization phase, which creates a random population. Then enter the evolution phase with genetic information containing the fractional ratio accounting for the number of chromosomes classified into selection, crossover, and mutation. For each iteration of the evolution phase, the fitness of all chromosomes is calculated by the fitness function. The objectives of the fitness function include features of an ONoC such as OSNR and the total number of MRs. When the convergence condition is satisfied, the best population is obtained from the REGO which implies an irregular topology-based ONoC solution.

III. REGO
While the initialization phase attempts to maximize randomness, the fitness function and evolution phase consider the characteristics of the optical elements to optimize the performance and energy efficiency. The REGO finds a fitness-based solution by incorporating the crucial information relating to the OSNR, throughput, and MR heaters into the fitness function thereby optimizing both throughput and power consumption. In addition, unreliable factors caused by improper router connections can be reduced by attempting to crossover with regard to fitness variations of the objectives.

A. PROBLEM DEFINITION AND TERMINOLOGY
The GA-based REGO for the HGC platform satisfies the following two constraints to ensure path validity between the connected routers: • Constraint 1. All routers and cores in a chromosome must be guaranteed to be connected.
• Constraint 2. A direct network in which each core forms a pair with only a single router is mandatory. Table 2 describes the notations used in the REGO. The gene, an element of the set G, represents a single router with connectivity information. Each chromosome represents the entire ONoC topology as a group of genes and a routing table. Fig. 2 depicts an example configuration of a single chromosome that allows routers with four and five ports. Chromosomes have connectivity and routing table information for all routers in the network. Connectivity indicates the router number connected to each router r i and the corresponding port number. The routing table stores the input ports, output ports, and MR control signals for the source node v i and destination node v j .The cardinality of R is the same as the number of cores |V|, because each core is coupled to a single router by Constraint 2. REGO aims to find the best population with the highest fitness in TG, the task graph of the HGC platform, from the chromosome set C set , whose elements are independently generated. Because the worst-case OSNR of ONoC determines the laser power consumption while guaranteeing the required performance, it is necessary to set up a fitness function suitable for the ONoC environment. Therefore, the worst-case OSNR is an essential objective when assessing the fitness function [35].
As aforementioned, the MR heaters for tuning the resonance wavelength of the MR accounts for approximately 20 % of the total power consumption of the ONoC, and the number of MR heaters is identical to the number of MRs [34]. If a router with a large number of ports is used as a building block, the number of required MRs increases, whereas the hop count in the longest path decreases. This relationship indicates that a trade-off exists between the number of MRs and the worst-case OSNR. Therefore, we separate the worst-case OSNR and the number of MRs into different objectives in the fitness calculation regarding power minimization. We incorporate a fitness function F(X , I ) comprising the importance factor I i and M objectives O i (X ) for multi-objective optimization proposed in [36].
Because the HGC platform for deep learning requires high throughput with low power consumption, we focused on optimizing power and throughput. Therefore, power and throughput were considered as objectives of the fitness function in REGO. A modified fitness function of a chromosome f (C) based on (1) is introduced in the REGO, which utilizes the throughput O thr , worst-case OSNR O snr , and number of MRs O mr as objectives: where I i indicates the importance factor of i th elements.
Adjusting the importance factors in the fitness function facilitates determining the optimized ONoC topology in terms of energy efficiency and throughput. Every element comprising the fitness function is scaled to an identical range for the uniform application of the importance factors.

B. INITIALIZATION PHASE
Population initialization is closely related to the convergence speed and quality of the final solution. Random initialization is commonly used to generate an initial population when the genetic information is not known in advance [37]. The GA guarantees randomness in the initialization by maximizing the diversity of genes in the chromosomes.
Algorithm 1 presents the process of the initialization phase of REGO. Each chromosome is sequentially initialized with a constraint that establishes a connected network. Because the number of available router ports is limited by the type of router permitted in the ONoC design, the process of randomly connecting routers is repeated until the number of ports of all routers reaches p min , the minimum number of ports. After connecting all routers in C i , the path validity of each signal path sp is checked. If SP i has an invalid path, adjusting gl to ensure network connectivity might harm the randomness of genetic data as well as increase computational complexity. VOLUME 9, 2021 Therefore, in this case, the REGO abandons the entire router connection and repeats the above process.

Algorithm 1 Initialization Phase
In/Output Chromosome set C set 1: for i = 1 to |C set | do 2: while !(C i constitutes connected network) do 3: remove all connectivity of G i in C i 4: for j = 1 to |R| do 5: randomly select r k of R in C i , k = j where nrp k < p max until nrp j < p min 6: update gl j and gl k 7: if (p min ≤ nrp k && nrp k ≤ p max ) then 8: update gt j and gt k 9: end if 10: end for 11: check path validity of sp jk in C i where 1 ≤ j, k ≤ |V| 12: end while 13: generate routing table RT i of C i 14: sort C i (∈ C set ) in descending order of fitness value f (C i ) 15: end for

C. EVOLUTION PHASE
The procedure of the evolution phase in REGO is depicted in Fig. 3. The selection, crossover, and mutation steps of the evolution phase are performed according to λ, ξ , and µ, respectively. In the selection step, chromosomes with high fitness are propagated to the next generation with selection probability p. In the crossover step, chromosomes with higher fitness than the previous generation are generated through the exchange of genetic data between chromosomes. In the mutation step, randomness is assigned to the chromosome set by transforming the genetic data in random range of the chromosomes.

1) SELECTION
The selection step in the REGO propagates chromosomes with high fitness values to the next generation. Although the fitness value of the chromosome is low, the dominant gene to elevate fitness value might be contained in the corresponding chromosome; thus, all chromosomes should be given an opportunity to be preserved for the next generation. In REGO, the chromosomes are initialized in a serial manner, and then sorted according to fitness values. Herein, the pre-sorted chromosomes in the initialization phase dramatically relieve the computational complexity of tournament selection.
Thus, the REGO uses tournament selection, which assigns ranks based on the fitness value and selects with a selection probability p s . In the crossover and mutation of REGO, only the chromosome with modified fitness value needs to be sorted and thus the computational complexity is decreased compared to the initial sorting. Furthermore, tournament selection offers the benefit of reducing the enormous calculation time required to find the worst-case OSNR.

2) CROSSOVER
After completing the selection step, the chromosomes of the remaining C set are selected for crossover according to the crossover rate ξ . In the crossover step of the REGO, the genetic data gt and gl representing the type and connectivity of the optical router are exchanged between selected chromosomes. To comply with the basic constraints of REGO in Section 3.A, all chromosomes must satisfy the following three crossover conditions (COCs): The REGO introduced a two-point crossover method for exchanging single router to avoid overlapping cases that violated the above COCs. If COC 1 is not satisfied after the crossover step, the exchanged router can be regarded as the dominant gene that determines the connectivity of the entire network. Forcing the crossover by connecting remaining isolated routers for COC 1 invalidates the effect of the crossover because most of the connectivity must be migrated from the previous generation. Therefore, when a case that violates COC 1 occurs, different gene or chromosome is newly selected for crossover in the REGO.
Chromosomes that violate COCs 2 or 3 frequently appear during the evolution phase iterations. Selecting only genes and chromosomes that fulfill all three COCs for crossover severely reduces the diversity of the chromosome set. Therefore, REGO should search for alternative routers that satisfy only COCs 2 and 3. The alternative routers must maintain the connection properties affecting on the fitness of the router originally intended to be connected. Fig. 4 illustrates the architecture of a Cygnus router, which is a 5 × 5 optical non-blocking router, where the MR state is ms and the input-to-output ratio in the lookup

Similarly, when the MR state is ms and the noise power LUT is LUT [i][ms][n][j]
, the output noise power of the j th port of the Cygnus router is P out n,j can be calculated using (4) according to P in n,i , which is the input noise power of the i th port. Assuming that the signal and noise power of every input port are commonly P in s and P in n , respectively, the OSNR of the j th output port OSNR j is calculated using (5) It should be noted that the term P in n ?P in s of the denominator in (5) is much less than 1 for guaranteeing signal reliability. Thus, the minimum OSNR, insertion loss, and crosstalk coefficient in the Cygnus router calculated using (5) are 13.44, −0.69 and −14.13 dB, respectively, which increased in proportion to the number of Cygnus routers on the routing path.
The packet delay P delay ignoring data collision is expressed as (6), where OR bw , OR h , ER clk , and ER pipe indicate the link bandwidth, the number of hops of the routing path, the operating clock frequency, and the number of pipelines of the electrical router, respectively.
Because the packet delay increases as OR h increases, the effect on the fitness function throughput perspective decreases. Consequently, the replaced router to avoid the COC violations must be placed adjacent to the router originally intended to be connected considering both the OSNR and throughput.

Algorithm 2 Crossover Step
In/Output Chromosome set C set 1: select chromosomes for crossover in C set based on ξ 2: select a random pair (C i , C j ) of the selected chromosomes (i, j ∈ |C set |) 3: randomly select r k of R 4: C tmp_i = C i , C tmp_j = C j 5: while !(COC 1) do 6: randomly select r k of R 7: replace gl k and gt k in C i to gl k and gt k in C tmp_j 8: replace gl k and gt k in C j to gl k and gt k in C tmp_i 9: if gl k violates the COCs 2 and 3 then 10: replace gl k to the closest router from gl k causing minimum fitness variation in RT 11: end if 12: check path validity of sp jk in C i and C j where 1 ≤ j, k ≤ |V| 13: end while 14: if both C i and C j satisfy network connectivity then 15: update gt k according to nrp k 16: mark C i and C j to avoid redundant crossover 17: else 18: recover C i and C j based on C tmp_i and C tmp_j 19: end if Algorithm 2 describes the behavior of the crossover step in the REGO. The REGO based on two-point crossover randomly selects a single router that contains genetic information to be exchanged. Chromosomes C i and C j selected for crossover are stored as temporary variables for exchange and recovery. The router link gl k violating the COC is replaced by a valid router link with regard to the fitness variation. Fig. 5 shows an example of the crossover step in the REGO, where r 5 is selected as the target router to be exchanged between C 1 and C 2 when p min is four and p max is five in a 16-core ONoC. C 2 can accept the genetic data of C 1 without violating the COCs, whereas three violations of COCs 1 and 2 FIGURE 5. Example of crossover step. VOLUME 9, 2021 occur in C 1 in the crossover step. As gl 5 of C 1 is replaced with gl 5 of C 2 , the number of ports of r 4 in C 1 , nrp 4 , becomes three including the link with the core, which is less than p min . In addition, nrp 2 and nrp 12 become larger than p max as they were equal to p max before the crossover. Accordingly, r 4 with insufficient number of ports, is connected to the target router r 5 to satisfy the port constraint. The available port in r 5 intended to be assigned for connecting to r 12 is reallocated to link r 4 . Thus, COC 2 and COC 3 violations caused by r 4 and r 12 , respectively, are simultaneously resolved. Finally, r 2 is replaced by r 6 , which is adjacent to r 2 , as revealed by the routing table RT . In this way, the crossover step of REGO reduces fitness variation by minimizing inevitable gene modification when recovering COC violations. Consequently, the convergence speed in evolution phase is accelerated by the crossover along with the tournament selection in REGO.

3) MUTATION
Mutation is a unique way of assigning additional randomness to the chromosome set, unlike selection and crossover, which depend on the diversity of the initial population [38]. In each chromosome of remaining C set with regard to the mutation rate µ after the selection and crossover steps, an independent random range of genes in G is selected to be mutated. The REGO maximizes the randomness of chromosome set by randomly modifying the selected genes that include the type and connectivity of the optical router. Fig. 6 shows an example of a mutation in C x . Mutation starts by searching for link candidates to be added or deleted. REGO generates a mutated chromosome by randomly selecting some candidates. When the mutated C x satisfies Constraint 2, C x is replaced with a mutated chromosome. To comply the Constraint 2 of Section III.A, each mutation confirms the path validity of the chromosome containing the modified genes.

IV. EVALUATION
A SystemC-based cycle-accurate simulator was built with ONoC parameters extracted through the linear optical device model (LODM) proposed in [12]. The visual geometry group (VGG), an academic group focused on computer vision at Oxford University, was dedicated to developing VGG-16 and VGG-19 16-layer and 19-layer deep convolutional networks, respectively [39]. We adopted the target application model of HGC as VGG-16 and VGG-19 [39], which are widely applied DNN models. Through the simulation, the throughput, latency, and energy efficiency of the derived ONoC topology and the convergence speed of the REGO were measured. To demonstrate the algorithmic complexity of the REGO, we compare the convergence speed of the REGO against the discrete binary PSO (BPSO)-based topology generation method and GA-based random mapping method that attempts to connect to the router randomly in the case of COC violation. We constructed the discrete BPSO-based topology generation method by extending the discrete BPSO-based core mapping method with the chaotic disturbance proposed in [16]. The extended discrete BPSO-based topology generation method uses the same fitness function, population size, and initial population as the REGO for a fair comparison. The velocity vector of the extended BPSO-based topology generation method includes the addition and deletion of connections between the routers, instead of swapping the position of the core. We assumed the maximum velocity of the extended discrete BPSO as three links, and applied chaotic disturbance when the movement of the location vector was inevitable owing to the network constraints of Section III.A. The random mapping method consists of the same GA parameters and processes as those of REGO, except for the responses of COC violation cases. We prepared 20 different initial populations to prevent inaccurate convergence speed derivation due to coincidence.
The throughput, latency, and energy consumption for irregular topology, regular topology mesh, and torus generated through REGO are compared to analyze the contribution of the REGO. Typically, the throughput in an NoC is defined as the amount of data transmitted per unit time. In this aspect, we assessed the throughput as the total data movement per unit time after all data transmission was completed in ONoC with the traffic of the target application. Latency per bit, considering data collision cases in real traffic, can be obtained as the reciprocal of throughput. Energy per bit is calculated by multiplying the power consumption according to the configuration of the ONoC and the latency per bit.
The specifications of the ONoC optical devices are listed in Table 3. In the mesh and torus, a Cygnus-based non-blocking optical router with four and five I/O ports was deployed [43]. A three-stage pipelined electrical router with an operating clock frequency of 1 GHz was assumed to control the optical layer.
According to [34], the MR heating power is a quarter of the laser power. Taking this relationship into account, we determine the importance factors I 1 , I 2 , and I 3 of (2) in Section III.A for 0.5, 0.4, and 0.1, respectively. In GA, ξ in the range of 60 %-90 % and µ in the range of 1 %-5 % can rapidly obtain the feasible solution [44]. To comply with this range, λ, ξ , and µ were set to 0.35, 0.6, and 0.05, respectively. The OSNR is analyzed using optical elements with the parameters listed in Table 4 [45]- [49]. First, the router-level OSNR was calculated using the optical crossbar implemented by Verilog-AMS. Next, transport-level OSNR was analyzed through EWOSA [35], a high-speed OSNR analysis method. The system configuration of the HGC system consisting of multicore CPUs, GPUs, and memory controllers (MCs) is presented in Table 5. Each GPU core consists of four unified shaders, and the last-level cache (LLC) is assumed to be an L2 cache shared between all CPUs and GPU cores. Fig. 7 shows degree of the convergence speed of the REGO, GA-based random mapping method, and the discrete BPSO according to the iteration number of the evolution phase. In VGG-16, the fitness value of the REGO, GA-based random mapping method, and the discrete BPSO converged at average 3635 th , 5672 nd , and 1180 th iteration, respectively. In VGG-19, the fitness value of the REGO, GA-based random mapping method, and the discrete BPSO converged at average 4065 th , 4580 th , and 989 th iteration, respectively. The average convergence speed of the REGO was 1.56 and 1.13 times faster than GA-based random mapping method in the VGG-16 and the VGG-19, respectively. The convergence  speed of discrete BPSO was faster than that of REGO in most cases. However, since discrete BPSO attempts to search only in the direction of the past and current best particles, a local best solution was defined along with a narrow search range. As a result, the average converged fitness of REGO was 0.003 and 0.011 higher than that of the random mapping method and discrete BPSO, respectively.

A. CONVERGENCE SPEED ANALYSIS
The average CPU runtime per iteration of REGO, GA-based random mapping method, and discrete BPSO was and O(N population · TC EWOSA ), respectively. Therefore, these results were caused by significant computational complexity of the discrete BPSO, which calculates the fitness for all particles. Moreover, the discrete BPSO was slowed down by additional searches for the router links that could be added or removed in all particles.
These results show that when deriving the best population, as shown in Fig. 7, crossing the dominant gene of the existing chromosome is more likely to generate a population with high fitness than a randomly generated gene, such as a mutation. These results reflect the fitness value variation of genetic data not covered by the random mapping method in the crossover step of REGO. Table 6 shows the results of comparing the irregular topology generated through REGO with regular topology mesh and torus topologies in terms of throughput, worst-case OSNR, number of MRs, and fitness. Increasing the number of links in the ONoC can improve the path diversity of the packet transmission. High path-diversity is beneficial to throughput, however, might involve the adversarial effect on OSNR caused by additional noise. For this reason, torus-based ONoC exhibited higher throughput and lower worst-case OSNR than mesh-based ONoC as shown in Table 6.

B. THROUGHPUT AND ENERGY EFFICIENCY ANALYSIS
The irregular topology produced by the REGO showed a maximum throughput improvement of 117.30 % and 78.09 % in HGC-1 and HGC-2, respectively, compared to the conventional mesh topology (68.29 % and 51.76 % on average). In addition, compared to the torus topology, the throughput improvement of REGO of HGC-1 and HGC-2 was up to 23.80 % and 17.80 %, respectively, (12.53 % and 10.46 % on average). In the REGO, the dominant genes are exchanged while maintaining the existing properties considering the OSNR and throughput at the crossover step of the GA, therefore the throughput of ONoC is significantly increased.
The average fitness value of topology obtained by the REGO achieved 4.72 % and 22.77 % higher than that of mesh and torus, respectively. These results indicate that the irregular topology-based solution is required to optimize the multi-objectives desired in ONoC for the HGC platform.
Figs. 8 (a) and (b) show the normalized latencies of HGC-1 and HGC-2, respectively. The irregular topology obtained from the REGO had 60.03 % and 11.49 % lower average latencies comparing to the mesh and torus topologies, respectively. The irregular topology optimized for the application has a structural advantage over the existing regular topology in terms of latency. Moreover, the lower latency of the irregular topology-based ONoC contributed to an increase in the energy efficiency.
Figs. 9 (a) and (b) show the average energy per bit for HGC-1 and HGC-2, respectively. In Table 6, in the 16-core network using the VGG-16 application, the mesh topology has approximately 2.14 dB higher worst-case OSNR than the topology obtained from the REGO. However, Fig. 9 presented that the energy per bit of the network generated from the REGO is 58.10 % lower than that of the mesh-based network because of the high latency with fixed number of MRs of the mesh-based ONoC. The mesh topology has a relatively large MR heating power owing to the adaption of a fixed-structured router, and a 31.16 % bit latency difference significantly affects the energy per bit. Because this difference is noticeable in the 32-core network where the number of MRs increases, the solution obtained from REGO of 32 cores in HGC-1 has an average of 50.25 and 9.91% lower energy per bit compared to the mesh-based and torus-based networks, respectively. In conclusion, the ONoC implemented by the REGO can explore beyond the regular topology-based networks for DNN in terms of both energy efficiency and throughput.

V. CONCLUSION
In this paper, a GA-based REGO that enables the optimization of throughput and energy efficiency of ONoC required by the HGC was proposed. The ONoCs produced by the REGO achieved 63.29 % and 22.80 % higher throughput than mesh-and torus-based ONoCs, respectively, with high energy efficiency. Moreover, the crossover in REGO accelerated the convergence speed of fitness value by quick recovery of the design constraint violations. The REGO optimized the trade-off between the throughput and power consumption, considering OSNR variations based on the parameters of optical elements.