Impact of Application Characteristics on Laser Energy Fluctuation in Integrated Photonic Switching Systems

Laser energy cost is one of the primary energy budgets in integrated photonic switching systems. Traditional photonic switch testing injects random traffic to the switches, which only generates a “static” laser energy cost result. However, the laser's energy per bit performance fluctuates due to different process mapping scenarios of applications. In this paper, we did experiments to study the influence of the application's process mapping on the photonic switches' injection traces. Then we built a model to show the connection between the traces and the laser's energy per bit performance. To get the energy fluctuation results quickly and accurately, we propose a heuristic-based energy boundary searching methodology, with the model we built being considered. We also analyze the speedup and convergence of the methodology. Two photonic switches are studied under five kinds of application traces. The study shows an over 60% searching speedup and 90% accuracy in most cases, compared with the enumeration method, and the lasers' energy fluctuations vary from nearly 0% to over 150%. We further analyze the factors inducing such huge fluctuation variations, and a qualitative criterion that predicts the magnitude of the variations is proposed and discussed.

computing systems. Leveraging wavelength routing, division multiplexing, and fabricating techniques, photonic switches have been a promising candidate to be integrated with chiplets or built alone as a single chip for distributed computing systems [1].
Recently, many photonic switches have been proposed, built, and tested [2], [3], [4], [5], [6], [7], [8]. Besides the communication performance, the optical energy cost should also be carefully considered, especially when the targeted system is energy-sensitive [9], [10]. Studies have shown that the laser energy cost is the main part of the total photonic energy budget [11]. The laser energy cost can be studied from the device and system levels. Researchers focus on the insertion loss and crosstalk performance from the device level by injecting the random modulated optical signal into one or some of the input ports [2], [7], [8]. Tunable lasers, modulators, photonic detectors, and oscillators are used to get the details of eye diagrams and the spectrum power loss. It can be observed that different optical paths will have different path insertion losses. These variants determine the minimum laser power injected for each path. From the system level, energy per bit performance is another vital metric. The energy per bit is determined by both the application traffic and photonic switches. Besides testing the characteristics of the photonic switches from the device level, the application traffic that the photonic switches carry out should also be studied in combination with the device-level characteristics to generate the energy per bit performance.
We focus on studying the influence of multi-core/node parallel application characteristics on the laser's energy-per-bit performance. During the execution of parallel applications, multiple processes are generated and allocated to different cores or nodes. Different from the randomly generated test traffic in the devicelevel study cases, these different processes will produce different amounts of traffic. Therefore, the amount of traffic injected into each switch's ports differs. Furthermore, we found that the mapping of processes to cores or nodes is not fixed and will generate different mapping scenarios. These different scenarios, the traffic variant of processes, and the differences in path insertion losses will lead to the fluctuation of the laser energy per bit performance. To our best knowledge, the energy per bit's fluctuating characteristic for integrated photonic switching systems has not been fully studied. Fig. 1 shows the workflow that is based on our proposed tools, models, algorithm, and prediction criterion. This work can be This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Fig. 1. The workflow is based on our proposed tools, models, algorithm, and prediction criterion.
regarded as an off-line evaluation and optimization tool which can be used jointly with other simulation tools or based on realistic photonic switch testing results. The whole workflow based on our work can evaluate whether a photonic switch can be deployed into an application-specific computing system, considering the energy cost requirement and application characteristics. Meanwhile, this work also leverages the optimization of process mapping and system energy consumption. Furthermore, the prediction criterion discussed in the manuscript will also help to determine which application will cause the maximum range of energy fluctuations.
The structure of this paper is as follows: in Section II, experiments that study the characteristics of the parallel applications are shown. The traces generated by application processes are extracted and expressed in a communication matrix. We then study the influence of process mapping scenarios on the communication matrix. In Section III, to integrate the application traces into the study of the laser energy per bit performance, we build an energy per bit model with the communication matrix, the process mapping scenarios, and the device characteristics being jointly considered. We also propose a heuristic methodology to efficiently evaluate the laser energy per bit fluctuation caused by processing mapping scenarios. In Section IV, two photonic switches and five communication matrices are used as study cases to show the proposed methodology's accuracy and speedup. In Section V, based on the evaluation results, we discuss the influence of the application traffic characteristics and the magnitude of the energy fluctuation. A qualitative prediction criterion is provided. The influence of crosstalk on the energy per bit performance is also discussed. The related work is shown in Section VI, and the conclusion is drawn in Section VII.

A. Overview of the Integrated Photonic Switching Systems
Since the MPI communication library is widely used for parallel programming in HPC and AI systems, in this work, we take Message Passing Interface (MPI) based applications as examples to study the parallel-application traces. Fig. 2 shows an overview of the integrated photonic switching system. We also depict the relationship between the application, the processes, and the proposed communication matrix. In the MPI-based application, the execution of the application will be divided  into several processes, and these processes will be allocated to different cores in one computation node or several computation nodes, according to the MPI allocation strategy and the operating system kernel. The execution of processes in cores/nodes finally generates the traces injected into the photonic switch in the system. Fig. 3 shows the details of the hardware structures we focus on in this work. At the hardware level, we study the scenarios where a photonic switch connects all computation nodes. The hardware scenarios that we foscus on are similar to the All-to-All mode in Xian Xiao's work [5] or the intra-region communication mode in Zheng Cao's work [12]. The photonic switch can be built based on microring reosonators (MRR), Mach-Zehnder interferometers (MZI), or arrayed waveguide grating routers (AWGR) [13], [14]. A computation node includes one or more CPU cores. If multiple cores are included in a computation node, a data hub is used to connect all cores. The intra-node data transmission will be done by the hub. The inter-node data transmission will be first queued in the data hub, and then sent out of the computation node in order.
The computation node is regarded as the injection source of the photonic switch. The routing characteristics of the photonic switch determine the laser management. As we only focus on the one-photonic-switch scenarios, we consider two main routing schemes. For the wavelength routing cases shown in Fig. 3(a) and (b), similar to the setup in [5], each computation node includes several laser sources which generate different wavelengths. Before transmitting through the photonic switch, the data from the computation node will be modulated to a specific wavelength and then injected into the photonic switch. In another case, a photonic switching structure can also work based on a single wavelength. Fig. 3(c) and (d) show the cases where only one laser source is used for each computation node. To avoid path contention and set up multiple optical paths, arbitration of electronic data from each computation node is needed, and the photonic switch also requires a configuration process [14].

B. Trace Extraction and Communication Matrix
Extraction of the application's traces is done in two steps. First, a tool named dumpi [15] is used to record the communication generated by each process. There are two types of communication that a process can generate: point-to-point communication and collective communication. Collective communication can be regarded as a set of point-to-point communication executed in chronological order. To collect traces from all processes, we expand the collective communication and merge it with point-to-point communication in the next step. To capture the characteristics of an application, we build the Communication Matrix (CM) based on the traces extracted. For illustration simplicity, we introduce the concept of a communication matrix based on the first type of connection in Fig. 3(a). Our definition of communication matrix can be easily extended to the second type by accumulating the related elements which correspond to the cores in the same cluster. The Communication Matrix records the total number of communications between each pair of cores/nodes when the application runs once. For simplicity, in the rest of the paper, we use the term cores to show the hardware that processes are allocated to. If application k runs in an N-core system, the Communication Matrix will be defined as (1). The value M i,j represents the amount of communication from Core i to Core j during the whole execution time. For simplicity, in the following sections, we use CM k = [M i,j ] N ×N to express the Communication Matrix.
The CM k can be used to describe the traffic that is injected into each of the photonic switch's input ports. As we have mentioned, since processes execute different functions and operations, M i,j in CM K may be different from each other. Therefore, cores may inject a different amount of data through the input ports of a photonic switch.
It should also be noted that the Communication Matrix represents the communication between cores, and it also reflects the communication between processes. However, one process can be mapped to different cores if the application runs again. We will discuss the relationship between the process mapping scenarios and the transformation of the Communication Matrix in the following subsection.

C. The Effect of Process Mapping Scenarios on Communication Matrix
We did experiments to study the mapping of processes to cores in a multi-core system. The results show that after the application is run several times, the same process will be allocated to different cores. For example, if two processes are needed to be allocated into two cores, and these two cores are in the same state, then process 1 might be mapped to core 1 , and if this application is rerun, it may be mapped to core 2 instead. Different mapping scenarios will lead to the transformation of the communication matrix, which further changes the amount of data injected into each input port of the photonic switch.
Since the mapping scenarios are generated according to the MPI process allocation strategy and the operating system, the transformations of the communication matrix are built with rules. The MPI communication allocation strategy and the schedule of the operating system operate in a relatively fixed mechanism. Therefore, we hold the view that the process mapping scenarios will follow the three basic rules: r Rule 1: An application will be divided into a fixed number of processes, so the total communication amount between cores will also be fixed.
r Rule 2: A process will generate a fixed number of communication between two specific cores.
r Rule 3: The operating system will always select a core with the lowest workload to execute a newly added process. The rules mentioned above will generate different process mapping scenarios. The change of the process mapping scenario will accordingly cause the transformation of a communication matrix. Fig. 4 shows a running example of an application to illustrate how the above three rules influence the Communication Matrix. In this example, four processes need to be allocated into three cores. Each process will generate a broadcast operation. In process mapping scenario 1, processes 1, 2, and 3 are allocated to cores 1, 2, and 3, respectively. Process 4 will be allocated randomly since all cores have the same workload in this example (Rule 3). Therefore, process 4 might be allocated to core 3 in process mapping scenario 1 or core 2 in mapping scenario 2. The CMs generated in these two scenarios are also shown in this figure. Based on the illustration of the example, it can be observed that the total amount of communication in the two CMs is the same (Rule 1). The transformation of a CM will generate another CM (Rule 2).
In summary, one process mapping scenario corresponds to a communication matrix. During the matrix's transformation, the value of each element is unchanged, but the position of each value will be changed. Furthermore, for an n-core system supporting n MPI processes, the number of process mapping scenarios should be n!. Let CM k q be the communication matrix obtained based on process mapping scenario q of application k . We then define a communication matrix set {CM k q } n! , which represents all the communication matrices that are generated because of all possible process mapping scenarios. In detail, M i,j q will be further used to show the communication amount from Core i to Core j under process mapping scenario q .

A. The Energy Per Bit Model
A photonic switch consists of paths connecting different input and output ports. The path which starts from Input P ort i to Output P ort j is defined as P ath i,j . The energy cost of a specific wavelength generated by a laser for P ath i,j is defined as E i,j laser and expressed by (2). P i,j laser is the laser power cost for P ath i,j . Det sen (dBm) is defined as the minimum power the detector should receive to demodulate the optical signal. Loss i,j total (dB) is the total optical insertion loss of the P ath i,j , and t i,j run is the running time of the corresponding laser and wavelength. It is assumed that the laser is only powered when data needs to be transmitted [16], [17]. Therefore t i,j run is determined by the number of packets transmitted in this path, the packet size, and the modulation rate. t i,j run is expressed by (3). S pkt represents the average packet size transmitted through the photonic switch. M i,j q is the number of packets that are transmitted through P ath i,j , under process mapping scenario q . M i,j q can be obtained from CM k q . Mod is the modulation rate for each modulator. It should be noted that the laser power should be large enough to overcome the optical path loss and exceed the detector sensitivity at the destination. Furthermore, the minimum laser power for each path varies because the optical loss in each path (which is shown as Loss i,j total in (2)) is different.
In an N-core computing system with an N × N photonic switch integrated, the total laser energy cost for all cores communication can be defined by (4). The lasers' energy per bit performance of the system can be further expressed by (5). It is assumed that the communication inside Core n does not need the photonic switch. For one application, the terms Det sen , S pkt , Mod, and the N i N j M i,j q can be regarded as constants.
Observing (4) and 5, it can be found that for one application, the energy per bit performance of the whole system fluctuates because of the different combinations of M i,j q and Loss i,j total . In essence, the different combinations are generated because of the different process mapping scenarios. Therefore, the laser's energy-per-bit boundary can be found by finding the two specific process mapping scenarios in which the maximum and the minimum energy cost happen.
Finding the above-mentioned two process mapping scenarios and the energy cost boundary based on the energy per bit model may require significant computation overhead. An intuitive method is to iterate every process mapping scenario, use the energy model to calculate each scenario's energy per bit result, and compare all the results. However, for a system with N process and N cores, the size of the process mapping scenarios is N !, and for each process mapping scenario, N (N − 1) times computa- are needed. The computation cost will increase dramatically with the increase of cores/processes, making it unacceptable. A searching methodology, which should be with high searching speed and accuracy, is needed to find the two targeted process mapping scenarios. We propose a heuristic methodology for the search process to solve these issues.

B. Workflow of the Heuristic Methodology
For simplicity, each processing mapping scenario is expressed as q, and all mapping scenarios (qs) form a mapping set named Q. q min and q max represent the two specific mapping scenarios that cause the laser energy cost upper and lower boundaries.
Searching the laser energy cost boundary can be expressed as the following mathematical description, (6). For an application k , the target is to find the CM k q min and CM q k max : The objective function (P ro_Loss), which is part of (5), can be regarded as an expectation of path loss when a packet is injected into a photonic switch. To find the target q min and q max in Q with an acceptable time cost, we choose a modified genetic algorithm to solve the problem, in which the whole process is an imitation of genetic inheritance and environmental selection [18].
The heuristic-based methodology is shown in Algorithm 1. Each process mapping scenario q in Q is regarded as an individual. At the beginning of the algorithm, some part of Q is selected as the first generation. In each generation, every individual's P ro_Loss q is calculated and used to choose the elites. The elites and some other individuals in the current generation form a mating pool. After the hybridization and mutation process, the next generation's individuals are generated. The above-mentioned process is called one iteration, and during each iteration, the individuals are ranked, selected, hybridized, and mutated. Several iterations will be executed until the algorithm converges to a target value.
In the genetic algorithm, a gene sequence is used to reflect the characteristics of an individual. Since each individual represents a process mapping scenario, we use an ordered core sequence as a gene sequence, and the sequence represents a process mapping scenario q. Fig. 5 shows the relationship between an ordered core sequence, the Communication Matrix, and the q. In this example, the system has four cores, and six chronological processes need to be assigned. In process mapping scenario 1 , process 1 to process 4 are arranged to Core 1 to Core 4 respectively. Then according to the workload balance strategy of the operating system and communication library, process 5 and process 6 are allocated to Core 1 and Core 2 . This process mapping scenario corresponds to a specific Communication Matrix and a core sequence [Core 1 , Core 2 , Core 3 , Core 4 ]. When rerunning the application, in case 2 , another process mapping scenario 2 might occur. In this case, the former four processes are allocated to cores in another way. Firstly process 1 is mapped to Core 4 . This mapping is because assigning process 1 to any of the cores makes no difference at the beginning of the mapping process. Then process 2 to process 4 are mapped to Core 2 , Core 1 , and Core 3 , respectively. After that, process 5 and process 6 are allocated to Core 2 and Core 4 , according to the workload balance between cores. Then the Communication Matrix will be transformed from the original one to a new one, which is shown in Fig. 5. By comparing the row ids of the two elements with the same value in these two matrices, a new core sequence [Core 4 , Core 2 , Core 1 , Core 3 ] can be obtained. So a specific core sequence can represent a specific Communication Matrix or a process mapping scenario. In our methodology, for an N-core system, the initial core sequence is [Core 1 , Core 2 ,..., Core N ]. Other transformations of the Communication Matrix and core sequence are derived from the initial one.
The gene sequences from individuals need hybridization and mutation processes in each iteration to converge to the final result. Fig. 6 shows an example of the above-mentioned two processes. For illustration simplicity, we take the eight-core case as an example. In this figure, x i , i ∈ [0, 7] represents a core, and i is the core's id. In the hybridization process, a hybridization point is randomly selected for both two individuals' gene sequences (or the two ordered core sequences). Individual 1 's and individual 2 's gene subsequences before each individual's gene hybridization point will be copied to the two new individuals' genes. Furthermore, two new individuals' gene subsequences after the hybridization point will be refilled according to regulations shown in Fig. 6(a).
In Fig. 6(a), the hybridization point is set after the third element in a gene sequence. The first three genes of individual 1 will be used as the first three genes in new individual 1 . Similarly, new individual 2 's first three genes are inherited from  1 . new individual 2 's generation process is the same as new individual 1 . In this way, part of the gene sequence from the parents is passed to the children, which provides the possibility of inheriting a wanted gene subsequence between generations. The mutation process is also shown in Fig. 6(b). There is a possibility that part of the gene will mutate. The mutation process is defined as the exchange of two cores in the core sequence. This definition is based on the fact that each core should only be listed once in the sequence. So the mutation of one core will influence another core in the sequence. For example, core x5 in position 2 is mutated to core x 7 , so the position of core x5 and core x 7 are exchanged. It should be noted that the mutation can happen in any position in the sequence with a customized mutation rate, and the id of the core will be changed randomly.

A. The Selected Photonic Switches for Evaluation
We chose two reported 8-port photonic switches fabricated and tested by other researchers to study the laser energy cost fluctuation. The structures of the two switches are shown in Fig. 7(a) [4] and 7(b) [2], [14]. The dual-microresonator structure is used as these two structures' basic switching element (SE) [2]. To realize switching, SE has two states, the BAR state, and the CROSS state. In the BAR state, the wavelength will be coupled by one of the two microresonators, and in the CROSS state, the wavelength will pass through the two microresonators. The structure in Fig. 7(a) is a wavelength routing based switching structure, which corresponds to the hardware shown in Fig. 3(a) and (b). Seven wavelengths are used for each port. The structure in Fig. 7(b) corresponds to the hardware shown in Fig. 3(c) and (d). Each port of Fig. 7(b)'s structure uses only one wavelength. It trades offs the non-blocking connectivity for a reduced number of SEs in a switch. To avoid contention, the data arbitration in the electronic domain is needed before optical transmissions. Furthermore, compared with PS-1, PS-2 needs to set up three SEs to build an optical path.
To show the comprehensive device-level characteristics, researchers should obtain each path's insertion loss based on the test of these devices. However, obtaining all device-level data from these fabrication and test papers is difficult. Most of the papers only provide several paths' insertion loss results. As we focus on the influence of the application-level characteristics on the laser energy fluctuation, we use a simplified model shown in (7) to express the insertion loss of each path in a photonic switch. N c , N bar , and N cross are the number of waveguide crossings, the number of SEs on the BAR state, and the number of SEs on the CROSS state along a transmission path. The waveguide crossing loss and the SE's insertion loss on the BAR and CROSS states are based on the data shown in [2]. These losses are the main device-level factors that we considered in the evaluation. Network-level and device-level parameters are shown in Table I.

B. Applications Characteristics
We select five basic communication matrices at the application level to study the proposed searching methodology. These five matrices are obtained from four applications. The settings of the applications and communication matrices are shown in Table II. The selection of these applications shows different characteristics from the application level, and additionally, it also provides a wide range of test scenarios for our proposed methodology.
The first two applications are from the PARSEC benchmark. The other two applications have been run in a high-performance computing center named EXMATEX [21]. Furthermore, to generate communication matrices with different communication amounts, the application named blackschole from the PARSEC benchmark is run with two different system configurations: 64core and 8-core systems. x264 from the PARSEC benchmark are run in an 8-core system. CMC and LULESH from EXMATEX are ran in a 64-core system. As we have mentioned, the running of an application for one time will generate a communication matrix. Each element in a communication matrix represents the number of communication times between two cores/nodes. Therefore, we define the total communication amounts as the sum of all elements in a communication matrix, and it means the total number of communication times for all cores. When it comes to the applications we selected, the total communication amount of different applications varies from 10 4 to 10 9 .
To be accommodated with the network structure, all communication matrices are processed to the same 8×8 size. Since some of the applications are run in 64-core systems, in these situations, we assume that eight cores form a cluster, and each of the 8 clusters uses one input port and one output port of the switching fabric. Fig. 8 shows the five 8×8 communication matrices generated by the application. In each sub-figure, a heat map is used to represent the statistic of communication amount from every input port to every output port. It is based on the trace extraction and communication matrix generation methods we introduced in Section II. Therefore, the heat map corresponds to running an application for one time, and it also corresponds to one communication matrix in the communication matrix set. It could be observed that hot spots exist in the communication matrices, which indicates that some pairs of cores or nodes carry out more communications than others. Merged blackscholes,' original blackscholes,' and x264's communication matrices' patterns are relatively random, while CMC's and LULESH's communication matrices' are more regular. Furthermore, compared with LULESH's, the pattern of CMC's communication matrix has higher asymmetric properties.

C. Settings and Speedup of the Proposed Heuristic Methodology
In addition to the above settings from the application and device levels, the heuristic methodology has parameters to be configured. In a genetic algorithm, the number of individuals in one generation, the elite individuals in each generation, and the mutation rate need to be set. Generally, the setting of these parameters is empirical, and different combinations of parameters determine whether the searching process will converge and   Table III. We make eight kinds of parameter configurations to study the influence of parameters on the proposed methodology: four configurations for the max(P ro_Loss) searching and four for the min(P ro_Losse) searching. Only one parameter is changed for the two adjacent configurations listed in the table.
As has been mentioned, for an 8-port switching fabric, totally there are 8! process mapping scenarios. The calculation amount is defined as the number of process mapping scenarios that should be calculated. Therefore, the calculation amount of the enumeration methodology is 8!. Regarding our proposed methodology, we use a statistical way to analyze the calculation amount.
Since the proposed methodology includes a heuristic algorithm, cases may occur that the searching process will end at a locally optimal point. To evaluate the probability that our proposed methodology reaches the global optimal point, for each set of parameter configurations and each application, we run Algorithm 1 10 times. Totally, there are (2 photonic switches) × (5 applications) × (8 set of parameter configurations/application) × (10 runs/set of parameter configurations) = 800 runs. For each time running, the algorithm will stop after 100 generations.
We refer to every run as an evaluation case. According to statistics, all evaluation cases have converged within less than 100 generations. However, not all of them converge to the globally optimal value. Assume that the probability of getting the globally optimal value is P once when running the proposed methodology once. Our target is to let the probability of obtaining the globally optimal value exceed P after k times the independent execution of our methodology. Considering that k is an integer, the minimum running times k min is determined by (8).
In each execution round of our methodology, on average, N generation generations are required to reach the global optimal point, and N individuals will be calculated in each generation. Since each individual is regarded as a process mapping scenario, the total calculation amount, named as Comp proposed , can be expressed as (9).
The calculation time reduction rate (or speedup rate) R is defined as (10). R is obtained by comparing the calculation amount of our proposed methodology (Comp proposed ) and that of the enumeration methodology (Comp). If Comp proposed < Comp enumeration , R shows the percentage of time that can be saved by our methodology, compared with the enumeration methodology. When Comp proposed → 0 and R → 1, this means the best ideal speedup. Conversely, if Comp proposed > Comp enumeration , R represents the percentage of time that is additionally needed compared with the enumeration methodology. Fig. 9 shows convergence speedup results, compared with the method which enumerates all process mapping scenarios and ranks them. Two photonic switches are studied with different searching settings and different applications. b1 to b4 and w1 to w4 are settings shown in Table III. Light-colored bars show the time reduction rate (speedup) when reaching the P = 80% probability of obtaining the optimal results, and the dark-colored bars show that when reaching the P = 90% probability. For different photonic switches, the performance of the searching methodology varies. Most of the cases shown in Fig. 9(a) and  (b) can reach a 60% speedup, and in some cases, the speedup reaches over 90%. It should also be noted that in very few cases, the proposed methodology performs worse and cause more time waste, which drives us to a further study of the selection of the searching parameters of the proposed methodology.
In our evaluation study, four different settings are used for the best and worst process mapping scenarios searching processes, respectively. Fig. 10 introduces the influence of different settings on the speedup performance. The settings labeled as 100-10-0.005 and 100-20-0.01 show much lower deviation and relatively higher average reduction rate. The findings indicate that setting a reasonable rate of number of elites mutation rate leverages the searching speed and accuracy. Furthermore, we find that the overall performance of the proposed methodology in PS-2 is better than that in PS-1. By analyzing the path insertion losses in PS-1 and PS-2, we find that the path insertion loss's variation range in PS-1 is smaller than that in PS-2. For a more "uniform" path loss distribution, the time cost for searching the optimal values will be much "higher".

D. Energy Fluctuation Results
After obtaining the best and worst mapping scenarios based on the proposed methodology, the energy per bit performance considering all lasers is calculated based on (5) and the parameters shown in Table I. Fig. 11 shows the energy per bit results and the fluctuation rate of the energy cost generated by the lasers in each photonic switch. The energy fluctuation rates (Ks) of applications are calculated based on (11). It can be observed that the fluctuation rates of applications vary a lot. Energy fluctuation can be omitted for applications such as Original Blackschole and x264. However, for CMC and LULESH, the laser fluctuation rate reaches over 37% and 16% in PS-1. For PS-2, these two applications' process mapping can lead to over 150% and 40% energy fluctuations. The results indicate that the influence of some specific applications should be paid special attention to if the system is energy-cost-sensitive.
When it comes to a system that may run several applications one by one, it is better to understand which application will cause the most significant energy fluctuations before evaluating all applications based on our proposed searching methodology. Therefore, finding a metric that could describe the differences regarding the ability to cause energy cost fluctuations is desirable. The enormous differences shown in the results draw our interest to study the relationship between the application characteristics and the energy cost fluctuation results. The related study is shown in the discussion section below.

A. Prediction of Energy Fluctuations Based on Application Characteristics
Based on our analysis, the study of the communication matrix of an application will show evidence of the relationship between the application characteristics and the energy fluctuations. Observing (11), if Then we get: Since E worst perbit and E worst perbit are defined based on (5), (13) can be further express as: where P (i, j) = 10 L i,j path is the total path insertion loss from input port i to output port j in a photonic switch. M i,j q,worst is the number of packets that transmitted from input port i to output port j under the worst process mapping scenario, and M i,j q,best will be that under the best scenario. It should be noted that, though the process mapping scenarios are different, M best i,j and M worst i,j can always be regarded as two values obtained from one Communication Matrix. This characteristic is guaranteed by the rules shown in Section II-B. Therefore, for a specific α, to make (13) stands, at least there are two elements in one communication matrix which will meet the following requirement: In other words, for an application and one of its communication matrices, the more pairs of elements meet the requirement shown in (15), the more possible this application will generate more significant laser energy fluctuations.
Based on the above analysis, we select (15) as the criterion to study different applications. For each of the applications shown in Table II, we study the elements in each application's communication matrix. For each communication matrix, we calculate the total number of element pairs that fulfill (15). α is set from 0.1 to 2, and the statistic results are shown in Fig. 12. It can be observed that for a given α, the value of CMC's curve is higher than that of any other application. The higher the value of an application shown in this figure, the more possibility that the application's process mapping will lead to a significant laser energy fluctuation. Fig. 12 shows consistency with the energy fluctuation results shown in Fig. 11. Therefore, (15) can be used as a metric to show which application should be first studied, considering the energy fluctuation that it generates. Eq15 helps find the applications which need the most attention, even though this metric is not quantitative. Our future work will focus on building a quantitative expression to predict energy fluctuation based on the application characteristics.

B. Influence of Crosstalk on the Laser Energy Fluctuations
Crosstalk of optical signals may influence the bit error rate (BER)). Studies have shown that more power is needed for the main signal to be detected under the same bit error rate [5], [22]. Similar to the influence of different path insertion losses, crosstalk at different output ports will also lead to variations in lasers' power and energy per bit performance. Observing (5), the term Det sen will no longer be a constant if the crosstalk power penalty is considered. For a photonic switch, by testing different communication scenarios and measuring the power at each output port, the minimum main signal power that each path's detector should receive, named as Det i,j sen , can be obtained. Then (5) will be changed to (16) accordingly.
To search for the best and worst process mapping scenarios based on our methodology, we can use (17) to express the objective function with crosstalk being considered.

VI. RELATED WORK
Several studies have focused on optimizing application mapping on the optical on-chip network. Most works are based on the core graph (or knowledge graph), and heuristic algorithms are also widely used.
Edoardo Fusella et al. proposed several algorithms which automatically maps the IP cores onto a generic mesh-based photonic NoC architecture such that the worst-case crosstalk or the laser power consumption can be minimized [23]. Hui Li et al. focus on the reliability and wavelength assignment for silicon photonic interconnects. The application mapping is optimized based on an ant colony algorithm [24]. The core graph, or shown as the communication graph, is used in these studies. In a core graph, an application will be divided into several IP cores, and the core graph records each IP core's id and communication between each pair of IP cores. The communication recorded can be the total number of communication flits, the bandwidth, and so on [25], [26]. Compared with the communication matrices, core graphs are usually fixed in size for each application, potentially limiting their usage in salable distributed parallel systems' study. In this paper, we provide a generic approach to generate the communication matrices, which could extract MPI-based application traces from real computing systems. This approach and the related analysis in this manuscript make it possible to generate different sizes of communication matrices according to the system changes. For example, considering the scalability of a parallel application, more cores can participate in the execution of the application when the system is scaling up. For example, while running an application in an 8-core and a 64-core system, we can generate different communication matrices, one is 8 × 8, and another one is 64 × 64.
Rashid Aligholipour et al. proposed an application mapping optimization method to decrease the number of turns through an electrical router-based interconnection network [27]. Instead of using a core graph, traces generated from Netrace are used [28]. Though Netrace reflects the cores' communication in a multi-core system, it can only provide traces for a 64-core system, which limits its usage for salable systems study. Wei Hu et al. proposed a task mapping optimization algorithm to realize power saving in an electrical mesh network [26]. This work generates tasks and traffics based on a full system simulator, a dynamic task mapping strategy, and an on-chip network simulator. The generation of traces/traffic can be adjusted with the scale of the system. However, how to generate the task graph and the traffic is unclear. Compared with these works, we provide the traces extraction method and a qualitative analysis method that could predict which application will generate the most significant energy fluctuation. The prediction tool will also facilitate the study of the influence of applications' process mapping on system energy cost.
Lei Guo et al. proposed an algorithm based on a genetic algorithm. Application mapping in a 3D optical on-chip network is optimized with crosstalk noise and thermal sensitivity being considered [29]. A simulated annealing algorithm is used to improve the search performance of the genetic algorithm. However, the algorithm parameters and their influence on the algorithm's speedup and accuracy performance have yet to be discussed well enough. Compared with that work, the study of the corresponding parameter settings in this paper provides more guidance for the algorithm's initialization.

VII. CONCLUSION
In this paper, we focus on the impact of application characteristics on laser energy cost fluctuation. We analyze and mathematically describe the application characteristics as communication matrices and study the influence of process mapping scenarios on the communication matrices. The laser energy fluctuation caused by the application characteristics is studied based on a proposed heuristic methodology and an energy model. A prediction criterion is also proposed and discussed to find the application leading to the most significant energy fluctuation. The study, analysis, and discussion of the prediction criterion, the abstraction of application characteristics, the energy per bit model, and the heuristic fluctuation analysis methodology provide a better understanding of the relationship between applications and laser energy cost in computing systems.