Multi-Objective Optimization for Bandwidth-Limited Federated Learning in Wireless Edge Systems

This paper studies a bandwidth-limited federated learning (FL) system where the access point is a central server for aggregation and the energy-constrained user equipments (UEs) with limited computation capabilities (e.g., Internet of Things devices) perform local training. Limited by the bandwidth in wireless edge systems, only a part of UEs can participate in each FL training round. Selecting different UEs could affect the FL performance, and selected UEs need to allocate their computing resource effectively. In wireless edge FL systems, simultaneously accelerating FL training and reducing computing-communication energy consumption are of importance. To this end, we formulate a multi-objective optimization problem (MOP). In MOP, the model training convergence is difficult to calculate accurately. Meanwhile, MOP is a combinatorial optimization problem, with the high-dimension mix-integer variables, which is proved to be NP-hard. To address these challenges, a multi-objective evolutionary algorithm for the bandwidth-limited FL system (MOEA-FL) is proposed to obtain a Pareto optimal solution set. In MOEA-FL, an age-of-update-loss method is first proposed to transform the original global loss function into a convergence reference function. Then, MOEA-FL divides MOP into $N$ single objective subproblems by the Tchebycheff approach and optimizes the subproblems simultaneously by evolving a population. Extensive experiments have been carried out on MNIST dataset and a medical case called TissueMNIST dataset for both the i.i.d and non-i.i.d data setting. Experimental results demonstrate that MOEA-FL performs better than other algorithms and verify the robustness and scalability of MOEA-FL.

collaboratively train a shared global model while keeping their collected privacy data on edge IoT devices such as user equipments (UEs). Without sharing local training data with private information, data privacy leakage can be effectively prevented in FL.
As for the FL in wireless edge systems, the shared global model is transmitted with the limited bandwidth, in which only a fraction of UEs could be selected to perform the FL training process. Own to the heterogeneity of UEs, picking different UEs in each training round could affect the FL performance. As a result, researchers focus on UE selection for FL. For instance, Huang et al. [7] proposed a stochastic UE selection strategy to deal with the tradeoff of the model training convergence and training stability. Xu and Wang [8] studied a wireless FL system, where UE selection and bandwidth allocation were jointly optimized to achieve a long-term resource allocation under the UE energy constraints. The authors in [9] modeled the UE selection problem as a Lyapunov optimization problem in which a fairness algorithm named RBCS-F was proposed for problem-solving to guarantee fairness. Yang et al. [10] considered a joint UE selection and beamforming design method to maximize UE selection utility under the mean-squared-error requirements for each global model aggregation.
Since UEs are energy-constrained, it is significant for them to allocate their computing resource effectively. Thus, computing resource (i.e., CPU frequency) allocation has an effect on the performance of FL systems. For instance, allocating less computing resources (i.e., lower CPU frequency) to update the local models would reduce UE's energy consumption but increase model training convergence. Therefore, the computing resource for FL has received extensive attention. Kim et al. [11] considered a novel joint dataset and computing resource allocation scheme for FL systems and established an energy consumption minimization problem under the requirement of the finishing training time. Huang et al. [12] proposed a model training convergence minimization problem under the energy consumption constraint, in which the CPU frequency and the phase shifts are optimized jointly. Tran et al. [13] adjusted the CPU frequency and communication power to balance the tradeoff between the energy consumption and model training convergence under the local model accuracy constraint. Yang et al. [14] jointly considered the CPU frequency, communication power, and bandwidth allocation to minimize the total energy consumption and learning time.
However, the above-mentioned studies establish single objective optimization problems where the model training convergence or energy consumption is separately considered as the objective function. In fact, both the model training convergence and energy consumption are significant performances of the FL system. Meanwhile, fewer works jointly consider UE selection decisions and the CPU frequencies and establish a combinatorial optimization problem due to the mix-integer variables. As a result, a multi-objective optimization problem (MOP) with the mix-integer variables is designed in a bandwidth-limited FL system to minimize the model training convergence and energy consumption simultaneously by jointly optimizing UE selection decisions and CPU frequencies. However, dealing with MOP poses three challenges as follows: the first is the difficulty in calculating the model training convergence accurately, the second is the multi-objective characteristic, and the third is the combinatorial optimization property. To address these challenges, a multi-objective evolutionary algorithm for bandwidth-limited FL in wireless edge systems called MOEA-FL is proposed.
As an early attempt, we consider multi-objective optimization for FL in this work to concurrently enhance the required energy and training performance. The main contributions of this work are summarized as follows: The rest of this paper is summarized as follows. In Section II, the system model and problem formulation are introduced. Section III describes the details of our proposed method. The experimental studies are given in Section IV. Finally, Section V concludes this paper.

II. SYSTEM MODEL AND PROBLEM FORMULATION
As shown in Fig. 1, we consider a wireless edge FL system including a single-antenna AP as a central server for aggregation and a set of K single-antenna energy-constrained UEs as IoT devices. In the studied system, UE k has its own local dataset D k with D k = |D k | labeled data samples (s n , l n ) D k n=1 ∈ D k , where (s n , l n ) denotes the input-output data sample including a feature s n and its corresponding label l n . Due to the limited bandwidth, only a part of UEs are selected to participate in the training process. We denote K to describe the selected UEs. In addition, selected UE k (k ∈ K ) utilizes its own local dataset D k to train the local model parameters w shared by the AP. The goal of the local model training is to find an optimal w by minimizing the local loss function G k (w), which is written as where g n (s n , l n ; w) represents the loss function on the data sample (s n , l n ). After the local model training, the global model parameters will be updated. Therefore, the goal of the FL system is to find optimal global model parameters by minimizing the global loss function G(w), which is expressed as where D is the number of total data samples. Then, the overall workflow of the FL system, consisting of T training rounds, is shown as follows: • Firstly, the AP initializes the global model parameters w 1 . Then, at the t-th (t ∈ T = {1, 2, . . . , T}) training round, the AP selects UEs to participate in the training process and shares the global model parameters w t with the selected UEs. • After accepting w t , selected UE k (k ∈ K ) utilizes its local dataset D k to train its local model parameters w k,t where w k,t = w t . The training aim is to minimize G k (w k,t ) by the stochastic gradient descent (SGD) algorithm. In the SGD algorithm, w * k,t = w k,t − δ G k (w k,t ) where δ = 0.1 is the learning rate. Then, the selected UE k sends its updated local model parameters w * k,t to the AP. • The AP performs the aggregation for the accepted w * k,t and then updates w t , which is denoted as Next, we will introduce the computation model and communication model, respectively.

A. COMPUTATION MODEL
In the studied system, each UE has limited computation capability to perform local training. In detail, the CPU cycles of executing one data simple (s n , l n ) are denoted as c k , which is a constant measured offline [15]. Hence, the number of CPU cycles is c k D k when UE k performs one local training iteration. Then, the computing latency in the t-th training round can be denoted by where f k,t is the CPU frequency for UE k in the t-th training round and κ is the number of local training iterations. Similarly, the computing energy consumption in the t-th training round can be expressed as where α k represents the capacitance coefficient for UE k.

B. COMMUNICATION MODEL
In this paper, we only consider the uplink communication from UEs to the AP [16]. In addition, the orthogonal frequency-division multiple access (OFDMA) is utilized in the uplink communication, where the AP allocates O subchannels with equal orthogonal frequency bandwidth to selected UEs. Each subchannel shares the same bandwidth denoted as b 0 = B O . The number of subchannels is smaller than that of total UEs because of the limited bandwidth.
In order to describe the size of shared global model parameters w t , we introduce a constant J, which is similar to [17]. In addition, selected UE k is assigned with a subchannel where the channel gain h k,t and the bandwidth b k are obtained in the t-th training round. Therefore, the achievable transmission data rate of UE k can be expressed as where b k = b 0 is the bandwidth of a subchannel assigned by the AP, p k represents the transmit power of UE k, and σ 2 represents the Gaussian channel noise. In order to obtain the minimum transmit power p k , the achievable transmission data rate r k is set to the threshold transmission data rate R k . As a result, the communication latency for UE k in the t-th training round can be expressed as Then, the communication energy consumption for UE k in the t-th training round can be given by where p 0 k is a power constant, and the value is set as the same value of [18], [19].

C. PROBLEM FORMULATION
The total energy consumption is composed of computing and communication energy consumption for all UEs in the t-th training round, which is expressed as where o k,t represents UE selection decision for UE k in the t-th training round. If UE k is selected to participate in the t-th training round, o k,t is equal to 1, and the bandwidth of a subchannel could be assigned to UE k; otherwise, o k,t = 0 and no bandwidth is assigned. Due to the limited bandwidth, there are a part of UEs participating in the t-th training round. Here, we assume that each subchannel is only assigned with one UE, so the number of selected UEs is under the number of subchannels, which is denoted by In this paper, since UEs are energy-constrained, they need to allocate their computing resource (i.e., CPU frequencies) effectively. In wireless edge FL systems, we aim to investigate an effective multi-objective optimization model for improving the performance of the studied FL systems, where the model training convergence and energy consumption are minimized simultaneously by jointly optimizing UE selection decisions and CPU frequencies. Particularly, the model training convergence is denoted as the weighted sum of the global loss function. Thus, a MOP is established with the mix-integer variables (i.e., UE selection decisions and CPU frequencies), which is formulated as where C1 states the lower and upper bounds of the CPU frequency for UE k; C2 illustrates whether UE k (k ∈ K) is selected to participate in the training process; C3 ensures that the number of selected UEs is not more than the maximal number of the subchannels in the t-th training round due to the limited bandwidth; C4 guarantees UE k has a latency limitation in each training round.

A. MOTIVATION
Firstly, the model training convergence depends on the global loss function G(w t ). However, it is difficult to calculate the value of G(w t ) in P 1 accurately [20]. Also, it is hard to describe the relationship between o k,t and w t in P 1 , due to w t changing in the training process. As a result, it is significant to calculate the value of the objective G(w t ) accurately and build a relationship with o k,t and w t . Then, it is clear that P 1 is a MOP, in which the objectives E and G(w t ) are contradictory to each other. For example, if E obtains the minimal value, it is illustrated that there are fewer UEs selected to participate in the training process or smaller CPU frequencies. In this condition, the model convergence becomes slow, and the value of G(w t ) could increase. To handle MOP, current approaches generally transform MOP into a single objective optimization problem by the linear weighting method. However, these approaches could not avoid being affected by subjectivity and preference [21], [22]. In addition, multi-objective approaches could find a Pareto optimal solution set rather than an optimal solution with preference found by single objective approaches [23]. Thus, it is necessary to find a multi-objective approach to tackle MOP.
Meanwhile, P 1 is a typical combinatorial optimization problem, where the CPU frequencies are continuous variables while UE selection decisions are discrete. The dimension of these variables is high. In addition, the combinatorial optimization problem P 1 has the NP-hard characteristic. Although traditional deterministic algorithms have the ability to seek the optimal variables simultaneously, they could afford a high computational burden [24]. Meanwhile, it is difficult to obtain the optimal solution within a reasonable time [25]. Therefore, an approach needs to be considered to solve the combinatorial optimization problem.
As a result, the following three challenges should be concerned: • How to describe the model training convergence with difficulty in calculating the value of G(w t ) accurately? • How to design a multi-objective method to deal with MOP? • How to tackle the combinatorial optimization problem with the mix-integer variables?

B. MOEA-FL
To address the above challenges, a multi-objective evolutionary algorithm (MOEA-FL) for bandwidth-limited FL in wireless edge systems is proposed to solve MOP P 1 . In MOEA-FL, a transformation approach is designed to convert P 1 to simplify calculation. Then, similar to [26], [27], MOEA-FL aims to utilize a decomposition approach to divide the transformed MOP. Finally, MOEA-FL performs the optimization process, including the mutation, crossover and selection operators, to search for a Pareto optimal solution set for MOP with the mix-integer variables. Next, we briefly describe the details. Problem transformation: Before performing the main procedure, P 1 needs to be transformed to simplify calculation by MOEA-FL. We propose a concept of CR, which constructs a CR function to select useful UEs and improve the model convergence. Then, the value of G(w t ) is transformed by the CR value, which makes the model training convergence easy to calculate. In addition, the CR function is based on AoUL method for convergence. As a result, the model training convergence is transformed into the CR function by AoUL method, which combines Age-of-Update (AoU) method with the training loss method.
Similar to [17], [28], [29], AoU method utilizes the age of information of the delivered dynamic contents as a metric to assess the system performance. In detail, AoU method leverages the model training times to describe the convergence value. To this end, AoU for UE k is updated by In addition, the training loss method records the training loss for all UEs, in which the training loss for UE k is denoted as L k,t = G k (w t ) in the t-th training round. As a result, combining AoU method and the training loss method, AoUL method is established, and it formulates the CR function in the t-th training round, which can be expressed as where τ is a constant. Therefore, P 1 can be transformed as Problem decomposition: MOEA-FL decomposes P 2 into single objective subproblems based on weight vectors. In detail, a set of N uniformly distributed weight vectors denoted as N = {1, 2, . . . , N} is firstly generated by the simplex method. Each weight vector λ i (∀i ∈ N ) is expressed as where M is a pre-defined parameter applied to control the number of weight vectors and M m=0 λ i m = 1 for λ i . Therefore, the number of weight vectors is calculated by where q = 2 is the number of objectives in MOP. Then, these weight vectors represent N divided single objective subproblems.
In this paper, the i-th subproblem corresponding to the weight vector λ i could obtain an optimal CR value and energy consumption in each training round. The total optimal CR value and energy consumption for the i-th subproblem are obtained by summing the optimal CR value and energy consumption for the weight vector λ i , whose corresponding UE selection decision and CPU frequency represent the optimal solution for the i-th subproblem. The optimal solution for the i-th subproblem is a Pareto optimal solution of P 2 . As a result, we can obtain different Pareto optimal Algorithm 1 General Framework of MOEA-FL Input: Initial CR value C i = 0 and energy consumption E i = 0 for the weight vector λ i (i ∈ N ); 1: Initialize the global model parameters w and transform P 1 into P 2 ; 2: Initialize a uniform spread of weight vectors to divide P 2 into N single objective subproblems; 3: for each training round t (t ∈ T ) do 4: Input the weight vectors to perform the optimization process based on Algorithm 2; 5: Execute the local training process and transmission; 6: for each weight vector λ i (i ∈ N ) do 7: Obtain an optimal solution including o i * k,t and f i * Obtain the corresponding C i t an E i t in the t-th training process for o i * k,t and f i * k,t ; 9: Update the total CR value and energy consumption by 10: end for 11: Perform the global aggregation and update w t ; 12: end for Output: A Pareto optimal solution set for all the weight vectors, and the corresponding minimal C i (∀i ∈ N ) and E i (∀i ∈ N ).
solutions by adjusting the weight vector λ i . Finally, a Pareto optimal solution set is obtained to solve P 2 .
The general framework of MOEA-FL is presented in Algorithm 1. In detail, the global model parameters w are first initialized in line 1. Meanwhile, P 1 is transformed into P 2 by AoUL method. After that, a uniform spread of weight vectors {λ 1 , λ 2 ,. . . , λ N } are initialized to divide P 2 into N single objective subproblems. Then, in each training round, the weight vectors are input to perform the optimization process via Algorithm 2 in line 4. After that, the local model training and transmission are executed in line 5. Next, for each subproblem based on λ i , the optimal UE selection decisions and the CPU frequencies are obtained in line 7. Based on the optimal variables, the corresponding C i t and E i t are calculated in line 8, and then the total CR value and energy consumption are updated in line 9, respectively. Finally, the global aggregation is executed, and w t is updated in line 11. When the FL training process ends, a Pareto optimal solution set, the corresponding minimal CR value, and the minimal energy consumption for all the weight vectors are obtained.

C. OPTIMIZATION PROCESS
In this part, we briefly introduce the optimization process in each training round, which aims to find the optimal UE selection decisions and CPU frequencies, and then obtain the corresponding minimal CR value and energy consumption for each subproblem. Here, DE algorithm is leveraged to complete the optimization process by evolving a population. DE algorithm, proposed by Storn and Price [30], is a simple

Algorithm 2 Optimization Process
Input: The weight vector λ i (∀i ∈ N ), control parameters including CR, F and , and the neighbor size H; 1: Step1: Initialization 2: Initialize candidate Pareto optimal solution set CP; 3: Initialize H closest weight vectors for λ i by the Euclidean distances and initialize the neighbor B(i); 4: Initialize the population X 1 = {x 1 1 , x 1 2 , ..., x 1 N }; 5: Initialize z * i and FV i = F(x i ) for the i-th objective; 6: Step2: Update 7: for generation g = 1 : g max do 8: for each weight vector λ i (i ∈ N ) do 9: Generate a random number γ from [0, 1] and set P based on (17); 10: Perform the mutation operation and crossover operation to obtain a new solution y. Then, fix the invalid solution to product y ; 11: Perform the fitness evolution to generate F(y ) and g te (y |λ i , z) for the i-th subproblem; 12: Update the reference point z * based on (24); 13: Update the neighboring solutions and FV based on (25) and (26); 14: Update the CP by the domination relationship between each solution in the CP and y ; 15: end for 16: end for Output: The candidate Pareto optimal solution set CP, the corresponding minimal CR values, and energy consumption for all the weight vectors.
and efficient population-based evolutionary algorithm with a variety of advantages, including ease of implementation, simple structure, and powerful search capability. DE algorithm randomly generates an initial population, where each individual is coded by the mix-integer variables (i.e., UE selection decisions and CPU frequencies), then performs the mutation, crossover, and selection operators to obtain optimal solution via iteration. These procedures of DE algorithm are involved in the optimization process. Next, the detail of the optimization process will be introduced. Initialization: Firstly, a candidate Pareto optimal solution set CP is initialized as ∅ to store the non-dominated solutions sought in the search process. Then, we initialize the neighbors B(i) and closest weigh vectors for λ i , respectively. Next, the Euclidean distances between any two weight vectors are calculated to generate H closest weight vectors denoted as {λ i 1 , λ i 2 , . . . , λ i H } for λ i . Here, H is set to 30. Then, the neighbors B(i) for λ i (i ∈ N ) are initialized as {i 1 , . . . , i H }.
After that, the population is initialized. In this paper, to jointly optimize UE selection decisions and the CPU frequencies, we randomly generate an initial population X g = {x g 1 , . . . , x g N }, where each individual represents a solution corresponding to a weight vector. Since P 2 has the mix-integer variables including UE selection decisions (discrete variables) and the CPU frequencies (continuous variables), they need to be encoded as each individual, which are seen in Fig. 2. From Fig. 2, each individual is denoted as where f g i,k = f min + rand * (f max − f min ) and o g i,k = rand represent the CPU frequency and UE selection decision for UE k in the individual i of the generation g, respectively; rand denotes a random number uniformly distributed over [0, 1]; and N is the population size. In addition, the dimension of x g i is 2 * K. After that, the reference point z * and FV are initialized, respectively. Here, FV is utilized to describe the F-value of x i . Then, the population starts to perform the update for optimization.
Update: In order to increase the diversity of the offspring population, it is reasonable to select individuals outside the neighbor as mating parents. Thus, we first select the individuals from the generated population to update the selection range of the mating parents. In detail, a number γ is randomly generated from [0, 1] to set P for the i-th subproblem, which can be expressed by where represents the probability to choose the neighbor. If γ ≤ for the i-th subproblem, the neighbor B(i) is utilized as a candidate set to select parents. Otherwise, we choose the entire population as a candidate set. Then, the mutation and crossover operations are employed to generate a new solution based on P.
The mutation operator of DE algorithm is performed to obtain a mutant vector v We employ a commonly used mutation operator named DE/current-to-rand/1 as follows [31]: where r1, r2 and r3 represent three distinct integers randomly selected from P and are also different from i; and F is the scaling factor. To increase the potential diversity of the population, a crossover operator of DE algorithm is executed on v g i and x g i to obtain u g i . In this paper, we utilize a commonly used binomial crossover, which is expressed as follows [32]: where n rand is a random integer selected from [1, 2 * K] to ensure that u g i is different from x g i in at least one dimension; rand n is a random number uniformly distributed over [0, 1] for each n; and CR represents the crossover control parameter. Then, the generated solution u g i is denoted as y and the invalid elements of y are fixed to produce y .
After that, we evaluate the fitness values of two objective functions for individual x g i in X g in the t-th training round.
The fitness values are calculated by and where o i,g k,t represents UE selection decision of the individual i in generation g for UE k in the t-th training round. Therefore, the fitness function for the weight vector λ i in the t-th training round can be expressed as Then, we decompose the transformed MOP into N single objective subproblems by a decomposition approach named the Tchebycheff method [33], in which the i-th subproblem is defined as where z * = (z * 1 , z * 2 ) represents the reference point initialized by z * j = min 1≤j≤2 F j (x) for the j-th objective function. Here, Fig. 3 shows the Tchebycheff method which uses the evenly distributed weight vectors to find Pareto optimal solutions (the red points).
After fitness evaluation, the reference point, the neighboring solutions and the population are updated, respectively. To be specific, the reference point for the j-th objective function is updated by Then, the neighboring solutions and the F-value in each index h (h ∈ B(i) = {i 1 , . . . , i H }) for λ i are updated by and Finally, the CP is updated by the domination relationship between each solution in the CP and new solution y . In detail, if all the solutions in the CP dominate y , we delete y from the CP. Otherwise, we add y to the CP.
The optimization process is summarized in Algorithm 2, where the weight vectors are the inputs from Algorithm 1. Meanwhile, the control parameters, including CR, F and , and the neighbor size are also input. The main optimization process contains two steps. In the initialization step, the candidate Pareto optimal set CP is initialized as ∅ in line 2. Then, the Euclidean distances between any two weight vectors are calculated to generate H closest weight vectors for λ i , and then the neighbors B(i) for λ i are initialized in line 3. Next, we initialize the population where each individual represents a decomposed subproblem corresponding to the weight vector. Each individual is encoded as a combination of UE selection decisions and CPU frequencies. After that, the reference point and the F-value are initialized for each objective in line 5. Then, we perform the second step to evolve the population for each weight vector to achieve optimization. In detail, a number γ is randomly generated from [0, 1] to set P for updating selection in line 9. Then, the mutation and crossover operations are employed to generate a new solution y. After that, we fix the invalid elements of y and generate a new solution y in line 10. Then, the fitness evolution is performed to obtain the fitness value F(y ) and the value of g te (y |λ i , z) for each weight vector λ i . Next, the reference point z * for each objective function is updated based on (24) in line 12. The neighboring solutions for λ i and FV are updated in line 13, whose update principles are shown in (25) and (26), respectively. Finally, the CP is updated by the domination relationship between each solution in the CP and new solution y . If the stopping criterion is reached (i.e., g reaches g max ), we output the candidate Pareto optimal solution set CP, the corresponding minimal CR values, and energy consumption for all the weight vectors. For λ i in all the FL training rounds, there is a Pareto optimal solution x * i which also represents the Pareto optimal solution of the i-th subproblem. The solution for each subproblem is also a Pareto optimal solution of P 2 . As a result, we can obtain different Pareto optimal solutions by altering the weight vector. Finally, the Pareto optimal solutions are obtained to solve P 2 .

IV. EXPERIMENTAL STUDY A. EXPERIMENTAL SETTINGS
The parameters of the studied wireless edge FL system are set as follows [17], [34].
, where PL 0 = 30 dB represents the path loss at the reference distance d 0 = 1 m, and d is the distance from the signal transmitter to the signal receiver. In addition, the size of the total model parameters and total dataset are set to 86.6 KB and 47.04 MB, respectively. The size of the local dataset D k is set based on the CPU frequency for UE k. In addition, there is a latency limitation T 0 = 1s in each FL training round [35]. In DE algorithm, CR and F are set to 0.9 and 0.9, which are the same as the other papers [31], [34]. Meanwhile, the population size N is set to 30. The remaining parameters are summarized in Table 1.
In the FL training process, the task is set to classify handwritten digits using MNIST dataset [36], where the training set includes 60,000 samples and the test set contains 10,000 samples. In addition, MNIST dataset has two data settings including the i.i.d and non-i.i.d. For the i.i.d data setting, data distribution over UEs is balanced and uniform. For the non-i.i.d data setting, the dataset distribution over UEs is unbalanced, where the unbalanced feature means that the dataset size varies greatly between different UEs [37]. In [37], [38], [39], the performance of FL systems is very sensitive to non-i.i.d data distributions and experiments for the non-i.i.d data setting could evaluate the robustness of the proposed algorithm. To this end, we take experiments with different datasets for the non-i.i.d data setting to verify the robustness of the proposed MOEA-FL. In addition, in order to train MNIST dataset, the training model is designed as a multi-layer perceptron (MLP), which consists of a hidden layer with 50 neurons and utilizes the Tanh function as the activation function [40].
In order to verify the effectiveness of MOEA-FL, the following six algorithms, including three multi-objective approaches and three single objective approaches, are adopted as baselines: • FL with a non-dominated sorting genetic algorithm II (FL-NSGAII): The objective function is the same as MOEA-FL. The optimization method is a famous multi-objective evolutionary algorithm named NSGA-II [41], [42]. • FL with a strength Pareto evolutionary algorithm (FL-SPEA): The objective function is the same as MOEA-FL. The optimization method is another famous multi-objective evolutionary algorithm named SPEA [43], [44]. • FL with a multi-objective particle swarm optimization (FL-MOPSO): The objective function is the same as MOEA-FL. The optimization method is a proposal for multiple objective PSO [45], [46]. • FL with the sliding DE policy (FL-SDE) [17]: It has a single objective function denoted as F = C t + ξ E t (∀t ∈ T ) where ξ = 5. The optimization method is a novel sliding differential evolution-based scheduling policy. • FedCS [47]: The state-of-the-art work considers the impact of UE selection to minimize the model training convergence while does not consider energy consumption. Here, its objective function is denoted as F = C t (∀t ∈ T ). Other settings are the same as MOEA-FL. • FEDL [13]: The state-of-the-art work considers energy consumption and does not involve UE selection. Here, its objective function is denoted as F = E t (∀t ∈ T ).
Other settings are the same as MOEA-FL.

B. PERFORMANCE EVALUATION
Firstly, we discuss the convergence for MOEA-FL and introduce a commonly used metric hypervolume (HV) to evaluate the convergence of MOEA-FL. The HVs are calculated by the obtained Pareto optimal solution sets with MNIST dataset for two data settings, and the results are shown in Fig. 4 and Fig. 5. In Fig. 4, we can observe that the HV with MNIST dataset for the i.i.d data setting starts to increase at first and then fluctuates within a certain range at about the 50-th generation. Similarly, from Fig. 5, the HV with MNIST dataset for the non-i.i.d data setting increases at the beginning and then starts to fluctuate within a certain range at about the 50-th generation. These results illustrate that MOEA-FL can find the Pareto optimal solution after the 50 generations in each subproblem with MNIST dataset. As a result, MOP reaches the Pareto optimal solution set when finding the Pareto optimal solution for each subproblem. Next, we discuss the accuracy of the FL system used by MOEA-FL. Due to a Pareto optimal solution set obtained by multi-objective approaches, there are N curves for multiobjective approaches. As for the single objective methods, FL-SDE, FedCS, or FEDL only has one optimal solution as well as one corresponding curve. The curve always has a certain preference, which may be one of the curves obtained by multi-objective methods. Therefore, we only compare the four multi-objective approaches. In order to compare them, we first take the mean value for the Pareto optimal solution set obtained by MOEA-FL, FL-NSGAII, FL-SPEA, and FL-MOPSO, respectively. Then, the solution closest to the mean value and corresponding accuracy curve is selected. The results are shown in Fig. 6 and Fig. 7. In Fig. 6, we can observe that the accuracy for MOEA-FL reaches 91.71% and the accuracy for FL-NSGAII reaches 90.48% while the accuracy for FL-SPEA is 89.81% and the accuracy for  FL-MOPSO is 88.36%. Meanwhile, MOEA-FL is faster to converge than the other three multi-objective approaches. In Fig. 7, among the four multi-objective approaches, MOEA-FL obtains the highest accuracy and faster convergence for the non-i.i.d data setting.
In addition, the solutions among the seven approaches in the last generation with MNIST dataset are presented in Fig. 8 and Fig. 9. We can observe that MOEA-FL reaches the approximate Pareto optimal solution set with MNIST dataset for both the i.i.d and the non-i.i.d data setting. Due to FedCS, FEDL, and FL-SDE with a single objective function, the number of their optimal solution in the last generation is only one. The objective function of FedCS, FEDL, or FL-SDE is a subproblem of MOP. The optimal solution of FedCS, FEDL, or FL-SDE is a part of the approximate Pareto optimal solution set generated by MOEA-FL. As for the multi-objective approaches, including FL-NSGAII, FL-SPEA and FL-MOPSO, their performances are not better than MOEA-FL. This result may be because they retain the non-dominant solutions that need to be discarded, which has a certain influence on the search process.
In order to further illustrate the effectiveness of MOEA-FL, we take a study about a medical case called  TissueMNIST dataset [48], which consists of Kidney Cortex Microscope images to perform multiple disease classifications. In addition, TissueMNIST dataset consists of 236,386 human kidney cortex cells segmented from three reference tissue specimens, which are divided into eight categories. Similar to MNIST dataset, the accuracy among MOEA-FL, FL-NSGAII, FL-SPEA and FL-MOPSO is calculated with TissueMNIST dataset for both the i.i.d data setting and the non-i.i.d data setting. The results are shown in Fig. 10 and Fig. 11. It can be obviously observed that MOEA-FL can reach the higher accuracy than the other three multi-objective baselines with TissueMNIST dataset for both the i.i.d and the non-i.i.d data setting. In detail, MOEA-FL reaches 88.97% for the i.i.d data setting and 87.88% for the non-i.i.d data setting, while FL-NSGAII only reaches 83.66% and 82.23%, respectively. In addition, the accuracy of FL-SPEA and FL-MOPSO is 81.33% and 79.81% for the i.i.d data setting, as well as 79.13% and 77.17% for the non-i.i.d data setting.
Then, solutions among the seven approaches in the last generation with TissueMNIST dataset are presented in Fig. 12 and Fig. 13. The results show that MOEA-FL reaches the approximate Pareto optimal solution set with  TissueMNIST dataset for both the i.i.d and the non-i.i.d data setting. Similar to MNIST dataset, there is only one optimal solution for FedCS, FEDL, or FL-SDE in the last generation. The solution of FedCS, FEDL, or FL-SDE is a part of the approximate Pareto optimal solution set generated by MOEA-FL. The performance of MOEA-FL is better than that of the other three multi-objective approaches for both the i.i.d and the non-i.i.d data setting.
Then, we discuss the effectiveness of the UE selection strategy. In the current research, UE selection strategies are roughly divided into three categories: all selection [49], random selection [47] and UE selection with a specific strategy, such as a greedy algorithm [50]. To this end, we compare energy consumption under a certain model convergence constraint with the above UE selection strategies and the UE selection strategy of MOEA-FL in the two datasets for both data settings. The result is shown in Fig. 14. We can observe that the UE selection strategy of MOEA-FL consumes less energy than other UE selection strategies, followed by UE selection with the greedy algorithm. Due to selecting whole UEs in each training round, all selection has the highest energy consumption among the four UE selection strategies.  The last experiment is to evaluate the scalability of MOEA-FL. In detail, we increase the number of UEs from 10 to 100 and run MOEA-FL. The result is presented in Fig. 15. From Fig. 15, we can observe that energy consumption first increases until the number of UEs exceeds 50. This is because MOEA-FL could select more UEs to participate in the FL training process. After the number of UEs is greater than 50, energy consumption starts to fluctuate in a small range. This is because MOEA-FL only selects at most M UEs to participate in the FL training process and optimize their energy consumption due to bandwidth limitations. Therefore, MOEA-FL could effectively find optimal energy consumption when the number of UEs changes.

C. DISCUSSION
The further extension can be carried out in the following three lines. Firstly, the priority of the training task for each UE is uniform in this paper. The study and results can be extended to address the case of different task priorities among UEs. For instance, UEs with higher priority should be selected due to the importance of their training tasks. To this end, it is necessary to study the UE selection order in the case of different UEs priorities. Secondly, the work can be extended to address the case of massive UEs in FL systems, which leads to more complex MOPs and the inefficiency of MOEA-FL. A largescale multi-objective approach can be developed to deal with complex MOPs, such as large-scale multi-objective FL systems.
Thirdly, the paper considers an FL system in terrestrial wireless edge networks. The work can be extended to non-terrestrial networks for overcoming connecting challenges such as terrain limitations, communication distance or technical difficulties. To achieve high-rate, reliable, low-latency, and massive connectivity performance, non-terrestrial networks, including low-earth orbit satellite systems, unmanned aerial vehicles, and high-altitude platforms, have become important next-generation wireless communication technology and have been applied in diverse fields. Therefore, we will integrate FL systems with non-terrestrial wireless edge networks to improve system performance in future work. With the combination of FL systems and non-terrestrial wireless edge networks, the inherent characteristics of this new network paradigm and novel algorithmic solutions need to investigate and study.

V. CONCLUSION
In this paper, we studied a wireless edge FL system with bandwidth-limited where a MOP was formulated. In MOP, the model training convergence and energy consumption were minimized at the same time by jointly considering the UE selection and computing resource allocation. To address MOP, MOEA-FL was proposed, where the global loss function was replaced with the CR function by AoUL method. Next, MOEA-FL divided MOP into N single objective subproblems based on the Tchebycheff approach and optimized them simultaneously. Then, MOEA-FL introduced DE algorithm to encode the mix-integer variables. The results showed that MOEA-FL had better performance than other algorithms on MNIST dataset and a medical case called TissueMNIST dataset for both the i.i.d and the non-i.i.d data setting.