Federated Learning Resource Optimization and Client Selection for Total Energy Minimization Under Outage, Latency, and Bandwidth Constraints With Partial or No CSI

We consider the problem of minimizing the total energy consumption due to the computation and communication tasks of federated learning (FL) under bandwidth and latency constraints. To avoid channel state information (CSI) feedback to the transmitter, we adopt outage probability as an additional constraint in the energy minimization problem. First, we define a feasibility metric based on the system design parameters to exclude slow clients (stragglers). Then, we propose a novel client selection algorithm, after excluding stragglers, based on dividing the remaining clients into clusters, where clients within the same cluster collaborate in a communication round to train their local models. For each communication round, one client cluster is selected in a round-robin fashion. Furthermore, we formulate and solve a resource allocation problem to optimize the transmit power, clock frequency, allocated bandwidth, and communication latency of clients within each cluster to minimize the total energy subject to total bandwidth, latency, and outage constraints. Moreover, we extend our FL design framework to the case of no CSI at both the client and server ends using differential transmission to eliminate CSI estimation pilot overhead and complexity at comparable total energy consumption and learning accuracy to coherent transmission. We test our proposed algorithms over MNIST and Fashion-MNIST datasets in iid and non-iid settings. Our proposed client selection algorithm reduces the number of participating clients per communication round by 41% compared to the baseline while maintaining the same learning accuracy. Moreover, our results demonstrate that increasing the number of receive antennas at the server from one to four can reduce the number of communication rounds required to reach a predetermined testing accuracy level by up to 53% for the Fashion-MNIST dataset.

limited communication bandwidth together with latency and privacy constraints.
Federated learning (FL) [1] has recently emerged as a powerful alternative approach to solve these problems by enabling edge devices, such as the clients considered in this paper, to collaboratively train the learning model using realtime data. In FL, the server orchestrates the participation of clients in the training process while keeping their data locally. Specifically, a server updates a global model by averaging local models computed using local data and transmitted by participating clients. The updating of the global model using local models and the reverse are iterated until convergence.
The transmission of high-dimensional models over wireless links is very challenging due to the scarcity of radio resources and the uncertainty of wireless channels. Moreover, FL learning performance depends on the reliability of the wireless links, which we quantify in this paper by the desired outage probability level, and on the communication resources (transmit power and bandwidth) of each FL client. These considerations motivate us to design reliable, low-latency, multi-access, privacy-preserving, and energyefficient resource allocation schemes by jointly considering edge learning and wireless communication design aspects.
In this paper, we investigate an FL approach where clients train a learning model using their local data and only upload their updated model parameters to the server [1]. The server then averages the model parameters from all participating clients and returns to them an updated global model. The main benefits of the FL approach are summarized as follows. First, it facilitates collaborative learning while ensuring data privacy since only local model updates are sent to the server. Second, by using FL it is possible to gain insights from non-local data, thereby capturing more patterns. This is especially important for clients with small data sets that can significantly degrade learning performance. The ability to leverage global data but only process local data makes FL attractive to resource-constrained clients with low processing capabilities. Finally, FL is resilient and offers a rapid recovery for clients experiencing faults. One of the main FL advantages is the ease of deployment over existing wireless networks, where clients communicate their model parameters to the server over the wireless links, making the FL performance dependent on the communication resources.
Most of the existing FL works assume the availability of channel state information (CSI) at the transmitter, and the client can adapt his transmission rate based on the state of the fading channel between the client and the server. In this paper, we assume that CSI is either: (i) partially available only at the server side (receiver) to eliminate the overhead and latency of feeding back CSI to the transmitter; or (ii) not available at both the client and server to eliminate the pilot overhead and the complexity of CSI estimation. We quantify the wireless link reliability using the outage probability metric and incorporate it in the joint communication and computation energy minimization problem under an FL communication-round latency constraint. In addition, we assume that the server is equipped with multiple receive antennas to improve reliability and decrease the outage probability of the wireless link.

A. RELATED WORKS
Since Google proposed FL in [1], the number of advances and challenges in FL has greatly increased [2]. One of these challenges is the impact of distributing the wireless network resources across the selected clients on the learning accuracy under energy consumption or latency constraints. The authors of [3] studied the trade-off between the computation and communication latency on one hand, and between the training and client energy consumption on the other hand. Then, by formulating this trade-off as an optimization problem, [3] characterized the impact of the computation and communication latency on the client's energy consumption, learning time, and accuracy. The work in [4] proposed an iterative algorithm to minimize the total energy consumption under latency constraints by jointly optimizing the learning accuracy and communication latency. Given a latency constraint, the authors of [5] utilized a Golden-section search method to minimize the total energy consumption by finding the optimum communication latency and subsequently the optimal power level, bandwidth allocation, and CPU frequency. This work also assumed the availability of CSI at each client (transmitter side) to compute the beamforming weights that maximize the achievable data rate. The authors of [6] introduced a control algorithm to achieve a trade-off between local training updates and global parameter aggregation to minimize the loss function under a given resource budget constraint. Similarly, the batch size and communication resource allocation parameters are jointly optimized to maximize the learning efficiency under bandwidth and latency constraints in [7].
In addition, client scheduling is considered to be one of the main factors in optimizing communication resources and maximizing learning accuracy. The authors of [8] derived a closed-form expression to vote for the clients with better wireless channels and computation capabilities to reduce the total energy consumption. Scheduling as many clients as the resource budget constraint allows to improve the learning accuracy is considered in [9]. However, the authors in [10] found that scheduling few clients during the early communication rounds and more clients during the late communication rounds improved learning accuracy and reduced energy consumption as well. Following a different approach, the authors of [11] depended on the staleness of the received local updates for scheduling the clients to accelerate the FL convergence rate.
The authors of [12] considered three scheduling policies: random, round-robin, and proportionally fair scheduling, where in the proportionally fair policy, the client is selected if the ratio of the instantaneous Signal-to-noise ratio (SNR) to average SNR is high. The authors of [13] jointly optimized the computation and communication resources to minimize the training time under the cell-free massive MIMO scenario. The work in [14] considers CSI imperfection at the transmitter side while scheduling clients and optimizing resource allocation. The authors of [15] proposed an algorithm (called GREED) to schedule the clients who have the lowest energy consumption to minimize the total energy consumption per communication round while maximizing the number of selected clients under the limited bandwidth constraint. All of the above works and others in [16], [17], [18] assumed the availability of CSI at the server side. Few works assumed that the CSI is available only at the clients' sides, such as [19], [20], [21]. We are not aware of any FL work that assumed no CSI at both the client and server ends. Table 1 summarizes the key differences between this work and the key related works mentioned above.

B. CONTRIBUTIONS
Our proposed approach is distinct from the existing literature in the following key aspects.
• We quantify the wireless link reliability with multiple receive antennas using the outage probability metric and incorporate it into the joint communication and computation energy minimization problem under an FL communication-round latency constraint and total bandwidth constraint. • We find the minimum required bandwidth allocation for each client to ensure the feasibility of the formulated resource allocation problem given the total bandwidth and latency constraints. Then, we derive a relationship between the minimum allocated bandwidth and both the number of required CPU cycles to train the local model and the distance between the client and the server. • We define a feasibility metric based on the system parameters to exclude the slow clients (stragglers) who are unable to finish the FL training task within the latency deadline and/or to meet the outage constraint given the total bandwidth and maximum transmit power constraints. • We propose a novel client selection algorithm to divide the remaining clients into clusters, allowing clients within each cluster to train their local models in the same communication round. • We formulate and solve a resource allocation problem to optimize the transmit power, clock frequency, allocated bandwidth, and communication latency of each participating client within each cluster to minimize the total (communication plus computation) energy subject to total bandwidth, latency, and outage constraints. • To eliminate the pilot overhead and complexity of CSI estimation, we extend our energy-efficient FL design framework to the case of no CSI at the client and the server sides by adopting differential wireless transmission. • We evaluate our proposed algorithms using MNIST and Fashion-MNIST datasets to quantify the impact of the feasibility metric threshold on convergence speed and learning accuracy, as well as the impact of the number of receive antennas at the server on FL convergence speed and hence the total energy consumption. The rest of this paper is organized as follows. Our model and assumptions are introduced in Section II, where we define the computation and communication energy consumption models and the latency and outage constraints. We formulate the resource optimization problem in Section III. We describe our proposed algorithms in Section IV, where we introduce our new client selection algorithm, resource optimization problem, and proposed overall green FL design framework for the cases of CSI at the receiver-side only and no CSI at both sides. Extensive simulation results are presented to evaluate the performance of our proposed client selection and resource allocation algorithms in Section V. Conclusions are presented in Section VI. The key notation used in the paper is summarized in Table 2.

II. SYSTEM MODEL
Our focus is on the uplink multiple-access wireless communication links from the clients to the server. The downlink is not considered because the server is assumed to have much more power and bandwidth resources than the clients; hence, the downlink broadcast wireless communication link is assumed error-free. Moreover, the computational task at the server, namely, local model averaging, is much simpler than the more computationally-intensive local model training at the clients. In this section, we describe the federated learning model as well as the clients' computation and communication models.

A. FEDERATED LEARNING MODEL
We consider a single server and K clients. Each client k utilizes its training dataset D k with D k data samples. For each dataset D k = {x k,i , y k,i } D k i=1 , x k,i represents the input sample to the neural network of the client k and y k,i is the corresponding output. The index k is used to emphasize the heterogeneous nature of the local data, computation, and communications resources of the clients (e.g., different sizes and qualities of data sets, different processor speeds and architecture, different communication transmit power, and bandwidth, etc.). The ultimate goal of the FL process is to minimize a loss function defined as where F k (W) is the local loss function, W ∈ R n is the local model parameters vector with dimension n, and D = K k=1 D k is the total number of data samples of all clients. Figure 1 depicts the communication round of the FL algorithm which consists of the following three key steps.
1) The server selects J out of K clients to participate in the FL communication round due to certain constraints (i.e., limited bandwidth) and then sends the global model parameters to these selected clients. 2) Each client j ∈ J computes the local gradient using its local data set to minimize the local loss function as follows where at communication round t, g j [t] is the local gradient of the j th client.
3) For each batch size, the model weights are updated as follows where η is the learning rate. 4) Instead of transmitting its local data, each client transmits its local model weights to the server using frequency division multiple access (FDMA) over the wireless link [22]. 5) After receiving the weights from all participating clients, the server computes their average, as shown in Equation (4) below, and sends its back to the clients who use this average global model weights to update their local model weights.
To perform model averaging, the server must wait for all participating client model updates. This global update synchronization requirement causes latency due to the idle time wasted while waiting for straggler clients with slower processors or those who require a large number of computation cycles. To address this issue, we impose a maximum latency constraint that a client must meet in order to participate in the global model update.

B. CLIENT COMPUTATION MODEL
In our analysis, to limit energy consumption, we assume that the client's device is equipped with a CPU only. The CPU energy consumption (in Joule) at the j th client per communication round is given by where f j is the client's CPU frequency in Hz, C eff is an effective capacitance parameter that depends on the client CPU chip [3], N c,j is a parameter that represents the number of CPU cycles per data bit for the j th client [3], N l is the number of local iterations, and L j = D k i=0 ζ i is the size of the local dataset D k in bits, where ζ i is the size of each input sample in bits. In addition, the computation latency for the j th client per communication round is Therefore, increasing the CPU frequency involves a trade-off between computation latency and energy consumption.

C. CLIENT WIRELESS COMMUNICATION MODEL
To simplify the analysis, we assume slow client mobility when communicating with the server and we model the wireless links between the clients and the server as quasi-static flat-fading Rayleigh channels. We assume a single antenna at the client and N r antennas at the server. Unlike previous work, such as [8], we do not assume CSI knowledge at the transmitter side but only at the receiver (server) side. In Section IV-C, we will extend our FL design framework to the case of no CSI knowledge at both the client and server. We denote the transmit power level, the communication latency, and the bandwidth allocated to the j th client by P j , τ comm j , and B j , respectively.
As a wireless link reliability metric, we adopt the outage probability which is defined as the probability that the instantaneous receiver SNR of the uplink falls below the minimum SNR required to support a target data rate [22]. We assume that the distance between the server and the client is d j , and we denote the distance path loss as P Lj . The outage probability of the j th client's wireless link to the server for the quasi-static Rayleigh fading channel model with receive-antenna diversity and additive white Gaussian noise (AWGN) of power spectral density N o (Watts/Hz) is well approximated by [22] where the exponent N r is the achieved spatial diversity order and the uplink data rate of the j th client is where L g is the size of the model parameters in bits. Since the dimensions of the local FL models (i.e., sizes of the weights and biases vectors) are assumed to be the same for all clients, the size of the model parameters vector that needs to be uploaded to the server is fixed. The energy consumed by the j th client in transmitting its model parameters per communication round is Increasing a client's transmit power reduces the outage probability level of his wireless link to the server but increases his energy consumption. In addition, the higher the transmit power is, the lower the communication latency will be to meet the maximum outage probability level which, in turn, decreases the client's energy consumption as in (9).
In the next section, we will formulate these trade-offs mathematically.
In summary, the total energy consumption of all participating clients per communication round is where J is the total number of FL participating clients per communication round.

III. RESOURCE OPTIMIZATION PROBLEM FORMULATION
In this section, we start by formulating the total energy minimization problem and its assumed constraints. Then, we describe our proposed algorithm for straggler detection and exclusion from the FL update.

A. CONSTRAINED TOTAL ENERGY MINIMIZATION PROBLEM
Our goal is to minimize the total energy consumption per communication round of all clients by optimizing the transmit power level, communication latency, CPU frequency, and bandwidth; i.e., where P j,min and P j,max are the minimum and maximum allowable client transmit power levels. Likewise, f j,min and f j,max are the minimum and maximum allowable client CPU frequencies.
In addition, P out,max is the maximum outage probability of the wireless link. To ensure the feasibility of the aforementioned optimization problem given the latency constraint in (14) and the total bandwidth constraint (15), we first detect the stragglers, who cannot meet these constraints, and then exclude them before solving the resource allocation optimization problem in (11).

B. STRAGGLER DETECTION AND EXCLUSION
In the FL system, the straggler is the client who requires more computation and/or communication resources than are available to finish the local training cycle and to send back the local model weights to the server. For example, a client with a limited computation capability (low CPU frequency) may take more than the latency constraint per communication round, as in (14) to finish his local training. In this case, this client might delay or halt the whole training cycle. Therefore, it is essential to identify the stragglers and then exclude them before starting the FL training process.
To do so, we define lower bounds on both the communication and computation latency for each client and compare these bounds to the total latency constraint. If the sum of these bounds exceeds the total latency constraint, the client is considered a straggler and excluded from the FL process.
Taking the constraints in (12)-(15) into consideration, the lower bound on the computation latency, denoted by τ comp,LB j , is related to the maximum CPU frequency f j,max as follows If τ comp,LB j ≥ τ max , this means that this client is unable to train his local model even at his maximum computational capability within the latency constraint. Hence, this client should be excluded from the FL update process. On the other hand, if τ comp,LB j < τ max , this client is able to train his local model within the deadline constraint.
In addition to the computation latency, it is also necessary to ensure that the communication task can be completed within the remaining time given the total bandwidth B. The maximum communication latency is given by Similarly, a lower bound on the communication latency is obtained by setting P j = P j,max and B j = B in the VOLUME 4, 2023 941 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.

constraint (16) as follows
If τ comm,LB j > τ comm j,max , then client j should be excluded from the FL update process. Finally, by substituting (19) and (17) into (14), a lower bound on the client j total latency is given by Intuitively, τ LB j < τ max to maintain the feasibility of (11). Definition 1 (Feasibility Metric): For each client, the feasibility metric is the following ratio between the lower bound on the client's total latency and the maximum latency constraint where λ j > 0. We consider clients with λ j ≥ 1 as stragglers and exclude them from the FL training process. We emphasize that clients with λ j close to one consume more computation and communication resources and therefore more energy than clients with λ j close to zero.

IV. PROPOSED FL FRAMEWORK
In this section, we start by describing our proposed criterion for dividing the non-straggler clients into clusters. Clients within each cluster transmit their local model weights to the server within a communication round. Cluster selection is done in a round-robin fashion. Then, within each cluster, we present our proposed solution to the constrained resource allocation problem to minimize the total client energy.

A. CLIENT SELECTION AND CLUSTERING CRITERION
After the stragglers are excluded, we schedule the remaining clients to participate in the FL training cycle. However, only a fraction of the non-straggler clients can transmit their local model weights in a communication round due to the scarce bandwidth resources and stringent latency constraints. To maintain the feasibility of (11), we divide the available (i.e., non-straggler) clients into clusters, where clients within the same cluster can transmit their local model weights given the latency and bandwidth constraints. First, we find the minimum required bandwidth per client B j,min given the constraints in (13) and (14). By following the approach in [8] and setting P j equal to P j,max and τ comm j to τ comm j,max in (16), we obtain Allocating a smaller bandwidth than B j,min to a client would require him to increase his transmit power level to be larger than P j,max or would require him to train his local model at a CPU frequency larger than f j,max , which violates the constraints in (13) and (14).
Second, we demonstrate the impact of the number of CPU cycles needed to train the local model and the distance between the client and the server on B j,min . As shown in Figure 2, clients that require more CPU cycles for training or are located farther from the server are allocated more bandwidth. This is because a higher number of CPU cycles for training leads to a longer training time which, in turn, leaves less room for communication latency. To meet the latency constraint, it is necessary for these clients to be allocated more bandwidth to increase their achievable rates. Similarly, clients located farther from the server need more bandwidth to compensate for path loss and meet the power constraint.
We propose a client selection algorithm that divides the available clients into clusters based on B j,min where j B j,min ≤ B per cluster to maintain the feasibility of (11). In addition, we constrain that j B j,min over each cluster is almost the same to maintain the fairness of dividing the resources across the clusters.
The main idea behind our proposed algorithm is that a client who has a large B min should share the same cluster with a client who has a small B min since this enables stacking more clients per cluster (which in turn improves the FL training accuracy) due to the variability of B min . To accomplish this goal, We modeled the client selection problem as a k-partition problem where the allocated bandwidth per cluster is below or equal to the total bandwidth while maintaining almost the same number of clients per cluster. As the solution to this problem is NP-complete, we implemented a recursive algorithm to solve it. The following steps describe the implementation of Algorithm 1: of each cluster is less than the total bandwidth, the procedure returns the clusters with assigned clients and the number of created clusters. We summarize the client selection steps in Algorithm 1.

B. RESOURCE OPTIMIZATION PROBLEM SOLUTION
Within each cluster, the clients train their local models and transmit back the updated models to the server simultaneously using FDMA. Given the formulated optimization problem in (11) and its constraints, we need to optimally assign the available resources to the clients within the same cluster to minimize the total energy consumption towards the goal of a green FL framework. According to the constraint in (16), it is clear that to achieve a lower outage probability level, we should increase P j as much as possible. However, this will increase the total energy in (10). Thus, the optimal transmit power level of each client j, given the bandwidth B j and communication latency τ comm j , is the level that achieves P out,max in (16) which is given by Similarly, the client should operate at the lowest CPU frequency to minimize the total energy consumption without violating the latency constraint in (14). Therefore, the optimal CPU frequency of each client j must satisfy (14) with equality and is given by Substituting (23) and (24) into (10), we arrive at the following new objective function which minimizes the total energy consumption over only two variables instead of four VOLUME 4, 2023 943 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
variables, namely, the bandwidth and communication latency The new objective function is the sum of two convex functions in τ comm j and B j , which makes the objective function convex as well. To find the optimum values of τ comm j and B j , we define the Lagrange function of (25a) as where μ is a Lagrange multiplier. First, we find τ opt,comm j by setting A closed-form expression for the optimum τ comm j does not seem feasible. Hence, we use the bisection method to solve (27) numerically.
Similarly, we can find the optimum bandwidth by setting A closed-form expression for the optimum bandwidth in terms of the Lambert function is given by [8] B opt j = L g ln 2 where e is the Euler constant, and the constant μ is found by applying the constraint J j=1 max(B  (23) and (24), respectively. However, due to the coupling between Equations (27) and (29), it is difficult to directly obtain the globally optimal solution to (25a). To overcome this challenge, we propose the following iterative algorithm • The server initializes the communication latency with the maximum communication latency τ comm max,j and creates a variable to track the total energy over each iteration with zero initial value. • A loop is initiated to find the global minimum of the total energy consumption. • Using the communication latency, the bandwidth allocation is calculated numerically using Equation (29). • The resulting bandwidth, along with the communication latency, are used to compute the transmit power level for this iteration using Equation (23). • The communication latency is then used to find the CPU frequency f j using Equation (24). • The total energy for this iteration is evaluated and compared to the evaluated total energy from the previous iteration. If the difference is within the allowable tolerance , the algorithm terminates with the optimal values for P j , f j , B j , and τ comm j . If the difference is not within the allowable tolerance, the communication latency is updated based on the calculated bandwidth from the previous iteration using Equation (27), and the loop continues. Algorithm 2 lists the pseudo code to implement the abovementioned steps.

C. NO CSI AT THE CLIENT OR THE SERVER
In this subsection, we extend our proposed energy-efficient FL resource optimization algorithm to the case where the CSI is unavailable at the server side (receiver side) as well. The motivation here is to eliminate the pilot overhead and complexity of CSI estimation.
Towards this goal, we adopt Differential Phase Shift Keying (DPSK) based wireless transmission, where the data is modulated through the phase changes between two adjacent transmitted symbols according to the following rule [24]: where s is the symbol time index, u s is the transmitted information symbol at time index s, and x s is the transmitted DPSK symbol. We assume the initial condition x −1 = 1.
The received signal at the server side is where y s is the N r × 1 received signal vector, h is the N r × 1 channel vector, and n s is the N r × 1 noise vector at time index s. Assuming that the channel does not change over at least two successive symbols, differential detection is performed at the receiver (server) without the need for CSI (see Appendix A). The decision-point SNR of client j computed at the server is given by Consequently, the jth client's outage probability expression of the differential transmission is given by Proof: see Appendix B. Therefore, to find the optimum transmit power, communication latency, and bandwidth for the differential transmission, we modify Equations (23), (27), and (29) by incorporating P D out,j . This is done by dividing the factor (N r !P out,max ) , and E 1: The server collects metadata from the connected clients 2: Find τ BL using (20) 3: Exclude the stragglers based on λ using (21) 4: Apply algorithm 1 to divide the remaining non-straggler clients into clusters. 5: for each cluster in parallel do 6: Apply resource allocation Algorithm in algorithm 2.
7: for t = 0 to T do communication rounds 8: The server selects a cluster in a round-robin fashion 9: The server sends the global model to the selected cluster's clients 10: Each client trains his local model 11: Each client sends the updated weight to the server 12: The server aggregates the weights to generate the global model

D. PROPOSED FL FRAMEWORK ALGORITHM
To put everything together, we propose the following outageconstrained green FL framework.
• The server collects the metadata, such as CPU frequency and dataset sizes from the connected clients. We assume that the open radio access network (O-RAN) has a near-real-time RAN intelligent controller (near-RT RIC) to collect the necessary metadata from the connected clients via the E2 interface. Then, The collected data is stored in the RIC database [25], [26]. • For each client, the server computes the lower bound on the total latency using (14). Then, it detects the stragglers based on the threshold defined in (15). • The server applies Algorithm 1 for client selection to divide the remaining (i.e., non-straggler) clients into clusters. Then, over each cluster, the server applies the resource allocation algorithm in Algorithm 2 to allocate the optimal resources (power, frequency, bandwidth, and communication latency) to each client to minimize the total energy consumption. • Finally, the server runs the FL training cycle while selecting a cluster for each communication round in a round-robin fashion. The Proposed FL framework is summarized in Algorithm 3.

V. SIMULATION RESULTS
The simulation parameters are set as follows unless specified otherwise. We assume that the total number of clients is 100 uniformly distributed in a 500 m × 500 m area where the server is at the center. We adopt the path loss model 128.1 + 37.6 log 10 (d), where d is the distance between the client and the server in Kilometers (Km). In addition, we assume an FDMA uplink, where the total bandwidth (B) is equal to 20 MHz. The AWGN PSD is N o = −174 dBm/Hz, the maximum outage level is P out,max = 0.1, N r = 4 receive antennas at the server, and the maximum latency per communication round is τ max = 400 ms. Other system parameters FIGURE 4. Clients' feasibility metric at τmax = 400 ms and Pout,max = 0.1. The client whose feasibility metric is above the straggler threshold is excluded. For instance in Fig. 4(a), the clients whose indices are 37 (circle), 45 (diamond), and 87 (square) are stragglers. The reason is that the clients whose indices are 37 and 45 cannot finish the training task within τmax as shown in Fig. 4(b) while the client whose index is 87 needs to allocate more than the total available bandwidth to send back his updated weights to the server as indicated in Fig. 4(c).
are f j,max = 2 GHz, P j,max = 30 dBm, C eff = 2 × 10 −28 , the number of CPU cycles per bit N c,j is assumed to be a uniformly distributed random variable over the interval in [10,40] cycles/bit.
We tested our proposed client selection algorithm on the MNIST [27] and Fashion-MNIST [28] datasets. For both datasets, the total number of training and testing samples is 60,000 and 10,000, respectively. We consider the imbalanced local dataset case where the number of the local data samples for client j, denoted by L j , is assumed to be a uniformly distributed random variable over the interval [100, 900]. To evaluate the accuracy of the global model, we keep the testing data samples on the server side. In addition, we test our proposed algorithms in both independent and identically distributed (iid) and non-iid settings. In the iid setting, each client is randomly assigned samples from all labels in the dataset, while in the non-iid setting, each client is only assigned samples from two or three labels according to their number of data samples. Fig. 5 shows the distribution of different labels for five clients in the iid and non-iid settings.
We train a convolutional neural network [29] (CNN) model with two 3 × 3 convolution layers where each layer has 64 channels, and these convolution layers are followed by a 2 × 2 max pooling layer. Then, we have two fully-connected layers with 384 and 192 units, respectively.
We use the mini-batch stochastic gradient descent (SGD) algorithm to train the local models at the clients, where the number of local iterations N l and the batch size are 4 and 16, respectively [30]. For each experiment, we average the performance over 10 trials with different random seeds. Fig. 4 shows a single realization of the clients' feasibility metric. We define this straggler threshold to ensure the feasibility of the problem in (11). In other words, this threshold excludes the clients who cannot meet the constraints in (12)- (15). Thus, by excluding those clients, we ensure that the FL process can be executed using the available resources within the latency constraint. In addition, by using this threshold, we can detect the stragglers. For instance, in Fig. 4a, the clients whose indices are 37 (circle), 45 (diamond), and 87 (square) are stragglers, since the clients whose indices are 37 and 45 cannot finish the training task within τ max as shown in Fig. 4b while the client whose index is 87 needs more than the total available bandwidth to send back his updated weights to the server as indicated in Fig. 4c.

A. STRAGGLER DETECTION AND EXCLUSION
We also show in Fig. 6 the impact of different feasibility metric thresholds on the convergence of the learning accuracy. Increasing the threshold from 0.4 to 1.0 improves the convergence rate by 23.5%. Specifically, for the case of λ = 0.4, it takes 34 communication rounds to reach 85% learning accuracy, while it takes only 26 communication  rounds in the case of λ = 1 to reach the same learning accuracy level. In addition, at λ = 1, the learning accuracy improves by 2.5% at the last communication rounds.

B. CLIENT SELECTION
Clients who fall below the feasibility threshold (i.e., nonstragglers) are candidates for the client selection algorithm, which divides them into clusters according to Algorithm 1.
In Fig. 7, we demonstrate the relationship between the feasibility threshold and the average number of clusters and clients per cluster. As the feasibility threshold decreases, a larger number of clients are excluded, leading to a decrease in the number of clusters and an increase in the number of clients per cluster. This is because clients who require larger bandwidth allocations are excluded, resulting in fewer clusters with more clients in each cluster. Next, we compare our proposed client selection algorithm with the Federated Client Selection (FedCS) algorithm [9] and GREED algorithm [15]. In the FedCS algorithm, the server randomly selects a subset of clients each communication round and then, sorts them in an ascending order based on their minimum bandwidth allocation B min . The server schedules as many clients as the total bandwidth permits such that j B min ≤ B. In the GREED algorithm, the server schedules the clients who have the lowest energy consumption. For a fair comparison, we assumed that there is no green (i.e., harvested) energy since we do not harvest any green energy. To evaluate the performance of our proposed client selection algorithm in a more realistic environment, we test it in both independent and identically distributed (iid) and non-iid settings as described in 5.
First, we show in Fig. 8 the average number of clients per communication round for both our proposed algorithm and the FedCS algorithm. We observe that our proposed algorithm schedules 41% fewer clients per communication round compared to the FedCS algorithm. This is because the server prioritizes the clients with a small B min allocation in the FedCS algorithm to schedule many clients for a given bandwidth.
Second, we show the accumulated total energy consumption across the communication rounds in Fig. 9. To find the total energy consumption per communication round, we apply Algorithm 2 over the scheduled clients to assign the optimum resource allocation parameters to each client. We can see that the accumulated total energy consumption of our proposed algorithm is smaller by around 40% at the last epoch compared to the FedCS algorithm since we schedule a smaller number of clients per communication round as indicated in Fig. 8.
Next, we compare the test accuracy of both algorithms. In Figure 10, we show the test accuracy in the iid setting for the MNIST and fashion-MNIST datasets. Although our algorithm schedules a fewer number of clients per communication round than FedCS, we achieve a slightly better test accuracy in the case of the Fashion-MINST dataset and maintain almost the same accuracy in the case of the MINST dataset. Similarly, in Fig. 11, we show the test accuracy for the non-iid case. In this case, we observe that the test accuracy curve of our proposed algorithm exhibits the same behavior in the Fashion-MINST and MNIST datasets. It is worth noting that although we schedule a smaller number of clients per communication round, we achieve slightly better accuracy or at least maintain the same accuracy as the FedCS algorithm. Compared to the GREED algorithm, our algorithm achieves higher or similar test accuracy in the iid scenario and much better test accuracy in the non-iid scenario. In addition, our proposed algorithm converges faster than GREED, resulting in less energy consumption needed to reach a certain learning accuracy performance level. For instance, in the case of the non-iid Fashion-MNIST dataset, the GREED algorithm takes 228 communication rounds to reach 77% testing accuracy, and this consumes around 335 J, while our proposed algorithm takes 103 communication rounds to reach the same testing accuracy and consumes only around 193 J. This means that our proposed algorithm reduced the total energy consumption by 42% to reach the same learning accuracy as the GREED algorithm.

C. RESOURCE ALLOCATION
After excluding the stragglers, we divide the remaining clients into clusters according to Algorithm 1 and then optimize the resource allocation algorithm within each cluster according to Algorithm 2. It is worth mentioning that in most cases, j B j,min < B after assigning clients to a cluster. This means that some bandwidth remains available and we allocate it to certain clients, which are selected to minimize the total energy consumption per cluster according to Algorithm 2. For example, in Fig. 12a, we show how the resource allocation algorithm redistributes the un-allocated bandwidth (i.e., B − j B j,min ), which consequently reduces the assigned CPU frequency and the transmit power level as in Fig. 12b and 12c, respectively, to minimize the total energy consumption. This energy saving is possible because B min was calculated based on the maximum CPU frequency f max and the maximum transmit power level P max . We can see in Fig. 12d and Fig. 12e the communication and computation energy reduction, respectively, before and after applying our proposed resource allocation algorithm. Fig. 12f shows that our proposed algorithm saves from 32% to 45% of the total energy consumption per cluster compared to the baseline case of assigning the maximum communication and computation resources to each client, while satisfying the resource constraints. We note that, under our resource allocation algorithm, each client operates at the maximum outage probability, since having a lower outage probability than the maximum would require allocating more transmit power to the client which increases the energy consumption. Next, we compare our resource allocation algorithm with the baseline case of uniform bandwidth allocation in Fig. 13. In uniform bandwidth allocation, the total bandwidth is divided equally over the scheduled clients in the current communication round. To do this, we enforce the condition B J ≥ B min at Step 14 of Algorithm 1. This condition ensures that the allocated bandwidth is greater than or equal to B min of any client in the cluster to maintain the feasibility of (11).
We observe from Fig. 13 that our optimum bandwidth allocation reduces the accumulated total energy consumption by 16% compared to uniform bandwidth allocation. Another advantage over the uniform bandwidth allocation is that we can utilize the available bandwidth efficiently as we assign the required bandwidth for each client.
The impact of the number of receive antennas at the server on the convergence speed of the testing accuracy is shown  in Table 4. The convergence speed is defined as the number of required communication rounds to reach a certain accuracy threshold. It is clear that increasing the number of receive antennas N r from one to four significantly reduces the required number of communication rounds. This is because having more receive antennas can improve the system reliability and decrease the outage probability, which relaxes the constraints on the system resources and allows more clients to be scheduled per communication round which improves learning accuracy. For example, in the case of non-iid data distribution, increasing the number of receive antennas from one to four, reduces the number of communication rounds by 42% and 53% for the MNIST and Fashion-MNIST datasets, respectively.
Finally, we investigate the impact of differential transmission (which does not require CSI at the server or the clients) on resource allocation and model accuracy. As observed in Figure 14, the accumulated total energy consumption when using differential transmission is around 8% lower than the coherent transmission case where CSI is required at the receiver. Initially, this result might seem surprising due to the well-known 3 dB SNR penalty of differential transmission. Consequently, differential transmission consumes more communication energy per client to meet the same outage probability constraint as coherent transmission. However, this  also means that selected clients are allocated more bandwidth leading to a decrease in the total number of clients per cluster, given the maximum client transmit power level constraint, as depicted in Figure 15.
It is also worth mentioning that the well-known 3 dB SNR gap between coherent and differential PSK 1 does not apply here. In fact, we can see that the average total (communications plus computation) energy consumption difference between the coherent and differential cases 1. For DPSK, the noise variance is doubled compared to coherent PSK, resulting in a 3 dB SNR gap between them for a given error rate. is less than 3 dB for Clients 1 to 6 because scheduling one less client in the cluster in the differential case makes sufficient bandwidth available to reduce the energy consumption. Moreover, for Client 7, the total energy consumption of coherent PSK is actually more than differential PSK because the computation energy increase in the former is more than the communication energy increase in the latter. This makes differential transmission an appealing choice for green federated learning, especially for mobile clients where accurate CSI estimation and tracking becomes more challenging.
It is also interesting to note that the reduction in the total number of clients per cluster has a negligible impact on the test accuracy, since the average number of clients per cluster is only decreased by one client as shown in Figure 16.

VI. CONCLUSION
In this paper, we proposed a new energy-efficient framework for federated learning under outage, latency, and bandwidth constraints for the two cases of partial CSI at the server only and no CSI. Our framework consists of three steps. In the first step, we determined the minimum required bandwidth to ensure the feasibility of our formulated resource allocation problem given the total bandwidth and latency constraints. Then, we studied the relation between the minimum allocated bandwidth, the required number of CPU cycles to train the local model, and the distance between the client and the server. Given the limited resources and the maximum outage constraint, we defined a client feasibility metric based on the system parameters to detect the stragglers.
In the second step of our framework, the non-straggler clients are grouped into clusters in a fair manner and only a single client cluster uses FDMA to communicate with the server per communication round to save total energy. In the following communications rounds, other client clusters participate in the global FL update in a round-robin fashion. In the third step of our framework, we formulated and solved a resource allocation algorithm over each cluster to optimally assign the communication and computation resources to each client within the cluster to minimize his total energy consumption under the outage, bandwidth, and latency constraints.
We applied our proposed client selection and resource optimization algorithms to the MNIST and Fashion-MNIST datasets in iid and non-iid settings. We studied the impact of the client feasibility threshold in terms of convergence speed and learning accuracy. Our proposed client selection algorithm can schedule a smaller number of clients per communication round compared to the baseline case, which leads to a reduction in the total energy consumption while still maintaining the same learning accuracy. In addition, we demonstrated the total energy saving advantage of our proposed optimal bandwidth allocation over uniform bandwidth allocation across the FL communication rounds. Then, we quantified the impact of increasing the number of receive antennas at the server on the convergence speed at a given learning accuracy. For example, with four receive antennas at the server, the training speed became 42% faster than with one receive antenna for the non-iid case of the MNIST dataset at a testing accuracy of 97.64%. Finally, we extended our resource allocation algorithm to the case of no CSI at both the server and client ends to avoid the pilot overhead and complexity of CSI estimation while still enjoying comparable energy consumption and learning accuracy.
An interesting direction for future research includes investigating the impact of quantizing the model parameters on the total energy minimization in the case of DPSK, as this can lead to a significant decrease in the communication overhead at the price of reduced test accuracy. Another direction is to enforce fairness between the clients in terms of energy consumption. Here, we can consider non-uniform quantization between clients subject to learning accuracy constraints where clients with large local datasets may quantize their models using fewer bits than clients with small datasets.

APPENDIX A DECISION-POINT SNR OF DIFFERENTIAL TRANSMISSION
The received signal at the server side is y s = hx s + n s .
where y s is the N r × 1 received signal vector, h is the N r × 1 channel vector, x s is the transmitted DPSK symbol, and n s is the N r × 1 noise vector at time index s.
To demodulate the received symbols differentially, we multiply the received symbol at time index s by the received symbol at time index s − 1 as followŝ = |h 1 | 2 + |h 2 | 2 + · · · + |h N r | 2 2 |x s−1 | 4 |u s | 2 (40) where |u s | 2 = 1 because the information symbols are drawn from a PSK signal constellation. Similarly, The noise energy is given by where σ 2 n is AWGN noise variance and E x = |x s−1 | 2 . Therefore, the decision-point SNR is given by where ρ = E x σ 2 n is the average SNR. At high SNR, the decision point SNR can be approximated as which shows the 3 dB SNR penalty (due to factor of 1 2 ) from coherent transmission.

APPENDIX B OUTAGE PROBABILITY OF DIFFERENTIAL TRANSMISSION
Starting with the definition of the outage probability, where R target is the target rate and R is the achievable rate. By substituting ρ = P BN o and |h| 2 = |h 1 | 2 +|h 2 | 2 +· · ·+|h N r | 2 in (53), the outage probability becomes The channels between the clients and the server are assumed to be quasi-static Rayleigh fading, which means that the sum of squares of 2N r independent real Gaussian random variables, (|h 1 | 2 + |h 2 | 2 + . . . |h N r | 2 ), is distributed as Chi-square with 2N r degrees of freedom. Therefore, the outage probability of differential transmission is given by