Energy-Efficient Massive MIMO for Federated Learning: Transmission Designs and Resource Allocations

This work proposes novel synchronous, asynchronous, and session-based designs for energy-efficient massive multiple-input multiple-output networks to support federated learning (FL). The synchronous design relies on strict synchronization among users when executing each FL communication round, while the asynchronous design allows more flexibility for users to save energy by using lower computing frequencies. The session-based design splits the downlink and uplink phases in each FL communication round into separate sessions. In this design, we assign users such that one of the participating users in each session finishes its transmission and does not join the next session. As such, more power and degrees of freedom will be allocated to unfinished users, resulting in higher rates, lower transmission times, and hence, higher energy efficiency. In all three designs, we use zero-forcing processing for both uplink and downlink, and develop algorithms that optimize user assignment, time allocation, power, and computing frequencies to minimize the energy consumption at the base station and users, while guaranteeing a predefined maximum execution time of each FL communication round.


I. INTRODUCTION
Over the past few decades, communication systems with the Internet and mobile telephony brought much convenience to human life [1]- [3].Recently, the rapid development of artificial intelligence has contributed to the modernization of our world with a wide range of applications such as smart cities and autonomous cars [4]- [6].However, current communication systems are also facing big challenges.Specifically, since users (UEs) need to send their data over a shared medium, their data privacy can be compromised, as already happened [7].
At the same time, mobile data traffic is anticipated to increase dramatically during 2020-26, at up to 32% per month [8].This in turn has led to concerns about energy consumption and carbon emissions, where communication systems are projected to contribute significantly [9].On the other hand, according to the report [10], the information and communication technology sector was estimated to account for a portion of 1.4% of global carbon emissions in 2015.More importantly, this portion is likely to grow in the future when the number of internet-ofthings devices grows exponentially.Therefore, it is critical for future communication systems not only to be integrated with machine learning applications, but also to preserve privacy and be energy-efficient.
Federated learning (FL) is a distributed learning framework that offers high privacy and communication efficiency [11]- [14].Especially, in FL, no raw data are shared during the learning process.An FL process is jointly implemented by several UEs and a central server.First, the central server sends a global model update to all the UEs.Each UE uses this model update, along with its private training data, to compute its own local learning model update.The UEs then send their local updates back to the central server for updating the global model update.This process is repeated until a certain level of learning accuracy is reached.Here, since the size of the model updates sent over the network is much smaller than that of the raw data, communication efficiency is much improved.

A. Review of Related Literature
In the literature, there are only several works that study energy-efficient implementations of FL over wireless networks, e.g., [15]- [21] and references therein.These papers can be categorized into learning-oriented and communicationoriented directions.The learning-oriented direction seeks learning solutions to reduce the energy consumed in the networks.In particular, [15] proposes an FL algorithm that adapts the compression parameters to minimize energy consumption at UEs.The work of [16] proposes a novel joint dataset and computation management scheme that trades off between learning accuracy and energy consumption for energy-efficient FL in mobile edge computing.Reference [17] introduces a federated meta-learning algorithm together with a resource allocation scheme to jointly improve convergence rate and minimize energy cost.Finally, [18] develops a SignSGD-based FL algorithm where local processing and communication parameters are chosen to achieve a desired balance between learning performance and energy consumption.
The communication-oriented direction does not propose new FL algorithms, but rather develops communication proto-cols and system designs to reduce the energy consumption of an FL process run over a wireless network [19]- [22].Compared to the learning-oriented direction, the communicationoriented gives more insights into how FL should be implemented at the physical layer.Specifically, [19] minimizes energy consumption at user devices by optimally allocating bandwidth, power, and computing frequency.Reference [20] proposes another resource allocation algorithm for FL networks, in which each user is equipped with a CPU-GPU platform for heterogeneous computing.The authors in [22] proposed a joint communication and learning framework that improves the learning performance while keeping the energy consumption acceptable on each user device.The work of [21] designs a network with unmanned aerial vehicles and wireless powered communications to provide an energy-efficient FL solution.

B. Research Gap and Main Contributions
The ongoing research efforts in the communication-oriented direction have mainly used frequency-division multiple access (FDMA) to support FL.The drawback of FDMA networks is that the spectral and energy efficiencies are very low when the channel is shared by many users.It is therefore desirable to propose a novel network design to implement FL frameworks with a much higher energy efficiency.
This research gap in the literature has motivated us to consider a massive multiple-input multiple-output (mMIMO) network to implement wireless FL in an energy-efficient manner.The use of massive MIMO to support FL has been shown to be very efficient [23]- [28], compared to conventional FDMA or time-division multiple access (TDMA) schemes.The main reasons for this are: (i) massive MIMO can simultaneously serve many users; (ii) massive MIMO offers huge spectral efficiencies, and hence, can significantly reduce the training time; and (iii) massive MIMO provides high energy efficiency [29].As a result, massive MIMO fits well with federated learning applications that require a large number of energyefficient and low-latency transmissions between user devices and the server at the same time (e.g., a camera network of augmented reality users in the same cell building a model for object detection and classification, a vehicular network of clients equipped with various sensors building a model for image classification [11]).
The specific contributions of this paper are summarized as follows: • To support FL over wireless networks, we propose to use mMIMO and let each FL communication round be executed within one large-scale coherence time 1 .Owing to a high array gain and multiplexing gain, mMIMO can offer very high data rates to all UEs simultaneously in the same frequency band [30].Therefore, it is expected to guarantee a stable operation during each communication round (and hence the whole FL process).
• We introduce three novel transmission designs for the steps within one FL communication round.The downlink (DL) transmission, the computation at the UEs, and the uplink (UL) transmission, are implemented in a synchronous, asynchronous, or session-based manner.The synchronous design strictly synchronizes UEs in each step of one FL communication round.The asynchronous design allows more flexibility for UEs to save energy by using lower computing frequencies.The session-based design splits the DL and UL steps into separate sessions.The UEs are then assigned such that one of the participating UEs in each session will complete its transmission and does not join subsequent sessions.This design allows more power to be allocated to the uncompleted UEs.This results in higher rates, lower transmission times, and higher energy efficiency.In all three designs, both DL and UL transmissions use a dedicated pilot assignment scheme for channel estimation and zero-forcing (ZF) processing.
• For each proposed transmission design, we formulate a problem of optimizing user assignment, time allocation, transmit power, and computing frequency to minimize the total energy consumption in each FL communication round, subject to a quality-of-service constraint.The formulated problems are challenging due to their nonconvex and combinatorial (mixed-integer) nature.Existing solutions to problems in standard massive MIMO systems cannot be used in a straightforward manner to solve the formulated problems.As such, we propose novel algorithms that are proven to converge to stationary points, i.e., Fritz John and Karush-Kuhn-Tucker solutions, of the formulated problems.• We show by numerical results that our proposed designs significantly reduce the energy consumption per FL communication round compared to heuristic baseline schemes.The presented numerical results also confirm that the session-based design outperforms the synchronous and asynchronous designs.
It is noted that the idea of the proposed synchronous design is similar to the transmission scheme in [23].However, the resource allocation algorithm in [23] for minimizing the FL training time in a cell-free massive MIMO network cannot be straightforwardly applied to solve the more complex problem of minimizing the energy consumption of an mMIMO network, as treated in this work.On the other hand, the proposed synchronous and asynchronous designs are different from those in [31].They use dedicated pilot assignment and ZF processing for each UE, while those in [31] use co-pilot assignment and ZF processing for each group of UEs.These key distinctions result in major differences in the respective problem formulations and algorithms for resource allocation.
Notation: We use boldface symbols for vectors and capitalized boldface symbols for matrices.R d denotes a space where its elements are real vectors of length d.X X X * and X X X H represent the conjugate and conjugate transpose of a matrix X X X, respectively.CN (0 0 0, Q Q Q) denotes the circularly symmetric complex Gaussian distribution with zero mean and covariance Q Q Q. E{x} denotes the expected value of a random variable x.

II. NOVEL MASSIVE MIMO DESIGNS TO SUPPORT FEDERATED LEARNING NETWORKS
In this work, we focus on the optimization of communication resources in a massive MIMO wireless network that supports FL applications.Specifically, we consider the use of a standard FL algorithm and develop optimized transmission designs that support this FL framework.We consider a network that supports FL algorithms with a synchronous aggregation mode 2 .In general, such an FL network includes a group of UEs and a central server.Each FL communication round includes K UEs and the following four basic steps [33]- [38]: The above process repeats until a certain level of learning accuracy is attained.Details on the local and global updates along with their associated computations are thoroughly discussed in [33]- [38].We assume that before our proposed schemes are undertaken, all the UEs that participate in each FL communication round have sufficient computational capabilities to update their models.This assumption is widely accepted in the literature on wireless network designs for supporting federated learning, e.g., [19]- [21], [39], [40] and references therein.
We note that the aggregation of local updates of the UEs can be performed by two approaches.The first approach makes the aggregation in the digital domain [19]- [21], [39], [40], and is called DigComp.The second approach leverages the signal superposition property to aggregate in the analog domain and is called over-the-air computation (AirComp) [24], [41]- [46].While DigComp leverages the capability of traditional digital transmission in wireless systems that are deployed and standardized, AirComp is an emerging approach which is still under basic development and not yet supported by cellular systems [47].Most existing works using AirComp require the UEs to acquire CSI, which in itself is a very challenging task.Research on wireless network designs using AirComp without CSI acquisition is still in its infancy [24], [42], [43].In this work, we follow the DigComp approach and propose energyefficient transmission designs for massive MIMO systems to support FL.The topic of using the AirComp approach for 2 FL algorithms with the synchronous aggregation mode wait to receive all local model updates sent from users before aggregation, while the FL algorithms with the asynchronous aggregation mode do not.The FL algorithms with synchronous aggregation normally outperforms the FL algorithms operating with asynchronous aggregation in terms of convergence rate and accuracy.Research on improvement of learning performance of the FL algorithms with asynchronous aggregation is still in its infancy, while FL algorithms with synchronous aggregation are well studied [32].Therefore, our paper focuses on transmission protocols supporting FL with synchronous aggregation [33]- [38].
energy-efficient transmission designs to support FL is left for future work.

A. Proposed Transmission Designs to Support Federated Learning Networks
To support FL in the network, we propose to use mMIMO technology where the BS acts as the central server.Accordingly, Steps (S1) and (S3) of each FL communication round take place over the DL and UL of the mMIMO system, respectively.Each FL communication round is assumed to be executed within a channel large-scale coherence time, which is a reasonable assumption for typical network scenarios [23], [25], [48].Under this assumption, we propose the following transmission schemes 3 to support Steps (S1)-(S3) of each FL communication round: 1) Synchronous Design: As shown in Fig. 1(a), the synchronous design requires a certain degree of synchronization among the UEs when executing the steps of one FL communication round.In particular, the UEs are synchronized for steps (S2) and (S3) to start simultaneously at all UEs.The UEs' rates are taken to be the achievable rates when all the UEs' transmissions are being active.2) Asynchronous Design: Compared with the synchronous design, the asynchronous design uses the same rate assignment scheme.The DL (UL) rate of each user is kept fixed for the whole DL (UL) mode.However, the asynchronous design has a different transmission protocol.The asynchronous design only requires the UEs to start Step (S1) simultaneously.As shown in Fig. 1(b), UEs have more flexibility in executing Steps (S1)-(S3).This is because they can transmit their local model updates in Step (S3) immediately after they complete Step (S2), as long as their UL transmission is performed during the BS UL mode.Thus, the UEs in the asynchronous design need not wait for other UEs, as is the case of the synchronous design.Instead, they can use the waiting time to compute their local model updates with a lower clock frequency to save energy.Also, thanks to the flexible synchronization requirement among the UEs, the asynchronous design has a significantly lower signalling overhead compared to the synchronous design, especially when the number of UEs is large.
3 UE selection could be beneficial for improving the energy efficiency of the system, especially in the case that some UEs have very bad channel conditions.However, UE selection reduces the number of UEs that participate in the FL process, and hence, would affect the FL performance (i.e., test accuracy) [41].Since we mainly focus on the communication aspects in a standard FL framework, we do not incorporate the UE selection process into our proposed transmission designs, but assume that all K UEs participate in each FL communication round.This assumption is made in much of the literature on wireless network design for support of federated learning, e.g., [19]- [21], [39], [40].More importantly, although we do not take into account the UE selection part in the transmission designs, our proposed transmission schemes can still be used to support FL frameworks that have UE selection in their FL algorithms.Specifically, in each communication round of such FL algorithms, different values of K and different UEs can be selected from a larger pool of UEs using the UE selection scheme in the FL algorithm.Then, our optimization problems can be reformulated for the given new K UEs without any changes in their mathematical structure.3) Session-based Design: In the asynchronous design, the DL (UL) rate of each user is kept fixed for the whole DL (UL) duration.This is not efficient because, for each mode, after some time, some users may complete their transmissions.Hence, other users can increase their rates owing to the reduced level of interference and increased availability of power (on DL).Based on this observation, we propose the session-based design in Fig. 1(c).Here, instead of using one single session for each step (S1) or (S3), we use multiple sessions to serve UEs in steps (S1) and (S3).After each session, one user completes its transmission, and the rates of other users are adapted accordingly.Since there are fewer UEs competing for power in each session, more power can be allocated to the UEs that have not yet completed their transmissions.
In addition, the inter-user interference reduces, which leads to higher rates, faster transmission and better energy efficiency compared to the other designs.

III. SYSTEM MODELS
This section provides detailed system models for the proposed designs.As discussed in Section II, because the synchronous and asynchronous designs use the same rate assignment scheme, their system models are similar.On the other hand, as can be seen from Fig. 1, the asynchronous design is a special case of the session-based design with a single session .Based on these observations, the system model of the session-based design is therefore provided as a general model, followed by the specific models for the asynchronous and synchronous designs.
In the considered mMIMO model, a BS equipped with M antennas serves K UEs each equipped with a single antenna at the same time and in the same frequency bands, using timedivision duplexing.The channel vector from a UE k to the BS is denoted by g g g k = (β k ) 1/2 g g g k , where β k and g g g k ∼ CN (0 0 0, I I I M ) are the corresponding large-scale fading coefficient and smallscale fading coefficient vector, respectively.In this work, we consider low mobility scenarios with a large coherence interval τ c .Each FL communication round is executed in one largescale coherence time [23] (see Fig. 2) The DL transmission for the global update in Step (S1) and the UL transmission for the local update in Step (S3) span multiple (small-scale) coherence times.

1) Step (S1):
The BS sends the parameter vector to all the UEs in K sessions.Each coherence block of this step involves two phases: UL channel estimation and DL payload data transmission.Define an indicator a k,i as Let to make sure that all the UEs are served in session 1.Also, in each of the subsequent sessions, one UE is instructed to finish its transmission such that it does not join the next sessions.Doing this helps the UEs who are yet to finish their transmissions in that they get assigned more power and experience a lower level of inter-user interference, which translates into higher data rates.
Here, the asynchronous and synchronous designs are considered as the same special case when all the UEs are served in a single session, i = 1, and a k,1 = 1, ∀k.
UL channel estimation: For each coherence block of length τ c , each UE sends its dedicated pilot of length τ d,p to the BS.We assume that the pilots of all UEs are pairwisely orthogonal, which requires τ d,p ≥ K. 4 At the BS, the channel g g g k between a UE k and the BS is estimated by using the received pilots and minimum mean-square error (MMSE) estimation.The MMSE estimate ĝ g g k of g g g k is distributed as CN (0 0 0, σ2 and ρ p is the normalized transmit power of each pilot symbol [30, (3.8)].We also denote by Ĝ G G i [. . ., ĝ g g k , .
. .], ∀k ∈ K i , the matrix obtained by stacking the channels of all participated UEs in a session i. DL payload data transmission: We assume that the BS uses a unicast scheme and ZF precoding to transmit the global training update to the K UEs.Let s d,k,i , where e e e k,Ki is the ZF precoding vector, η k,i is a power control coefficient, e e e k,Ki is the k-th column of I I I Ki , and ρ d is the maximum normalized transmit power at the BS.Note that ZF requires M ≥ K i .The transmitted power at the BS must meet the average normalized power constraint Here, we have to ensure that no power is allocated to the UEs that are not served in the session i.The achievable rate of the UE k in the session i is given by R , where B is the transmission bandwidth, η η η i {η k,i } k∈K , and SINR d,k,i (η ∈K η ,i +1 is the effective DL signal-to-interference-plus-noise ratio (SINR) [30, (3.56)].
Similarly, the power constraint at the BS and the achievable rate at the UE k in the asynchronous and synchronous designs are given as where η η η {η k } k∈K are the power control coefficients, and The same global training update can be coded differently for different UEs to improve the spectral efficiency of the DL transmission.Specifically, it can be transmitted by either a multicast scheme or a unicast scheme [50].As shown in Fig. 5 of [50], in a massive MIMO system where the same message is sent to all users, the scheme using unicast and ZF is recommended in almost all cases, except when the coherence interval is short (small τ c ) or the number of antennas M at the BS is small.In our paper, we consider low mobility scenarios (i.e., large τ c ) with a large value of M .Therefore, we choose unicast and ZF precoding for our transmission scheme.We verify the advantage of this choice over the multicast scheme by Fig. 3, which compares the unicast scheme and the multicast schemes in a single group of UEs.From the figure, in terms of per-UE rates, the unicast with dedicated pilots significantly outperforms the multicast counterparts in both dedicated and co-pilot pilot designs.We also note that the difference in the global training update for each user is the difference of the symbols that encode the same global training update for different users.There is no change in the FL model of the standard FL framework discussed in Section II. A. On the other hand, ZF precoding, while simple, performs very closely to the optimal precoding in massive MIMO [30], [51].
That is why ZF precoding is employed in this paper, to achieve both simplicity and good performance.
DL delay: Let S d and S d,k,i be the size of the global model update and the size of the split data of the update intended for a UE k in a session i, respectively.Then, we have Let t d,i be the length (in second) of the session i.Then from Fig. 1(c), the transmission time t d,k,i to the UE k ∈ K in the session i of the session-based design is given by Clearly, (9) also implies that (S d,k,i = 0, if a k,i = 0), ∀k, i, which ensures that no data is sent to the UEs not served in session i.The transmission time to UE k ∈ K in the asynchronous and synchronous designs is expressed as , ∀k.Energy consumption for the DL transmission: Denote by N 0 the noise power.The energy consumption for transmitting the global update or its split data to a UE k is the product of the transmit power ρ d N 0 η k or ρ d N 0 η k,i and the transmission time to the UE k.Therefore, the total energy consumption for transmission by the BS in a session i of the session-based , ∀k, i, and that in the asynchronous and synchronous designs is where a a a i {a k,i } k∈K , t t t d {t d,i } i∈K .

2)
Step (S2): After receiving the global update, each UE uses its local data set to execute L local computing rounds in order to compute its local update.The model of this step is used in all the proposed designs.
Local computation: Let c k (cycles/sample) be the number of processing cycles for a UE k to process one data sample [37].Denote by D k (samples) and f k (cycles/s) the size of the local data set and the processing frequency of the UE k, respectively.The computation time at the UE k is then given by t [37].Energy consumption for local computing at the UEs: The energy consumed by the UE k to compute its local training update is given as , where α 2 is the effective capacitance coefficient of the UEs' computing chipset [23], [37]. 1, if UE k send its data in a session j, 0, otherwise.
Let N j {k|b k,j = 1} be the set of N j = k∈K b k,j participating UEs in a session j ∈ K. Here, we have to guarantee that all the UEs finish their transmissions in the last session K and each session has one more UE sending its data.Doing this helps the UEs that start their transmissions earlier.They can have more power which yields higher achievable rates, lower delays, and thus, potentially lower transmission energy in each FL communication round.Note that in the asynchronous and synchronous designs, there is only one session j = K, and hence, K j = K, K j = K, and {b k,j } are not variables but constants, i.e., b k,K = b k = 1, ∀k.
Uplink channel estimation: In each coherence block, each UE sends its pilot of length τ u,p to the BS.We assume that the pilots of all the UEs are pairwisely orthogonal, which requires the pilot lengths to satisfy τ u,p ≥ N j .The MMSE estimate ḡ g g k of g g g k is distributed according to CN (0 0 0, σ2 UL payload data transmission: After computing the local update, a UE k encodes this update into symbols denoted by s u,k,j , where E{|s u,k,j | 2 } = 1, and sends the baseband signal x u,k,j = ρ u ζ k,j s u,k,j to the BS, where ρ u is the maximum normalized transmit power at each UE and ζ k,j is a power control coefficient.This signal is subject to the average transmit power constraint, E |x u,k,j | 2 ≤ ρ u , which can be expressed as Here, we have to ensure that the UEs not sending data in session j are not allocated power.After receiving data from all UEs, the BS uses the estimated channels and ZF combining to detect the UEs' message symbols.The ZF receiver requires M ≥ N j .The achievable rate (bps) of UE k is given by R +1 is the effective uplink SINR [30, (3.29)].Similarly, the power constraint at the UEs and the achievable rate of the UE k in the asynchronous and synchronous designs are given by where ζ ζ ζ {ζ k } k∈K are power control coefficients, and UL delay: Let S u and S u,k,j be the size of the local model update and the size of the split data of this update in a session j, respectively.Then, we have Since the transmission time t u,j from every participating UE to the BS in the session j is the same, the transmission time t u,k,j from a UE k ∈ K in the session-based design is given by t u,k,j (b k,j , t u,j ) = b k,j t u,j , ∀k, j, = R u,k,j (ζ ζ ζ j )t u,j , ∀k, j.
(18) Here, (18) also implies (S u,k,j = 0, if b k,j = 0), ∀k, j, which ensures that the UEs not participating in session the j do not send any data.The transmission time from the UE k ∈ K in the asynchronous and synchronous designs is Energy consumption for the UL transmission: The energy consumption for the UL transmission at a UE is the product of the UL power and the transmission time.In particular, the energy consumption at a UE k in a session j of the session-based design is given by , ∀k, j, and that of both the asynchronous and synchronous designs is expressed as

4)
Step (S4): In this step, the BS recomputes the global update using all the received local updates.This step is executed at the BS and does not affect our transmission designs.The computational capability of the central server (i.e., the BS) is much higher than that of each UE, and Step (S4) typically entails the application of a simple aggregation rule such as summing up the model updates.Therefore, the time required for computing the global update in Step (S4) is assumed negligible.Consequently, the computation time of Step (S4) is ignored in the problem formulation and solution in the subsequent sections.

A. Problem Formulation
In this work, we aim to (i) improve the energy efficiency of the proposed FL-enabled mMIMO networks by minimizing the total energy consumption in one FL communication round, and (ii) guarantee the execution time of each round below a quality-of-service threshold.Here, the total energy consumption of one FL communication round includes the energy consumption for transmission and local computation at both the BS and the UEs.Thus, the total energy consumption of one FL communication round in the session-based design is The problem of optimizing user assignment (a a a, b b b), data size (S S S d , S S S u ), time allocation (t t t d , t t t u ), power ( η η η, ζ ζ ζ), and computing frequency f f f , to minimize the total energy consumption of one FL communication round in the session-based design is formulated as min s.t.(1) − (4), ( 7), ( 9), ( 10) − (13), ( 16), ( 18) where x x x {a a a, b b b, η η η, ζ ζ ζ, f f f , S S S d , S S S u , t t t d , t t t u }, S S S d {S d,k,i }, S S S u {S u,k,j }, ∀k, i, j.Here, (19f) is introduced to ensure that all the UEs send their local updates during the UL mode of the BS.The right-hand side of (19f) corresponds to the first UE that finishes its DL transmission and local computation, while the left-hand side corresponds to the slowest UE that finishes its DL transmission.Constraints (19d) and (19e) take into account the time consumption in each FL communication round.These constraints make sure that the time consumption of each FL communication round does not exceed the minimum requirement t QoS , in order to ensure a target level of quality of service.Note that the study of the optimal trade-off between the time and energy consumption such as [52] is interesting but beyond the scope of our paper, and hence, is left for future work.
Remark 2. Similar to many existing works that follow the DigComp approach (such as [19]- [21], [39], [40]), our intention is to design energy-efficient wireless networks to support standard FL.Also, we do not combine massive MIMO and FL to create a new learning framework.We focus on the communication aspects, and more specifically the schemes for users to receive, compute, and transmit their model updates.On one hand, our proposed schemes do not require any changes to or even assumptions on the learning algorithm.As such, the learning performance (including convergence rates) of any standard FL framework (e.g., those in [33]- [38]) implemented over massive MIMO systems using our proposed schemes remains unchanged.The complexity of the existing FL algorithm to be implemented on the proposed massive MIMO networks does not increase.On the other hand, transmitting and receiving FL model updates is nothing but transmitting and receiving data between user devices and base station.Therefore, the complexity of a massive MIMO network used to support FL is similar to that of a current 5G massive MIMO network with the same system configuration.
Proposition 1.The following statements hold: (50) corresponding to λ converge to 0 as λ → +∞.(ii) Problem (49) has the following property min and therefore, it is equivalent to (50) at the optimal solution λ * ≥ 0 of the sup-min problem in (51).
Proof.See the Appendix.

A. Problem Formulation
Similarly, the total energy consumption of one FL communication round in the asynchronous and synchronous designs is Optimization Problem for Asynchronous Design: The problem of optimizing power (η η η, ζ ζ ζ) and computing frequency f f f to minimize the total energy consumption of one FL communication round in the asynchronous design is formulated as min s.t.(5), ( 14), (19c) 2) Optimization Problem for Synchronous Design: Similarly, the problem of optimizing power (η η η, ζ ζ ζ) and computing frequency f f f to minimize the total energy consumption of one FL communication round in the synchronous design is formulated as s.t.(5), ( 14), (19c), (66b) (67b) Here, the constraint (67b) captures the nature of "step-by-step" scheme, , i.e., every UE needs to wait for all the UEs to finish one step before starting the next step as seen in Fig. 1(a).Compared to (67b), the constraints (19d) and (66c) provide more flexibility in allocating the available time in Steps (S1)-(S3) to each UE.This is because the UEs in the asynchronous and session-based schemes need not wait for other UEs to start a new step.
Following the same procedure in Section IV-B, the concave lower bounds of Then, constraints (68b) and (68c) can be approximated by the following convex constraints Solve (74) to obtain its optimal solution y y y *

B. Results and Discussion
As discussed in Remark 2, our paper focuses on the communication aspects rather than the learning aspects of the implementation of FL over wireless networks.Therefore, the simulation results of our paper do not include dataset or the learning performance (e.g., convergence speed, training loss, and test accuracy), which is similar to many existing DigComp works in the literature such as [19]- [21], [39], [40].
1) Effectiveness of the Proposed Schemes: First, we evaluate the convergence behavior of our proposed Algorithms 1 and 2. Fig. 4 shows that Algorithm 1 converges within 60 iterations for the session-based scheme, while Algorithm 2 converges within 30 iterations for the asynchronous and synchronous schemes.It should be noted that each iteration of Algorithm 1 or 2 involves solving simple convex programs, i.e., (65), (74) and (75).
Next, since we are aware of no other existing work that studies energy-efficient massive MIMO networks for supporting FL, we compare the proposed session-based scheme (OPT SB), asynchronous scheme (OPT Asyn) and synchronous scheme (OPT Syn) with the following heuristic schemes: • HEU SB (Heuristic session-based scheme): In each session, a UE that has a less favorable link condition (i.e., smaller large-scale fading coefficient) is allocated more power to meet the required execution time of one FL communication round.First, since all UEs participated in the DL session 1 and the UL session K, we let a k,1 = b k,K = 1, ∀k, and take the power allocated to a UE k in the DL session 1 to be η and the transmit power of a UE k in the UL session . Now, in each DL session i, ∀i = 1, since the UE that has the highest data rate finishes its transmission earlier than other UEs and it does not join the subsequent sessions, we choose a and R u,k,j is obtained by using the given ζ ζ ζ k,j and (15).Then, the DL power allocated to a UE k in a DL session respectively, be the matrices of the DL and UL rates, where Denote by t t t d ∈ R 1×K and t t t u ∈ R 1×K , respectively, be the row vectors comprising the transmission times of DL and UL sessions, where [t t t d ] i = t d,i and [t t t u ] j = t u,j .Let 1 1 1 ∈ R K×1 be a column vector of all ones.Then, from ( 7) and ( 9), we have t t t d = (R R R −1 d S d 1 1 1) T .Similarly, from ( 16) and ( 18), we have The DL and UL data sizes in each session are calculated according to (9) and (18).The processing frequencies are − j∈K b k,j tu,j , ∀k. • HEU Asyn (Heuristic asynchronous scheme): The idea of heuristic power allocation in HEU SB is applied to the asynchronous scheme.In particular, the DL power to all the UEs are η , ∀k.
• HEU Syn (Heuristic synchronous scheme): This scheme is similar to HEU Asyn, except that the processing frequencies are instead set as All the following results are obtained by averaging over 200 channel realizations.
Fig. 5 compares the total energy consumption in an FL communication round by all the considered schemes.As seen, our proposed schemes significantly outperform the heuristic schemes.In particular, OPT SB, OPT Asyn, and OPT Syn reduce the total energy consumption by a substantial amount, e.g., by more than 80% in both cases of K = 10 and K = 5.These results show the significant advantage of joint optimization of user assignment, data size, time, transmit power, and computing frequencies over the heuristic schemes.
2) Comparison of the Proposed Schemes: Fig. 6(a) shows that the session-based scheme is the best performer while the synchronous scheme is the worst.Compared to OPT Syn, the total energy consumption by OPT SB is reduced by up to 29% while that figure for OPT Asyn is 6%.To gain more insights into this result, the total energy consumption for local computing, E C,total k∈K E C,k (f k ), of all considered schemes is shown in Fig. 6(b), and the total energy consumption for transmission E x − E C,total is shown in Fig. 6(c), where x ∈ {SB, Asyn, Syn}.First, it can be seen that the energy consumption for local computing and transmission of OPT Asyn are both smaller than that of OPT Syn.This is so because the UEs in the asynchronous scheme do not wait for other UEs to finish each step.As they have more time available, they can save more energy by using a lower transmit power and a lower computing frequency than the UEs in the synchronous scheme.However, the gap between OPT Asyn and OPT Syn is small because the transmission designs of the asynchronous and synchronous schemes are the same.Here, the session-based scheme uses a more energy-efficient transmission design in which power is not allocated to the UEs who have finished transmission.As a result, compared to the asynchronous and synchronous schemes, the energy consumption for transmission by the session-based scheme is reduced by up to 73% as shown in Fig. 6(c).This substantial reduction compensates for the small increase (i.e., 15%) in the energy consumption for local computing, making the overall energy consumption by the session-based scheme noticeably lower than that by the asynchronous and synchronous schemes.
3) Impact of the Number of Antennas on the Total Energy Consumption: Fig. 6(a) also shows that using a large number of antennas corresponds to a reduction of up to 40% in the total energy consumption in one FL communication round.This is because with more antennae, the data rate is higher for the same power level.Thus, the transmission time is shortened, which leads to the reduction in transmission energy; see Fig. 6(c).This also results in more time for local computing, a lower required computing frequency, and then, a reduction in the energy required for local computing as shown in Fig. 6(b).This result shows the importance of massive MIMO technology to support FL. 4) Impacts of t QoS on the Total Energy Consumption of One FL Communication Round: Fig. 7 shows that increasing t QoS leads to a dramatic decrease of up to 79% in the total energy consumption.This is because when t QoS increases, the transmit power and computing frequency required to satisfy the quality-of-service constraint are lower.In turn, they result in a reduction in energy consumption for both transmission and computing.Fig. 7 also shows that when increasing t QoS , compared with OPT Syn, the energy consumption by OPT SB is reduced by even more: from 21% for t QoS = 1 s to 71% for t QoS = 4 s, while the total energy consumption of OPT Asyn and OPT Syn is almost the same.This result confirms the significant advantage of the session-based transmission design over the conventional transmission designs used in the asynchronous and synchronous schemes.

VII. CONCLUSION
In this paper, we proposed novel synchronous, asynchronous, and session-based communication designs for massive MIMO networks to support FL. Targeting the minimization of total energy consumption per FL communication round, we formulated design problems that jointly optimize UE assignments, time allocations, transmit powers, and computing frequencies.Relying on successive convex approximation techniques, we developed novel algorithms to solve the formulated problems.Numerical results showed that our proposed designs significantly reduced the total energy consumption per FL communication round compared to baseline schemes.
In terms of energy savings, the session-based design was the preferred choice to support FL as it outperforms the synchronous and asynchronous designs.For future work, it would be interesting to study the combination of massive MIMO and intelligent reconfigurable surfaces to improve the network coverage as well as taking into account UE selection to improve the energy efficiency of massive MIMO systems to support FL. APPENDIX Following the arguments in [53], [54], let E(λ) be the optimal value at the optimal solution of problem (50) corresponding to λ.For ease of presentation, we use E for E sb (f f f , v v v d , v v v u ).Also, since (a a a, b b b, f f f , v v v d , v v v u , r r r d , r r r u , S S S d , S S S u , t t t d , t t t u , t t t d , t t t u , λ) is a subset of variables in x x x, we use L( x x x, λ) instead of L(a a a, b b b, f f f , v v v d , v v v u , r r r d , r r r u , S S S d , S S S u , t t t d , t t t u , t t t d , t t t u , λ).Let E * be the optimal value of problem (49).Then E * < +∞ since F is compact.Due to a duality gap between the optimal value of problem (49) and the optimal value of its dual problem, we have sup λ≥0 E(λ) = sup λ≥0 min x x x∈ F L( x x x, λ) ≤ E * min x x x∈ F max λ≥0 L( x x x, λ), which implies that ,λ k∈K i∈K ((r d,k,i ) λ ( td,k,i ) λ − (S d,k,i ) λ ), V 3,λ k∈K i∈K ((r u,k,j ) λ ( tu,k,j ) λ − (S u,k,j ) λ ), V 4,λ k∈K t λ − i∈K ( td,k,i ) λ − t C,k ((f k ) λ ) − i∈K ( tu,k,i ) λ be the value of V 1 , V 2 , V 3 , V 4 at the values f f f λ , a a a λ , b b b λ , (r r r d ) λ , (r r r u ) λ , (S S S d ) λ , (S S S u ) λ , ( t t t d ) λ , ( t t t u ) λ , ( t t t d ) λ , ( t t t u ) λ , t λ corresponding to λ.Then V 1,λ , V 2,λ , V 3,λ , V 4,λ ≥ 0, ∀λ.Let V λ γ 1 V 1,λ + γ 2 V 2,λ + γ 3 V 3,λ + γ 4 V 4,λ .Denote by E λ the value of E corresponding to λ.Let 0 ≤ λ 1 < λ 2 .Because E(λ 1 ) and E(λ 2 ) are the optimal values of (50) corresponding to λ 1 and λ 2 , we have S1) The central server sends a global update to the UEs.(S2) Each UE updates its local learning problem with the global update and its local data, and then computes a local update by solving the local problem.(S3) Each UE sends its local update to the central server.(S4) The central server recomputes the global update by aggregating the received local updates from all the UEs.

Fig. 1 .
Fig. 1.Illustration of one FL communication round over the considered mMIMO network with three UEs.

Fig. 2 .
Fig. 2. Operation of one FL communication round in the considered massive MIMO network.

Fig. 3 .
Fig.3.Comparison of downlink per-UE rates in a single group of UEs between the unicast ZF scheme with a dedicated pilot design and the multicast schemes with dedicated and co-pilot pilot designs.Here, M = 75, K = 10, η k = 1/K, ∀k.All other parameters are the same as in our simulation results (see Section VI-A).

3 )
Step (S3): In this step, the local model updates are transmitted from the UEs to the BS through K sessions.Each coherence block of this step involves two phases: channel estimation and uplink payload data transmission.Define the indicator b k,j as b k,j

A
. Network Setup and Parameter Settings We consider an mMIMO network in a square of D × D, where the BS is located at the center and the UEs are located randomly within the square.We choose D = 0.25 km, and set τ c = 200 samples.The large-scale fading coefficients, β k , are modeled in the same manner as [60]: β k [dB] = −148.1 − 37.6 log 10 d k 1 km + z k , where d k ≥ 35 m is the distance between a UE k and the BS, z k is a shadow fading coefficient modelled according to a log-normal distribution with zero mean and 7-dB standard deviation.We choose B = 20 MHz, τ d,p = τ u,p = K, S d = S u = 1 MB, noise power σ 2 0 = −92 dBm, L = 5, f max = 5 × 10 9 cycles/s, D k = 10 4 samples, c k = 20 cycles/samples [37], for all k, and α = 5 × 10 −21 .Let ρd = 10 W, ρu = 0.2 W and ρp = 0.2 W be the maximum transmit power of the BS, UEs and UL pilot sequences, respectively.The maximum transmit powers ρ d , ρ u and ρ p are normalized by the noise power.

Fig. 7 .
Fig. 7. Impact of t QoS on the total energy consumption of one FL communication round.Here, M = 75.