Federated Learning Over-the-Air by Retransmissions

Motivated by the increasing computational capabilities of wireless devices, as well as unprecedented levels of user- and device-generated data, new distributed machine learning (ML) methods have emerged. In the wireless community, Federated Learning (FL) is of particular interest due to its communication efficiency and its ability to deal with the problem of non-IID data. FL training can be accelerated by a wireless communication method called Over-the-Air Computation (AirComp) which harnesses the interference of simultaneous uplink transmissions to efficiently aggregate model updates. However, since AirComp utilizes analog communication, it introduces inevitable estimation errors. In this paper, we study the impact of such estimation errors on the convergence of FL and propose retransmissions as a method to improve FL accuracy over resource-constrained wireless networks. First, we derive the optimal AirComp power control scheme with retransmissions over static channels. Then, we investigate the performance of Over-the-Air FL with retransmissions and find two upper bounds on the FL loss function. Numerical results demonstrate that the power control scheme offers significant reductions in mean squared error. Additionally, we provide simulation results on MNIST classification with a deep neural network that reveals significant improvements in classification accuracy for low-SNR scenarios.

Abstract-Motivated by the increasing computational capabilities of wireless devices, as well as unprecedented levels of user-and device-generated data, new distributed machine learning (ML) methods have emerged.In the wireless community, Federated Learning (FL) is of particular interest due to its communication efficiency and its ability to deal with the problem of non-IID data.FL training can be accelerated by a wireless communication method called Over-the-Air Computation (AirComp) which harnesses the interference of simultaneous uplink transmissions to efficiently aggregate model updates.However, since AirComp utilizes analog communication, it introduces inevitable estimation errors.In this paper, we study the impact of such estimation errors on the convergence of FL and propose retransmissions as a method to improve FL accuracy over resource-constrained wireless networks.First, we derive the optimal AirComp power control scheme with retransmissions over static channels.Then, we investigate the performance of Over-the-Air FL with retransmissions and find two upper bounds on the FL loss function.Numerical results demonstrate that the power control scheme offers significant reductions in mean squared error.Additionally, we provide simulation results on MNIST classification with a deep neural network that reveals significant improvements in classification accuracy for low-SNR scenarios.
Index Terms-Federated learning, over-the-air computation, retransmissions.

I. INTRODUCTION
T HE data collection rate in wireless devices is growing at an exceptional speed due to the increasing adoption of smartphones, tablets, and Internet of Things (IoT) devices [1], [2].These devices are expected to provide a broad range of Artificial Intelligence (AI) services in Sixth Generation (6G) networks, such as predictive healthcare [3], search-andrescue drones [4], and environmental monitoring [5].As a consequence, new distributed machine learning methods, such as Federated Learning (FL), have become essential to enable privacy-preserving and communication-efficient model training [6].A recent survey on open problems of FL argues that communication is often a primary bottleneck for FL because wireless links operate at low rates that can be both expensive and unreliable [7].Communication-efficient FL is investigated thoroughly in [8], where various compression techniques such as quantization, random rotation, and sub-sampling are evaluated.In [9], [10], and [11] it is established that new wireless methods can greatly improve the communication efficiency of edge AI.
A novel approach to wireless communication, called Overthe-air computation (AirComp), has recently been adapted to support Machine Learning (ML) services [12], [13].AirComp is an analog communication scheme that orders its users to communicate simultaneously over the same frequency band, thereby promoting interference.This interference is leveraged to compute a function of the transmitted messages by utilizing the superposition property of the wireless channel [14].By appropriately precoding the transmitted signals and postcoding the received signal, all nomographic functions can be calculated over the air [15].In FL, the central server is interested in collecting the arithmetic mean of model updates from the participating devices.Since the arithmetic mean is a nomographic function, AirComp is a suitable communication solution [11].
Compared to conventional point-to-point digital communications, AirComp is attractive from a communicationefficiency standpoint, with throughput gains approximately proportional to the number of users [12].The reason for this drastic improvement is that the entire wireless spectrum can be utilized concurrently by all devices, rather than dividing it and allocating smaller resource blocks to each device.Additionally, AirComp obfuscates the participating users since the central server directly receives the arithmetic mean rather than the individual model updates, thereby enhancing privacy [16].
Currently, AirComp is reliant upon specialized hardware and fine synchronization that might be difficult to achieve in practice [17].Additionally, AirComp is unable to guarantee perfect reconstruction of the transmitted messages at the receiver.Shannon's "fundamental theorem for a discrete channel with noise" establishes that for any degree of noise contamination, it is possible to communicate discrete data with an arbitrarily small frequency of errors [18].However, to achieve a non-zero communication rate, redundant information must be transmitted in the form of a code.Since the information transmitted in AirComp is not discrete, existing codes do not appear to be directly applicable.Instead, AirComp settles for estimating the desired function as closely as possible, while retaining some non-zero estimation error [19].In [20], it is proven that these errors harm the convergence properties of FL, both the rate of convergence and post-convergence loss.
In the current AirComp literature, the main way of reducing the estimation error is to optimize the transmission powers.In [19] and [21], the authors propose a closed-form power control scheme that minimizes the mean squared error (MSE) between the received signal and the desired function of the sources' messages under a peak transmission power constraint.For the case of multiple antennas, no closed-form power control scheme has been found, but [13] develops a strong heuristic by using a difference-of-convex-functions representation of the problem.In [22] and [23], the multi-antenna problem is coupled with wireless power transfer to improve the battery life of participating IoT devices.To further improve the power control, [24] proposes a gradient-statistics aware scheme that learns statistical properties of the model updates to improve the AirComp estimation error.In [25], the temporal structure of gradient sparsity is leveraged to develop a Bayesian prior that improves the estimation.Another common approach is to incorporate intelligent reflective surfaces with AirComp to reach substantially lower estimation errors [26], [27], [28].
As a general pattern, none of these works offer avenues to trade off communication resources for improved estimation.In digital communications, such communication-estimation trade-offs are the main way to reduce errors.For instance, it is standard to adaptively control the modulation order and coding rate to compensate for poor channels [29].Unfortunately, none of these approaches are directly compatible with AirComp since the communication is analog.In this paper, we take a first step towards enabling this communication-estimation tradeoff for Over-the-Air federated learning with a system we call AirReComp.The contributions of this paper are summarized as follows.
• A power control scheme for AirReComp is proposed.The proposed scheme is proven to be globally optimal in terms of MSE between the estimated and desired function, given assumptions on the first and second moments of the local model updates.• Upper bounds on the FL loss function are derived for single-epoch Lipschitz-smooth functions, both for the strongly convex and convex case.• To further support the feasibility of AirReComp under non-convex functions, we provide numerical results with Deep Neural Networks (DNNs).These results suggest that AirReComp can beat state-of-the-art Over-the-Air FL in terms of classification accuracy.The remainder of the paper is organized as follows.Section II introduces the system model.Section III presents and solves the power control problem to minimize the MSE between the desired and received sum.Section IV provides worstcase analyses on the performance of AirReComp in terms of two upper bounds on the FL loss function.In Section V, the proposed AirReComp scheme and the convergence bound are numerically evaluated for non-convex and convex loss functions.Finally, section VI concludes the paper and discusses future work.
Notation: z is a scalar, z is a vector, and Z is a matrix.Element i of vector z is expressed as z (i) .To denote elementwise operations of vectors, we overload the scalar equivalent, e.g.x/y is the element-wise division of x and y. z denotes the complex conjugate.ẑ denotes an estimate of z.

II. SYSTEM MODEL
In this section, we describe the system model and the AirReComp algorithm.We consider a distributed ML system consisting of K single-antenna user devices each carrying a distinct dataset D k and a single-antenna parameter server (PS) which can be reached by all devices in a single hop.The objective of the system is to solve the following optimization problem using the datasets at the user devices.The vector w ∈ R d is the d × 1 parameter vector that defines the ML model, F (w) is denoted the global loss function, and F k (w) is a local loss function.
The uplink wireless channel is modeled as a block-fading multiple access channel (MAC) with additive noise [30].If the K users simultaneously transmit a vector x k ∈ R d over the MAC, the PS receives where h k ∈ C denotes the channel coefficient from device k to the PS and z ∈ C d denotes additive white Gaussian noise (AWGN) with variance σ 2 z .Additionally, we consider retransmissions over this channel, where we assume that the coherence time of the wireless channel is long enough to accommodate M uplink transmissions.If the PS aggregates the result of these transmissions, we get where the desired signal strength is increased by a factor M but the aggregate of the noise terms z m is diminished due to the random sampling.These kinds of static fading channels exist in several practical wireless applications, such as industrial communications.As a conservative example, consider an IEEE 802.11 factory wireless sensor network with coherence times of around 100ms [31].Such a network provides at least L = 10 parallel communication channels [32] and has a symbol period of less than T = 10µs [33].Considering a small neural network with d = 10, 000 parameters, it takes M dT /L = 10M ms to perform M uplink transmissions, which accommodates M = 10 transmissions within the coherence time.For other scenarios with fast-fading channels, we refer to our recent work [34].
For simplicity, we assume error-free broadcast transmission in the downlink, which is an acceptable approximation for most practical scenarios since the PS generally has much greater communication capability than the user devices [35].

A. Federated Learning Algorithm
FL is an iterative algorithm to solve (1), where each iteration is denoted a communication round and consists of downlink broadcast, model training at the user devices, and uplink aggregation.
Communication round n starts when the PS broadcasts the global model w n to all user devices in the downlink.Upon receiving the model, user device k solves the local problem where u i denotes one training sample and l(w, u i ) is the sample-wise loss function.Generally, (4) can not be solved exactly.Instead, each device runs E epochs of gradient descent to approximately solve (4) as follows where  [6], written as Finally, the PS concludes the communication round by generating the next iteration of the model parameters The algorithm repeats for N communication rounds until w N is generated as the final model.

B. Over-the-Air Computation Protocol
In the uplink aggregation step of FL, see (6), the PS reconstructs the sum of K model updates.In this section, we describe how this is achieved by AirComp.To start, the model updates are embedded into the transmit signals x n,k as where h n,k is the channel coefficient of device k for communication round n.All devices transmit x n,k simultaneously over the MAC (2), which yields the following received value at the PS Ideally, the transmission powers would be chosen as p n,k = 1/|h n,k | 2 , which would completely compensate for the fading effect.However, with a natural constraint on the maximum transmission power, p n,k = 1/|h n,k | 2 might be impossible to achieve.Because of this limitation and due to the additive noise, the PS can never perfectly reconstruct ∆w n .Instead, it estimates ∆w n by dividing the received signal by a posttransmission scalar √ η n and the number of devices K In practice, the division of √ η n K takes place in the baseband of the PS, i.e., an operation in the digital hardware of the receiver.Coupled with the transmission powers, √ η n has an important role.We see that the ideal choice of the transmission powers is now As such, the selection of a small η n will reduce the amount of energy required to invert a channel and thereby reduce the fading error.However, lowering η n will also increase the relative power of the noise.Therefore, the post-transmission scalar √ η n will play the role of a tradeoff parameter between the fading error and the noiseinduced error [19].
In this work, we propose AirReComp, which considers retransmissions in the uplink aggregation step.Specifically, the devices transmit the same values in the uplink M times such that the signal part of (10) combines constructively, while the additive noise is different for each transmission.After receiving M values, the PS forms its estimate by calculating their arithmetic mean Next, the PS takes the real part of y n to reduce the power of the noise With appropriate choices of p n,k and η n (elaborated upon in Section III), the estimate described in ( 12) can be a close estimate of ∆w n .However, note that due to the analog modulation protocol, the norm of the model update ∥∆ ŵn ∥ depends on the transmission powers p n,k .To ensure that the transmission protocol does not affect the length of the global update step, the PS updates the model as where This particular choice of normalization is motivated by that E[∆ ŵn /(c 1 /K)] = β∆w n , given that the model updates are independent and identically distributed (IID).See Appendix A, (43) for details.
The whole AirReComp process is summarized in Algorithm 1. Device k: x n,k ← Equation ( 8) 10: for each m = 1, 2, . . ., M do 11: all devices simultaneously: ∆ ŵn ← Equation (12) 17: w n+1 ← Equation (13) 18: end for Remark 1: The expected transmission power of where x n,k is defined in (8).Since the expected power is not p n,k , a maximum transmission power constraint of P , leads to p n,k ≤ P /(max k E[∥∆w n,k ∥ 2 ]).Throughout this paper, we refer to this maximum value as P max,n := Remark 2: While retransmissions improve the ability to accurately estimate the global model update, the total training time is increased significantly.If the slowest device consumes approximately T c seconds to solve (5) and T u seconds for uplink communication, the total time spent in one communication round will be which is roughly proportional to M for communicationconstrained systems.However, in the convergence bounds and numerical results we demonstrate that in low-SNR scenarios, this additional cost can be necessary to achieve sufficient performance.
Note that T c corresponds to the slowest device due to the straggler problem of FL, which could potentially be improved through the use of coded computing [36] or by introducing a relay [37], which is not considered in this work.In practice, the time for transmission could be estimated using information about the wireless protocol [38] and the computational time could be estimated using standard formulas relating to the computational capacity of the devices [39].

III. POWER CONTROL
In this section, we consider a power control problem to minimize the mean-squared estimation error defined as where the expectation is taken over the AWGN, and ∆w n and ∆ ŵn are defined in ( 6) and ( 12) respectively.For mathematical tractability, as done in the related literature, we assume that these model updates are IID, zero mean, and have unit variance [19], [21].To perform the minimization, we seek the optimal choice of the transmission powers √ p n,k and the post-transmission scalar √ η n .Since we consider static fading coefficients, the power control problem only has to be solved once per communication round (the same solution is reused for M transmissions).To model the limited transmission power of the devices, we consider the following constraint where P max,n is defined in Remark 1.The minimization of ( 17) is formulated as where the subscript n has been ommitted for brevity.Note that the number of transmissions M is given as an input parameter and is selected before the power control problem is solved.Proposition 1: Problem (19) has a unique solution.The optimal post-transmission scalar is given by the solution to the K subproblems where The optimal transmission powers are The proof of Proposition 1 follows the proof in [19] and is omitted from this paper.
Remark 3: From (21), we see that the post-transmission scalar η * n assumes a lower value when more retransmissions are used.As we increase the number of retransmissions, the signal-to-noise ratio (SNR) increases and consequently, the noise-induced error reduces.Therefore, the fading error becomes dominant and the optimal post-transmission scalar η * n is lowered to improve it.Corollary 1: The optimal transmission powers p k , given in Proposition 1, are decreasing in M .
Proof: From ( 21) it is clear that all η n,k 's are strictly decreasing in M .The post-transmission scalar η n is selected according to (20), which in turn is selected as the smallest Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
of K different η n,k .The transmission powers p n,k is selected according to (22), from which it is clear that p n,k is decreasing in η n , and therefore decreasing in M .

IV. CONVERGENCE ANALYSIS
In this section, we analyze the learning performance of Algorithm 1.For the analysis, we assume that there is only one epoch of local training in each communication round (E = 1).Additionally, we assume that the channels remain static for the entire duration of the training process.As a result, we drop the n index in the channel coefficients h k , the transmission powers p k , and the post-transmission scalar η.The effect of dynamic channels is evaluated numerically in Section V.The performance is measured as the gap between the FL loss gap at communication round n, defined as We derive two upper bounds on this loss gap, one for stronglyconvex functions and one for convex functions.For both bounds, we use the following well-known lemma [20], [40].Lemma 1: Let F (x): R d → R be a convex function with L-Lipschitz gradient.Then, the following inequality holds: Additionally, we make an assumption regarding the similarity of the local model updates ∆w n,k and the global model update ∆w n [24], [30].
Assumption 1: The local model updates ∆w n,k are assumed to be independent and unbiased estimates of the global model update ∆w n .
The local gradients and the global gradient are in general different.The difference has coordinate bounded variance [30]: and as a consequence, the model update difference can be bounded as where ∆w n,k is the i-th element of ∆w n,k , and (σ (i) ) 2 are the element-wise upper bounds.We will also use σ ∈ R d to denote the vector of variance bounds.

A. Strongly-Convex Loss
In this subsection, we assume that the FL loss is µstrongly convex.For such a loss, we use the following lemma [20], [40]: Lemma 2: Let F (x): R d → R be a µ-strongly convex function with L-Lipschitz gradient.Then, the following inequality holds: (∇F (x) − ∇F (y)) T (x − y) Before we are ready to state the upper bound, we must also assert that the local step size β has been selected to be sufficiently small for convergence.
Assumption 2: Let the local step size β be where p k is determined according to Proposition 1.
Note that even though the power normalization step (13) makes the expected gradient norm independent of the power control parameters, the squared norm is still dependent.Therefore, the step length β must be selected with respect to the power, as seen in (29).We are now ready to give the first upper bound on the FL loss function (23).Given Assumption 1 and 2, the update described in ( 12) and ( 13) converges according to Proposition 2.
Proposition 2: Let and Then the FL loss is upper bounded by where r 0 = ∥w 0 − w * ∥ is the distance between the initial weight vector and the optimal one, σ is a vector of the coordinate bounded variances from (27), and d is the number of model parameters.
Proof: The proof is provided in Appendix A. We refer to the first term on the RHS of (32) as the diminishing term because it approaches zero if n → ∞.Along the same line, we refer to the other term as the post-convergence term because it remains non-zero even if n → ∞.From (32), we know that the convergence rate of the diminishing term is O(c n 2 ), typically called linear convergence.Implications of Proposition 2 are given in Section IV-C.

B. Convex Loss
In this subsection, we relax the assumption on strong convexity and develop a bound for Lipschitz smooth and convex loss functions.For this bound, we need a different guarantee on the fixed step size than for the strongly convex case.
Assumption 3: The fixed step size β is selected to satisfy: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Proposition 3: Consider Assumption 1 and 3. Then the FL loss is upper bounded by where c 3 is defined in (31).
Proof: The proof is provided in Appendix B. From (34), it is clear that the convergence rate of the diminishing term is O(1/n), typically called sub-linear convergence.

C. Discussion on Proposition 2 and 3
Since the propositions are upper bounds, we are discussing the worst-case properties of the FL loss using AirReComp.We are specifically interested in the impact of the number of retransmissions.
1) Diminishing Term: In both bounds, the diminishing term is unaffected by M .Similar results can be seen in the optimization literature.For instance, an illuminating parallel can be drawn to the convergence bound for mini-batch gradient descent (GD) [41] where var(v) is the variance introduced by the random selection of samples.Similar to our system, where the variance of the gradient is reduced by adding retransmissions, var(v) in mini-batch GD is reduced by increasing the batch size.As seen in (35), the effect of this variance reduction is only reflected in the post-convergence term, just as how M only shows up in the post-convergence term of ( 32) and (34).
2) Final Error: Since both bounds have post-convergence terms, the algorithm does not converge to a local optimum.Instead, the algorithm converges to a region of optimality, where the expected remaining loss gap is given by the postconvergence terms.There are two reasons why AirReComp does not converge exactly.Firstly, the channel noise (characterized by σ z ) causes unavoidable errors which prevent exact convergence.Secondly, the difference between local and global model updates (characterized by σ) causes a global model update that differs from what is achieved in centralized gradient descent.This result aligns with what was found in [30].
We investigate the post-convergence terms of the two bounds closer.Since we are interested in the impact of retransmissions, we focus on the terms that are affected by M .To start, consider the post-convergence terms of ( 32) and (34), which can be expressed as where C := L/(2(1 − c 2 )) and C := (2 + L)/2 for (32) and (34), respectively.It is worth noting that the first term, caused by the gradient difference σ cannot be completely eliminated, even with perfect communication.In fact, by the Cauchy-Schwarz inequality, we can lower-bound the first term to which is completely unaffected by the communication scheme.
As for the noise-induced term, we see an improvement of order O(1/M ).However, from Corollary 1, we know that this order is slightly diminished by the fact that the transmission powers p k are decreasing in M .Since the relationship between p k and M cannot be stated in closed form, we analyze the diminishing terms further in the numerical results.

V. NUMERICAL RESULTS
The performance of the proposed AirReComp system is now evaluated in terms of the model update estimation error, federated learning loss, and classification accuracy.Specifically, we have four goals with this section: • To demonstrate the need for retransmission-aware power control, by comparing our proposed solution with the state-of-the-art single transmission schemes proposed in [19], [20], and [21] and a perfect communication baseline; • To demonstrate that introducing retransmissions is also beneficial for non-convex loss functions.Note that our analytical results assumed convex loss functions; • To demonstrate that the proposed method is viable both when the channels change every communication round (as assumed in Section II) and when the channels remain static for the entire training process (as assumed in Section IV); • To demonstrate the rate that the post-convergence terms of the bounds we developed in Section IV are decreasing in M .

A. Power Control
In this subsection, we wish to evaluate the impact of our proposed power control scheme on the estimation error of the global update ∆w k .Specifically, we compare the MSE of the ∆w k for different choices of M , and compare the AirReComp power control scheme to the baseline solutions of [19] and [21] wh ere the power control algorithm is unaware of the number of retransmissions.For this, we consider the transmission of randomly generated scalars instead of running a complete FL simulation setup.For this simulation, we consider K = 20 users and varying noise powers σ 2 z .To simulate the network, we generate channel coefficients according to unit Rayleigh fading h k ∼ N (0, 1/2) + jN (0, 1/2) and additive noise components as z ∼ N (0, σ 2 z ).The transmitted scalars ∆w k are generated according to the unit normal distribution, which matches the assumption in Section III.The maximum transmission power is selected as P max = P = 1, according to Remark 1.The PS estimate of the arithmetic mean ∆ ŵ is generated according to (13), where the transmission powers p k and the post-transmission scalar √ η are selected according ) is compared to using retransmissions (M > 1).Note that even though SNR scales linearly with the number of transmissions, the estimation error is not reduced as drastically.Right: The estimation error of optimal retransmission-aware power control is compared to a retransmission-unaware baseline.The results demonstrate the importance of designing the power control scheme with retransmissions in mind.
to Proposition 1. Upon calculating the estimate, it is compared to the true arithmetic mean of ∆w k according to This process is repeated 20,000 times for each value of σ 2 z .The resulting MSEs are averaged to form the plot in Fig. 1.In the left plot of the figure, the estimation errors using different numbers of transmissions M are illustrated.The plot demonstrates that the mean squared estimation error is approximately linear with the variance of the additive noise, regardless of M .In the right plot of Fig. 1, we compare the AirReComp power control scheme (Aware) to the retransmission-unaware scheme proposed in [19] and [21].Their proposal is the optimal power control scheme for single retransmissions (Unaware).The numerical results demonstrate that AirReComp has a significantly lower estimation error when M > 1.The gap between AirReComp and the baseline is also increasing with M , which demonstrates the importance of designing the power control scheme with retransmissions in mind.From the left plot, it is clear that the reduction in estimation error is worse than proportional to M .Instead, the system using M = 8 achieves approximately three times lower estimation error that the baseline of M = 1.Compared to using a forward error-correcting code, this result is significantly worse.However, since such codes are not compatible with analog communication, retransmissions are a good first step towards enabling a communication-estimation trade-off.

B. Federated Learning Convergence
In this subsection, we have two goals: to verify that the post-convergence classification accuracy is increasing in M for non-convex loss functions and to demonstrate the level of improvement compared to other baselines.For the FL simulation, the network setup is identical to Section V-A, except that σ 2 z is fixed for each simulation and K = 10.The ML task is multi-label classification on the MNIST dataset [42] with |D k | = 6000 training samples per user device.The classifier is a DNN which consists of an input layer of 784 nodes, a hidden layer with 10 neurons, and an output layer of 10 neurons.The network is trained with a static learning rate of β = 0.1, ReLU activation, sparse categorical cross-entropy loss, L2 regularization with ϵ = 10 −5 , and without dropout.We run 2 epochs (E = 2) per communication round, for N = 50 rounds.The whole training process is repeated 10 times for each considered value of M , these results are then averaged to get the plots in Figs 2a and 2b.
AirReComp is compared to two baseline solutions.The first baseline (max power) is based on the scheme proposed in [20].This scheme is a maximum power transmission scheme that does not require any channel information for the devices and thus allows for simple implementation.The second baseline (error-free) considers the case where there is no noise or fading and therefore that the server retrieves perfect copies of the model updates.Comparing AirReComp with the error-free baseline quantifies the performance gap caused by estimation errors of the model update aggregation.
In Figs 2a and 2b, the results of two simulations with σ z = 1 and σ z = 2 are presented.We wish to highlight that these simulations correspond to low SNR scenarios, because even though p k has a maximum value of P max,n = 1, the actual transmission power is much lower, as mentioned in Remark 1.In our simulations, we measured the average update norm to be E[∥w n,k ∥ 2 ] = 329.Because this is lower than the number of parameters in the model (d = 7960), the average signal strength is less than 1.As a result, the average SNR was −5.3dB and −11.3dB for σ z = 1, and σ z = 2, respectively.
These results clearly demonstrate that the classification accuracy is improved as additional retransmissions are introduced, at least in low-SNR scenarios.While the convergence analysis in Section IV only holds for convex loss functions, Fig. 2. Federated Learning performance with AirReComp.We consider K = 20 devices and train fully-connected DNNs over a multiple access channel with fading.We consider AirReComp with M = 1, 2, 4, 8 and two baselines.The first baseline corresponds to the max-power system and the second corresponds to the error-free system.Both plots correspond to low-SNR systems with varying levels of noise.
these results show that the method can also offer benefits for more complicated non-convex models, such as DNNs.Specifically, for σ z = 1, the system with M = 1 achieved an average classification accuracy of 58%, while the best system with M = 8 achieved an average classification accuracy of 88%.The poor result of 58% is largely due to the low SNR of the network.This can also be seen by comparing Figs 2a and 2b, where the latter has a significantly wider gap between M = 1 and M = 8.
If we compare our proposal to the baselines, it is clear that the max-power basline performs closely to M = 1.This is unexpected since it does not perform any power control and therefore should be experiencing a worse MSE.However, it could potentially be explained by the assumption that E[(∆w = 1 in our power control scheme, which does not hold in practice.Alternatively, if the MSE improvement between the max-power baseline and M = 1 is minor, it might not have a noticeable effect on the classification accuracy when training with IID data, as suggested in [24].
While comparing to the error-free baseline, we notice that our proposal with M = 8 transmissions achieves 6% worse Fig. 3. Federated Learning performance with AirReComp.This simulation uses the same setup as in Figure 2 except that channels remain static throughout the training process.While the performance is not identical, the results demonstrate that the overall trend matches that of dynamic channels which change between communication rounds.classification accuracy than perfect communication.This highlights the issue of estimation errors in FL performance and suggests that further improvements are necessary for low-SNR scenarios.One could always increase M , but at some point the increased communication cost causes digital communications to be a better alternative.
Finally, we provide a simulation for the case of static channels.The simulation setup is identical to that of Fig. 2 except that the same channel coefficients are used for all N communication rounds.As illustrated in Fig. 3, the classification accuracies are slightly worse for all systems (except for the error-free system) but the overall trend matches that of Fig. 2. A possible explanation for the performance decline is that, with static channels, any device that experiences a poor channel coefficient will consistently contribute less to the global update.Therefore, the knowledge contained in its dataset will be underrepresented, leading to model drift between its local model and the global model.Whereas in the dynamic case, where new channels are experienced for each communication round, the model drift would be corrected whenever a better channel is sampled.Fig. 4. The relative decline of the post-convergence terms in (32) and (34).The ∥σ∥ 2 -term refers to the term caused by the difference between the local and the global model updates, this term cannot be completely eliminated even with perfect communication, in contrast to the σ 2 z -term caused by the additive noise.

C. Convergence Bounds
In Section IV, we developed two bounds on the FL loss function and illustrated that the post-convergence terms are expected to decrease in M .However, due to the lack of a closed-form expression for the relationship between the transmission powers p k and the number of uplink transmissions M , we were unable to provide the rate of decline with respect to M .Instead, we demonstrate the decline of these terms in this section.Specifically, there are two post-convergence terms for each bound, as expressed in (36).
In practice, the three learning-related variables ∥σ∥ 2 , L, and µ of the bounds are difficult to estimate.Therefore, rather than attempting to evaluate the absolute magnitude of the postconvergence terms, we are looking at the relative decline with respect to M .We define the relative decline as the quotient of the post-convergence term for M transmissions and the same term evaluated for one transmission, given by relative decline(∥σ∥ 2 ) where p k (M ) is the transmission power of device k evaluated according to Proposition 1.For the simulation, we use the same network setup with K = 20 devices, Rayleigh fading, and σ z = 1.We simulated the terms for 1,000 random realizations of the channels and averaged to get the results in Fig. 4.
As displayed in Fig. 4, the error caused by the difference between local and global gradients is hardly affected by introducing additional retransmissions.This is to be expected since the only improvement comes from the slight decrease of transmission powers that follow from an increased M , as highlighted in Corollary 1.The noise-induced error is however significantly improved, almost at the order of O(1/M ), but with a gap due to the decreased transmission powers, as discussed in Section IV-C.

VI. CONCLUSION
In this paper, we propose retransmissions for Over-the-Air FL, in a system we call AirReComp.Arguably, this is the first work to enable a trade-off between communication resources and convergence speed for Over-the-Air FL.To improve the estimation error of AirReComp, we find a closed-form solution for optimal power control in the uplink.This power control solution shows that the number of retransmissions must be known by the transmitters to realize the MMSE estimator.We also prove two upper bounds on the FL loss for the AirReComp system, both for strongly-convex and convex loss functions.These bounds show that the post-convergence error of FL is strictly decreasing in the number of retransmissions, while the convergence rate is unaffected even though the estimation error of the updates is decreased.This contradicts to the findings of earlier works on AirComp [20], [30].The reason is that those works do not normalize the update step, and thus the transmission scheme directly impacts the learning rate.We numerically verify the improved post-convergence performance for non-convex loss functions by training DNNs with AirReComp.The simulations also demonstrate that Air-ReComp can significantly outperform single uplink transmissions as well as full power baselines.
There is interesting open work on the reduction of estimation errors for Over-the-Air FL, including: • Gradient Statistics for Power Control In a recent work [24], the authors proposed that online estimation of gradient statistics can significantly improve the power control of over-the-air FL.They found the optimal power control algorithm given that these statistics are known, but only for one-shot transmission.By combining this result with AirReComp, one could avoid the assumption that E[(∆w ) 2 ] = 1 and find the optimal power control scheme for more realistic assumptions.
• The consideration of fast-fading channels and diversity gains for AirReComp.In this work, we consider a network with static channels, which restricts the improvements of retransmissions to reducing the power of the noise.In a fast-fading scenario, the problem changes substantially, especially the power control problem described in Section III.We have taken a first step in this direction in [34], where we show that by exploiting the ergodicity of the fast-fading channel, one can probibalistically guarantee unbiased over-the-air computation under peak transmission power constraints.the analytical and numerical results.It is likely that the importance of improved estimation is more pronounced when the datasets are non-IID, because the suppression of any individual update should cause greater harm to the convergence.There has been some work on over-the-air FL with non-IID data [43] but, as far as we are aware, no work that considers the gradient estimation.• Other methods of controlling the estimation error for Over-the-Air FL.For instance, one could consider the possibility of a distributed channel code.With analog communication, this appears to be inapplicable, but with recent ideas of one-bit digital Over-the-Air Computation, there might be possibilities to explore in this direction [44], [45].Additionally, one could consider combining the retransmission scheme with device selection for further improvements, as suggested in [46].• Tradeoff between transmission power and retransmissions.Instead of focusing on adding retransmissions to improve the estimation error, one could consider changing the transmission power.Similar to the tradeoff explored in this work, there is a tradeoff between transmission power and convergence rate.Especially for low-powered IoT-devices, it would be interesting to analyze how much the transmission power could be reduced without significantly harming the FL performance.

APPENDIX A PROOF OF PROPOSITION 2
We start the proof by expressing the distance between the optimal global model w * and the current global model w n at communication round n as This distance can be related to the FL loss function via Lemma 1 and Lemma 2. The plan for the proof is to utilize this relationship to form the upper bound.But before we get to that stage, we need to introduce the impact of AirReComp on the model update.To do so, we use ( 13) with (41) to express where ∆ ŵn is the model update from ( 12) and c 1 is defined in (14).Next, we take the expectation of (42) with respect to ∆w n,k and z m .To do that, we first need to determine E [∆ ŵn ] and E ∥∆ ŵn ∥ 2 .Beginning with E [∆ ŵn ], we use (12) to get which has been simplified using Assumption 1 and the final equality holds since we assume there is only one epoch (E = 1) and therefore that the model update is the gradient of the global loss function.Next, we find E ∥∆ ŵn ∥ 2 , once again using ( 12) The first term of ( 44) can be upper-bounded by the Cauchy-Schwartz inequality as follows Then, we our assumption on the local model updates from (27) to get With E [∆ ŵn ] and E ∥∆ ŵn ∥ 2 evaluated in ( 43) and ( 46), we go back to the model distance.Taking the expectation on both sides of ( 42) yields Now we are ready to introduce the FL loss by utilizing strong convexity and Lipschitz smoothness.We do this by rewriting Lemma 2 to where we have utilized ∇F (w * ) = 0 for the final term on the RHS.Combining (47) and (48) yields Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Since this expression is getting long, we use three constants c 2 , c 3 , and c 4 to simplify it.The model distance is then where c 2 and c 3 were defined in ( 30) and ( 31) respectively.Because of our choice of learning rate in Assumption 2, we have the following inequality for c 4 Since c 4 is less than zero, we can rewrite our bound in (50) as At this point, the bound is almost complete.The only thing that remains is to find an inequality comparing E[r 2 n ] and E[r 2 0 ] instead of comparing two adjacent communication rounds.As such, we reduce the communication round counter by one and replace E[r 2 n ] in (52) to get By induction we have Then we apply Finally, we utilize convexity and Lipschitz smoothness from (24) to relate the LHS of (55) to the FL loss, which yields which is the bound from Proposition 2.

APPENDIX B PROOF OF PROPOSITION 3
Just as in the first proof, we utilize the properties of convexity and Lipschitz smoothness to relate the distance between the optimal global model w * and the current global model w n to the FL loss function.In contrast to the first proof, we use these properties immediately.Specifically, we start with Lemma 1 and take the expectation on both sides to get Then, we add the global update via (13) to get Next, (43) gives us We insert the bound for E ∥∆ ŵn ∥ 2 from (46) to get Next, we recognize c 3 from (31) and substitute it into the bound This expression can be simplified by using Assumption 3 to bound the second term on the RHS Next, we are going to upper bound the first term on the RHS of (62).We use the following standard property of convexity (see equation 2.1.2from [40]): Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Plug this into (62) Then ( 65) becomes Now, just like in the previous proof, we want to form a bound with respect to r 2 0 .However, instead of using induction, we use a telescoping sum.To set it up, we start by taking a sum of (67) over n communication rounds to get The sum n i=1 E r 2 i−1 − r2 i can be rewritten as and the middle terms ] will be upper bounded to a constant.We develop this bound next.To start, we plug in the definition of r 2 i+1 and r2 i+1 into E[r 2 i+1 − r2 i+1 ]: Applying (13) and doing some algebra yields Then we apply (43) to get Next, we insert ∥∆ ŵi ∥ 2 from (44), apply Assumption 1, and do some algebra which yields Since we assume E = 1, we have Finally, we apply the Cauchy-Schwarz inequality to get That concludes the upper bound on E r 2 i+1 − r2 i+1 so we plug it back into (68) to get which is the bound from Proposition 3.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Fig. 1 .
Fig.1.Estimation error evaluation of AirReComp.We consider K = 20 devices and evaluate the squared estimation error.Left: The estimation error of a single transmission (M = 1) is compared to using retransmissions (M > 1).Note that even though SNR scales linearly with the number of transmissions, the estimation error is not reduced as drastically.Right: The estimation error of optimal retransmission-aware power control is compared to a retransmission-unaware baseline.The results demonstrate the importance of designing the power control scheme with retransmissions in mind.

E√ p k |h k | 2 to the RHS, which together yields n i=1 E√ p k |h k | 2 c 3 .
[F (w i )] − nF (w * ) r2 n is positive, we can add one to the RHS of (76) without breaking the inequality.Similarly, we can addc 3 / K k=1 [F (w i )] − nF (w * ) (77) Finally, we note that E [F (w n )] ≤ E [F (w i )] for all i ≤ n.Therefore E [F (w n )] − F (w * ) ≤ 1 n n i=1 E [F (w i )] − F (w * ) Federated Learning Over-the-Air by Retransmissions Henrik Hellström , Graduate Student Member, IEEE, Viktoria Fodor , Member, IEEE, and Carlo Fischione , Senior Member, IEEE the first communication round is based on the global model, i.e. w n,k (0) = w n , and β is the step size.After executing E epochs, device k calculates a local model update as ∆w n,k = w n − w n,k (E).After all local model updates have been computed, they are transmitted in the uplink to the PS.At the PS, the local model updates are aggregated to form a global model update.In this paper, we consider the original FedAvg update

•
The consideration of non-IID data distributions In this work, we only consider IID data distributions both in Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.