Beamforming and Device Selection Design in Federated Learning With Over-the-Air Aggregation

Federated learning (FL) with over-the-air computation can efficiently utilize the communication bandwidth but is susceptible to analog aggregation error. Excluding those devices with weak channel conditions can reduce the aggregation error, but it also limits the amount of local training data for FL, which can reduce the training convergence rate. In this work, we jointly design uplink receiver beamforming and device selection for over-the-air FL over time-varying wireless channels to maximize the training convergence rate. We reformulate this stochastic optimization problem into a mixed-integer program using an upper bound on the global training loss over communication rounds. We then propose a Greedy Spatial Device Selection (GSDS) approach, which uses a sequential procedure to select devices based on a measure capturing both the channel strength and the channel correlation to the selected devices. We show that given the selected devices, the receiver beamforming optimization problem is equivalent to downlink single-group multicast beamforming. To reduce the computational complexity, we also propose an Alternating-optimization-based Device Selection and Beamforming (ADSBF) approach, which solves the receiver beamforming and device selection subproblems alternatingly. In particular, despite the device selection being an integer problem, we are able to develop an efficient algorithm to find its optimal solution. Simulation results with real-world image classification demonstrate that our proposed methods achieve faster convergence with significantly lower computational complexity than existing alternatives. Furthermore, although ADSBF shows marginally inferior performance to GSDS, it offers the advantage of lower computational complexity when the number of devices is large.


I. INTRODUCTION
F E derated learning (FL) is an effective distributed ma- chine learning technique that allows multiple devices to collaboratively learn a global model using their local datasets [1], [2].A parameter server needs to aggregate the local model updates from the devices to perform a global model update.However, in wireless FL, information exchange between the devices and the server can create stress on the limited communication resources, especially when a substantial number of devices participate in the process.In such a scenario, the devices may not be able to send their updates simultaneously via conventional orthogonal multiple access over the limited available bandwidth.
In order to improve communication efficiency in wireless FL, analog aggregation of the local models has been proposed [3].In this approach, the devices simultaneously transmit their local models using analog modulation over a shared wireless ⋆ University of Toronto † Ontario Tech University ‡ Ericsson Canada uplink channel, which naturally results in model aggregation at the receiver by superposition.Such over-the-air computation has attracted growing interest due to its advantages of efficient utilization of bandwidth and reduced communication latency over the conventional approach of orthogonal multiple access [4]- [10].Some recent works have also developed real-life prototypes for FL with over-the-air aggregation [11]- [14].However, over-the-air computation is susceptible to noise, which can cause significant aggregation errors that propagate over the FL computation and communication iterations.Furthermore, the quality of aggregation is disproportionately affected by devices with weak channel conditions, since the devices with strong channel conditions have to reduce their transmit power in order to align their transmitted signal amplitude with that of devices experiencing weak channels.This adjustment leads to a lower received signal-to-noise ratio (SNR) [2].Although excluding devices with weak channels can reduce the aggregation error, it can also harm the learning performance as a result of the reduced size of training data.Therefore, we need to carefully design an effective method for device selection that can properly trade off these two effects to improve the overall FL training performance.
Device selection for over-the-air FL was first considered in [3] and later in [15] and [16].In [3], a distance-based device selection method within a cell was proposed to increase the received SNR at the base station (BS).In [15], the authors proposed to only select the devices with strong channel conditions in the learning process to improve the convergence of the model training.However, how to design a proper threshold on the channel strength for device selection was not discussed.In [16], under imperfect downlink channel conditions for model broadcasting, device selection and transmit power at both devices and the server were jointly optimized to minimize the global training loss.All these works assume a single antenna at the server for communication, and thus the multi-antenna receiver processing was not considered.
In current wireless networks, the server is typically equipped with multiple antennas, where beamforming techniques can be applied to enhance the signal strength and reduce the noise in over-the-air computation [17], [18].It was demonstrated in [19] that the method in [18] can be applied to improve FL performance.However, the study in [19] did not consider device selection.Joint receiver beamforming and device selection was studied in [20]- [22].In [20], the joint design aimed to maximize the number of selected devices while limiting the communication error by a target threshold, and a difference-of-convex-function (DC) programming method was proposed to solve the joint optimization problem.For the same problem considered in [20], the authors of [21] introduced a low-complexity method to design the receiver beamforming and device selection jointly.FL via a reflective intelligent surface (RIS)-assisted wireless system was considered in [22], where device selection, receiver beamforming, and RIS phase shift were jointly considered to minimize an upper bound on the steady-state expected global loss as the training time approaches infinity.In their proposed scheme, the successive convex approximation (SCA) method was used to design receiver beamforming, and Gibbs sampling was used for device selection.
However, there are some limitations in these existing methods.First, the design and analysis of these works all assume that the channel states remain unchanged for all communication rounds during the entire FL training process, which is unrealistic in practical systems.Also, for [20] and [21], the design objective of maximizing the number of selected devices does not directly measure the joint impact of device selection and imperfect communication on FL training performance.The main challenge of this approach is that it is unclear how to properly set the target threshold for the communication error, which represents the proper trade-off between the communication error and the impact of device selection on the FL training convergence.Although [22] directly uses the global training loss as the design objective, the proposed Gibbs sampling method has high computational complexity as the number of devices grows, which is undesirable for implementation in practical systems, especially when device selection needs to be updated in each communication round.It is important to design a low-complexity algorithm for device selection and receiver beamforming that effectively improves over-the-air FL training performance.
Given the above issues, in this work, we consider FL with over-the-air aggregation.Aiming at improving the training convergence rate for FL, we jointly design uplink receiver beamforming and device selection to minimize the global training loss after arbitrary T communication rounds, subject to per-device average transmit power constraint.Unlike the existing works [20]- [22], we formulate our design problem assuming the channel states between the server and the devices can change over communication rounds, and the device selection and beamforming solutions are computed in each round based on the current channel state information.The main contribution of this paper is summarized as follows: • The formulated joint receiver beamforming and device selection problem is a challenging finite time-horizon stochastic optimization problem.By analyzing the training procedure, we obtain an upper bound for the global training loss function after T communication rounds.To improve the convergence rate of FL, we design receiver beamforming and device selection to minimize this upper bound on the global loss.• The reformulated joint optimization problem is a mixedinteger programming problem that presents significant challenges.We first propose a Greedy Spatial Device Selection (GSDS) approach to obtain a solution.GSDS uses a greedy method to select the devices and then solves the corresponding beamforming optimization problem among the selected devices.In particular, GSDS uses a sequential procedure to add devices to the set of selected devices based on a metric that measures the channel strength and the channel correlation to the selected devices.Given the selected devices, we show that the optimization problem with respect to (w.r.t.) receiver beamforming is equivalent to downlink single-group multicast beamforming, for which we apply an SCA method to obtain a solution.
The overall computational complexity of GSDS is shown to grow with the number of devices M as O(M 3 ).• To reduce the computational complexity, we further devise an Alternating-optimization-based Device Selection and Beamforming (ADSBF) approach.ADSBF employs the alternating optimization technique to break the joint optimization problem into two subproblems w.r.t.receiver beamforming and device selection and solve them alternatingly.We show that the receiver beamforming subproblem is the same as that in GSDS and can be solved in the same manner.For the device selection subproblem, despite being an integer problem that generally is difficult to solve, we are able to develop an efficient algorithm in our problem to find the optimal solution with computational complexity O(M log(M )).• We test the proposed GSDS and ADSBF with real-world image classification tasks using MNIST and CIFAR-10 datasets.Our simulation results demonstrate that both GSDS and ADSBF outperform existing methods for beamforming and device selection design in terms of the training convergence rate.They also have significantly lower computational complexity compared with the existing methods.This shows the effectiveness of our proposed approaches in providing a proper tradeoff between the impact of noisy communication and the amount of training data.Between the two approaches, GSDS performs slightly better than ADSBF as it leads to faster convergence.However, the run time of ADSBF is significantly lower than that of GSDS when the number of devices is large.
The rest of this paper is organized as follows.In Section II, we present the system model and problem formulation for over-the-air FL.In Section III, we reformulate the problem via training convergence analysis.In Section IV, we propose GSDS and ADSBF approaches to obtain the solution.The simulation result is provided in Section V, followed by the conclusion in Section VI.

A. FL System
We consider a wireless network consisting of a server and M local devices.The set of the devices is denoted by M. Device m has a local training dataset of size K m and denoted by is the k-th data feature vector, and y m,k is its corresponding label.The devices aim to collaboratively train a global model at the server that can predict the true labels of data feature vectors from all devices while keeping their local datasets private.The empirical local training loss function at device m is defined as where w ∈ R D is the global model parameter vector, and l(•) is the sample-wise training loss function associated with each data sample.The global training loss function is defined as the weighted sum of the local loss functions over all devices, given by where K = M m=1 K m is the total number of training samples over all devices.
We follow the general Federated Stochastic Gradient Descent (FedSGD) approach for the iterative model training in FL, where the server updates the global model parameters based on an aggregation of the gradients of all devices' local loss functions [23].The learning objective is to find the optimal global model w ⋆ that minimizes the global training loss function F (w).We call each iteration of the global model update a communication round.In communication round t, the following steps are performed: 1) Device selection: The server selects a subset of devices to contribute to the training of the model.The set of selected devices in round t is denoted by M s t ⊆ M. 2) Downlink phase: The server broadcasts the model parameter vector w t to all devices and notifies the selected devices.3) Local gradient computation: Each selected device m computes the gradient of its local loss function, given by g m,t = ∇F m (w t ; D m ), where ∇F m (w t ; D m ) is the gradient of F m (•) at w t .4) Uplink phase: The selected devices send their local gradients to the server through their uplink wireless channels.5) Global model update: The server computes a weighted aggregation of local gradients to update the global model.In the ideal scenario where the local gradients can be received at the server accurately, the weighted aggregation r t ≜ m∈M s t K m g m,t ∈ R D is used to update w t .However, in practice, the received complex base-band signal processed at the server rt ∈ C D is imperfect due to the noisy communication channels.Thus, the server updates the global model as where λ is the learning rate, and ℜ[•] represents the real part of a complex variable.

B. FL with Over-the-Air Analog Aggregation
We assume the server is equipped with N antennas, and each device has a single antenna.The uplink channel between device m and the server in communication round t is denoted by h m,t ∈ C N .We consider over-the-air analog aggregation over the multiple access channel to efficiently obtain the aggregated local gradients at the server [3].The FL system with over-the-air aggregation is shown in Fig. 1.In each communication round, the devices send their local gradients to the server simultaneously over the same frequency band.Due to the superposition of the transmitted signals, the server receives the weighted sum of local gradients.
At each communication round t, the local gradient of each selected device is transmitted over D channel uses (or symbol durations).Specifically, the local gradient vector g m,t at the selected device m is first normalized by its local gradient normalization scalar v m,t , and then adjusted by the transmit weight a m,t ∈ C for transmission.Thus, the transmitted signal for the d-th entry of the local gradient g m,t [d], which is denoted by q m,t [d], is given by To facilitate the receiver processing at the server, each selected device m sends its local gradient normalization scalar via the uplink signaling channel to facilitate the receiver processing.We assume the signaling channel is a separate digital channel and the reception is perfect.
Let q m,t = [q m,t [1], ..., q m,t [D]] T denote the corresponding transmitted signal vector for g m,t .From (4), the average transmit power used to send each entry in g m,t from device m in communication round t is The average transmit power is subject to the maximum average transmit power limit P 0 : |a m,t | 2 ≤ P 0 , ∀m, ∀t.
The corresponding received signal at the server for the d-th channel use is given by where n d,t ∼ CN (0, σ 2 n I) is the receiver additive white Gaussian noise for the d-th channel use and is i.i.d.over t.
The server applies receive beamforming to process the received signal.Let f t ∈ C N denote the receive beamforming vector in communication round t with ∥f t ∥ = 1, and let η t ∈ R + denote the receive scaling factor.The post-processed received signal for y d,t is given by  We describe the sequential operations performed within each communication round in Fig. 2 and summarize the key notations used throughout this work in Table I.

C. Problem Formulation
Since learning efficiency is crucial for FL, we aim to maximize the training convergence rate.As such, our objective is to minimize the expected global loss function after T communication rounds by jointly optimizing the device selection M s t , the devices transmit weights {a m,t }, and the receiver processing (beamforming vector f t and scaling factor η t ).This optimization problem is formulated as follows: where E[•] is the expectation taken over the receiver noise, and s t is the binary device selection vector at round t, with its m-th entry s t [m] = 1 indicating device m is selected to participate the model updating in communication round t and 0 otherwise.Note that s t and M s t convey the same information, and we have

III. PROBLEM REFORMULATION BASED ON GLOBAL TRAINING LOSS ANALYSIS
Problem ( 7) is a finite time horizon stochastic optimization problem, and the global loss function in the objective is not an explicit function of the optimization variables.Furthermore, it requires the knowledge of channel states, {h m,t } for t = 0, ..., T − 1, and therefore, it cannot be solved in an online fashion as the channel state in the future is unavailable.In addition, problem (7) is a mixed integer program as it contains a binary variable s t .To tackle this challenging problem, we consider a more tractable upper bound on the loss function through training loss convergence analysis and propose algorithms to minimize this upper bound.

A. Upper Bound on the Global Training Loss
To analyze the expression of F (w T ), we rewrite the global model update at the server in (3) as where ∇F (w t ) is the gradient of global loss function at w t , and e t ∈ R D is the error vector representing the deviation of the updating direction from ∇F (w t ).Based on (3), the error vector e t can be expressed as follows: where the first component of the error e 1,t arises from the difference between the accurate aggregation r t and its serverestimated counterpart ℜ[r t ], due to imperfect communication, stemming from receiver noise and analog aggregation across wireless channels.The second component of the error e 2,t is due to the subset selection of devices.When all devices are chosen, this term disappears.Note that the gradient error e t in (13) depends on the device selection set M s t , transmit weights of devices {a m,t }, and receiver processing (f t , η t ) through rt , as shown in (6).
From ( 6), rt is a function of the selected devices, the transmit weight {a m,t } at these devices, the receive beamforming f t , and the receive scaling η t at the server.Given selected devices set M s t and receive beamforming f t , we want to optimize the receive scaling η t and transmit weight {a m,t } to minimize the expected error E[∥e 1,t ∥ 2 ] under the transmit power constraint (8), which is given by However, the optimal solution to the error minimization problem ( 14) is not straightforward to obtain.Thus, we adopt a suboptimal solution to this problem, which has been considered by the existing works [20]- [22]: Substituting the above expressions into (13), we can analyze the impact of e t on the expected global loss function based on (12).An error expression similar to ( 13) is analyzed in [22] for an RIS-assisted FL system, where an upper bound on the expected global loss function E[F (w t+1 )] is derived.We can apply this upper bound to our problem straightforwardly by replacing the RIS channel with our channel model.The upper bound is derived based on the following assumptions on the global loss function F (w), which are common in the stochastic optimization literature [24]: A3.The gradient of the global loss function is Lipschitz continuous with a positive Lipschitz constant L: ∀w, A4.The gradient of sample-wise training loss function is upper bounded: , ∀k,∀t.(21) By substituting the expressions of η t and {a m,t } in ( 17) and ( 18) into ( 6) and setting the learning rate λ = 1 L in (3), and based on Assumptions A1-A4, the expected difference between the global loss function at round (t + 1) and the optimal loss is bounded by [22] where and α 1 and α 2 are given in (21).Let w 0 be the initial model parameter vector.Applying the bound in (22) to E[F (w t+1 ) − F (w ⋆ )] for t = 0, . . ., T − 1, we have the following upper bound after T communication rounds:

B. Problem Reformulation via Training Loss Upper Bound
Note that minimizing is difficult to optimize directly.Instead, we minimize its upper bound in (24).Note that since ψ t is an increasing function of d(f t , s t ; H t ), the upper bound in ( 24) is an increasing function of d(f t , s t ; H t ), t = 0, . . ., T − 1.Thus, to minimize the upper bound, it is sufficient to minimize d(f t , s t ; H t ) at each round t w.r.t.(f t , s t ).This joint device selection and receiver beamforming optimization problem is given below, where we drop subscript t from the problem for notation simplicity: IV. JOINT DEVICE SELECTION AND RECEIVER BEAMFORMING WITH ANALOG AGGREGATION The joint device selection and receive beamforming problem ( 25) is a mixed-integer program and is challenging to solve.Below, we propose two approaches to find a solution.The first approach uses a greedy method to select the devices based on their channel strength and correlation and then solves the corresponding beamforming optimization problem.To reduce the computational complexity, we further propose an alternating-optimization-based approach, where we devise an efficient low-complexity algorithm for the sub-problem of device selection.

A. Greedy Spatial Device Selection (GSDS) Approach
From (23), we note that the channels of the selected devices, in both their strength and correlation among each other, can affect the value of d(f , s; H).Specifically, d(f , s; H) is a decreasing function of the channel strength of each selected device.Furthermore, since the same receive beamforming vector f applies to the channels of all selected devices, having highly correlated channels among these devices improves the minimum beamforming gain among them, which leads to a reduced value of d(f , s; H).Based on these two factors, we propose a Greedy Spatial Device Selection (GSDS) approach.It uses a sequential procedure to add a device to the set of selected devices.We use a metric to measure channel strength and its correlation to the set of selected devices.In each step, the device with the maximum metric value is added to the set of selected devices, and then beamforming optimization is performed for (25) under this new set of selected devices.Since the beamforming optimization is performed in each step, we first describe this beamforming design problem, and then detail the device selection procedure for GSDS.
1) Receiver Beamforming Design Given Device Selection: Assume the current set of selected devices is given by s (or equivalently M s ).Problem (25) is now reduced to the receiver beamforming optimization problem w.r.t.f at the server.Since the first term of d(f , s; H) in ( 23) is a function of s only, with given s, problem (25) is equivalent to By introducing the auxiliary variable c, we can further rewrite the min-max problem (28) as its equivalent epigraph form as Algorithm 1 Greedy Spatial Device Selection (GSDS) Select device 7: Compute receiver beamforming vector f i by solving problem (28). 8: Compute d(f i , s i ; H). 9: end for 10: Choose i ⋆ = arg min 1≤i≤M d(f i , s i ; H). 11: Output: Let f ≜ √ cf .We can directly optimize f instead of f and c separately in problem (30).In this case, constraint (32) can be dropped, and the objective function is replaced by ∥ f ∥ 2 .Thus, we further simplify problem (30) and transform it into the following final equivalent problem: After the above transformations, we arrive at our final problem (33) that is in fact equivalent to a single-group downlink multicast beamforming quality-of-service (QoS) problem [25], [26]: The BS transmits a common message to all devices in M s using the multicast beamforming vector f , which is optimized to minimize the transmit power while meeting each device's SNR target.In our problem (33), the SINR target is K 2 m for each device m.The multicast beamforming design problem has been well studied in the literature [25]- [28].It is generally an NP-had problem.Nonetheless, effective and efficient algorithms have been proposed in the literature to find a close-to-optimal solution [26], [28].We adopt the SCA method to solve problem (33), which is guaranteed to converge to a stationary point [26].Once f is obtained, the receive beamforming vector f can be readily computed as f = f /∥ f ∥.
2) Greedy Selection of Devices: As shown in the above problem (33), we note that the receiver beamforming optimization essentially is a single-group multicast beamforming problem.Since the receiver beamforming vector is applied to all device channels, the worst received SNR among devices improves if all devices have good channel conditions and similar channel directions.Based on these heuristics, we propose our greedy device selection process in GSDS.
Our GSDS for device selection is a sequential procedure.We denote the set of selected devices in step i by M s i , i = 1, 2, . . ., M .In the initial step, GSDS selects the device with the strongest channel condition.That is, M s 1 = {m 1 : m 1 = arg max 1≤m≤M ∥h m ∥}.In each subsequent step, a new device is selected based on a metric and added to the set of selected devices.Therefore, M s i contains i devices.We propose a metric that measures both device channel strength and its correlation to the set of selected devices.Specifically, in step i, for each unselected device m ∈ M\M s i−1 , we project its channel h m onto the subspace spanned by the channel vectors of selected devices.That is, denote The set of selected devices at step i is then given by Let s i be the device selection vector corresponding to M s i .Once s i is obtained, we optimize the receiver beamforming vector f to minimize d(f , s i ; H) as in problem (25) with given s i , which is to solve problem (28), except M s is replaced by M s i .Following the approach discussed in Section IV.IV-A-IV-A1, we obtain the receiver beamforming vector, denoted by f i and the value of d(f i , s i ; H).
After performing all steps i = 1, . . ., M , choose i ⋆ as and the set of selected devices M s i ⋆ (and s i ⋆ ) and receiver beamforming f i ⋆ as the output of GSDS.The detail of GSDS is summarized in Algorithm 1.
3) Computational Complexity: Algorithm 1 involves obtaining the set of selected devices {M s i } M i=1 and computing the receiver beamforming vector f i , i = 1, . . ., M .For each step i, the computational complexity of determining the selected device m i in (35) is O(N M 2 ).Also, the computational complexity of obtaining the receiver beamforming vector f i using the SCA method is O(I max min(M, N ) 3 ), where I max is the maximum number of SCA iterations to reach the convergence threshold.Since there are total M steps in order to determine the set of selected devices M s i ⋆ via (37), the overall computational complexity of GSDS is O(I max min(M, N ) 3 M +N M 3 ).

B. Alternating-optimization-based Device Selection and Beamforming (ADSBF) Approach
The computational complexity of the proposed GSDS grows with the number of devices as O(M 3 ), which is relatively high as M becomes large.In this section, we propose an algorithm, named Alternating-optimization-based Device Selection and Beamforming (ADSBF), to determine device selection and receiver beamforming with low computational complexity.ADSBF uses an alternating optimization approach to solve problem (25) w.r.t.devices selection s and receiver beamforming vector f alternatingly.Specifically, ADSBF breaks problem (25) into two subproblems to solve alternatingly: one is the receiver beamforming optimization with given device selection, and the other is device selection under the provided receive beamforming vector.These two subproblems are described below.
1) Receiver Beamforming Design Given Device Selection s: Given the set of selected devices s, the minimization of d(f , s; H) in problem (25) w.r.t.f is given by which is the same as problem (28).Hence, the approach discussed in Section IV.IV-A-IV-A1 for transforming the problem into problem (33) is directly applicable to problem (38).
Following this, we can use the same SCA method to obtain a solution to the problem.
2) Device Selection Design Given Receiver Beamforming f : Given receiver beamforming vector f , problem (25) reduces to The above problem is an integer program, and it also contains a min-max optimization problem, which typically is hard to solve.However, for this problem, we are able to develop an efficient algorithm to solve it.Specifically, we first sort |f H hm| 2 in ascending order and index the corresponding devices as m 1 , • • • , m M : , which is attained by some m j , for 1 ≤ j ≤ M .This means that s[m j ′ ] = 0, j ′ > j, and only the first j sorted devices, m 1 , . . ., m j , are candidates for selection, and the rest are not selected.Next, for fixed c, note that the objective function in (40) decreases as more devices from {m 1 , . . ., m j } are selected.As a result, the minimum objective value is attained by selecting all these devices, m 1 , . . ., m j .
Therefore, let z j , 1 ≤ j ≤ M , denote the device selection vector that selects devices m 1 , . . ., m j , given by We evaluate the objective function value d(f , z j ; H) under each device selection vector z j , for j = 1, . . ., M .Then, we obtain the optimal selection vector s from {z j } that achieves the minimum value among all d(f , z j ; H)'s: We summarize the device selection algorithm in Algorithm 2. Below, we show that our proposed algorithm is guaranteed to find the optimal solution s to problem (40).
Algorithm 2 Optimal Device Selection Given Beamforming in ascending order: with device indices m 1 , . . ., m M .3: for j = 1, . . ., M do Proof: Assume y is an arbitrary device selection vector.Let m † be the device with the largest value of |f H hm| 2 among the selected devices in y.Assume its corresponding index in the sorted devices {m 1 , . . ., m M } in Algorithm 2 is m j † (i.e., m † = m j † ).Then, the set of selected devices in y is a subset of {m 1 , m 2 , ..., m j † }, i.e., the devices selected in z j † defined by Algorithm 2. From the objective function in (40), we have d(f , z j † ; H) ≤ d(f , y; H).Thus, for ∀y ∈ {0, 1} M , we can find a selection vector in {z j } with an equal or less objective value.Thus, the global optimal point is in {z j }.
Our proposed ADSBF alternatingly solve the two subproblems w.r.t.f and s described in Sections IV.IV-B.IV-B1 and IV.IV-B.IV-B2, respectively, to obtain a solution to problem (25).The overall ADSBF algorithm is summarized in Algorithm 3.
3) Initialization and Convergence: Note that the SCA method used for the receiver beamforming subproblem requires a feasible initial point to problem (33).At the initial iteration l = 1, this initial point can be found by solving problem (33) using the semi-definite relaxation (SDR) approach [26], which gives a feasible approximate solution.In the subsequent iteration l > 1, the receive beamforming vector f (l−1) obtained from the previous iteration (l − 1) is used as the initial point for the SCA method in this iteration l.Since the SCA is guaranteed to converge to a local minimum, and the device selection subproblem is solved optimally, the objective value in ( 25) is non-increasing over iterations and is non-negative.Thus, ADSBF is guaranteed to converge.
4) Computational Complexity: The computational complexity of Algorithm 2 is O(M log(M ) + M N ), and that for solving problem (33) by the SCA method is O(I max min(N, M ) 3 ), where I max is the number of SCA iterations.Therefore, the overall computational complexity of ADSBF is O(J max (I max min(N, M ) 3 + M log(M ) + M N )), where J max is the maximum number of iterations of ADSBF.
In comparison, the computational complexity of the Gibbs sampling approach proposed by [22] is O(M 4 ).Also, the computational complexity of the DC approach in [20] is Algorithm 3 Alternating-Optimization-Based Device Selection and Beamforming (ADSBF) 1: Initialization: Set ϵ.Set initial f (0) with ∥f (0) ∥ = 1, and ; H) by the SCA method, where f (l) is used as the initial point for SCA.
).We summarize in Table II the computational complexities of different methods when M ≫ N .We see the computational advantage of GSDS and ADSBF over the existing approaches.In particular, ADSBF has substantially lower complexity than all other approaches.

V. SIMULATION RESULTS
In this section, we evaluate the performance of our proposed approaches for an image classification task.We consider training and testing using the logistic regression for the MNIST [29] dataset and the convolutional neural networks (CNNs) for the CIFAR-10 [30] dataset. 2e consider a scenario with M = 200 devices and N = 16 antennas.The distance between each device m and the parameter server, denoted by d m , is drawn from a uniform distribution: d m ∼ U [r min , r max ], where r min = 10 m and r max = 100 m.The path loss follows the COST Hata model, given by PL[dB] = 139.1 + 35.22 log(d m [km]).We assume the device channels are constant during the training.The channel vector for device m is generated using a complex Gaussian distribution as h m,t = h m ∼ CN (0, 1 P L I N ), ∀t.The maximum permissible average transmit power for the devices is assumed to be P 0 = 0 dBm.For comparison, we also consider the following approaches: 1) Select all: All of the devices are selected to contribute to the FL training, i.e., M s = M.The receive beamforming is obtained by solving problem (28) via the SCA method.2) Top one: Only the device with the strongest channel condition is selected to contribute to the FL training, i.e., M s = {m † }, where m † = arg min 1≤m≤M ∥h m ∥.The receiver beamforming vector is aligned with the channel vector of the selected device, i.e., f = h m † /∥h m † ∥. 3) Gibbs sampling [22]: Gibbs sampling is used for device selection, and the receiver beamforming vector is obtained by an SCA method.4) DC approach [20]: The receiver beamforming and the device selection are jointly optimized by DC programming to maximize the number of selected devices, subject to the MSE of the over-the-air aggregation for the global model is no larger than threshold γ.Note that for a fair comparison of the computational complexity, the SCA method described in Section IV.IV-A.IV-A1 is used in GSDS, ADSBF, "Select all", and Gibbs sampling methods.The maximum number of iterations for ADSBF is set to J max = 10.For the Gibbs sampling method, to achieve the optimal performance, we set the initial temperature β 0 = 1, and the cooling schedule parameter ρ = 0.1.Additionally, the number of iterations is configured to be 40.It is important to note that these parameter values have been determined through hyper-parameter tuning.For DC approach, the parameter γ must be carefully tuned to ensure that an appropriate number of devices are selected to attain the optimal performance.After conducting the hyper-parameter tuning, we set γ = 94 dB to achieve the fastest training convergence.Note that since the channel conditions in our experiments are much weaker than that of [20], the resulting MSE is larger, and hence we need to set γ to a larger value compared to [20].In contrast to Gibbs sampling and DC approach, our proposed GSDS and ADSBF do not require extra hyperparameter tuning.This is a noticeable advantage, as our proposed methods do not need extensive tuning.

A. MNIST Dataset
In the MNIST dataset, each individual data sample is a labeled gray-scale image of 28 × 28 pixels, depicting a handwritten digit, denoted by x k ∈ R 784 .The corresponding label, y k ∈ {0, 1, ..., 9}, specifies the class to which the image belongs.The dataset consists of 60, 000 training samples and 10, 000 test samples, belonging to ten different classes.We consider training a multinomial logistic regression classifier with the cross-entropy loss function given by l(w; where 1{•} is an indicator function, u k = [x T k , 1] T , and w = [w (0) T , . . ., w (9) T ] T with w (j) ∈ R 785 being the model parameter vector for class j, consisting of 784 weights and a bias term.
We assume the data distribution over devices is i.i.d.Each device's local dataset contains an equal number of data samples from different classes, and the number of local data samples in each device is K m = 270.Upon thorough hyperparameter tuning, we set a unified learning rate λ = 0.05 for the global model update in (3) for all approaches.Furthermore, the full local batch is used for the gradient computation in each communication round for all approaches.A (training) epoch Figs. 3 and 4 respectively show the average test accuracy and the average test loss with 95% confidence intervals over the 20 different channel realizations and receiver noise realizations.We see from both figures that our proposed GSDS and ADSBF approaches converge after T = 100 rounds, which is the fastest convergence rate among all the considered methods.Also, they achieve the highest test accuracy and lowest test loss among the approaches considered.In particular, GSDS and ADSBF provide test accuracy above 80%, which is 15% higher than the best accuracy achieved by the benchmarks.This significant improvement demonstrates the efficacy of our proposed methods for device selection.Between ADSBF and GSDS, although ADSBF performs slightly worse than GSDS,  its run time is significantly lower than that of GSDS as discussed below.Note that due to high instability, the loss of DC is significantly larger than that of the other approaches, so we omit it in Fig. 4. Fig. 5 shows the average number of selected devices by different approaches over 20 channel realizations.The accompanying error bar indicates the standard deviation around the average value.We see that ADSBF and GSDS select fewer devices than Gibbs sampling and DC approaches, but more than "Top one".This result, combined with the learning performance results in Figs. 3 and 4, demonstrates the effectiveness of our methods in providing a proper trade-off between imperfect noisy communication and the amount of training data provided by the local devices for model training through device selection.
Table III lists the average computation time of generating the beamforming and device selection solution using different approaches prior to the start of training.As we see in the column of the MNIST dataset, our proposed GSDS and ADSBF have significantly lower computational complexity as compared with the Gibbs sampling and DC approach.Furthermore, we see that the run time of ADSBF is two magnitudes lower than that of GSDS, with a slightly worse test accuracy.This demonstrates the computational advantage of ADSBF over GSDS and the overall efficacy of ADSBF.

B. CIFAR-10 Dataset
For the CIFAR-10 dataset, each individual data sample is a colored image of 3 × 32 × 32 pixels, which is represented as x k ∈ R 3 × R 32 × R 32 ; the associated label y k ∈ {0, 1, ..., 9} indicates the image's corresponding class.The dataset comprises a total of 50, 000 training samples and 10, 000 test samples.Given the increased complexity of this dataset, we opt for a more sophisticated convolutional neural network (CNN) model: the Residual Network (ResNet) with 14 layers (ResNet-14) [31].The training process employs the crossentropy loss function.To enhance the training process, we implement a data augmentation technique outlined in [31].This technique involves augmenting the data by adding a 4pixel padding on all sides and randomly selecting a 32 × 32 crop from either the padded image or its horizontally flipped counterpart.
Note that our joint optimization problem ( 25) is obtained based on the training convergence analysis under Assumptions A1-A4 with a strongly convex loss function, which is not the case here due to the non-convex nature of CNNs.Nonetheless, we test our proposed approaches using this dataset to demonstrate the effectiveness of our proposed method for this application.
Following the approach similar to that for the MNIST dataset, we uniformly allocate the training samples of each class across the devices, with a total number of samples per device as K m = 250.During each communication round, the devices compute their local gradient by processing a batch of 50 data samples from their respective local datasets.Consequently, each epoch comprises 5 communication rounds, and we assess the model's accuracy upon the completion of each epoch.We use the built-in SGD optimizer in PyTorch [32] and set the value of the learning rate as 0.01, the value of momentum as 0.9, and the value of weight decay factor as 10 −4 .We set the receiver noise power σ 2 n = −50 dBm.Figs. 6 and 7 show the average test accuracy and average test loss, along with 95% confidence intervals, across 20 different channel and communication noise realizations.Again, we see that our proposed GSDS and ADSBF consistently achieve superior test accuracy and lower test loss after 400 epochs as compared to other approaches.Specifically, the final test accuracy achieved by GSDS is approximately 58%, while the highest test accuracy among the benchmarks is 53%.Note that we omit showing the loss values for "Top one" and DC in Fig. 7 as these methods result in a very large test loss due to their low accuracy, as shown in Fig. 6.
Comparing Figs. 3 and 6 using two different datasets, we note that "Top one" performs better than "Select all" on the MNIST dataset, while "Select all" performs better than "Top one" on the CIFAR-10 dataset.The reason for this observation is as follows: The CIFAR-10 dataset is more difficult to learn compared with the MNIST dataset, as the variation among different data samples is much higher in CIFAR-10 compared with that in MNIST.Therefore, for the MNIST dataset, even if we only use the data samples stored in a single device (i.e., "Top one") to train the model, the achieved test accuracy can still be relatively high.On the other hand,  "Select all" suffers from a relatively large aggregation error caused by devices having weak channels, resulting in worse performance.However, for the CIFAR-10 dataset, if we train the model exclusively with the samples from a single device, the performance will suffer as the variation among the samples cannot be adequately captured during training.The beamforming and device selection computation time for each approach, prior to the start of training for the ResNet model, is listed in the last column of Table III.Similar to that of the MNIST dataset, our proposed methods have significantly lower computational complexity than the Gibbs sampling and DC approaches.Fig. 8 shows the average number of selected devices and the standard deviation over 20 channel realizations obtained by different approaches.Similar to the experiments with the MNIST dataset, we see that ADSBF and GSDS select fewer devices than Gibbs sampling, but more than DC and "Top one".This again shows the effectiveness of our methods in choosing an appropriate set of devices to trade off commu- nication imperfection and the amount of training data from devices to achieve a satisfactory learning performance.

VI. CONCLUSION
In this paper, we have jointly designed uplink receiver beamforming and device selection in over-the-air FL to minimize the global training loss after arbitrary T communication rounds, assuming time-varying wireless channels.To tackle this challenging stochastic optimization problem, we have obtained an upper bound for the global training loss and designed receiver beamforming and device selection to minimize this upper bound.We have proposed two approaches, GSDS and ADSBF, to obtain a solution.GSDS uses a greedy method that exploits the channel strength and correlation to sequentially add devices to the set of selected devices.In contrast, ADSBF employs the alternating optimization technique to solve the device selection and receiver beamforming subproblems alternatingly, where we have provided an efficient algorithm to solve the device selection subproblem optimally with low computational complexity.In both approaches, we have shown that given the selected devices, the receiver beamforming optimization problem is equivalent to downlink single-group multicast beamforming, for which existing efficient algorithms can be used to obtain a solution.The simulation results obtained from image classification experiments have demonstrated that both GSDS and ADSBF speed up the training convergence and have lower computational complexity than the state-of-the-art approaches.Furthermore, we have observed that GSDS and ADSBF offer two distinct design choices with a trade-off between learning performance and computational complexity.

Fig. 2 .
Fig. 2.An illustration of the sequential operations performed within communication round t.

TABLE I NOTATIONS
Symbol Explanation wt Model parameter vector in round t Fm(•) Local loss function of device m F (•) Global loss function Dm Local dataset of device m gm,t Gradient of local loss function of device m in round t Km Size of local dataset of device m K Total number of data samples over all devices hm,t Channel condition of device m in round t ft Receive beamforming in round t ηt Receive scaling factor in round t st Device selection vector in round t M Set of all devices M s t Device selection set in round t am,t Transmit weight of device m in round t M Number of devices N Number of antennas at server P 0 Average transmit power limit n d,t Additive noise vector in d-th channel use in t-th round σ 2 n The variance of each entry of the noise vector λ Learning rate Imax Maximum number of SCA iterations Jmax Maximum number of ADSBF iterations

TABLE III AVERAGE
RUN TIME FOR DIFFERENT APPROACHES (SECONDS) Fig. 6.Average test accuracy over epoch (CIFAR-10 dataset).