Multiple Parallel Federated Learning via Over-the-Air Computation

This paper investigates multiple parallel federated learning in cellular networks, where a base station schedules several FL tasks in parallel and each task has a group of devices involved. To reduce the communication overhead, over-the-air computation is introduced by utilizing the superposition property of multiple access channels (MAC) to accomplish the aggregation step. Since all devices use the same radio resource to transfer their local updates to the BS, in order to separate the received signals of different tasks, we use the zero-forcing receiver combiner to mitigate the mutual interference across different groups. Besides, we analyze the impact of receiver combiner and device selection on the convergence of our multiple parallel FL framework. Also, we formulate an optimization problem that jointly considers receiver combiner vector design and device selection for improving FL performance. We address the problem by decoupling it into two sub-problems and solve them alternatively, adopting successive convex approximation (SCA) to derive the receiver combiner vector, and then solve the device scheduling problem with a greedy algorithm. Simulation results demonstrate that the proposed framework can effectively solve the straggler issue in FL and achieve a near-optimal performance on all tasks.

Nevertheless, communication bandwidth is a key bottleneck affecting the performance of FL, while the straggler issue caused by the system heterogeneity of computation capability and wireless channel condition makes it even worse [7], [8], [9], [10]. A common way to overcome this difficulty is to reduce the number of participating devices via some scheduling policies [11], [12], [13], [14]. Another way is to reduce the amount of parameters required to upload from clients to the central server via quantization [15], [16], [17] or sparsification [18], [19]. Although the above methods successfully reduce the communication costs, FL performance is still constrained by the communication capability of the network, especially when a large number of devices participate in the FL process with limited communication bandwidth. This is because all these methods suppose that uplink communications between BS and devices use conventional orthogonal-access schemes, e.g., orthogonal frequency division multiple access (OFDMA) or time division multiple access (TDMA), such that the spectral resource allocated to each device will drop sharply as the number of devices increases.
In order to reduce the communication costs, over-the-air computation was introduced to aggregate data in sensor networks [20]. In this setting, all users transmit their data simultaneously via the same radio resources over multiple access channels (MAC); then, without decoding the information of each device, the computation is done utilizing the superposition property of the wireless channel. Compared with conventional orthogonal-access schemes, over-the-air computation can significantly improve communication efficiency, especially when there are a large number of devices in wireless networks because the required communication resource will not increase with the number of devices. Moreover, it is worth noting that over-the-air computation is only applicable when the BS wants to obtain the uniform summation or its variant results from all devices [21]. Due to its energy efficiency, over-the-air computation is more suitable for IoT network. In [22], the authors proposed a framework that was robust against synchronization errors. The work [23] considered a generalized IoT network where multiple different clusters of sensors independently compute different target function. A sensor selection algorithm to improve the computation performance was proposed in [24]. In [25], the authors built a experiment platform to verify the validity of over-the-air computation.
In a typical FL process, the BS receives distributed updates (model parameters or gradients) uploaded from edge devices via MAC and then averages them to update the global model, which is a classic adaptation scenario of over-the-air computation.

A. RELATED WORKS
The over-the-air computation-based FL aggregation was firstly introduced in [26], where the authors derived two trade-offs between communication and learning to quantify the selected device population. At the same time, the parallel work [27] considered the same trade-offs and maximized the number of devices with respect to the mean squared error (MSE) of gradient error. Then the author in [26] extended their work to one-bit over-the-air computation FL in [28] and [29], where a new scheme featuring one-bit quantization followed by modulation at edge devices and majority-voting based decoding at the edge server was proposed. The works [30], [31] supposed that the model update vector is sparse and projected the resultant sparse vector into a low-dimensional vector for reducing the bandwidth in the over-the-air FL. Moreover, the power control of the over-the-air computation FL was studied in [32] and [33]. The goal of [32] is to minimize the MSE of gradients by optimizing the transmit power at each device subject to average power constraints. Further, the authors in [32] analyzed the convergence of the over-the-air computation FL under any given power control policy to optimize the transmit power. The tractable FL convergence analysis of full gradient descent optimization was done in [34], [35]. In [36], reconfigurable intelligent surfaces (RIS) were leveraged to improve the performance of the over-the-air FL. Specifically, the authors developed a convergence analysis framework of the RIS-aided over-the-air computation FL and tried to solve the straggler issue by device scheduling.

B. MOTIVATIONS
The aforementioned works have well optimized the performance of FL. However, they all consider a single FL task over wireless networks. When FL become a service or a popular application in the network [37], the central server (e.g., the base station) may need to schedule multiple FL processes simultaneously.In this situation, the mutual interference across different FL processes need to be considered, as it may bring a huge reduction to learning performance. Different from the latest paper [38] which focuses on multiple FL tasks over multi-cell wireless networks, in this paper we study multiple parallel FL via over-the-air computation over wireless networks where the central server schedule multiple FL processes simultaneously. We jointly consider receiver combiner vector design and device selection policy and effectively solve the high communication costs and straggler issue in FL under the premise of sacrificing only slight FL performance.

C. CONTRIBUTIONS
The main contribution of this paper is to propose a novel framework for the implementation of multiple over-the-air FL in wireless networks by jointly taking the receiver combiner design and device selection into account. To our best knowledge, this is the first work that considers multiple FL process via over-the-air computation. The contributions of this paper are summarized as follows: • We propose a novel over-the-air computation FL framework, in which one BS services multiple groups of devices to train different FL models. In the uploading stage, all devices from different groups use the same radio resources to transmit their local updates to the BS, and then, by utilizing the superposition property of MAC, the BS receives the sum of signals from all groups of devices. To perform FL machine learning models accurately for every group, we propose a zero-forcing receiver combiner design to separate the received signals of different tasks. • We analyze the convergence of FL within our framework. Specifically, we derive an upper bound on the gap between the realistic and ideal optima value of global loss function with respect to the aggregation error caused by transmission distortion and device selection, and find how combiner design and device selection policy affect FL performance, e.g., convergence and FL loss function. Based on this analysis, we formulate a mixed-integer non-convex programming problem that jointly optimizes the combiner vector and device set. • To solve the formulated mixed-integer non-convex programming problem, we first decouple it into two sub-problems, namely, receiver combiner, and device scheduling. Then, we solve them separately in an alternate way. Specifically, given the device selection policy, we use successive convex approximation (SCA) proposed in [39] to derive the receiver combiner vector. Further, based on the derived receiver combiner vector, we solve the device scheduling problem with the greedy algorithm. • Simulation results show that our proposed framework can efficiently transfer gradient information of all tasks. Besides, we can see that the straggler issue severely and adversely affects FL performance of conventional FL systems. Our proposed framework effectively solves the straggler issue and achieves near-optimal (noise-less aggregation) performance on all processed FL tasks.

D. ORGANIZATION
The remainder of this paper is organized as follows. Section II introduces the FL model, the MAC communication model, and the multiple FL aggregation framework via over-the-air computation. In Section III, we analyze the FL expected convergence rate and formulate the optimization problem to minimize the FL training loss. The optimal receiver combiner design and user selection policy are determined in Section IV. Simulation results are analyzed in Section V. Conclusions are drawn in Section VI.

E. NOTATIONS
In this paper, scalars, vetors, and matrices are denoted by regular letters, boldface lowercase letters, and boldface uppercase letters, respectively. R and C are used to denote the real and complex number set, respectively. (·) T and (·) H denote the transpose operator and complex conjugate transpose operator, respectively. CN (μ, σ 2 ) represents the circularly symmetric complex Gaussian random distribution with mean μ and variance σ . The l 2 -norm of a vector is denoted by || · || and the size of set S is denoted by |S|.
diag(·) stands for a diagonal matrix of vector whose diagonal entities are specified by the vector enclosed, and E[ · ] means expectation.

II. SYSTEM MODEL
In this paper, we consider a cellular network in which a set M with M groups of devices perform M different FL tasks with different training models via the same BS, as shown in Fig. 1.

A. FEDERATED LEARNING MODEL
In the system, each group m (1 ≤ m ≤ M) trains a machine learning model represented by the parameter vector w m ∈ R D m ×1 with D m denoting the model size. The learning objective of group m is done in a way to solve the following optimization problem: where K m is the total number of training samples of group m; (x k m , y k m ) is the kth training sample with x k m and y k m denoting the input feature and output label respectively; Suppose that there is a set I m with I m devices in group m, and the ith device has K m,i training sampels and then the objective in (1) turns into: with To overcome the bottleneck of limited network bandwidth, federated averaging (FedAvg) is developed to reduce communication rounds between devices and the BS [6]. Specifically, at the t-th round in group m, the following processing will be conducted in sequence by FedAvg: • The BS selects a subset of devices I t m ⊆ I m to participate in the current round; • The BS sends the current global model w t m to the selected devices via multicast; • Each device adopts a standard gradient descent method to compute their local gradients respect to the local dataset as specified in [40]. Specifically, the gradient of device i in group m is given by • The devices upload g t m,i to the BS, and then the BS performs FedAvg to update the global model. In this case, we can estimate r t m i∈I t m K m,i g t m,i at the BS from the received signals. We denoter t m as the estimate of r t m (true value); accordingly the global model of group m is updated by where λ denotes the learning rate.

B. COMMUNICATION MODEL
In this paper, we focus on uplink transmissions between single-antenna devices and an N-antenna BS over a MAC based on the fact that the uploading process dominates the convergence of FL systems, and we consider over-the-air computation for fast update aggregation by exploiting the superposition property of MAC. We assume a block fading channel where channel coefficients remain constant within a communication round, but may change over different communication rounds. Besides, we assume that the channel state information (CSI) is available at all participating entities. At the tth communication round, we denote the channel coefficient vector between the BS and the ith device in the denote the transmit signal from device i, the received signal at the BS, denoted by y t [d], is given by where n t [d] is an additive white Gaussian noise (AWGN) vector with the entries following the distribution of CN (0, σ 2 n ).
To simplify the notation, we omit the time index t. Denote the dth elements of g m,i by g m,i [d]. In order to exploiting the superposition property of MAC to accomplish FedAvg, First, each device compute the local gradient statistics byḡ Then, each device transfers where p m,i ∈ C is the transmitter scalar used to combat channel fading and accomplish the weighting process of over-the-air FedAvg. The normalization step in (8) with P 0 > 0 as the maximum transmit power. By substituting (8) into (6), the received signal at the BS in time slot d is given by To perform over-the-air model aggregation, the BS computes the estimate of r m ∈ C N×M is the receiver matrix;ḡ = (ḡ 1 ,ḡ 2 , . . . ,ḡ M ) withḡ m i∈I m K m,iḡm,i , ∀m ∈ M used to restore the subtracted mean value of transmit signal in the regularization according to step (7); and η m > 0 is a normalization scalar.
Taking the mth(m ∈ M) FL model as an example, the first term after the second equal sign of the above formula can be rewritten as As we can observe from (12), the received signal at the base station after receiving combining is the sum of gradient information from all participating devices. However, when the BS targets the mth FL process, only gradient information from mth group is needed, while the gradient information from other groups will be treated as interference. Therefore, it is necessary to design a combining scheme to decode the received signals of different groups in a separate manner.

C. ZERO-FORCING RECEIVER COMBINER DESIGN
The signal received by the base station is the weighted sum of the model signals of all tasks. However, machine learning based on gradient descent method is sensitive to interference. For the purpose of training models of the group m accurately, the receiver combiner matrix needs to separate the received signals of different tasks, which reminds us of the zero-forcing receiver. Specifically, the BS should treat the gradients of different tasks as mutual interference and force the interference to zero, such that = [f 1 , f 2 , . . . , f M ] should be designed to meet the following criterion: where f m ∈ C N×1 , and m = 1, 2, . . . , M. Note that the proposed scheme can not only reduce the communication cost, but also minimize the training delay, because it schedule multiple FL tasks in a parallel way. For clear presentation, we list all notations used in this paper in Table 1.

III. PERFORMANCE ANALYSIS AND PROBLEM FORMULATION
In this section, we analyze how the device selection and communication noise affects the performance of the federated learning under the over-the-air model aggregation framework.

A. LEARNING PERFORMANCE ANALYSIS
To facilitate the analysis, we omit the task index m and first make the following assumptions on the loss function F(·): Assumption 1: F(·) is rigorously convex with positive parameter μ, such that for any w and w : Assumption 2: The gradient ∇F(·) of F(·) is Lipschitz continuous with parameter L. Hence, we have: Assumption 3: F(·) is twice-continuously differentiable. Assumption 4: The gradient computed by each sample is bounded as a function of the true gradient as follow: where β 1 , β 2 ≥ 0. Remark 1: Assumptions 1-4 are satisfied for most machine learning loss functions, such as squared support vector machine (SVM) and linear regression [41], and are widely used in the literature of performance analysis for FL [10], [42]. Although some machine learning models, such as neural network, might not satisfy Assumption 1, our experimental results presented later will clearly show that the proposed receiver combiner and device selection policy based on these four assumptions work well.
Assumptions 1-4 leads to an upper bound on ∇F(w t ) 2 with a proper learning rate λ. According the analysis in [43], we have where the learning rate is given as λ = 1 L and F(w ) denotes the global optima.
Based on (5), the global model at iteration t is updated by the relation given infra: where e t = ∇F(w t ) −r t / i∈I t K i , which denotes the gradient error caused by device selection and communication noise. According the analysis in [43], the upper bound on E[F(w t+1 ) − F(w )] can be given by: where E(·) returns the expected value of the random variable/quantity enclosed. In order to lessen the gap between the realistic and ideal optima value of global loss function, we need reduce the value of e t . Since the gradient error e t is determined by device selection and communication noise, given the device selection policy, the transmitter scalar is determined in the following proposition.
Proposition 1: Given the channel coefficient, receiver combiner vector and device selection policy, the optimal transmitter scalar that minimizes the gradient error is designed by Proof: See Appendix A.
Considering the transmit power constraint formulated in (9) and Proposition 1, the optimal η that minimizes the gradient error can be computed as From Proposition 1, we can get the tractable expression of the gradient error e t , based on which we can derive an upper bound on E[F(w t+1 ) − F(w )] in the following theorem with respect to any given device selection policy I t and receiver combiner vector f.
Theorem 1: Supposing that the assumptions 1-4 holds, with p i and η given in (20) and (21), for arbitrary {I t , f}, we have where ψ = 1 − μ L + d(I t , f) and [ψ] t denotes exponentiation with base ψ and power t; d (I t

B. PROBLEM FORMULATION
From Theorem 1, we can observe that ψ controls the convergence rate of the FL algorithm. A smaller ψ means faster convergence rate and the FL algorithm will not converge when ψ ≥ 1. Therefore, in this paper we only consider the case where ψ < 1. As a result, as t → ∞, we have [ψ] t = 0, the gap between E[F(w t+1 )] and E[F(w )] can be rewritten as β 1 L d(I t , f). Moreover, from the expression of ψ and the gap, we can see that the convergence rate ψ and the gap are both monotonic functions of d(·). As a result, a smaller d(·) leads to faster convergence and a smaller gap. Besides, we see that the selected device set I t m and receiver combiner vector f m determine the value of d(·). Based on the above observations, we formulate the following minimization problem for task m: f H m h n,i = 0, ∀n ∈ M\m, ∀i ∈ I t n (23) where C2 is zero-forcing constraint. Obviously, the objective function in (23) is non-convex, and the optimization variable I t m is a set. Hence, the objective problem (23) is a mixed-integer non-convex optimization problem. Due to the heterogeneity of the system, different users have different amounts of data and channel state information. From the optimization problem we can see that, The first item of the objective function requires selecting a device with a large amount of data, but this may cause the effect of noise in the second item to be amplified. In addition, the second term requires the selection of devices with good channel conditions and the design of the receiver vector to minimize the effects of channel fading.

IV. JOINT OPTIMIZATION OF RECEIVER COMBINER AND DEVICE SELECTION
In this section, our goal is to solve the minimization problem fomulated in (23). However, (23) is a mixed-integer nonconvex optimization problem with non-convex objective and constraints. To facilitate solving this type of optimization problems, certain tactics are necessary. First, we decouple it into two sub-problems, namely, receiver combiner and device scheduling, and solve them alternately. Specifically, given the device selection policy, we use SCA to derive the receiver combiner vector. Further, based on the derived receiver combiner vector, we solve the device scheduling problem with the greedy algorithm.

A. RECEIVER COMBINER DESIGN
Given the device scheduling policy of task m at the tth round I t m , the minimization problem (23) can be written as follows: Problem (24) is a min-max optimization problem with non-convex objective, we first transform it through the following proposition.
Proposition 2: The problem formulated in (24) is equivalent to the following problem:  (25) is a quadratically constrained quadratic programming (QCQP) problem with non-convex constraints. Thus, we are able to solve it iteratively through SCA. Specifically, at the lth iteration, we derive the optimal f m by solving the following problem: Solve the convex optimization problem (26) (26) where the first constraint is obtained by performing the second-order Taylor expansion. The problem given in (26) is convex, and we solve it through a standard convex optimization solver, e.g., CVX. Besides, we initialize c (0) i in a random way, and the iteration stops when the difference of c i between two consecutive iterations is less than a preset threshold ε. The algorithm for optimizing f m is summarized in Algorithm 1.

B. DEVICE SELECTION
In this subsection, we adopt a greedy device selection algorithm based on (23). Specifically, at the kth iteration with k = 1, 2, . . . , ξI m , given the device scheduling policy I (k−1) m , we first remove each device of I (k−1) m and perform Algorithm 1 to derive the corresponding optimal receiver combiner vector, based on which we compute the target value of (23) corresponding to the each removed device. Finally, we find the removed device with the minimum target value, and this device is the one needed to be deleted in this step. We suppose that in step 1, all devices are selected to participate in the FL algorithm. In this algorithm, ξ is the rate of device selection and is treated as a tunable hyperparameter. The algorithm for optimizing I m is summarized in Algorithm 2.
Remark 2: From the analysis in Section III-B, we obtain that as the proportion of device selection increases, the amount of data involved in training will increase, leading to the data distribution being closer to the true distribution. This is conducive to the convergence of FL. However, choosing more devices may cause the effects of noise to be amplified, adversely affecting FL performance. On the contrary, when too few devices are selected, the distribution of data involved in training may deviate significantly from the true distribution, which will also bring serious disadvantages to the convergence of FL. Therefore, for scenarios with poor network conditions, the proportion of selected devices should be appropriately reduced, while for scenarios with less data held by devices, it is better to get more devices involved.

Algorithm 2: Greedy Device Selection Algorithm
Design receiver combiner vector via Algorithm 1 Compute obj = d(I  (26) is a second-order cone programming problem, and, therefore, the worst-case complexity during each iteration of the algorithm is O (N 3 ). The computational complexity of device selection is thus O((2 − ξ)ξI 2 ), where I is the total number of devices.

V. SIMULATION RESULTS
In this section, we evaluate the performance of the proposed joint receiver combiner and device selection algorithm. The simulation set up is introduced in section V-A, and in section V-B, we numerically demonstrate the performance of proposed algorithm for two image classification tasks with different settings and benchmarks described in section V-A.

A. SIMULATION SETUP
For our simulations, we consider a square network area with one BS placed at its center servicing M × I uniformly distributed devices. We simulate the channel to experience small-scale fading multiplied by large-scale fading, where the small-scale fading follows the standard independent and identically distributed (i.i.d.) Gaussian distribution and the large-scale fading follows the free-space path loss as G BS G D ( 3 * 10 8 m/s 4π f c d BD ) PL , where G BS and G D are the antenna gains of BS and each device; PL is the free-space path loss coefficient; f c is the carrier frequency, and d BD is the distance between BS and device. We consider the following two settings on data and device location distribution: (1) One cluster device with equal data: the M × I devices are uniformly distributed in a square network area {(x, y): −10 < x < 10, −10 < y < 10} and each device have 1000 training samples.
(2) Two cluster device with unequal data: the M×I devices are uniformly distributed in two square network areas: half devices are in {(x, y) : −10 < x < 10, −10 < y < 10}, and the other half are in {(x, y) : 40 < x < 60, −10 < y < 10}. Besides, the number of training samples for each device is unequal. We randomly set half devices with [1500,2000] training samples and the other half with [300, 500] training samples.
The muti-FL algorithm is simulated by using PyTorch for two image classification tasks on MNIST [44] and FMNIST [45] datasets. Since the sample size and the sample space of the two datasets are the same, we can use the same neural network structure to conduct the classification task. Specifically, each device trains a CNN consisting of two convolution layers with 5×5 kernel size, and each convolution layer is followed by a 2×2 max pool layer, a batch normalization layer, a fully connected layer, a ReLu activation layer, and a softmax output layer. The total number of neurons is 21921. The loss function is the cross-entropy loss.
For the purpose of comparison, we use the three benchmarks as follows: (a) Noiseless aggregation: Suppose that the gradient information uploaded by the device to BS is undistorted, which means the BS directly uses the gradient calculated by the device to perform the FedAvg algorithm. Meanwhile, all devices are selected to participate in the FL process.
(b) Optimizes receiver combiner with random device selection: Suppose that devices are randomly selected, and the receiver combiner vector f is optimized by the SCA algorithm.
(c) OFDMA scheme: Orthogonal frequency division multiple access communication scheme with the same user selection sets as proposed algorithm.
(d) Proposed algorithm: A wireless optimization algorithm that optimizes the receiver combiner vector f via SCA and the device selection policy via greedy algorithm.
In our simulations, we stipulate that the BS has 64 antennas and 40 devices in each task, and we select half of all devices to participate in the FL process. We perform 1000 FL rounds in each task, and the learning rate of each device is set to be 0.01. The values of parameters used in simulations are listed in Table 2.

B. SIMULATION RESULTS
In this section, we simulate the performance of the proposed algorithm for two image classification tasks with different settings and benchmarks described in Section V-A. Fig. 2 and Fig. 3 show the test accuracy of MNIST and FMNIST classification tasks with setting 1. From the two figures, we see that the random device selection benchmark can approximate the OFDMA scheme and proposed algorithm, and all of them nearly achieve optimal performance (noiseless aggregation). This is due to the fact that devices  are all close to the BS under setting 1, such that there are no significant stragglers. Besides, Fig. 2 and Fig. 3 also show that the proposed algorithm has a better performance than the OFDMA scheme and nearly reach the same performance as the optimal FL. The improvement stems from the fact that the proposed algorithm optimizes receiver combiner vector based on FL convergence speed and error. Above results verifies that the proposed algorithm can not only improve the performance of multiple parallel FL but also effectively reduce the communication costs since the spectrum resources for the OFDMA scheme increase proportionally with the number of devices. Fig. 4 and Fig. 5 show the test accuracy of MNIST and FMNIST classification tasks under setting 2. From the two figures, we observe that the FL performance is significantly affected when randomly selecting devices because of stragglers. However, due to our greedy device selection algorithm, the performance of the proposed algorithm is still nearoptimal. Besides, with the same device selection policy, the OFDMA scheme can also achieve a relatively satisfactory performance. Fig. 6 and Fig. 7 show the test accuracy of FMNIST and MNIST classification tasks versus different numbers of BS   receive antennas under setting 2. From the two figures, we can see that, as the number of BS receive antennas increases, the values of the FL test accuracy increase on both tasks. This is because, as the number of antennas increases, the dimension of the vector f increases, which helps the SCA algorithm find a better solution. In addition, as the dimension of the vector f increases, the dimension of the solution space of the zero-forcing constraint given in (13) increases, which is beneficial for the SCA algorithm to find a better solution as well.

VI. CONCLUSION
In this paper, we developed a multiple FL framework via over-the-air computation in wireless networks. We proposed the zero-forcing receiver combiner to separate the received signals of different computing tasks. Also, we analyzed the convergence of FL under our framework and derived an upper bound on the difference between the loss function and its optimal value, which reveals how the receiver combiner vector and device selection policy affect FL performance. Based on this discovery, we formulated an optimization problem that jointly considers receiver combiner vector design and device selection for improving FL performance. We addressed the problem by alternately optimizing the receiver combiner vector and device selection policy. In particular, we adopted SCA to derive the receiver combiner vector and solve the device scheduling problem with a greedy algorithm. Simulation results show that our proposed framework effectively solves the straggler issue and achieves near-optimal performance for all processed learning tasks.

APPENDIX A PROOF OF PROPOSITION 1
Let N t denote the complement of I t , so that I t ∪ N t = I, and then the gradient residual in (18) is bounded by the expression derived in (28) 2 at the top of the next page, where N = (n [1], n[2], . . . , n[D]) ∈ C N×D , and the inequality is achieved by the inequality of arithmetic and geometric means. To minimize the gradient residual, transmitter scalar p i should satisfy (K i − f H h i p i √ ηv t i ) = 0 for i ∈ I t , thus we get the p i in (20). 2. In this expression, vector g subtracting scalarḡ means each entitiy of the vector g subtracting scalarḡ. Besides, addition and subtraction operations between vectors and scalars involved in other formulas in this paper also obey the above principle.

APPENDIX B PROOF OF THEOREM 1
, the first term at the right side of (28) is bounded as follows where the first two inequalities are achieved by the triangle-inequality, and the last one is achieved based on Assumption 4. Substituting (20) into the last term at the right side of (28) yields From (21), we have 2Dσ 2 n η i∈I t K i 2 = 2Dσ 2 n i∈I t K i 2 max i∈I t Based on (7), we have where the last inequality is derived based on Assumption 4.

APPENDIX C PROOF OF PROPOSITION 2
By introducing an auxiliary variable τ = min i∈I t m |f H m h m,i | 2 , the problem developed in (24) can be rewritten as which completes the proof.