Scalable Hierarchical Over-the-Air Federated Learning

When implementing hierarchical federated learning over wireless networks, scalability assurance and the ability to handle both interference and device data heterogeneity are crucial. This work introduces a new two-level learning method designed to address these challenges, along with a scalable over-the-air aggregation scheme for the uplink and a bandwidth-limited broadcast scheme for the downlink that efficiently use a single wireless resource. To provide resistance against data heterogeneity, we employ gradient aggregations. Meanwhile, the impact of uplink and downlink interference is minimized through optimized receiver normalizing factors. We present a comprehensive mathematical approach to derive the convergence bound for the proposed algorithm, applicable to a multi-cluster wireless network encompassing any count of collaborating clusters, and provide special cases and design remarks. As a key step to enable a tractable analysis, we develop a spatial model for the setup by modeling devices as a Poisson cluster process over the edge servers and rigorously quantify uplink and downlink error terms due to the interference. Finally, we show that despite the interference and data heterogeneity, the proposed algorithm not only achieves high learning accuracy for a variety of parameters but also significantly outperforms the conventional hierarchical learning algorithm.


I. INTRODUCTION
With the growing pervasiveness and computational power of wireless edge devices, i.e., phones, smart watches, sensors, and autonomous vehicles, there is an increasing demand for enabling machine learning to train a global model from the diverse distributed data over the edge devices [2].However, loading such enormous amounts of data from the devices to a central server is not often feasible due to strict constraints on latency, power, and bandwidth, or concerns on data privacy.A promising and practically feasible distributed approach is federated learning (FL), which is able to implement machine learning directly at the wireless edge while under no circumstances the data leaves the devices [3].In this approach, the model training is performed locally at each single device with the help of a parameter server, such that synchronous model update at all devices and model aggregation at the server are repeated until convergence.Most studies on FL consider a single server.However, to support more devices that can collaborate and to speed up the learning process, recent studies have proposed hierarchical architectures for FL that incorporate a core server and multiple edge servers [4]- [11].
The hierarchical FL has two levels of aggregation of model parameters: the intra-cluster aggregation at the edge servers (edge aggregation) and then an inter-cluster aggregation at the core server (core aggregation).The goal of this paper is to propose a hierarchical learning scheme that is both resource efficient and resilient to data heterogeneity and wireless interference from parallel learning processes, and to provide a modeling methodology that allows to express and evaluate the interference and its effect.

A. Prior Art
The convergence properties of hierarchical FL, without considering the limitations of a wireless environment is evaluated in [4]- [6], proposing server selection, node scheduling and data compression solutions.The limited resources and wireless channel impairments under orthogonal transmissions and fixed network configurations are taken into account for example in [7]- [10].It is recognized however that the orthogonal transmissions lead to performance bottlenecks when a high number of devices participate in the learning.An effective and desirable approach for these scenarios is known as over-the-air FL, which utilizes the over-the-air computation scheme [12], [13].This approach leverages interference caused by simultaneous multi-access transmissions from edge devices to perform aggregation.Through the integration of communication and computation, over-the-air FL can operate with significantly fewer resources and lower latencies in both communication and computation compared to FL using orthogonal transmissions [2].An extensive survey of opportunities and challenges of FL in wireless networks is presented in [14].
The majority of research in the field of over-the-air FL focuses on single-cell learning scenarios, with an emphasis on the uplink transmissions [15]- [18].Recently, several works include specific aspects of uplink interference.Interferers distributed according to Poisson point processes (PPPs) are considered in [19]- [21].An abstract interference model with heavy tail is considered in [22], and it is shown that while heavy tail slows down the learning process, it may improve the generalization capability.In [23], transmission power is controlled to escape from saddle points.Studies have been also conducted to examine bandwidth-limited downlink in both single-cell [24] and multi-cell [25] settings.
Hierarchical FL using over-the-air computation is studied in [11], and that is the work most closely related to our paper.However, that study assumes the presence of multiple antennas at the edge servers, an ideal downlink transmission, and most importantly, does not account for inter-cell interference.
To understand the effect of interference in hierarchical FL and provide a comprehensive network analysis, we turn to stochastic geometry.Stochastic geometry has been developed as a tool to characterize interference in networks, taking the location of the nodes into account [26], [27].Within the field, there are two canonical approaches to characterize wireless networks.One of them is PPPs, which assumes uniform device and server placements, and has applications in cellular networks [20], [21], [28].The other approach is based on Poisson cluster processes (PCPs), which takes into account non-uniformity and allows to model the correlation between the locations of the devices and servers within and among clusters.PCPs are justified by Third Generation Partnership Project (3GPP) [29], [30] and widely-adopted in deployments where devices are frequently grouped together, meaning that they are commonly found in specific areas, known as "hotspots" [30]- [34].The PCP fits to the application area of distributed learning on hierarchical networks, which are composed of explicitly defined clusters, such as transportation and other smart city sensing applications, environmental monitoring or industrial IoT.Moreover, it is designed to allow performance models where the effects of the network parameters are pronounced.Therefore, this is the approach we follow in this paper.

B. Key Contributions
This paper develops a learning and transmission scheme for hierarchical FL utilizing over-the-air computation, and provides modeling solution that can capture the effect of interference.The key contributions are as follows: Learning Method: We propose a new iterative learning method with two-level aggregation named MultiAirFed, that combines intra-cluster gradient and inter-cluster model parameter based aggregations, and includes also multi-step local training.The method is well suited for the hierarchical network structure with unreliable wireless and reliable backhaul links, while also being resilient to non-i.i.d.data.
Transmission Scheme: We propose scalable clustered overthe-air aggregation scheme for the uplink and bandwidthlimited analog broadcast scheme for the downlink transmission.The schemes are independent of the number of clusters and devices, and each iteration all over the network is performed only in one single resource block, for both the uplink and the downlink.We design uplink and downlink power control schemes, and propose receiver normalizing factors that minimize the distortion of the recovered gradients or model vectors, taking task diversity, data heterogeneity, wireless channel impairments and interference into account.While the proposed transmission scheme is general, we apply it for MultiAirFed, and express the intra-and inter-cluster aggregation errors caused by the uplink and downlink interference.
Tractable Modeling and Convergence Analysis: We utilize the tools of stochastic geometry, and specifically PCPs, and quantify the intra-and inter-cluster aggregation error terms.Then, we characterize the learning convergence in terms of the optimality gap for the setup, as a function of the network parameters, and present design remarks and special cases.The optimiality gap is tractable and well structured, where it is easy to identify the effects of the learning parameters, the transmission scheme and the network topology.
System Design Insights: Our analysis reveals that due to interference, there is a non-zero optimality gap after convergence.Increasing the number of devices in each cluster or the number of collaborating clusters increases the learning accuracy, however, the improvement is limited by increasing the interference.A higher density of cluster centers has also a degrading effect on the learning performance.Our numerical results show that MultiAirFed, based on the combination of gradient and model parameter aggregations, can provide significantly better learning accuracy compared to the other hierarchical FL algorithm in the proposed wireless setup.

II. SYSTEM MODEL A. Network Topology
We consider groups of devices clustered around edge servers, performing FL, as shown on Fig. 1.The clusters may have different or similar learning tasks.For example, a university campus is mostly interested in science learning tasks in contrast with a farm where the type of sensors and data are completely different.One or more core servers support the clusters in the learning process.The clusters with the same task connect to the same core server and collaborate to allow hierarchical learning.The clusters themselves are wireless cells, with the edge servers located at the base stations.The edge servers are connected to the core server by a wired backhaul link.
We model the emerging network topology with the help of PCP [26].A PCP Φ is formally defined as a union of offspring points in R 2 that are located around parent points.In our case, the parent points are the edge servers, while the offspring points are the devices.The parent point process is a homogeneous Poisson point process (PPP) Φ p with density λ p .Also, the offspring point processes are conditionally independent.The set of offspring points of x ∈ Φ p is denoted by cluster N x , such that Φ = ∪ x∈Φp N x .The PDF of each element of N x being at a location y + x ∈ R 2 inside the cluster is shown by f ∥y∥ (y).
In wireless networks, clusters are mostly modeled as diskshaped regions where nodes are distributed uniformly.When these clusters are employed in a PCP, the resulting point process is referred to as Matérn cluster process (MCP) [26].Research has demonstrated that MCP can be utilized to accurately simulate actual setups utilized by 3GPP [29], [30].Also, since the edge servers are usually located inside base station buildings or integrated with antenna towers, there are protective zones over them where devices and other edge servers cannot be located.These zones are additionally motivated to suppress interference by inhibiting nearby devices [33], [35].Therefore, we further consider a modified type of MCP, named MCP with holes at the cluster centers (MCP-H) [33], [34], where the points are distributed around cluster centers with uniform distribution inside rings with inner radius r 0 and outer radius R as The number of devices in each cluster is assumed to be M , i.e., |N x | = M .Also, among all the devices of a cluster x, the set of active devices in a time slot is denoted by A x ⊆ N x .The term "active device" denotes a device that participates in the aggregation phase of the FL by its uplink transmission.

B. Channel Model
All the nodes are single-antenna units.For the wireless links between devices and edge servers, we assume single-slope power-law path loss and small-scale i.i.d.Rayleigh fading.The pathloss exponent is denoted by parameter α.The uplink fading between a device y ∈ N x and a server at z is modeled by f x yz ∈ C, such that the channel gain |f x yz | 2 ∼ exp (1).The downlink fading between a server x and a reference device is f x with the gain |f x | 2 ∼ exp(σ 2 d ).Also, all channel gains are assumed invariant during one time slot required for an uplink or downlink transmission, while they change independently from one time slot to another.The communication between the edge servers and the respective core server is performed over high-capacity backhaul links [36].This communication is considered error free, and its optimization is not part of this work.

III. PROPOSED LEARNING METHOD
Assume that there are C collaborating clusters including a reference cluster with its center at the origin o that have a same learning task.The centers of these clusters are denoted by a set C. Also, a device at a location y in a cluster x has its private dataset D x y .The l-th sample in D x y is ξ l .The learning model is parametrized by the parameter vector w ∈ R d , where d denotes the learning model size.Then, the local loss function of the model parameter vector w over D x y is where D x y = |D x y | is the dataset size and ℓ(w, ξ l ) is the sample-wise loss function that quantifies the prediction error of the model vector w on ξ l .Then, the global loss function on the distributed datasets ∪ x∈C ∪ y∈N x D x y over the clusters in C is computed as Thus, the learning process has the objective to find the desired model parameter vector w as To solve (4) for the hierarchical systems, we propose a new two-level algorithm named MultiAirFed.The MultiAirFed is a combination of intra-cluster gradient and inter-cluster model parameter aggregation.Gradient aggregation has been shown to be robust to noise and interference in [20], [22], and to non-i.i.d.data in [37], and therefore is a good candidate for the intra-cluster learning process over the interfering wireless links.The model-parameter aggregation at the core server at the same time allows multiple inter-cluster iterations.The resulting hierarchical learning process is shown on Fig. 1 and Algorithm 1.It is as follows: Consider T global inter-cluster iterations.In a particular iteration t, consider τ intra-cluster iterations.In a particular intra-cluster iteration i, each device y in a cluster x computes the local gradient of the loss function in (2) from its local dataset, indexed by {i, t}, as where w x y is its parameter vector and ξ x y is the local mini-batch chosen uniformly at random from D x y . 1 Then, devices upload (transmit) their local gradients to their servers for intra-cluster aggregation.The server of cluster x averages of the local gradients from its active devices and broadcasts the generated intra-cluster gradient where |A x i,t | is the number of active devices in the cluster x for the iteration index {i, t}, i.e., |A x i,t | = y∈N x 1 x y,i,t , where 1 x y,i,t equals 1 if the device is active and 0 otherwise.Then, the servers broadcast the intra-cluster gradients g x i,t , ∀x to their devices.Utilizing g x i,t , each device y in any cluster x updates its local model following a one-step gradient descent as where µ t is the learning rate at the global iteration t.After completing τ intra-cluster iterations, each device performs a γ-step gradient descent locally as w x y,τ,j,t = w x y,τ,j−1,t − µ t ∇F x y (w x y,τ,j−1,t , ξ x y,τ,j−1,t ), To start the inter-cluster iteration, the devices upload their model parameters, i.e., w x y,τ,γ,t , ∀y, x, to their servers.Accordingly, each server x computes an intra-cluster model parameter vector with the following average Then, the collaborating servers upload their intra-cluster model parameter vectors to the core server for a global inter-cluster model parameter aggregation as which is the average of all model parameter vectors from the active devices in the clusters of C. Also, |A τ,t | = x∈C |A x τ,t | denotes the number of the active devices for the iteration index {τ, t}.Then, the servers broadcast w G t+1 to the devices to update their initial model parameter vector for the next global iteration t + 1 as w x y,0,t+1 = w G t+1 , ∀y, x.This global update synchronizes all the devices in the collaborating clusters and prevents a high deviation of the local training processes.In line with [5], [7], [15]- [17], [22], we have not included the dataset sizes across devices into the average terms in (6) and (10).However, for a weighted average extension, one can adjust the local gradient or model vector for the intra-or inter-cluster aggregation uploads by replacing g x y,i,t with D x y g x y,i,t and w x y,τ,γ,t with D x y w x y,τ,γ,t .To calculate the number of active devices in each cluster, the cardinality y∈N x D x y 1 x y,i,t can be used instead of |A x i,t |.Then, the proposed algorithm can be followed.The proposed transmission scheme in Section IV can also be readily modified using these changes.
Compared to other hierarchical methods in [4]- [8], [10], [11] which are based on model parameter transmissions, gradient transmission in MultiAirFed over wireless links is expected to be more robust to channel noise and interference.This is because for each device local update, a noisy model parameter aggregation leads to imperfections both on the initial model parameter vector update and the local gradient function evaluation in (5).Moreover, the resulting errors propagate and reinforce through multiple local steps.As learning convergence demands high accuracy in the gradient direction, particularly in the vicinity of the optimal solution, noisy model parameter aggregation may hinder the model to converge.However, when gradients are transmitted, devices can download the aggregated gradient (6) as a same gradient term for their update without the need for local computations, and the initial model parameter vector remains unaffected by any noise.This further guarantees that the local steps taken during an inter-cluster iteration will continuously approach convergence.Moreover, in general, gradient based aggregation is shown to be resilient against heterogeneity and non-i.i.d.data when compared to approaches that transmit model parameters [37].These will be further justified through experimental results in Section VI.Also, even though gradient transmission does not permit multiple local iterations at each intra-cluster iteration, the proposed final gradient descent ( 8)-( 9) serve as a reinforcement for integrating local training into the learning process.
Before we conclude this discussion, please note that the proposed learning method is not limited to the choice of the spatial model in Subsection II.A and the sequel transmission scheme in Section IV.It can be easily applied to different scenarios of multi-server FL.

., T do
Each device updates its model by w G t for intra-cluster iteration i = 1, ..., τ do Each device obtains its local gradient from g x y,i,t = ∇F x y (w x y,i,t , ξ x y,i,t ) Each edge server obtains its intra-cluster gradient from Each device updates its local model as Core server obtains global model from

IV. CLUSTERED TRANSMISSION SCHEME
To implement MultiAirFed, we propose a scalable transmission scheme including two types of analog transmissions for uplink and downlink, where each is done simultaneously over the clusters in a single resource block, as shown on the message exchange diagram on Fig. 1.It is inspired from [24] which shows that analog downlink approach significantly outperforms the digital one.Synchronization is required within a cluster, this is a common assumption in the literature, see e.g., [19], [20], [25].It is also worth noting that the transmission scheme is not constrained to the spatial model choice presented in Subsection II.A, but the spatial model is needed to derive analytic expressions for the gradient and model aggregations.From here, we ignore the iteration indexes for simplicity of presentation.

A. Uplink
For the uplink, we propose a clustered over-the-air aggregation scheme.The term "over-the-air" stems from the facts that devices transmit simultaneously and the objective is to construct the aggregation vectors ( 6) and ( 11) at the edge servers based on the additive nature of wireless multiple-access channels.The term "clustered" comes from the fact that the power allocation in each cluster is distinct from other clusters.
Depending on an intra-or inter-cluster iteration, the gradient parameters or model parameters at each device are normalized before transmission to have zero mean and unit variance.There are two advantages to normalizing the parameters.First, when the parameters have zero-mean entries, the estimates obtained in the sequel are unbiased.Second, when the entries have unit variance, the interference and consequently the error terms depend only on the power control, and do not depend on the specific values of the model or gradient parameters.
For an intra-cluster iteration, the local gradient vector at a device y ∈ N x , i.e., g x y , is normalized as ḡx y = , where 1 is the all one vector, and µ x g,y and σ x g,y denote the mean and standard deviation of the d entries of the gradient given by where g x y (i) is the i-th entry of the vector.Also, for an intercluster iteration, the normalized local model parameter vector is wx y = , where the mean and variance are Then, at each device y in the cluster x, the normalized vector ḡx y or wx y is analog modulated and transmitted as p x y ḡx y or p x y wx y simultaneously with other devices in all the clusters, where |p x y |2 denotes the transmission power.Thus, the received signal at a server located at z is where the first term is the useful signal, the second is the inter-cluster interference and n z u ∈ C d×1 is the additive white Gaussian noise (AWGN) at the receiver of the server with zero mean and variance σ 2 en for each entry.Each device y ∈ N x of cluster x follows a truncated power allocation [16] as where ρ is the power allocation parameter and th 1 is a threshold.We assume that the device knows this channel, the uplink channel to its server.In (15), devices with deep fades do not transmit but the channel pathloss is not included in the conditions.By enabling the inclusion of devices with high pathloss, the learning process can ensure fair device deployment and leverage data diversity from all devices [16], [20].In (15), to meet a maximum average power P u in each device, we have where Ei t dt is the exponential integral function.Thus, ρ for all the devices can be selected as In each time slot, the activity set in the cluster x is defined as has the binomial distribution with probability P |f x yx | 2 ≥ th 1 = e −th1 and |A x |, ∀x are independent.If A x is found to be empty, the device y ∈ N x with the highest |f x yx | is selected to transmit at P u , and we consider |A x | = 1.However, for the sake of simplicity and to gain more insight into the analytical results, we make the assumption that the probability of the emptiness event, which is equal to (1−e −th1 ) M , is sufficiently small that it can be disregarded and not included as a condition in the sequel analytical derivations for ( 20) and ( 43)-(50).

B. Downlink for intra-cluster iteration
The downlink transmission happens parallel at each edge server, via bandwidth limited analog broadcast.As E {v x u } = 0, ∀x, each server at a location x ∈ Φ p normalizes its received signal v x u with its variance, which is E ∥v x u ∥ 2 , as . Then, all the servers transmit the normalized signals simultaneously.Therefore, the received signal at a reference device at y 0 in the reference cluster o 2 is where P d is the transmission power constraint of the servers, and n d ∈ C d×1 is the AWGN at the device with zero mean and variance σ 2 dn for each entry. 3In general, the server can estimate E ∥v x u ∥ 2 by taking measurements of the received signal over time and its entries and calculating the average power of those samples.However, the MCP-H modeling allows us to express E ∥v x u ∥ 2 as a function of the network parameters and the power control.Specifically, from ( 14) and ( 15) where where . Also, (a) comes from the Campbell's theorem [26] and (b) is due to the fact that edge servers have at least 2r 0 distance from each other.
Denormalizing received signal, the reference device estimates the intra-cluster gradient (6) as where ϑ o dy 0 is the intra-cluster receive normalizing factor at the device.For this operation, it is assumed that each device knows its downlink channel from its server 4 and the reference server shares the scalars (µ o g,y , σ o g,y ), ∀y ∈ A o with its devices in an error-free manner.This is needed to support data heterogeneity.This information is however small compared to the gradients, and needs to be shared within a single cluster only.If the downlink channel |f o | is lower than a threshold th 0 , the device does not update its local model, and will not contribute to the present inter-cluster iteration 5 .However, retransmission strategies can be utilized in such case.By replacing (14) in (18) and expanding the result, (21) can be rewritten as where ϵ o u is the intra-cluster uplink error given by and the intra-cluster downlink error ϵ o These results hold for the learning process under general network topology.For the specific case of the MCP-H, we can select the normalizing factor ϑ o dy 0 in (21) to minimize the distortion of the recovered gradient g o y0 with respect to the ground true gradient 1 (22), which can be measured by the mean squared error (MSE) [18], [38], as where the equality holds due to the independent error terms and dy 0 ρ Ψ from ( 14) and (23), where Ψ is calculated in (20).Then, for the expected term on the intra-cluster downlink error in (24), due to the MCP-H network topology, we have where (c) is due to the Campbell's theorem.To solve (25), we take derivative from the objective and set the result to zero, which leads to where C. Downlink for inter-cluster iteration The core server sums and redistributes the signals received from any set of collaborating edge servers without introducing further error.Consider C x collaborating clusters having the same learning task with a cluster x, denoted as the set C x .Then, the sum of received signals of the clusters in C x , i.e., z∈Cx v z u , is normalized with its variance, which is Then, the result is simultaneously transmitted from the servers of the clusters in C x to their devices.Therefore, the received signal at the reference device is Then, the reference device can estimate the inter-cluster model parameter vector (11) as where due to the symmetry of the network Also, ϑ dy 0 is the inter-cluster receive normalizing factor.We assume that the reference edge server receives scalars (µ x w,y , σ x w,y ), ∀x ∈ C, y ∈ A x from its core server and shares them among its devices.
After replacing ( 14) in ( 29), (30) can be expanded as where ϵ u is the inter-cluster uplink error as and the inter-cluster downlink error ϵ dy 0 is Again, the results up to here do not depend on the network topology.For the MCP-H case, we can progress as follows.
The normalizing factor ϑ dy 0 is selected to minimize the distortion of the recovered model vector w o y0 with respect to the ground true model vector 1   |A| x∈C y∈A x w x y from (31) as where due to the symmetry of the network and the error terms in ( 23)-( 24) and ( 32)-( 33), we have Therefore, the solution of ( 34) is Note that in the intra-and inter-cluster iterations, the uplink error is due to the interference of devices in the uplink and the downlink error comes from the interference of edge servers in the downlink.Also, they include the effect of simultaneous transmissions of all the clusters regardless of their learning tasks.Since ( 14), (18), and (29) utilize only one resource block all over the network containing frequency subchannels equal to the size of the learning model, i.e., d, and all nodes have a single antenna, the communication efficiency is validated in terms of bandwidth and antenna resources.According to the uplink and downlink schemes, the expected latency in completing the MultiAirFed algorithm is obtained as where t CM is the local computation latency of each device given by t CM = cNb f [40], where c is the number of CPU cycles required for computing one sample data, f is the CPU cycle frequency, and N b is the size of data involved in the local update.In (36), t BC is the time needed for uplink or downlink transmission of d model or gradient parameters as t BC = d W [20], where a total bandwidth of W is assumed.Also, t BH ≫ t BC denotes the backhaul latency for the intercluster process.As observed from (36), the latency is independent of the number of clusters and devices.

V. CONVERGENCE ANALYSIS
The convergence analysis of MultiAirFed in terms of the optimality gap is presented in the following theorem, based on the estimations in (22) and (31).The analysis assumes common assumptions found in literature [5], [17], [18], [24], [25] as Assumption 1 (Lipschitz-Continuous Gradient): The gradient of loss function F (w) in ( 3) is Lipschitz continuous with a non-negative constant L > 0. It means that for any model vectors w 1 and w 2 , we have Assumption 2 (Variance Bound): The local gradient estimate g at a device is an unbiased estimate of the ground-true gradient ∇F (w) with bounded variance where B is the mini-batch data size.Assumption 3 (Polyak-Lojasiewicz Inequality): Consider F * = F (w * ) from the problem (4).There is a constant δ ≥ 0 such that the following condition is satisfied.
The inequality ( 40) is significantly more general than the assumption of strong convexity [39].
To make the analysis more manageable, we assume that the downlink channel gains from an edge server to the devices in its cluster are greater than th 0 .Also, normalizing factors ϑ o dy 0 ,i,t and ϑ dy 0 ,t in any intra-cluster iteration i and intercluster iteration t are lower than constants ϑ bo dy 0 and ϑ b dy 0 , respectively.
Theorem 1: Consider a fixed learning rate µ t = µ satisfying Then, the following optimality gap holds for the local learning model of any reference device.
Proof: See Appendix A. Considering the error term in the bound, the first four parts reflect the gradient estimation errors.These are followed by the effects of the intra-and inter-cluster uplink and downlink errors.The scaling factors of the error terms depend on the learning parameters and on the device selection, while the error terms depend on the interference, determined by network topology, the wireless environment and the power control.This structure therefore supports a separate design and evaluation of the learning algorithm and the wireless communication.
In order to establish the bound, we need to characterize several expected quantities.These include the expected values of the terms related to the number of active devices, namely E . Additionally, we need to determine the upper-bounds for the intra-and inter-cluster uplink and downlink error terms, which include Binomial(M, e −th1 ), ∀x, |A| ∼ Binomial(CM, e −th1 ), and there is at least one active device in each cluster, the expected terms are computed as , and E ∥ϵ dy 0 ∥ 2 provided in Subsections IV.B and C can be upper-bounded as which is due to , and The following remarks and design insights can be concluded from Theorem 1.
Remark 1: The first term of the optimality gap decreases with the number of inter-cluster iterations T , while the second, error term is increasing, approaching a bound.
Remark 2: There is a tradeoff for the optimality gap when th 1 increases.The term Ei(th 1 )e −th1 in Ψ and then the intracluster uplink error term E ∥ϵ bo u ∥ 2 decrease, however the term in (43) increases.Hence, the impact of th 1 on the convergence in the general case is not evident.
Remark 3: The scaling factors of the intra-cluster uplink and downlink error terms E ∥ϵ bo u ∥ 2 and E ∥ϵ bo dy 0 ∥ 2 in the optimality gap increase by τ and µ with the rate O(µ 3 τ 2 ).Also, while the scaling factor of the inter-cluster uplink error term E ∥ϵ b u ∥ 2 does not change with them, the scaling factor of the inter-cluster downlink error term E ∥ϵ b dy 0 ∥ 2 increases with the rate O(µτ ).Hence, the intra-cluster error terms grow the optimality gap with τ and µ much faster compared to the inter-cluster error terms.
Remark 4: When µ ∝ with an order higher than one as in ( 45) and ( 46).On the other hand, the inter-cluster error terms E ∥ϵ b u ∥ 2 and E ∥ϵ b dy 0 ∥ 2 linearly increase with C. Hence, the optimality gap decreases.
Remark 7: From ( 20), the intra-and inter-cluster uplink error terms E ∥ϵ bo u ∥ 2 and E ∥ϵ b u ∥ 2 linearly increase with the number of devices per cluster, i.e., M .However, from (44) and ( 46), as E decrease with M at orders higher than one and they contribute to all the scaling factors of the error terms, the optimality gap is overally decreased.
Remark 8: For a fixed total number of active devices in the collaborating clusters, reducing the number of clusters (or increasing the number of active devices in each cluster) reduces the optimality gap.This results from the constant value of E Remark 9: The scaling factor, or the effect on the optimality gap, of the intra-cluster uplink error term E ∥ϵ bo u ∥ 2 is more than the one for the intra-cluster downlink error term However, for the intercluster error terms, the downlink term E ∥ϵ b dy 0 ∥ 2 has much higher scaling factor than the uplink term E ∥ϵ b u ∥ 2 since the inequality Hence, reducing uplink interference during intra-cluster iterations and reducing downlink interference during inter-cluster iterations can be an efficient way to improve the performance.
Remark 10: From ( 20) and ( 48)-( 50), the uplink error terms E ∥ϵ bo u ∥ 2 and E ∥ϵ b u ∥ 2 and the downlink error terms E ∥ϵ bo dy 0 ∥ 2 and E ∥ϵ b dy 0 ∥ 2 linearly and quadratically6 increase with the cluster center density λ p , respectively.Hence, the optimality gap increases with the rate O(λ 2 p ).In the following corollaries, we present special cases of Theorem 1.For the single-server case when all edge servers are working independently under different learning tasks, i.e., C = 1, we have the next simplified convergence result.
Corollary 1: In the case C = 1 and the learning rate as in (41), the optimality gap for the local learning model of any reference device is Proof: It comes from Theorem 1 when A = A o and = 1.When all the edge servers are collaboratively working under the same task, i.e., C → ∞, the convergence result is simplified in the next corollary.
Corollary 2: In the case C → ∞ and the learning rate as in (41), the optimality gap for the local learning model of any reference device is Proof: It comes from Theorem 1 when Remark 11: Four terms in the optimality gap given in Theorem 1, the two terms of the inter-cluster errors and Remark 12: The scaling factors of the intra-cluster error terms E ∥ϵ bo u ∥ 2 and E ∥ϵ bo dy 0 ∥ 2 aproaches to equal terms when C → ∞, thus their effect on the convergence will be the same.

VI. EXPERIMENTAL RESULTS
The learning task over the collaborating clusters is the classification on the standard MNIST and CIFAR-10 datasets.The classifier model for MNIST (CIFAR-10) is implemented using a CNN, which consists of two (four) 3 × 3 convolution  layers with ReLU activation (the (two) first with 32 channels, the (two) second with 64), each (two) followed by a 2 × 2 max pooling; a fully connected layer with 128 units and ReLU activation; and a final softmax output layer.We consider both i.i.d. and non-i.i.d.distribution of dataset samples over the devices.The number of samples at different devices is different and comes from the power law distribution ∼ 110n −2 , 100 ≤ n ≤ 1000.For non-i.i.d.case, each device has samples of only two classes, similar to [15], [21], [24].The performance is measured as the learning accuracy with reference to the test dataset over global inter-cluster iteration count t.Each performance result is evaluated as the average of 10 realization samples to account for random network distributions.In Fig. 2, the accuracy is shown for different intra-cluster iterations τ in the MNIST and i.i.d.scenario.As observed, increasing τ or t improves the learning performance, justifying Remarks 1 and 4. The improvement gap is decreased in higher τ or t.Also, increasing τ accelerates convergence in terms of t.It shows that a minimum number of intra-and inter-cluster iterations can ensure a desirable performance.The latency T L given in (36) is plotted in Fig. 3 for different τ in Fig. 2 when the target learning accuracy 95% is achieved.We assume t BH = 10t BC .As observed, the latency is minimized at τ = 17.For higher values, the convergence rate does not increase sufficiently to compensate for the longer time for the intracluster iterations.In Fig. 4, the accuracy is shown for different τ and local iterations γ in the CIFAR-10 and i.i.d.scenario.The performance is improved with the increase in γ, which justifies Remark 4. Also, by comparing the cases (τ = 6, γ = 12) and (τ = 12, γ = 2), the greater impact of τ compared to γ on the performance is demonstrated.This is mainly because of the detrimental effect of the term on the optimality gap in (42).
Figs 5-7 evaluate the effect of C and M , based on the MNIST and non-i.i.d.scenario.On Fig. 5, we can observe that multi-server collaboration can significantly improve the accuracy.It justifies Remark 6.That is because of accessing a diverse set of intra-cluster learning models and increasing the total active devices in the learning process over different clusters.Furthermore, as the inter-cluster uplink and downlink error increases, the degree of improvement diminishes at higher values of C. Fig. 6 demonstrates that the performance is improved as M increases since the number of active devices that can participate in the learning process increases.It justifies Remark 7. In Fig. 7, the number of collaborating devices is kept constant M c × C = 60, for C = 1, 3, 6, while in the non-collaborating clusters M n = 15.The results suggest that consolidating a greater portion of collaborating devices within fewer clusters enhances the performance.This observation aligns with Remark 8 and can be attributed to the engagement of a larger number of devices in intra-cluster iterations.
Figs 8 and 9 studies the effects of the cluster size, as a function of R and r 0 , and the cluster density λ p in the MNIST and non-i.i.d.scenario.In Fig. 8, as the cluster size grows, the performance diminishes.This decline can be attributed to the devices in a cluster becoming closer to other clusters, leading to amplified interference.Similarly, Fig. 9 illustrates a reduction in accuracy with an increment in λ p , aligning with Remark 10.This behavior is the consequence of the increasing interference.
In Figs 10 and 11, the learning performance of MultiAirFed is compared with the conventional hierarchical FL (HierFed) in [5], [7], [8] and FedSGD in [3] in the MNIST scenario.In FedSGD, the edge servers act as simple relay nodes.In each iteration, gradients are aggregated at the edge server, and then are directly forwarded to the core server.Thus, the gradients from all the devices in the collaborating clusters are aggregated in each iteration.HierFed differs from MultiAirFed in the intra-cluster iteration, as model parameters are uploaded to and aggregated at the edge servers, and the initial state at each local device is synchronized.This allows executing γ > 1 local decent steps at each intra-cluster iteration.In the numerical results, γ = 2.We consider the "over-the-air" transmission scheme proposed in Section IV both for MultiAirFed and for HierFed.Additionally, we implement all the methods with "orthogonal" transmissions in both uplink and downlink, which eliminates interference by assuming unlimited communication resources.Fig. 10 considers non-i.i.d.data distribution.The results indicate that MultiAirFed outperforms HierFed by a substantial margin, for both transmission schemes.This highlights the robustness of MultiAirFed against both data heterogeneity and interference.The less demanding i.i.d.scenario is shown on Fig. 11, however, now considering also a denser network with λ p = 40 Km −2 .HierFed outperforms MultiAirFed when interference is absent, due to the multiple local steps, its performance is significantly impacted under the higher interference.While the accuracy is increased fast in the first iterations, the learning does not converge.This supports our reasoning in Section III, and motivates the use of MultiAirFed.On both Figs 10 and the performance of FedSGD is weak even under orthogonal transmission, as this scheme does not take advantage of the hierarchical structure.Therefore, we do not evaluate the effect of interference.

VII. CONCLUSIONS
In this paper, we proposed a new two-level federated learning algorithm that leverages the hierarchical network architecture and intra-and inter-cluster collaborations for a higher communication efficiency and learning accuracy.To implement the proposed algorithm over wireless distributed systems independent of their scale and with minimum resource requirements, we presented an over-the-air aggregation scheme for the uplink and a bandwidth-limited broadcast scheme for the downlink, and determined how uplink and downlink interference impacts gradient and model aggregations in the algorithm.To minimize the interference-induced distortion on the estimations, we incorporated and optimized normalizing factors.We utilized PCP to characterize the spatial distribution of the devices and edge servers, derived a convergence bound of the learning process, and presented design remarks.Our results show that the PCP based modeling leads to useful insights on how the network parameters affect the interference and consequently the learning performance.The presented experimental results confirm the analytic findings and demonstrate that the proposed gradient based hierarchical FL outperforms existing solutions, and achieves high accuracy, despite interference and data heterogeneity.

APPENDIX A PROOF OF THEOREM 1
In the proof, we show the index for the intra-and intercluster iterations.Then, the update of the learning model at global inter-cluster iteration t + 1 is represented as According to (53) and the L-Lipschitz continuous property in Assumption 1, we have By taking expectation on both sides of (54) and considering the independency of error terms, we continue as Next, we bound the first term of the right-hand side (RHS) in (55).We can write its inner-sum term as Using the equality ∥a 1 − a 2 ∥ 2 = ∥a 1 ∥ 2 + ∥a 2 ∥ 2 − 2a ⊤ 1 a 2 for any vectors a 1 and a 2 , the term in the sum in (56) can be written as From Assumption 1, the last term in (57) is bounded as where using the equality where the first term of RHS can be upper-bounded as where (a) comes from the inequality of arithmetic and geometric means, i.e., ( b) is from the convexity of the function ∥.∥ 2 .The second term of RHS in (59) can be upper-bounded as Next, following the same approach as in (56)-(63), we can bound the second term of RHS in (55) as where (i) is due to the independency and (j) is from the Assumption 2. Following a similar approach as in (68), for the third term of RHS in (66), we have (70) Thus, under small enough µ t and the following conditions This bound connects the inter-cluster iterative steps t + 1 and t.To get the bound of Theorem 1, we can replace E F (w o y0,0,t ) − F * on RHS with the same one step bound for t and t − 1. Repeating the procedure over {t − 1, • • • , 0}, and from the equality t−1 i=0 c i = 1−c t 1−c for any c < 1, the proof is complete.

Fig. 1 :
Fig. 1: Representation of the FL system with three clusters, and the message exchange in the MultiAirFed method.

and E ∥ϵ b dy 0 ∥ 2 .
Due to the facts that |A x | ∼

1 |A| 2 and the decreasing value of E 1 |A o | 2
, which as the scaling factor directly reduces the terms in (42).

Fig. 11 :
Fig. 11: Test accuracy as a function of global iterations t (i.i.d.) Remark 5: In the optimality gap, the intra-cluster error terms E ∥ϵ bo u ∥ 2 and E ∥ϵ bo dy 0 ∥ 2 are directly scaled by the inverse squared of the number of active devices in a single cluster E 1 |A o | 2 .Also, the inter-cluster error terms E ∥ϵ b u ∥ 2 and E ∥ϵ b dy 0 ∥ 2 are scaled by the inverse squared of the total number of active devices in the collaborating clusters E 1τ +γ , the optimality gap has the rate O(const + 1 τ +γ ) with τ and γ.Hence, increasing τ and γ decrease the gap.