Processing math: 100%
Federated Learning in Heterogeneous Wireless Networks With Adaptive Mixing Aggregation and Computation Reduction | IEEE Journals & Magazine | IEEE Xplore

Federated Learning in Heterogeneous Wireless Networks With Adaptive Mixing Aggregation and Computation Reduction


Abstract:

Despite the recent advancements achieved by federated learning (FL), its real-world deployment is significantly impeded by the heterogeneous learning environment, specifi...Show More
Topic: Special Issue of Selected Best Papers of IEEE International Conference on Communications (ICC)

Abstract:

Despite the recent advancements achieved by federated learning (FL), its real-world deployment is significantly impeded by the heterogeneous learning environment, specifically manifesting as devices with various computing capabilities, non-I.I.D. (Independent Identically and Distributed) data distribution and dynamic wireless transmission conditions. Such learning heterogeneity greatly harms the learning performance, e.g., convergence and learning accuracy. Therefore, we introduce the AMA-FES (adaptive-mixing aggregation, feature-extractor sharing) framework with an asynchronous aggregation scheme to address these challenges. To mitigate the impact of the non-I.I.D. data, we propose the AMA scheme to maintain the training stability by compromising between the previous global model and the synchronised local model updates, avoiding abrupt changes to a completely new model. To reduce computation load, we introduce the FES scheme, enabling the computing-limited devices to update only the classifier. To address the asynchronous model updates caused by the transmission delay, we perform asynchronous aggregation with staleness-based weighting. We implement the AMA-FES framework in a practical scenario where mobile UAVs act as FL training clients to conduct image classification tasks. The experimental results validate the effectiveness of the AMA-FES scheme in restoring training stability and learning accuracy without causing extra computation or communication expenditures in heterogeneous wireless networks.
Topic: Special Issue of Selected Best Papers of IEEE International Conference on Communications (ICC)
Page(s): 2164 - 2182
Date of Publication: 25 March 2024
Electronic ISSN: 2644-125X

Funding Agency:


SECTION I.

Introduction

The recent data booming at the network edge prospered by the advances in wireless communication, has catalysed the growing applications of distributed learning (DL). At the same time, federated learning (FL) has emerged as a leading DL approach and has achieved great success in recent years. FL, first introduced by Google [2] in 2016, allows the remote devices and the FL server to train the target model in a collaborative manner, as shown in Fig. 1, without sharing the privacy-sensitive raw data. Additionally, FL achieves communication efficiency by allocating more computations to each local device. Heretofore, FL has been intensively utilised in real-life applications, i.e., smart home & smart appliances [3], automated-vehicles [4] and healthcare [5].

FIGURE 1. - Federated Learning: Naive FedAvg on image classifications.
FIGURE 1.

Federated Learning: Naive FedAvg on image classifications.

Despite the successful emergence of FL and its applications, the heterogeneous learning environment has significantly impeded its real-world implementations. This inherent heterogeneity in DL is especially prominent in FL, manifesting in three primary forms. The computation heterogeneity is the first one. As the devices are distributed across the network edge, the computing resources of the devices exhibit great variations, including the battery capacity and the computing capability of the processing unit. Devices with limited computing resources are often referred to as the “stragglers” in the FL literature. Such stragglers struggle to complete the local training in the required time frame or are incapable of conducting the local training at all, due to the limitation of the computing unit or the battery condition. The second is the heterogeneous wireless communication environment. The wireless communication channels between the devices and the FL server are time-varying and can be significantly impacted by environmental factors & the geographical location. If the transmission rate threshold is not satisfied, the FL server may fail to receive the local model updates in time. In some FL literature, computation and communication heterogeneity are collectively referred to as the system heterogeneity. The computation and communication heterogeneity, despite arising from different mechanisms, both result in delays or failures in receiving local model updates for the FL server, thereby slowing down the training process. Consequently, FL aggregation is postponed, leading to long idling time for other devices, or proceeded without the delayed model updates, potentially decreasing the learning accuracy and stability. Finally, the third is the statistical heterogeneity, a.k.a. data heterogeneity or the non-I.I.D. (Independent Identically Distributed) data. Due to the geographical dispersion, each device may see samples from varying distributions, i.e., suffers from statistical heterogeneity. However, learning from a balanced dataset is crucial to achieve optimal model performance in machine learning (ML), while the imbalanced & non-I.I.D. data harms the model performance, such as accuracy and convergence. Hence, addressing the learning heterogeneity is of vital importance to promoting the real-world applications of FL, e.g., developing augmented reality (AR) for Metaverse creation [6], employing UAVs (Unmanned Aerial Vehicles) for image classifications [7] or processing sensor collected data for further analysis [8].

A. Related Works

A significant amount of effort has been dedicated to solving the FL heterogeneity problem, as summarised in Table 1. Regarding the statistical heterogeneity, the naive FL optimisation approach, Federated Averaging (FedAvg), is first reported to be tolerant of the non-I.I.D. data distribution in work [2]. However, Zhao et al. [9] have shown that when the data distribution is extremely skewed, with each client device holding samples from only one class, FedAvg fails to maintain model performance. The accuracy drops significantly, up to 11.31% in MNIST classification [24] and 51.31% in CIFAR10 classification [25]. Additionally, FedAvg did not account for the system heterogeneity, often leading to the discarding of delayed updates. This consequently results in the loss of information, causing unstable training and reduced accuracy. Therefore, Zhao et al. [9] propose to alleviate the non-I.I.D. data distribution on each distributed device by enabling a shared public dataset, which has samples balanced across the whole feature space among the training participants. In contrast, to enrich the local data diversity, rather than directly sharing the datasets, Duan et al. [26] gather the data statistics of each local device at the server, which are then sent to each local device for local data augmentation. A similar work that performs data augmentation with a lightweight data generator can be found in work [27]. However, sharing private data, directly [9] or indirectly [26], [27], violates the privacy rule of FL, and does not apply to privacy-sensitive applications, such as healthcare.

TABLE 1 Summary of FL Heterogeneity: Cause & Consequence, Related Works and Our Solutions
Table 1- Summary of FL Heterogeneity: Cause & Consequence, Related Works and Our Solutions
TABLE 2 Key Abbreviations
Table 2- Key Abbreviations
TABLE 3 Variable Specifications
Table 3- Variable Specifications
TABLE 4 Comparison of Approaches Addressing Non-I.I.D. Issue in FL
Table 4- Comparison of Approaches Addressing Non-I.I.D. Issue in FL

Nevertheless, other literature proposes to address the statistical heterogeneity from the model level. Borrowing the idea from multi-task learning, Arivazhagan et al. [10] treat the local non-I.I.D. datasets as distinct learning tasks, then propose FedPER, which performs personalised FL via combining personalisation layers and base layers on each local model to account for the data skewness. While the base layers are updated with FedAvg, the personalised layers are kept locally. Alternatively, to achieve personalisation, Li et al. [11] propose FedBN that incorporates the batch-normalisation (BN) layers as personalised layers to correct the feature shift caused by the imbalanced data distribution. Similarly, Mills et al. [12] propose Multi-Task FL (MTFL) utilising the non-federated BN layers to address the non-I.I.D. data. Similar works that address the non-I.I.D. data issue utilising model personalisations can be found in work [28], [29]. Following the idea of correcting the feature shift presented in the local model, Wang et al. [13] propose FedNOVA to tackle the non-i.i.d. data impact. Specifically, FedNOVA performs weight normalisation on the local model updates before global aggregation. Although these model-level solutions avoid direct data sharing, they induce extra computation load, which may not apply to devices with limited computing resources. On the contrary, FedProx [14] addresses statistical heterogeneity with local model regularisation, i.e., adding a penalisation term to the local loss function to restrict the local model growth, thus forcing each updated local model to be alike.

On the other hand, targeting the system heterogeneity, work [15] proposes to select participating devices with sufficient training resources, which often leads to the fairness issue. To be specific, regardless of the selection criteria, the selection scheme tends to favour a specific group of devices. In this case, the remaining devices would have less chance of participating in the training, consequently resulting in lower training accuracy on these devices. The problem becomes even worse in the non-I.I.D. scenario. Alternatively, instead of performing client selection, FedProx [14] proposes a partial work scheme that reduces the computation for computing-limited devices by allowing them to compute a reduced amount of local updates depending on their computing capability, which often hinders the model convergence. On the contrary, Cho et al. [16] propose FedET, which accommodates heterogeneous computing devices by pre-designing a variety of local models. Each device is assigned with a different local model that suits its computing capability and then the knowledge learned locally is transferred to the global model via knowledge distillation. Nevertheless, the necessity to pre-design the local models significantly limits the scalability of the application, thereby impeding its practical implementation.

Nevertheless, the restricted power supply of wireless devices, specifically their battery life, also significantly affects the performance of FL. Given the limited resources, energy consumption becomes a crucial concern for long-term FL applications, potentially resulting in the incapacity of devices to partake in FL training. Instead of selecting devices with sufficient resources as mentioned work [15], energy harvesting that enables wireless devices to collect and store energy from the surrounding environment is a viable alternative. Pan et al. [17] consider both energy security and information privacy, offering an energy harvesting approach that balances these aspects optimally. Specifically, they introduce an FL-based scheme for detecting malicious energy usage, taking into account the energy status and user behaviour. This scheme is further enhanced by an incentive mechanism that encourages the participation of each FL node in the energy-information protection strategy. A similar work can be found in Pan et al. [18].

There is also literature targeting the asynchronous model updates caused by the dynamic wireless channel condition. First, many works employ staleness-based weighting schemes to aggregate asynchronous model updates. For example, Xie et al. [19] propose a fully asynchronous aggregation scheme that aggregates the weighted local model updates upon receiving without considering other devices. Specifically, [19] provide three different weighting functions based on the model staleness, i.e., constant, hinge and polynomial, while the hinge function demonstrates the best simulation performances. However, the fully asynchronous approach may lead to excessive communication overhead. In contrast, Chen et al. [20] propose FedSA, which performs periodic asynchronous aggregation with staleness-based weighting. Specifically, the staleness weighting scheme considers not only the one-round communication delay but also the local computing delay. Then, this staleness-based weight is normalised before being applied for aggregation. Alternatively, Liu et al. [21] propose FedASMU that performs asynchronous aggregation between the local model updates and the global model with a dynamic staleness-aware weighting scheme. Specifically, the staleness-aware weight of the local model updates is expressed with a dynamic polynomial function. Then, a bi-level optimisation problem is formulated to optimise these staleness-aware weights. Similar studies utilising the staleness-based weighting can be found in work [30], [31]. On the other hand, rather than performing post aggregation of the delayed model updates, where the inclusion of the staled model into the global model may bring training instability and reduced accuracy, Li et al. [7] propose an opportunistic transmission scheme that proactively uploads the intermediate local model updates to the FL server as fallback solutions for global aggregation. However, additional model transmissions also signify more power consumption, which can be unfriendly to devices with limited power resources.

Alternatively, client selection is another viable solution addressing the heterogeneous communication environment. To avoid asynchronous updates, we can select clients with sufficient communication resources. For example, Nishio and Yonetani [22] require the channel state information of each client at the start of each training round for client selection. However, as mentioned above, client selection in FL often leads to the fairness issue, where some clients are less frequently selected and thus have lower accuracy on the global model. In contrast, considering the training fairness, Zhu et al. [23] propose an online selection scheme for FL minimising the training latency while considering the client availability and long-term fairness.

B. Research Gap & Contributions

Although the existing literature has made significant advancements, challenges still exist. First, most existing solutions addressing the FL heterogeneity themselves require extra effort, either through modifying the local model structure [10], [11] or pre-designing local models [16], affecting the scalability. In scenarios. e.g., UAVs being FL training clients, where not only the computing and communicating resources are limited but also the storage and power, lightweight solutions with easy adaptability are expected. Secondly, in most practical scenarios, these three types of heterogeneity coexist simultaneously while unilaterally addressing one aspect is inefficient, which is the case for most existing literature.

Therefore, motivated by the need for lightweight solutions and a lack of work that provides systematic investigation in addressing FL heterogeneity, we propose the FL framework, AMA-FES featuring asynchronous aggregation, systematically addressing the FL heterogeneity without causing additional expenditures. Additionally, it only requires minor modifications of the existing FL optimising algorithm, thereby guaranteeing its scalability and easy adaptability in real-world applications. Contributions of the work are stated as follows:

  • We study the problem of heterogeneity presented in FL and particularly focus on its formation in practical scenarios, i.e., the non-I.I.D. data distribution, the computing-limited devices1 and the asynchronous model updates caused by the dynamic wireless conditions.

  • We propose a computation mitigation scheme for the computing-limited devices, referred to as feature-extractor sharing (FES). The CPU-friendly FES scheme can be easily integrated with other FL optimising methods and only requires minor modifications.

  • We propose the adaptive-mixing aggregation (AMA) scheme to counteract the impact brought by the data skewness. The AMA scheme greatly smooths the convergence and effectively achieves lower loss value. We further study an asynchronous aggregation scheme, which is compatible with the AMA-FES framework, to incorporate the delayed model updates into the global model.

  • We provide a convergence analysis of the proposed AMA scheme under the smooth non-convex assumption. Given the total T training rounds, the AMA scheme converges to a stationary point at a rate of \mathcal {O} \left({{}\frac {1}{\beta \epsilon T}}\right) , where \epsilon, \beta represents the learning rate and AMA parameter respectively, while the asynchronous AMA scheme converges to a stationary point at a rate of \mathcal {O} \left({{}\frac {1}{\overline {\gamma }\beta \epsilon T}}\right) , where \overline {\gamma } relate to the staleness-based asynchronous aggregation weight.

  • To address the effectiveness of the proposed AMA-FES learning framework, We simulate a practical scenario, where mobile UAVs serve as the FL training clients and train image classifiers to facilitate the creation of AR applications. Specifically, we consider a wireless transmission model with Rician fading and additional path loss, where each mobile UAV constantly experience transmission delay.

  • Simulation results reveal that under a moderated level of computing heterogeneity, the FES scheme can successfully restore learning performance. Under the non-I.I.D. setting, the proposed AMA-FES FL framework increases the accuracy by up to 19.77% and 2.38% on FMNIST [32] when compared with the naive FL and FedProx respectively. It also shows that the proposed asynchronous aggregation scheme achieves a modest accuracy increase of 3% on non-I.I.D. CIFAR10, compared with the naive FL.

Organisations of the remaining paper: In Section II, we detail the preliminaries of FL, including the naive FL optimisation method, FedAvg, and the benchmarking scheme used in this work, FedProx. We also provide a comprehensive explanation of how the non-I.I.D. data distribution deteriorates the FL performance. In Section III, we formally state the system models employed in this work. We introduce the FES, the AMA and the asynchronous aggregation scheme in Sections IV-A–​IV-C respectively. We present the convergence & complexity analysis in Section V. In Section VI, we present the simulation results while we conclude the work in Section VII.

SECTION II.

The Preliminaries of FL

In this section, we lay the groundwork for FL to facilitate a better understanding of the work. We start with the naive FL optimisation approach, FedAvg, which is used as the baseline scheme in the simulations. Next, we demonstrate how the non-I.I.D. data affects the FL training performance, providing insights into how the proposed AMA scheme addresses the problem. Finally, we outline the FedProx framework, which serves as a benchmarking scheme in the experiment simulations.

A. Naive FL Optimisation: FederatedAveraging (FedAvg)

Let k denote the device index, S_{t+1} denote the collection of selected devices for training round t+1 , w denote the model weights and E denote the rounds of local updates performed on the local devices. Then, in the training round2t+1 , the naive FL optimising method FedAvg [2] states that each selected device updates the local model as, \begin{equation*} \forall k \in S_{t+1}, w_{t+1}^{k} = w_{t} - \epsilon \nabla f_{k} \left ({w_{t}}\right ), \tag{1}\end{equation*} View SourceRight-click on figure for MathML and additional features. where \begin{equation*} \nabla f_{k} \left ({w_{t}}\right ) = \sum _{e=0}^{E-1} \nabla f_{k} \left ({w_{t, e}}\right ), \tag{2}\end{equation*} View SourceRight-click on figure for MathML and additional features.\epsilon denotes the learning rate and f represents the objective function. Upon receiving all the local model updates, the FL server conducts the aggregation and updates the global model as, \begin{equation*} w_{t+1} = \sum _{k \in S_{t+1}} \bigg | \frac {d_{k}}{D}\bigg | w_{t+1}^{k}, \tag{3}\end{equation*} View SourceRight-click on figure for MathML and additional features. where |d_{k}| denotes the size of the local dataset of the k_{th} device while D = \cup ~d_{k}, k \in S_{t+1} . Then, the general FL optimisation problem is to minimise the following loss function, \begin{equation*} \min _{w} \sum _{k=1}^{K} \Bigg | \frac {d_{k}}{D} \Bigg | F_{k} (w),~F_{k} (w) = \frac {1}{|d_{k}|} \sum _{\{x, y\}_{k} \in d_{k}} f_{k}(w). \tag{4}\end{equation*} View SourceRight-click on figure for MathML and additional features.f_{k} (w) denotes the local loss \ell _{k} ( \{x, y\}_{k}; w) on the given model parameter w , where \{x, y\}_{k} represents a sample from d_{k} . In this work, we establish our proposed AMA-FES FL framework upon FedAvg and employ FedAvg as our comparison baseline, of which the detailed simulation settings can be found in Section VI.

B. Non-I.I.D. Data in FL: Impact & Solution

Zhao et al. [9] suggest that the degraded learning performance under non-I.I.D. setting attributes to weight divergence. Specifically, in the I.I.D. scenario, the aggregated model closely approximates the global optima. However, under the non-I.I.D. setting, each local model is updated upon samples from different classes, and consequently, the updated model differs from one to another. Then, the aggregated model diverges from the global optima and results in reduced accuracy.

Heuristically, under the non-I.I.D. setting, each device only contains samples from a limited number of classes. Being trained upon these samples, the local model only learns from these limited classes, leading to reduced accuracy on samples from the remaining ones. Similarly, during an FL training round, the selected devices may possess only samples from a few classes. Consequently, the aggregated model would be biased towards samples from these classes and perform poorly on the balanced dataset. Additionally, the random selection of participating devices for each training round further contributes to this variability in training data diversity, leading to unstable training performance.

Therefore, to mitigate the impact of non-I.I.D. data, one could enhance the diversity of the local dataset by sharing public datasets [9] or performing local data augmentation [26], yet these approaches potentially lead to privacy concerns. Or, we can treat the non-I.I.D. local datasets as multi-task learning and employ personalised models to address the data skewness [10], [11], [12], [28], [29]. However, the personalisation often involves model structure alterations, which may require special aggregation rules on the server. Alternatively, we can restrict the local model growth, thereby preventing them from overfitting the limited local samples [14]. However, this is may hinder the convergence speed.

C. Dealing With Heterogeneity Through FedProx

FedProx [14] is proposed to tackle both the computation and statistical heterogeneity. Targeting the devices with various computing capabilities, FedProx proposes the partial work scheme. Specifically, it allows each device to compute a variable amount of local updating, i.e., fewer rounds of local SGD (stochastic gradient descent), depending on its computation resources. On the other hand, to counteract the impact brought by the non-I.I.D. data, FedProx adds a penalisation term to the local objective function f_{k} (\cdot ) in addition to the local loss \ell _{k} (\cdot ) to restrict the model growth.\begin{equation*} f_{k} \left ({w_{t+1}^{k}; w_{t}}\right ) = \ell _{k} \left ({w_{t+1}^{k}}\right ) + \rho \left \|{ w_{t+1}^{k} - w_{t}}\right \|^{2} \tag{5}\end{equation*} View SourceRight-click on figure for MathML and additional features. Equation (5) indicates the modified local objective of the k_{th} device, where \rho is a hyperparameter and denotes the regularisation strength. The larger \rho indicates stronger regularisation and the local model updates will learn less from the local dataset, potentially damaging the convergence performance, while the small \rho may not provide sufficient regularisation effort.

We employ FedProx as a benchmark scheme to provide further comparison. FedProx itself is not a novel approach yet it requires less modification of the model and no additional computation. This matches our expectation where computing and power resources are limited, while most of the existing literature induces negligible resource consumption.

SECTION III.

System Model

Herein, we consider the problem of image classification with FL in a dynamic wireless environment, where each client device trains the model locally and then uploads the model updates to the FL server for global aggregation. We consider a set of devices, \{u_{k}, k\in \mathcal {K}\} , where \mathcal {K}=\{1,2,\ldots, k,\ldots, K\} denotes the device index, train the target model collaboratively with the FL server in T communication rounds. Each device u_{k}, \forall k\in \mathcal {K} owns a local dataset d_{k} , with a size of |d_{k}| > 0 .

At the start of the training round t+1 , a subset of devices, \{u_{k}, k \in S_{t+1}\} , where S_{t+1} \subseteq \mathcal {K} , are randomly selected for training. S_{t+1} denotes the indexes of the selected devices at training round t+1 and |S_{t+1}|=m denotes the size of the participating devices. If u_{k} \in S_{t+1} , then the global model w_{t} is distributed to u_{k} and trained on its local dataset d_{k} for E local epochs as indicated in equation (1). After the local training, the updated local model w_{t+1}^{k} would be sent back to the FL server. Upon receiving all model updates from the selected devices \{u_{k}, k \in S_{t+1}\} , the FL server conducts the global aggregation to update the global model w_{t+1} , as indicated in equation (3).

Specifically, we consider training with convolutional neural network (CNN) for image classifications, where the loss function indicated in equation (4) is minimised. We use the cross-entropy [33] as the local loss function \ell (\cdot ) in this work.

SECTION IV.

Methodology

In this section, we present the proposed AMA-FES framework featuring an asynchronous aggregation scheme, as shown in Fig. 2, to tackle the challenges encountered during the practical deployment of FL applications, including the computing-limited devices, the non-I.I.D. data and the asynchronous model updates caused by the transmission delay. We further summarise the proposed framework in Algorithm 1.

FIGURE 2. - Demonstration of AMA-FES FL with Asynchronous Aggregation in training round 
$t+1$
.
FIGURE 2.

Demonstration of AMA-FES FL with Asynchronous Aggregation in training round t+1 .

SECTION Algorithm 1

AMA-FES FLWith Asynchronous Aggregation

1:

Initialise: E, T, \mathcal {K}= \{1,2,3, \ldots,K\}, m, \alpha _{0}, \gamma, \eta, w_{0}

2:

FL Server Executes:

3:

for t=1 to T do

4:

Device selection, S_{t} \subseteq \{1,2,3, \ldots,K\}, |S_{t} |=m .

5:

Global model distribution w_{t-1} \rightarrow u_{k} \in S_{t} .

6:

for u_{k} \in S_{t} in parallel do

7:

Device u_{k} updates the local model w_{t-1} .

8:

/* Adaptive Mixing Aggregation (AMA) */

9:

Calculate \alpha _{t}, \beta _{t} indicated equation (9) with \alpha _{0}, \eta .

10:

Update w_{t} with AMA scheme in equation (8).

11:

/* Asynchronous Aggregation */

12:

if Asynchronous model updates received then

13:

Calculate weight \gamma _{k} as in equation (11).

14:

Conduct asynchronous model aggregation as indicated in equation (10).

15:

Device Executes:

16:

if Device u_{k} is computing-limited then

17:

/* Feature-Extractor Sharing (FES) */

18:

Updates only the classifier as equation (7) for E epochs.

19:

else

20:

Update the entire model as equation (1) for E epochs.

21:

Upload the updated model w_{t}^{k} to the FL server.

A. Feature-Extractor Sharing (FES)

Being distributed over the network edge, the computing capabilities of the devices exhibit large variations, leading to the asynchronous completion of local training. However, FL necessitates the prompt reception of the local mode updates for training efficiency. On the other hand, excluding the delayed model updates from global aggregation would result in information loss and reduced accuracy, especially when the number of participating devices is limited. Thus, it is important to proceed with the training on the computing-limited devices while maintaining the training performance, e.g., accuracy and efficiency.

In transfer learning, a CNN is typically segmented into two parts, i.e., the feature-extractor and the classifier. The former usually consists of several convolutional layers while the latter generally comprises fully-connected (FC) layers, which requires less computation. The feature extractor is responsible for extracting lower-level features from the raw data while the classifier is responsible for identifying its corresponding label. The pre-trained feature extractor is often reused for similar tasks to expedite training and lessen computational demands.

Drawing inspiration from the concept of pre-trained feature-extractor, we introduce the feature extractor sharing (FES) scheme to ease the computation burden, where the computing-limited devices update only the classifier while the devices with sufficient computing resources update the entire model. Additionally, the FES scheme is CPU-friendly since it avoids computing convolutional layers. Suppose that at the start of a communication round t+1 , the global model w_{t} , consisting of a feature extractor w_{f,t} and a classifier w_{c,t} as shown in (6), is distributed to the selected devices. During local training, the computing-limited devices fix parameters for the feature extractor and only update the classifier, as indicated in equation (7).\begin{align*}&w_{t} = \left [{ \begin{array}{l} w_{f,t} \\ w_{c,t} \end{array} }\right ], \tag{6}\\[-3pt]&w_{c,t+1}^{k} \longleftarrow w_{c,t} - \epsilon \nabla f_{k} \left ({w_{c,t}}\right ), ~w_{t+1}^{k} = \left [{ \begin{array}{l} w_{f,t} \\ w_{c,t+1}^{k} \end{array} }\right ]. \tag{7}\end{align*} View SourceRight-click on figure for MathML and additional features.

On the other hand, Luo et al. [34] investigate the similarities among layers across devices in the non-I.I.D. setting. A key observation is that the classifier parameters demonstrate the least similarities, whereas earlier layers show greater uniformity. This observation further endorses the logic of reusing the feature extractor under computation insufficiency. The FES scheme necessitates minor modifications of the optimising methods and can be easily integrated with other FL optimisation approaches. Thus, it provides more flexibility in reducing the computation burden and greater scalability in practical deployment.

B. Adaptive Mixing Aggregation (AMA)

As discussed in Section II-B, to mitigate the impact of non-I.I.D. data distribution, one could enrich the diversity of the local dataset, which potentially causes privacy concerns, or personalise the local model to account for the data bias, which alters the model structure and may require additional effort for aggregation at the server. Alternatively, we can also penalise the model growth forcing each local model to be close to the global model, as indicated in equation (5) in FedProx. However, the regularising strength hyperparameter \rho is very sensitive and often relates to the model structures. Additionally, inappropriate regularisation strength hinders the model convergence and thus requires careful consideration.

In contrast, we propose a novel adaptive mixing aggregation (AMA) scheme to enhance training accuracy and stability, necessitating a simpler parameter selection. Instead of focusing on the local training stage, we perform mixing aggregation of the previous global model and the timely local model updates received from the selected device, \forall k \in S_{t+1} . For the training round t+1 , the AMA scheme is detailed as, \begin{equation*} w_{t+1} \overset {\Delta }{=} \alpha ~w_{t} + \beta \sum _{k \in S_{t+1}} \frac {|d_{k}|}{|D|}~w_{t+1}^{k},~\beta = 1 - \alpha . \tag{8}\end{equation*} View SourceRight-click on figure for MathML and additional features. Specifically, w_{t+1}^{k} denotes the local model updates from the device k , while \alpha and \beta indicate the learning strength from the previous global model w_{t} and the local model updates w_{t+1}^{k} respectively. By cautiously assimilating new information from the local model updates while avoiding abrupt transitions to an entirely new model, which can be heavily biased towards certain classes, the AMA scheme preserves the training stability.

However, large \alpha implies learning less from the local model updates, hurting the model convergence, yet small \alpha fails to provide sufficient training stability. Hence, compromising between the convergence and the stability, we propose to set \alpha and \beta dynamic, \begin{equation*} \alpha _{t} = \alpha _{0} + \eta t, ~\beta _{t} = 1 - \alpha _{t}. \tag{9}\end{equation*} View SourceRight-click on figure for MathML and additional features.\alpha _{0} denotes the initialisation value, \eta denotes the increase rate and t represents the training round index. At earlier stages of the training, \alpha is small and \beta is relatively large, warranting fast convergence. As the training progresses, \alpha gradually increases while \beta reduces, thereby enhancing stability while continuing to learn new knowledge.

C. Asynchronous Aggregation

In addition to the computing-limited devices, which fail to provide model updates promptly due to computing inability, the wireless condition can also delay the reception of the local model updates for the FL server, owing to the transmission channel dynamics. Similarly, discarding such delayed model updates is not the optimal solution as it often results in information loss and degraded accuracy. Conventional literature in FL opts for asynchronous aggregation to handle such transmission dynamics. Following this convention, we propose a periodic asynchronous aggregation scheme, which integrates the delayed model updates with global aggregation.

Considering the training round t+1 , the FL server first checks whether any delayed model updates are received. We denote such delayed model updates received in training round t+1 as w_{t_{k}}^{k}, \forall k \in S_{t+1}^{\prime } . S_{t+1}^{\prime } denotes the collection of device index whose model updates are delayed and received in training round t+1 . For any device k \in S_{t+1}^{\prime } , w_{t_{k}}^{k} denotes the delayed model updates sent from training round t_{k} . As detailed in Algorithm 1, if such delayed model updates are detected, then after conducting the global aggregation indicated in equation (8), the FL server performs asynchronous aggregation as indicated in equation (10).\begin{equation*} w_{t+1} = \left ({ 1 - \sum _{k \in S_{t+1}^{\prime }} \gamma _{k} }\right ) w_{t+1} + \sum _{k \in S_{t+1}^{\prime }} \gamma _{k} w_{t_{k}}^{k}, \tag{10}\end{equation*} View SourceRight-click on figure for MathML and additional features. where \gamma _{k} represents the staleness-based weight, and is given by, \begin{equation*} \gamma _{k} = \frac {1}{m_{t+1} + n_{t+1}} \left [{ 1 - \sigma \left ({t+1- t_{k}}\right ) }\right ]. \tag{11}\end{equation*} View SourceRight-click on figure for MathML and additional features. Specifically, m_{t+1} , n_{t+1} indicates the number of synchronised local model updates and asynchronous local model updates received in the training round t+1 respectively. \sigma represents the standard sigmoid function.

SECTION V.

Convergence & Complexity Analysis

In this section, we address the convergence performance for the AMA and asynchronous AMA scheme, under the non-convex assumption of the loss function f(\cdot ) and the non-I.I.D. data distribution [35]. Specifically, we establish the upper bound on the convergence of both schemes, accompanied by the investigation of the algorithm complexity.

We assume the device selection in each round is independent i.e., S_{t_{1}} and S_{t_{2}} are independent, given that t_{1} \neq t_{2} . We consider the partial device selection scheme for FL training, i.e., |S_{t}|=m < K . We consider training with T communication rounds in total and for each communication round, E local epochs are scheduled for each selected device with a learning rate \epsilon . For asynchronous AMA, to assist the proof, we indicate the maximum number of asynchronous model transmissions received in a training round |S_{t^{\prime }}| to C , i.e., \forall t \leq T, |S_{t^{\prime }}| \leq C . To facilitate comprehension, we first introduce the assumptions underlying the convergence analysis and then present the convergence bound.

Assumption 1:

\forall k \in \mathcal {K}, f_{k} is Lipschitz-smooth: For all \boldsymbol{\upsilon }, \boldsymbol{\omega } , f_{k}(\boldsymbol{\upsilon }) \leq f_{k}(\boldsymbol{\omega }) + (\boldsymbol{\upsilon }-\boldsymbol{\omega })^{T} \nabla F_{k} (\boldsymbol{\omega }) + {}\frac {L}{2} ||\boldsymbol{\upsilon }-\boldsymbol{\omega }||^{2} , where L is the Lipschitz constant.

Assumption 2:

\forall k \in \mathcal {K} , let \xi _{t}^{k} represent a sample uniformly sampled from device u_{k} , then the stochastic gradients (SG) for u_{k} is unbiased: \mathbb {E} [\nabla f (\omega _{t}^{k}, \xi _{t}^{k})] =\nabla f (\omega _{t}^{k}) .

Assumption 3:

\forall k \in \mathcal {K} , the squared norm of the expected SG is uniformly bounded: \mathbb {E} ||\nabla f (\boldsymbol{\omega }_{t}^{k})||^{2} \leq G^{2} .

Theorem 1:

Let Assumptions 1–​3 hold, and L,G, T, E, m, \epsilon defined within. Let f^{*} denote the optimal loss value. Then, training with AMA satisfies,\begin{align*}&\frac {1}{T} \sum _{t=0}^{T-1} \mathbb {E} \left [{\left \|{\nabla f\left ({\omega _{t}}\right )}\right \|^{2} }\right ] \leq \frac {2\left [{ f\left ({\omega _{0}}\right ) - f^{*} }\right ]}{E\beta \epsilon T} \\&{}+ \underbrace {\frac {\epsilon L E m \beta + m \epsilon ^{2} L^{2} (E-1)^{2} }{T}\sum _{t=0}^{T-1} \sum _{k \in S_{t+1}} \Bigg |\frac {d_{k}}{D}\Bigg |^{2} G^{2}}_{A_{1}}. \tag{12}\end{align*} View SourceRight-click on figure for MathML and additional features.

Complexity Analysis: As indicated in Theorem 1, the AMA scheme converges to a stationary point of f(\omega ) at a rate of \mathcal {O} \left({{}\frac {1}{\beta \epsilon T}}\right) , which is scaled by the AMA factor \beta compared with the naive FedAvg \mathcal {O} \left({{}\frac {1}{ \epsilon T}}\right) [35]. Term A_{1} demonstrates the impact of the non-I.I.D. data distribution. It achieves its minimum when each device has the same amount of data samples, i.e., |{}\frac {d_{k}}{D}|=m, \forall k \in K . In that case, the second term in the upper bound reduces to, \begin{equation*} \frac {\epsilon L E \beta + \epsilon ^{2} L^{2} (E-1)^{2} }{mT}\sum _{t=0}^{T-1} \sum _{i \in k_{t+1}} G^{2}. \tag{13}\end{equation*} View SourceRight-click on figure for MathML and additional features. It indicates that the convergence speed can be accelerated by increasing m , i.e., encouraging more devices to participate in each training round, which in turn reduces the upper bound in equation (13). Details of Theorem 1 can be found in the Appendix.

Theorem 2:

Let Assumptions 1–​3 hold, and L,G, T, E, m, C defined within. Let \overline {\gamma } = \left({ 1 - \sum _{k \in S_{t+1}'} \gamma _{k} }\right) . Let f^{*} denote the optimal loss value. Then, training with asynchronous AMA satisfies, \begin{align*}&\frac {1}{T}\sum _{t=0}^{T-1} \mathbb {E}\left [{ \left \|{\nabla f\left ({\omega _{t}}\right )}\right \|^{2}}\right ] \leq \frac {2\left [{f\left ({\omega _{0}}\right ) - f^{*} }\right ]}{T E \overline {\gamma } \beta \epsilon } \\&{}+ \frac {L m \beta \epsilon E + \epsilon ^{2} m L^{2} (E-1)^{2}}{T} \sum _{t=0}^{T-1}\sum _{k \in S_{t+1}} \Bigg |\frac {d_{k}}{D}\Bigg |^{2} G^{2} \\&{}+ \underbrace {\frac {C L\epsilon E}{\overline {\gamma } \beta T} \sum _{t=0}^{T-1} \sum _{k \in S_{t+1}^{\prime }} \left [{ m^{2}\left ({t-t_{k}}\right )^{2} +1 }\right ]\gamma _{k}^{2} G^{2}}_{A_{2}}. \tag{14}\end{align*} View SourceRight-click on figure for MathML and additional features.

Complexity Analysis: The asynchronous AMA scheme converges to a stationary point of f(\omega ) at a rate of \mathcal {O} \left({{}\frac {1}{\overline {\gamma } \beta \epsilon T}}\right) , which is further scaled by the asynchronous weighting factor \overline {\gamma } , compared with Theorem 1. Additionally, Theorem 2 has an additional term A_{2} in the upper convergence bound compared with Theorem 1. A_{2} indicates that the convergence bound is proportionate to the number of asynchronous model transmissions received C and the model staleness (t-t_{k}) , which matches our expectations. If C=0 , meaning no asynchronous model transmission received, i.e., \overline {\gamma } = 1 , then term A_{2} vanishes and Theorem 2 reduces to Theorem 1, matching our expectations. Details of Theorem 2 can be found in the Appendix.

SECTION VI.

Simulation Investigations

In this section, we perform various experiments to examine the performance of the proposed AMA-FES FL framework featuring the asynchronous aggregation scheme. The focus of the experiments is to explore the robustness and efficiency of the proposed schemes against different types of FL heterogeneity. Specifically, we consider a practical training scenario, where mobile UAVs serve as the FL training clients. The mobile UAVs collect image samples and train the classification models collaboratively with the FL server, supporting AR applications that contribute to Metaverse environment development. As indicated in Fig. 3, each UAV operates in a designated zone, collecting training data as it randomly moves within the area.

FIGURE 3. - Image classifications with UAVs as FL training clients.
FIGURE 3.

Image classifications with UAVs as FL training clients.

Notably, in our training scenario, we consider three types of heterogeneity. First, the computing capability of the mobile UAVs greatly varies and some UAVs may experience computing insufficiency. In the following experiments, we define such computing-limited UAVs, who can not afford to finish the local training promptly and thus fail to send timely updates back to the FL server, as the stragglers.3 Secondly, given that the UAVs are geographically distributed, they possess data samples from different distributions. Finally, each UAV experiences time-varying wireless transmission conditions. Transmission delay occurs when the transmission rate of the mobile UAV fails to meet the transmission requirement, resulting in model updates being received by the FL server in the subsequent training rounds. Detailed simulation settings can be found in the following sections.

We consider three datasets, MNIST, FMNIST and CIFAR-10, to simulate images captured in real-world. As we consider training with UAV clients having limited computing capabilities, we employ compact CNN models for conducting image classification tasks. Specifically, we consider a small CNN, comprising two convolutional layers and three fully-connected layers. For MNIST and FMNIST, the CNN is configured with smaller feature maps, whereas for CIFAR-10, we employ larger feature maps for the CNN. We consider training with K=50 UAVs and for each training round, a subset of m=10 UAVs are randomly selected for updating the global model. For all experiments, we adopt a learning rate of \epsilon =0.01 .

To better observe the performance of the proposed learning schemes, we alter the heterogeneity setting in each experiment and specifically focus on the following algorithms, of which the details are presented below:

  • FES, where we consider that some UAVs are equipped with computing-limited devices and for such UAVs, we fix the parameters of the first two convolutional layers, i.e.,, the feature-extractor, and only update the parameters of the classifier, as indicated in equation (6) and (7);

  • AMA-FES, where we consider not only the computing-limited UAVs but also the non-I.I.D. data samples. During local training, for computing-limited UAVs, we apply the FES scheme and update the local model as in equation (6) and (7). After receiving the local model updates, the FL server updates the global model as in equation (8) to mitigate the impact brought by non-I.I.D. data;

  • Asynchronous Aggregation, where we consider that some UAVs experience model transmission delay. When receiving the delayed model updates, the FL server first assigns a staleness-based weight for the asynchronous model updates as indicated in equation (11) and then incorporates such weighted model updates into the global model as indicated in equation (10);

  • Asynchronous Aggregation with AMA, where we consider not only the asynchronous model updates but also the non-I.I.D. data samples. At the end of a training round session, the FL server first updates the global model with the AMA scheme as indicated in equation (8) to alleviate the impact of the non-I.I.D. data. Then, the FL server updates the global model with the weighted delayed models, as indicated in equation (10) and (11).

Algorithm 1 and Fig. 2 further illustrate the details of the considered algorithms. To investigate the performance of the above schemes, we conduct the following three experiments:

  1. In Section VI-A, we experiment with the FES scheme on I.I.D. CIFAR-10 dataset and investigate its performance in alleviating the impact of computing-limited UAVs under varying levels of computing heterogeneity;

  2. In Section VI-B, to examine the performance of the proposed AMA-FES scheme when both the computation and statistical heterogeneity exist, we apply the AMA-FES FL framework on non-I.I.D. MNIST & FMNIST with varying levels of computation heterogeneity;

  3. In Section VI-C, we simulate a wireless environment with Rician fading and additional path loss, where the mobile UAVs constantly experience transmission turbulence, leading to asynchronous model updates. Then, we apply and evaluate the effectiveness of the proposed asynchronous aggregation scheme with AMA in managing these asynchronous model updates under the I.I.D. and non-I.I.D data settings.

Simulation settings of each experiment, including the considered algorithms and relative training parameters, are summarised in Table 5. Further details of each experiment can be found in the following sections. Through these experiments, our objective is to validate the effectiveness of the proposed AMA-FES framework featuring asynchronous aggregation in aiding UAV clients to restore training accuracy while achieving training efficiency amidst the heterogeneous learning environment.

TABLE 5 Summary of Experiment Settings: Considered Algorithms & Training Parameters
Table 5- Summary of Experiment Settings: Considered Algorithms & Training Parameters

Baseline scheme: Throughout the experiments, we employ the naive FL as the baseline scheme, where no fancy aggregation rule applies, and the global model is updated by FedAvg as stated in equation (3).

Bench-marking scheme: We use FedProx as a bench-marking scheme for further performance comparison. Details of FedProx implementation can be found in the following sections.

A. Addressing Computation Heterogeneity With FES

In this experiment, we focus on the computation heterogeneity and investigate the performance of the proposed FES scheme in mitigating its impact. Specifically, we consider that partial UAVs are equipped with limited-computing units and refer to such computing-limited UAVs as the stragglers in the following text. We vary the ratio of computing-limited UAVs among all the participating UAVs, p= 0.25, 0.50 and 0.75, to simulate different levels of computation heterogeneity. Additionally, we assume that for all the computing-limited UAVs, with the FES scheme, only the final three fully connected layers of the CNN are updated, while the parameters for the first two convolutional layers, which contribute to most of the computation, are fixed. UAVs with adequate training resources are responsible for updating the entire model.

We first simulate a naive FL, where the delayed model updates from the stragglers are dropped from aggregation, as the baseline scheme and denote it as NaiveFL(DropStraggler). We also simulate a naive FL where no straggler exits during the training and denote it as NaiveFL(NoStraggler). Through the comparison with these naive cases, we are able to observe the performance improvement offered by the FES scheme. We employ the partial work scheme from FedProx, where each computing-limited UAV randomly performs one or two rounds of local model updates, as the bench-marking scheme. We test on the I.I.D. CIFAR-10 images, where each UAV has access to 1000 image samples evenly distributed through the total 10 classes.

Fig. 4 compares the performance of the FES scheme on mitigating the impact of computing-limited devices with varying degrees of computation heterogeneity. Compared with the naive FL case (indicated in the blue line) that drops the delayed model updates from training, the proposed FES scheme (indicated in the orange line) effectively improves the training performance in all three cases, as smoother convergences and lower loss values are achieved. On the other hand, we can observe that when the ratio of computing-limited UAVs is low, i.e., p = 0.25, the FES scheme manages to restore the training performance, as the FES scheme overlaps the naive FL with no straggler (indicated in green line). When the level of computing heterogeneity increases to p = 0.50, we can observe that a slight discrepancy emerges between the FES scheme and the naive FL with no straggler at convergence. This disparity grows further when p increases to 0.75, which can be attributed to the reduced efficiency of the remaining UAVs in providing an effective feature extractor due to the reduced number of samples available for learning.

FIGURE 4. - Convergence performance of the FES scheme on mitigating the impact of computation-limited UAVs on I.I.D. CIFAR-10.
FIGURE 4.

Convergence performance of the FES scheme on mitigating the impact of computation-limited UAVs on I.I.D. CIFAR-10.

Fig. 5 compares the performance of the FES scheme with the partial works scheme, using the naive FL with no straggler for referencing. It is evident that for all cases, both the FES scheme and the partial work scheme eventually reach a comparable level of converged accuracy. Notably, when p = 0.25, both schemes manage to restore the training accuracy, while as the level of computing-limited UAVs increases, there is a gradual decline in accuracy, yet it stays within a manageable 2.5% margin. However, it is worth noticing that the FES scheme consistently achieves faster convergence compared to the partial work scheme across all three scenarios. This is attributed to the fact that a certain number of UAVs in the partial work scheme undergo fewer rounds of local model updates, leading to a slower progression towards the global optimum. Table 6 summarises the number of training rounds required for each approach to achieve a reference accuracy of 80 %. It shows that the FES scheme provides a convergence acceleration of up to 87.29% when p = 0.5 while offering a significant early convergence improvement of nearly 40% in the other scenarios in comparison to the partial work scheme.

TABLE 6 Number of Training Rounds Required to Achieve 80% Accuracy on I.I.D. CIFAR-10 Datasets
Table 6- Number of Training Rounds Required to Achieve 80% Accuracy on I.I.D. CIFAR-10 Datasets
FIGURE 5. - Comparison of the FES scheme with the partial work scheme on I.I.D. CIFAR-10.
FIGURE 5.

Comparison of the FES scheme with the partial work scheme on I.I.D. CIFAR-10.

In conclusion, the conducted experiments demonstrate two key findings:

  1. The FES scheme can effectively reinstate the training performance under a relatively low level of computation heterogeneity, which is p = 0.25 in our experiment settings;

  2. Despite both the FES and partial work schemes ultimately achieving similar levels of accuracy, the FES scheme exhibits significantly faster convergence, which proxies lower energy consumption in real-life applications, making FES a more energy-efficient option.

B. Addressing the Non-I.I.D. Data & Computation Heterogeneity With AMA-FES FL

In this experiment, we explore a scenario where partial UAVs are computing-limited and the data distribution of each UAV is non-I.I.D. We apply the AMA-FES framework for training the model to classify MNIST & FMNIST images. We employ the same settings for computing-limited UAVs as described in Section VI-A, where the stragglers update only the last three fully connected layers of the CNN while fixing the parameters of the convolutional layers. On the other hand, to impose statistical heterogeneity, we follow the non-I.I.D. data settings in work [2], where each UAV has access to 300 training samples from only two classes. Regarding the AMA scheme indicated in equation (8) and (9), we adopt the following parameters: \alpha _{0} = 0.1, \eta = 2.5e^{-3} . We employ the naive FL, which drops the delayed model updates from computing-limited UAVs from aggregation and is denoted as NaiveFL(DropStraggler) in Fig. 6 & 7, as the performance baseline. Similarly, we vary the ratio of computing-limited UAVs to p = 0.25, 0.50, 0.75. We also employ FedProx, including both the partial work scheme and the weight penalisation indicated in equation (5), as the benchmarking scheme. Specifically, we set the regularisation strength \rho = 0.01 . We conduct the training for 200 rounds on MNIST and 300 rounds for FMNIST. For both datasets, the local epoch E is set to 10.

FIGURE 6. - Convergence performance of the AMA-FES FL compared with the naive FL on non-I.I.D. MNIST & FMNIST.
FIGURE 6.

Convergence performance of the AMA-FES FL compared with the naive FL on non-I.I.D. MNIST & FMNIST.

FIGURE 7. - Testing accuracy and training stability: for p=0.50 and 0.75, the variance of the naive FL on MNIST is 86.96 and 173.16 respectively. For ease of visualisation, the scale of the y-axis is cut to [0, 30].
FIGURE 7.

Testing accuracy and training stability: for p=0.50 and 0.75, the variance of the naive FL on MNIST is 86.96 and 173.16 respectively. For ease of visualisation, the scale of the y-axis is cut to [0, 30].

Fig. 6 demonstrates the converging performance of the naive FL, which drops the delayed updates from the stragglers, and the proposed AMA-FES FL on MNIST & FMNIST images, considering varying degrees of computation heterogeneities. We can observe that for both datasets, the naive FL becomes more fluctuated as p increases due to the information loss. Compared with Fig. 4, we can observe that while the I.I.D. data has some tolerance for losing model updates, it becomes disastrous and intolerant in the non-I.I.D. data scenario. On the other hand, the AMA-FES FL, as indicated in the orange line, effectively achieves a smoother convergence and converges to a lower loss for both datasets, restring the learning performance.

Fig. 7 compares the test accuracy and the training stability with varying levels of computation heterogeneities. Note that here we calculate the variance of the converged test accuracy for the last 50 rounds as a proxy for training stability. Therefore, higher variance means lower training stability. AMA-FES FL outperforms the naive FL and the FedProx scheme in most settings. Specifically, compared with the naive FL, the AMA-FES FL framework improves the accuracy by up to 19.77% on MNIST. Compared with FedProx, the accuracy is increased by 2.38% on FMNIST. An exception occurs at p=0.75 for FMNIST, where FedProx achieves a higher accuracy. Because as the ratio of stragglers increases, the residual UAVs lack sufficiently diverse samples, rendering a compromised feature-extractor and leading to diminished accuracy. On the other hand, AMA-FES FL exhibits much higher training stability than the naive FL and FedProx. In particular, AMA-FES FL demonstrates an increase of training stability of up to 93.10% on FMNIST over FedProx.

In conclusion, experiment Section VI-B addresses two findings:

  1. The proposed AMA-FES scheme is effective in mitigating the impact brought by the computation and statistical heterogeneity by achieving smoother convergence and lower loss value;

  2. The AMA-FES scheme better preserves the training stability than FedProx.

C. Addressing the Asynchronous Model Updates With Asynchronous Aggregation

In this section, we explore a scenario, where the mobile UAVs occasionally encounter model transmission delay during flight and investigate how the proposed asynchronous aggregation scheme restores the training performance. The UAVs train the CNN to classify CIFAR-10 images under both I.I.D. and non-I.I.D. conditions. Specifically, we adopt the same non-I.I.D. sampling method for CIFAR-10 images as in experiment Section VI-B, where each UAV has 1000 training samples evenly collected from two classes. For the I.I.D. setting, we train the model for 300 training rounds, while for the non-I.I.D. setting, we train the model for 700 training rounds to allow a full convergence. For both cases, the local epoch E=5 .

On the other hand, to simulate the dynamic wireless condition, we consider the wireless transmission model for UAVs as described in work [36], which combines the Rician fading with path-loss. Specifically, the path-loss integrates the expectation of Line-of-Sight (LoS) and non-LOS (NLoS) components [37]. For device k in training round t+1 , the timely transmission rate r_{t+1}^{k} is given by, \begin{equation*} r_{t+1}^{k} = B_{up} \log _{2} \left ({ 1 + \frac {{|h_{t+1}^{k}|^{2}} l_{t+1}^{k} P_{uav}}{\sigma ^{2}} }\right ). \tag{15}\end{equation*} View SourceRight-click on figure for MathML and additional features.h_{t+1}^{k} indicates the Rician fading channel gain and is given by, \begin{equation*} h_{t+1}^{k} = \left ({ \sqrt {\frac {v_{t+1}^{k}}{v_{t+1}^{k} + 1}} {\overline {h}_{t+1}^{k}} + \sqrt {\frac {1}{2\left ({v_{t+1}^{k} + 1 }\right )}}{\hat {h}_{t+1}^{k}} }\right ), \tag{16}\end{equation*} View SourceRight-click on figure for MathML and additional features. where v_{t+1}^{k} indicates the time-varying Rician fading factor, \overline {h}_{t+1}^{k} indicates the LoS component and |\overline {h}_{t+1}^{k}|=1 , \hat {h}_{t+1}^{k} \sim \mathcal {CN} (0, 1) indicates the complex random NLoS component, l_{t+1}^{k} denotes the path loss. Specifically, we consider a random LoS link by integrating the probability of LoS, P_{LoS, t+1}^{k} , which solely depends on the elevation angle of the UAV with respect to the server [36], [37].\begin{equation*} P_{LoS, t+1}^{k} = \frac {1}{1 + c_{1} \exp \left ({-c_{2} \left [{\theta _{t+1}^{k} - c_{1}}\right ]}\right )}, \tag{17}\end{equation*} View SourceRight-click on figure for MathML and additional features. where c_{1}, c_{2} are environment factors indicating the type of environment, e.g., urban or rural, \theta _{t+1}^{k} indicates the elevation angle of the k_{th} UAV at time t . Then, the path loss component is given by, \begin{equation*} l_{t+1}^{k} = \left ({P_{LoS, t+1}^{k} ~\eta _{1} + P_{NLoS, t+1}^{k} ~\eta _{2} }\right ) \left [{ \frac {4 \pi f d_{t+1}^{k}}{c} }\right ]^{\alpha }, \tag{18}\end{equation*} View SourceRight-click on figure for MathML and additional features. where the probability of NLoS groups P_{NLoS, t+1}^{k} = 1 - P_{LoS, t+1}^{k} , \eta _{1} \& \eta _{2} are the additional path loss coefficients, f is the carrier frequency, c denotes the speed of light, d_{t+1}^{k} denotes the distance between the UAV and the server, while \alpha denotes the path loss exponent. The simulated parameter values are referenced from work [38], [39]. Table 7 summarises the details of the simulation parameters of the described wireless transmission model.

TABLE 7 Wireless Transmission Parameters
Table 7- Wireless Transmission Parameters

During the training process, each UAV randomly moves within its target area, as shown in Fig. 3. Before uploading the local model updates, each UAV calculates its timely transmission rate r_{t+1}^{k} according to equation (15) and compares it with the transmission threshold r_{\tau } . The model transmission is conducted only if r_{t+1}^{k} > r_{\tau } , otherwise it is held until the desired transmission condition is satisfied. Note that the time index is also sent to the server alongside the model updates. When the FL server receives the model transmission, it will check the time index and calculate the staleness-based weight for such asynchronous model updates according to equation (11). Then, the FL server conducts the asynchronous aggregation according to equation (10). We employ the naive FL as the baseline, which excludes the delayed model updates caused by the dynamic transmission conditions from aggregation and is denoted as NaiveFL(DropDelayedUpdates). We further simulate naive FL with no delayed transmission encountered for further comparison and it is denoted as NaiveFL(NoDelayedUpdates).

As indicated in Fig. 8, in the I.I.D. scenario, the naive FL has some tolerance for dropping the delayed model updates as it only results in a small accuracy drop, which can be observed from the gap between the green line and the blue line in the first figure. However, the proposed asynchronous aggregation scheme, indicated in the orange line, still improves the accuracy and reduces the oscillations, compared with the blue line where the delayed model updates are dropped. From the non-I.I.D. scenario, it is noticeable that the test statistics exhibit significant fluctuations, primarily because losing model updates has a more pronounced effect in non-I.I.D. setting. We can observe that the proposed asynchronous aggregation scheme manages to reduce a certain amount of oscillations over the other two settings. Also, compared with the naive FL where the delayed model updates are dropped, applying the asynchronous aggregation scheme leads to a modest accuracy increase of around 3%.

FIGURE 8. - Convergence performance of the asynchronous aggregation scheme on I.I.D. CIFAR-10 & non-I.I.D. CIFAR-10 in comparison with naive FL.
FIGURE 8.

Convergence performance of the asynchronous aggregation scheme on I.I.D. CIFAR-10 & non-I.I.D. CIFAR-10 in comparison with naive FL.

The above experiments demonstrate two important observations:

  1. The I.I.D. data distribution shows a greater resilience to the exclusion of asynchronous model updates compared to non-I.I.D. data, since the balanced data samples can provide sufficient knowledge for effective learning;

  2. Under the non-I.I.D. setting, although the training statistics still exhibit oscillations, the proposed asynchronous aggregation scheme manages to increase the converged accuracy and reduce fluctuations.

SECTION VII.

Concluding Remarks

In this work, we studied the practical issues posed by the heterogeneous environment in wireless FL networks, i.e., the computing-limited devices, the non-I.I.D. data and the asynchronous model updates caused by dynamic wireless conditions. Targeting these practical challenges, we proposed the AMA-FES framework with the asynchronous aggregation scheme. We implemented the proposed scheme in a practical scenario, where mobile UAVs serve as FL training agents and conduct image classification to support edge applications. Simulation results showed that the proposed framework effectively mitigates the impact brought by the learning heterogeneity without introducing extra communication or computation expenditures, thereby achieving energy efficiency. It has also demonstrated that learning with I.I.D. data offers resilience for losing some delayed model updates, whether caused by the computing inability or the dynamic transmission condition, as balanced datasets provide sufficient knowledge about the parameter space for the model to learn from even when partial samples are missing. In contrast, learning with the non-I.I.D. data magnify the impact of dropping delayed model updates, leading to training oscillations and reduced accuracy. This is because the non-I.I.D. data can not fully represent the parameter space, and losing certain local model updates in such situations further prevents convergence to the global optima.

Appendix

SECTION A.

Proof of Theorem 1

For adaptive-mixing aggregation, we have, \begin{equation*} \omega _{t+1} \overset {\Delta }{=} \alpha ~\omega _{t} + \beta \sum _{k \in S_{t+1}} \frac {|d_{k}|}{|D|}~\omega _{t+1}^{k},~\beta = 1 - \alpha . \tag{19}\end{equation*} View SourceRight-click on figure for MathML and additional features. Then it follows that, \begin{align*} \omega _{t+1} - \omega _{t}=&\left [{ \alpha ~\omega _{t} + \beta \sum _{k \in S_{t+1}} \frac {|d_{k}|}{|D|}~\omega _{t+1}^{k} }\right ] - \omega _{t} \tag{20}\\=&\beta \sum _{k \in S_{t+1}} \frac {|d_{k}|}{|D|}~\omega _{t+1}^{k} - \left ({1-\alpha }\right ) \omega _{t} \tag{21}\\=&\beta \sum _{k \in S_{t+1}} \frac {|d_{k}|}{|D|}~\left ({\omega _{t+1}^{k} -\omega _{t}}\right ) \tag{22}\\=&- \beta \epsilon \sum _{k \in S_{t+1}} \frac {|d_{k}|}{|D|}~\sum _{e=0}^{E-1} \nabla f_{k} \left ({\omega _{t, e}}\right ). \tag{23}\end{align*} View SourceRight-click on figure for MathML and additional features. Since the loss function f is L -smooth, it follows that, \begin{align*}&\mathbb {E}\left [{f\left ({\omega _{t+1}}\right ) }\right ] - f\left ({\omega _{t}}\right ) \\\leq&\frac {L}{2} \mathbb {E} \left [{ \left \|{ \omega _{t+1} - \omega _{t}}\right \|^{2} }\right ] + \mathbb {E} \langle \nabla f\left ({\omega _{t}}\right ), \omega _{t+1}-\omega _{t} \rangle \tag{24}\\=&\underbrace {\frac {L \beta ^{2} \epsilon ^{2} }{2} \mathbb {E} \left [{ \left \|{ \sum _{k \in S_{t+1}} \frac {|d_{k}|}{|D|}~\sum _{e=0}^{E-1} \nabla f_{k} \left ({\omega _{t, e}}\right )}\right \|^{2} }\right ] }_{a_{1}} \\&\underbrace {-\beta \epsilon ~\mathbb {E} \left \langle{ \nabla f\left ({\omega _{t}}\right ), \sum _{k \in S_{t+1}} \frac {|d_{k}|}{|D|}~\sum _{e=0}^{E-1} \nabla f_{k} \left ({\omega _{t, e} }\right ) }\right \rangle }_{a2}. \tag{25}\end{align*} View SourceRight-click on figure for MathML and additional features. Since |S_{t+1}| = m , for a_{1} , with Jensen’s inequality, it holds that, \begin{align*}&\mathbb {E}\left [{ \left \|{ \sum _{k \in S_{t+1}} \frac {|d_{k}|}{|D|}~\sum _{e=0}^{E-1} \nabla f_{k} \left ({\omega _{t, e} }\right )}\right \|^{2} }\right ] \\\leq&m \sum _{k \in S_{t+1}} \mathbb {E} \left [{ \left \|{ \frac {|d_{k}|}{|D|}~\sum _{e=0}^{E-1} \nabla f_{k} \left ({\omega _{t, e} }\right )}\right \|^{2} }\right ] \tag{26}\\=&m \sum _{k \in S_{t+1}} \Bigg |\frac {d_{k}}{D}\Bigg |^{2} \mathbb {E} \left [{\left \|{ \sum _{e=0}^{E-1} \nabla f_{k} \left ({\omega _{t, e} }\right )}\right \|^{2} }\right ] \tag{27}\\\leq&mE \sum _{k \in S_{t+1}} \Bigg |\frac {d_{k}}{D}\Bigg |^{2} \left [{\sum _{e=0}^{E-1} \mathbb {E} \left \|{\nabla f_{k} \left ({\omega _{t, e} }\right )}\right \|^{2} }\right ]. \tag{28}\end{align*} View SourceRight-click on figure for MathML and additional features.

Thus, \begin{align*} a_{1}=&\frac {L \beta ^{2} \epsilon ^{2} }{2} \mathbb {E} \left [{ \left \|{ \sum _{k \in S_{t+1}} \frac {|d_{k}|}{|D|}~\sum _{e=0}^{E-1} \nabla f_{k} \left ({\omega _{t, e}}\right )}\right \|^{2} }\right ] \tag{29}\\\leq&\frac {L \epsilon ^{2} \beta ^{2} mE}{2} \sum _{k \in S_{t+1}} \Bigg |\frac {d_{k}}{D}\Bigg |^{2} \left [{\sum _{e=0}^{E-1} \mathbb {E} \left \|{\nabla f_{k} \left ({\omega _{t, e} }\right )}\right \|^{2} }\right ]. \tag{30}\end{align*} View SourceRight-click on figure for MathML and additional features.

For a_{2} , it holds that, \begin{align*}&{}-\beta \epsilon \mathbb {E} \left \langle{ \nabla f\left ({\omega _{t}}\right ), \sum _{k \in S_{t+1}} \frac {|d_{k}|}{|D|}~\nabla f_{k} \left ({\omega _{t,e} }\right ) }\right \rangle \\=&-\beta \epsilon ~\mathbb {E} \left \langle{ \nabla f\left ({\omega _{t}}\right ), \sum _{k \in S_{t+1}} \frac {|d_{k}|}{|D|}~\nabla f_{k} \left ({\omega _{t,e} }\right ) -\nabla f\left ({\omega _{t}}\right ) + \nabla f\left ({\omega _{t}}\right ) }\right \rangle \tag{31}\\=&-\beta \epsilon ~\mathbb {E} \left \langle{ \nabla f\left ({\omega _{t}}\right ), \sum _{k \in S_{t+1}} \frac {|d_{k}|}{|D|}~\nabla f_{k} \left ({\omega _{t,e} }\right ) -\nabla f\left ({\omega _{t}}\right ) }\right \rangle \\&{}-\beta \epsilon \mathbb {E}\left [{ \left \|{ \nabla f\left ({\omega _{t}}\right )}\right \|^{2} }\right ] \tag{32}\\\leq&\frac {\beta \epsilon }{2} \mathbb {E} \left [{ \left \|{ \nabla f\left ({\omega _{t}}\right ) - \sum _{k \in S_{t+1}} \frac {|d_{k}|}{|D|}~\nabla f_{k} \left ({\omega _{t,e} }\right ) }\right \|^{2}}\right ] \\&{}-\frac {\beta \epsilon }{2} \mathbb {E} \left [{ \left \|{\nabla f\left ({\omega _{t}}\right )}\right \|^{2}}\right ] \tag{33}\\=&\frac {\beta \epsilon }{2} \mathbb {E} \left [{ \left \|{ \sum _{k \in S_{t+1}} \frac {|d_{k}|}{|D|}~\left [{\nabla f\left ({\omega _{t}}\right ) - \nabla f_{k} \left ({\omega _{t,e} }\right ) }\right ]}\right \|^{2}}\right ] \\&{}-\frac {\beta \epsilon }{2} \mathbb {E} \left [{ \left \|{\nabla f\left ({\omega _{t}}\right )}\right \|^{2}}\right ] \tag{34}\\\leq&\frac {\beta \epsilon m}{2} \sum _{k \in S_{t+1}} \Bigg |\frac {d_{k}}{D}\Bigg |^{2} \mathbb {E} \left [{ \left \|{ \left [{\nabla f\left ({\omega _{t}}\right ) - \nabla f_{k} \left ({\omega _{t,e} }\right ) }\right ]}\right \|^{2}}\right ] \\&{}-\frac {\beta \epsilon }{2} \mathbb {E} \left [{ \left \|{\nabla f\left ({\omega _{t}}\right )}\right \|^{2}}\right ] \tag{35}\\\leq&\frac {\beta \epsilon m L^{2}}{2} \sum _{k \in S_{t+1}} \Bigg |\frac {d_{k}}{D}\Bigg |^{2} \mathbb {E} \left [{ \left \|{ \omega _{t} - \omega _{t, e}^{k} }\right \|^{2} }\right ] \\&{}-\frac {\beta \epsilon }{2} \mathbb {E} \left [{ \left \|{\nabla f\left ({\omega _{t}}\right )}\right \|^{2}}\right ] \tag{36}\\=&\frac {\beta \epsilon m L^{2}}{2} \sum _{k \in S_{t+1}} \Bigg |\frac {d_{k}}{D}\Bigg |^{2} \mathbb {E} \left [{ \left \|{ \sum _{j=0}^{e-1} \epsilon \nabla f_{k} \left ({\omega _{t, j}}\right )}\right \|^{2} }\right ] \\&{}-\frac {\beta \epsilon }{2} \mathbb {E} \left [{ \left \|{\nabla f\left ({\omega _{t}}\right )}\right \|^{2}}\right ] \tag{37}\\\leq&\frac {\beta \epsilon ^{3} m L^{2} (e-1)}{2} \sum _{k \in S_{t+1}} \Bigg |\frac {d_{k}}{D}\Bigg |^{2} \sum _{j=0}^{e-1} \mathbb {E} \left [{\left \|{ \nabla f_{k}\left ({\omega _{t, j}}\right )}\right \|^{2} }\right ] \\&{}-\frac {\beta \epsilon }{2} \mathbb {E} \left [{ \left \|{\nabla f\left ({\omega _{t}}\right )}\right \|^{2}}\right ]. \tag{38}\end{align*} View SourceRight-click on figure for MathML and additional features.Thus, \begin{align*} a_{2}=&-\beta \epsilon \sum _{e=0}^{E-1} \mathbb {E} \left \langle{ \nabla f\left ({\omega _{t}}\right ), \sum _{k \in S_{t+1}} \frac {|d_{k}|}{|D|}~\nabla f_{k} \left ({\omega _{t,e} }\right ) }\right \rangle \tag{39}\\\leq&-\frac {\beta \epsilon E}{2} \mathbb {E} \left [{ \left \|{\nabla f\left ({\omega _{t}}\right )}\right \|^{2}}\right ] + \frac {\beta \epsilon ^{3} m L^{2} (E-1)}{2} \\&\sum _{k \in S_{t+1}}\Bigg |\frac {d_{k}}{D}\Bigg |^{2} \sum _{e=0}^{E-1} \sum _{j=0}^{e-1} \mathbb {E} \left \|{ \nabla f_{k}\left ({\omega _{t, j}}\right )}\right \|^{2}. \tag{40}\end{align*} View SourceRight-click on figure for MathML and additional features. Combining equation (25), (30) and (40), we have, \begin{align*}&\mathbb {E}\left [{f\left ({\omega _{t+1}}\right ) }\right ] - \mathbb {E} \left [{ f\left ({\omega _{t}}\right ) }\right ] \\\leq&\frac {L \beta ^{2} \epsilon ^{2} }{2} mE \sum _{k \in S_{t+1}} \Bigg |\frac {d_{k}}{D}\Bigg |^{2} \left [{\sum _{e=0}^{E-1} \mathbb {E} \left \|{\nabla f_{k} \left ({\omega _{t, e} }\right )}\right \|^{2} }\right ] \\&+\frac {\beta \epsilon ^{3} m L^{2} (E-1)}{2} \sum _{k \in S_{t+1}}\Bigg |\frac {d_{k}}{D}\Bigg |^{2} \sum _{e=0}^{E-1}\sum _{j=0}^{e-1} \mathbb {E} \left \|{ \nabla f_{k}\left ({\omega _{t, j}}\right )}\right \|^{2} \\&{}-\frac {E\beta \epsilon }{2} \mathbb {E} \left [{ \left \|{\nabla f\left ({\omega _{t}}\right )}\right \|^{2}}\right ]. \tag{41}\end{align*} View SourceRight-click on figure for MathML and additional features. Rearranging equation (41), \begin{align*}&\mathbb {E} \left [{ \left \|{\nabla f\left ({\omega _{t}}\right )}\right \|^{2}}\right ] \leq \frac {2}{E\beta \epsilon }\left ({\mathbb {E} \left [{ f\left ({\omega _{t}}\right ) }\right ] - \left [{f\left ({\omega _{t+1}}\right ) }\right ]}\right ) \\&{}+ L \beta \epsilon m \sum _{k \in S_{t+1}} \Bigg |\frac {d_{k}}{D}\Bigg |^{2} \left [{\sum _{e=0}^{E-1} \mathbb {E} \left \|{\nabla f_{k} \left ({\omega _{t, e}}\right )}\right \|^{2} }\right ] \\&\frac {m \epsilon ^{2} L^{2} (E-1) }{E}\sum _{k \in S_{t+1}}\Bigg |\frac {d_{k}}{D}\Bigg |^{2} \sum _{e=0}^{E-1}\sum _{j=0}^{e-1} \mathbb {E}\left \|{ \nabla f_{k}\left ({\omega _{t, j}}\right )}\right \|^{2}. \tag{42}\end{align*} View SourceRight-click on figure for MathML and additional features. For equation (42), it follows that, \begin{align*}&\frac {1}{T}\sum _{t=0}^{T-1} \mathbb {E} \left [{ \left \|{\nabla f\left ({\omega _{t}}\right )}\right \|^{2} }\right ] \leq \frac {2}{E\beta \epsilon T}\left ({\mathbb {E} \left [{ f\left ({\omega _{0}}\right ) }\right ] - \left [{f\left ({\omega _{T}}\right ) }\right ]}\right ) \\&{}+ \frac {L \beta \epsilon m }{T} \sum _{t=0}^{T-1} \sum _{k \in S_{t+1}} \Bigg |\frac {d_{k}}{D}\Bigg |^{2} \left [{\sum _{e=0}^{E-1} \mathbb {E} \left \|{\nabla f_{k} \left ({\omega _{t, e}}\right )}\right \|^{2} }\right ] \\&{}+\frac {m \epsilon ^{2} L^{2} (E-1) }{TE} \sum _{t=0}^{T-1}\sum _{k \in S_{t+1}}\Bigg |\frac {d_{k}}{D}\Bigg |^{2} \sum _{e=0}^{E-1}\sum _{j=0}^{e-1} \mathbb {E} \left \|{ \nabla f_{k}\left ({\omega _{t, j}}\right )}\right \|^{2}. \tag{43}\end{align*} View SourceRight-click on figure for MathML and additional features. With Assumption 3, let f^{*} denotes the optimal loss. For equation (42), it holds that, \begin{align*}&\frac {1}{T}\sum _{t=0}^{T-1} \mathbb {E} \left [{ \left \|{\nabla f\left ({\omega _{t}}\right )}\right \|^{2} }\right ] \leq \frac {2\left [{ f\left ({\omega _{0}}\right ) - f\left ({\omega _{T}}\right ) }\right ]}{E\beta \epsilon T} \\&{}+ \frac {\epsilon L E m \beta + m \epsilon ^{2} L^{2} (E-1)^{2} }{T}\sum _{t=0}^{T-1} \sum _{k \in S_{t+1}} \Bigg |\frac {d_{k}}{D}\Bigg |^{2} G^{2} \qquad \tag{44}\\\leq&\frac {2\left [{ f\left ({\omega _{0}}\right ) - f^{*} }\right ]}{E\beta \epsilon T} + \\&\frac {\epsilon L E m \beta + m \epsilon ^{2} L^{2} (E-1)^{2} }{T}\sum _{t=0}^{T-1} \sum _{k \in S_{t+1}} \Bigg |\frac {d_{k}}{D}\Bigg |^{2} G^{2}. \tag{45}\end{align*} View SourceRight-click on figure for MathML and additional features. This concludes the proof for Theorem 1.

SECTION B.

Proof of Theorem 2

For adaptive-mixing aggregation with asynchronous updates, we have, \begin{equation*} w_{t+1} = \left ({ 1 - \sum _{k \in S_{t+1}^{\prime }} \gamma _{k} }\right ) w_{t+1} + \sum _{k \in S_{t+1}^{\prime }} \gamma _{k} w_{t_{k}}^{k}, \tag{46}\end{equation*} View SourceRight-click on figure for MathML and additional features. where S_{t+1}^{\prime } denotes the collection of device index whose model updates are delayed and received in training round t+1 . For any device k \in S_{t+1}^{\prime } , w_{t_{k}}^{k} denotes the delayed model updates sent from training round t_{k} . Let \overline {\gamma } = \left({ 1 - \sum _{k \in S_{t+1}^{\prime }} \gamma _{k} }\right) , and substitute \omega _{t+1} with the AMA update scheme, we have, \begin{align*} \omega _{t+1} \overset {\Delta }{=} \overline {\gamma } ~\left ({\alpha ~\omega _{t} + \beta \sum _{k \in S_{t+1}} \frac {|d_{k}|}{|D|}~\omega _{t+1}^{k}}\right ) + \sum _{k \in S_{t+1}^{\prime }} \gamma _{k} w_{t_{k}}^{k}. \tag{47}\end{align*} View SourceRight-click on figure for MathML and additional features. As \alpha + \beta =1, \overline {\gamma }+\sum _{k \in S_{t+1}^{\prime }} \gamma _{k} = 1, it follows that, \begin{align*}&\omega _{t+1}{}- \omega _{t} \\=&\overline {\gamma } ~\left ({\alpha ~\omega _{t} + \beta \sum _{k \in S_{t+1}} \frac {|d_{k}|}{|D|}~\omega _{t+1}^{k}}\right ) + \sum _{k \in S_{t+1}^{\prime }} \gamma _{k} w_{t_{k}}^{k} \\&{}- \overline {\gamma } \omega _{t} - \sum _{k \in S_{t+1}^{\prime }} \gamma _{k} \omega _{t} \tag{48}\\=&\overline {\gamma } ~\left ({\alpha ~\omega _{t} + \beta \sum _{k \in S_{t+1}} \frac {|d_{k}|}{|D|}~\omega _{t+1}^{k} - \omega _{t} }\right ) + \sum _{k \in S_{t+1}^{\prime }} \gamma _{k} \left ({w_{t_{k}}^{k} - \omega _{t} }\right ) \tag{49}\\=&{}-\overline {\gamma }\beta \epsilon \sum _{k \in S_{t+1}} \frac {|d_{k}|}{|D|}~\sum _{e=0}^{E-1} \nabla f_{k} \left ({\omega _{t, e}}\right )+ \sum _{k \in S_{t+1}^{\prime }} \gamma _{k} \left ({ w_{t_{k}}^{k} - \omega _{t} }\right ). \tag{50}\end{align*} View SourceRight-click on figure for MathML and additional features.For the second term, we have, \begin{align*}&\sum _{k \in S_{t+1}^{\prime }}\gamma _{k} \left ({ w_{t_{k}}^{k} - \omega _{t} }\right ) \\=&\sum _{k \in S_{t+1}^{\prime }} \gamma _{k} \left [{w_{t_{k}}^{k} - \omega _{t} + \omega _{t_{k}-1} - \omega _{t_{k}-1} }\right ] \tag{51}\\=&{}- \sum _{k \in S_{t+1}^{\prime }} \gamma _{k} \left [{\omega _{t} - \omega _{t_{k}-1}}\right ] + \sum _{k \in S_{t+1}^{\prime }} \gamma _{k} \left [{w_{t_{k}}^{k} - \omega _{t_{k}-1}}\right ] \quad \tag{52}\\=&{}- \sum _{k \in S_{t+1}^{\prime }} \gamma _{k} \left [{-\epsilon \sum _{\tau = t_{k}-1}^{t-1} \nabla f\left ({\omega _{\tau }}\right )}\right ] \\&{}+ \sum _{k \in S_{t+1}^{\prime }} \gamma _{k} \left [{ -\epsilon \sum _{e=0}^{E-1} \nabla f_{k} \left ({\omega _{t_{k}-1, e}}\right )}\right ]. \tag{53}\end{align*} View SourceRight-click on figure for MathML and additional features. Hence, combining equation (50) and (53), it follows that, \begin{align*} \omega _{t+1}-\omega _{t}=&-\overline {\gamma }\beta \epsilon \sum _{k \in S_{t+1}} \frac {|d_{k}|}{|D|}~\sum _{e=0}^{E-1} \nabla f_{k} \left ({\omega _{t, e}}\right ) \\&{}+\sum _{k \in S_{t+1}^{\prime }} \gamma _{k} \epsilon \sum _{\tau = t_{k}-1}^{t-1} \nabla f\left ({\omega _{\tau }}\right ) \\&{}-\sum _{k \in S_{t+1}^{\prime }} \gamma _{k} \epsilon \sum _{e=0}^{E-1} \nabla f_{k} \left ({\omega _{t_{k}-1, e}}\right ). \tag{54}\end{align*} View SourceRight-click on figure for MathML and additional features. Since the loss function f is L -smooth, it follows that, \begin{align*}&\mathbb {E}\left [{f\left ({\omega _{t+1}}\right ) }\right ] - f\left ({\omega _{t}}\right ) \\\leq&\underbrace {\frac {L}{2} \mathbb {E} \left [{ \left \|{ \omega _{t+1} - \omega _{t}}\right \|^{2} }\right ]}_{b1} + \underbrace {\mathbb {E} \left \langle{ \nabla f\left ({\omega _{t}}\right ), \omega _{t+1}-\omega _{t} }\right \rangle }_{b2}. \tag{55}\end{align*} View SourceRight-click on figure for MathML and additional features. For b1 , we have, \begin{align*}&\frac {L}{2}\mathbb {E}\left [{ || \omega _{t+1} - \omega _{t}||^{2} }\right ] \\=&\frac {L}{2}\mathbb {E} \left [{ \bigg \Vert \overline {\gamma }\beta \epsilon \sum _{k \in S_{t+1}} \frac {|d_{k}|}{|D|}~\sum _{e=0}^{E-1} \nabla f_{k} \left ({\omega _{t, e}}\right ) }\right . \\&\qquad {}- \sum _{k \in S_{t+1}^{\prime }} \gamma _{k} \epsilon \sum _{\tau = t_{k}-1}^{t-1} \nabla f\left ({\omega _{\tau }}\right ) \\&\qquad \left .{{}+ \sum _{k \in S_{t+1}^{\prime }} \gamma _{k} \epsilon \sum _{e=0}^{E-1} \nabla f_{k} \left ({\omega _{t_{k}-1, e}}\right ) \bigg \Vert ^{2}}\right ] \tag{56}\\=&\underbrace {\frac {L}{2}\mathbb {E} \bigg \Vert \overline {\gamma }\beta \epsilon \sum _{k \in S_{t+1}} \frac {|d_{k}|}{|D|}~\sum _{e=0}^{E-1} \nabla f_{k} \left ({\omega _{t, e}}\right ) \bigg \Vert ^{2}}_{b1\_{}1} \\&{}+ \underbrace {\frac {L}{2}\mathbb {E} \bigg \Vert \sum _{k \in S_{t+1}^{\prime }} \gamma _{k} \epsilon \sum _{\tau = t_{k}-1}^{t-1} \nabla f\left ({\omega _{\tau }}\right ) \bigg \Vert ^{2}}_{b1\_{}2} \\&{}+ \underbrace {\frac {L}{2}\mathbb {E} \bigg \Vert \sum _{k \in S_{t+1}^{\prime }} \gamma _{k} \epsilon \sum _{e=0}^{E-1} \nabla f_{k} \left ({\omega _{t_{k}-1, e}}\right ) \bigg \Vert ^{2} }_{b1\_{}3}. \tag{57}\end{align*} View SourceRight-click on figure for MathML and additional features. With Jensen inequality, for b1\_{}1 , it holds that, \begin{align*}&\frac {L}{2}\overline {\gamma }^{2}\beta ^{2} \epsilon ^{2} \mathbb {E} \left [{ \left \|{ \sum _{k \in S_{t+1}} \frac {|d_{k}|}{|D|}~\sum _{e=0}^{E-1} \nabla f_{k} \left ({\omega _{t, e}}\right )}\right \|^{2} }\right ] \\\leq&\frac {L\overline {\gamma }^{2}\beta ^{2} \epsilon ^{2} mE}{2} \sum _{k \in S_{t+1}} \Bigg |\frac {d_{k}}{D}\Bigg |^{2} \left [{\sum _{e=0}^{E-1} \mathbb {E} \left \|{\nabla f_{k} \left ({\omega _{t, e}}\right )}\right \|^{2} }\right ]. \tag{58}\end{align*} View SourceRight-click on figure for MathML and additional features. For b1\_{}2 , it holds that, \begin{align*}&\frac {L}{2}\mathbb {E} \bigg \Vert \sum _{k \in S_{t+1}^{\prime }} \gamma _{k} \epsilon \sum _{\tau = t_{k}-1}^{t-1} \nabla f\left ({\omega _{\tau }}\right ) \bigg \Vert ^{2} \\\leq&\frac {LC \epsilon ^{2}}{2} \sum _{k \in S_{t+1}^{\prime }} \gamma _{k}^{2} \mathbb {E} \bigg \Vert \sum _{\tau = t_{k} - 1}^{t-1} \nabla f\left ({\omega _{\tau }}\right )\bigg \Vert ^{2} \tag{59}\\\leq&\frac {LC \epsilon ^{2}}{2} \sum _{k \in S_{t+1}^{\prime }} \left ({t-t_{k}}\right ) \gamma _{k}^{2} \sum _{\tau = t_{k} - 1}^{t-1} \mathbb {E} \bigg \Vert \nabla f\left ({\omega _{\tau }}\right )\bigg \Vert ^{2} \tag{60}\\=&\frac {LC \epsilon ^{2}}{2} \sum _{k \in S_{t+1}^{\prime }} \left ({t-t_{k}}\right ) \gamma _{k}^{2} \sum _{\tau = t_{k} - 1}^{t-1} \mathbb {E} \bigg \Vert \sum _{k \in S_{\tau }} \sum _{e=0}^{E-1} \nabla f_{k} \left ({\omega _{\tau, e}}\right ) \bigg \Vert ^{2} \tag{61}\\\leq&\frac {LC \epsilon ^{2} m E}{2} \sum _{k \in S_{t+1}^{\prime }} \left ({t-t_{k}}\right ) \gamma _{k}^{2} \sum _{\tau = t_{k} - 1}^{t-1} \sum _{k \in S_{\tau }} \sum _{e=0}^{E-1} \mathbb {E} \bigg \Vert \nabla f_{k} \left ({\omega _{\tau, e}}\right ) \bigg \Vert ^{2} \tag{62}\end{align*} View SourceRight-click on figure for MathML and additional features. where C denotes the maximum number of asynchronous model transmission received in a training round, i.e., |S_{t}| \leq C, \forall t \leq T .

For b1\_{}3 , it holds that, \begin{align*}&\frac {L}{2}\mathbb {E} \bigg \Vert \sum _{k \in S_{t+1}^{\prime }} \gamma _{k} \epsilon \sum _{e=0}^{E-1} \nabla f_{k} \left ({\omega _{t_{k}-1, e}}\right ) \bigg \Vert ^{2} \\\leq&\frac {L C \epsilon ^{2}}{2} \sum _{k \in S_{t+1}^{\prime }} \gamma _{k}^{2} ~\mathbb {E} \bigg \Vert \sum _{e=0}^{E-1} \nabla f_{k} \left ({\omega _{t_{k}-1, e}}\right ) \bigg \Vert ^{2} \tag{63}\\\leq&\frac {LC \epsilon ^{2} E }{2} \sum _{k \in S_{t+1}^{\prime }} \gamma _{k}^{2} \sum _{e=0}^{E-1} \mathbb {E}\bigg \Vert \nabla f_{k} \left ({\omega _{t_{k}-1, e}}\right ) \bigg \Vert ^{2}. \tag{64}\end{align*} View SourceRight-click on figure for MathML and additional features. With Assumption 3, combining equation (57) (58) (62) and (64), for b_{1} it holds that, \begin{align*} \frac {L}{2} \mathbb {E}&\left [{ \big \Vert \omega _{t+1} - \omega _{t} \big \Vert ^{2} }\right ] \\\leq&\frac {L\overline {\gamma }\beta ^{2} \epsilon ^{2} m E^{2}}{2} \sum _{k \in S_{t+1}} \Bigg |\frac {d_{k}}{D}\Bigg |^{2} G^{2} \\&{}+ \frac {LC \epsilon ^{2} m^{2} E^{2}}{2} \sum _{k \in S_{t+1}^{\prime }} \left ({t-t_{k}}\right )^{2} \gamma _{k}^{2} G^{2} \\&{}+ \frac {LC \epsilon ^{2} E^{2}}{2} \sum _{k \in S_{t+1}^{\prime }} \gamma _{k}^{2} G^{2} \tag{65}\\=&\frac {L\overline {\gamma }\beta ^{2} \epsilon ^{2} m E^{2}}{2} \sum _{k \in S_{t+1}} \Bigg |\frac {d_{k}}{D}\Bigg |^{2} G^{2} \\&{}+ \frac {LC \epsilon ^{2} E^{2} G^{2} }{2}\sum _{k \in S_{t+1}^{\prime }} \left [{ m^{2}\left ({t-t_{k}}\right )^{2} +1 }\right ]\gamma _{k}^{2}. \tag{66}\end{align*} View SourceRight-click on figure for MathML and additional features.

For b_{2} , it holds that, \begin{align*}&\mathbb {E}\big \langle \nabla f\left ({\omega _{t}}\right ), \omega _{t+1}-\omega _{t} \big \rangle \\=&{}- \mathbb {E} \Big \langle \nabla f\left ({\omega _{t}}\right ), ~\overline {\gamma }\beta \epsilon \sum _{k \in S_{t+1}} \frac {|d_{k}|}{|D|}~\sum _{e=0}^{E-1} \nabla f_{k} \left ({\omega _{t, e}}\right ) \Big \rangle \\&{}+ \mathbb {E} \Big \langle \nabla f\left ({\omega _{t}}\right ), \sum _{k \in S_{t+1}^{\prime }} \gamma _{k} \epsilon \sum _{\tau = t_{k} - 1}^{t-1} \nabla f\left ({\omega _{\tau }}\right ) \Big \rangle \\&{}- \mathbb {E} \Big \langle \nabla f\left ({\omega _{t}}\right ), \sum _{k \in S_{t+1}^{\prime }} \gamma _{k} \epsilon \sum _{e=0}^{E-1} \nabla f_{k}\left ({\omega _{t_{k} - 1, e}}\right ) \Big \rangle . \tag{67}\end{align*} View SourceRight-click on figure for MathML and additional features. The 2nd and 3rd term in b_{2} vanish to zero, due to the independency of the inner product. As proved above, it holds that, \begin{align*}&\mathbb {E}\Big \langle \nabla f\left ({\omega _{t}}\right ), ~\omega _{t+1}-\omega _{t} \Big \rangle \\\leq&\sum _{e=0}^{E-1} \frac {\overline {\gamma } \beta \epsilon ^{3} m L^{2} (e-1)}{2} \sum _{k \in S_{t+1}}\Bigg |\frac {d_{k}}{D}\Bigg |^{2} \sum _{j=0}^{e-1} \mathbb {E} \left \|{ \nabla f_{k} \left ({\omega _{t, j}}\right )}\right \|^{2} \\&{}-\frac {E \overline {\gamma } \beta \epsilon }{2} \mathbb {E} \left \|{ \nabla f\left ({\omega _{t}}\right )}\right \|^{2} \tag{68}\\=&\frac {\overline {\gamma } \beta \epsilon ^{3} m L^{2} }{2} \sum _{e=0}^{E-1} (e-1)^{2} \sum _{k \in S_{t+1}}\Bigg |\frac {d_{k}}{D}\Bigg |^{2} \mathbb {E} \left \|{ \nabla f_{k} \left ({\omega _{t, j}}\right )}\right \|^{2} \\&{}-\frac {E \overline {\gamma } \beta \epsilon }{2} \mathbb {E} \left [{ \left \|{\nabla f\left ({\omega _{t}}\right )}\right \|^{2}}\right ] \tag{69}\\\leq&\frac {\overline {\gamma } \beta \epsilon ^{3} m L^{2} E(E-1)^{2}}{2} \sum _{k \in S_{t+1}}\Bigg |\frac {d_{k}}{D}\Bigg |^{2} G^{2} \\&{}-\frac {E \overline {\gamma } \beta \epsilon }{2} \mathbb {E} \left [{ \left \|{\nabla f\left ({\omega _{t}}\right )}\right \|^{2}}\right ]. \tag{70}\end{align*} View SourceRight-click on figure for MathML and additional features. Combining equation (55), (66) and (70), it holds that, \begin{align*}&\mathbb {E} \left [{f\left ({\omega _{t+1}}\right ) }\right ] - f\left ({\omega _{t}}\right ) \\\leq&\frac {L\overline {\gamma }\beta ^{2} \epsilon ^{2} m E^{2}}{2} \sum _{k \in S_{t+1}} \Bigg |\frac {d_{k}}{D}\Bigg |^{2} G^{2} - \frac {E \overline {\gamma } \beta \epsilon }{2} \mathbb {E} \left \|{\nabla f\left ({\omega _{t}}\right )}\right \|^{2} \\&{}+\frac {LC \epsilon ^{2} E^{2} G^{2} }{2}\sum _{k \in S_{t+1}^{\prime }} \left [{ m^{2}\left ({t-t_{k}}\right )^{2} +1 }\right ]\gamma _{k}^{2} \\&{}+ \frac {\overline {\gamma }\beta \epsilon ^{3} m L^{2} E (E-1)^{2}}{2} \sum _{k \in S_{t+1}}\Bigg |\frac {d_{k}}{D}\Bigg |^{2} G^{2}. \tag{71}\end{align*} View SourceRight-click on figure for MathML and additional features. Rearranging equation (71), \begin{align*}&\frac {E \overline {\gamma } \beta \epsilon }{2} \mathbb {E}\left [{ \left \|{\nabla f\left ({\omega _{t}}\right )}\right \|^{2}}\right ] \leq f\left ({\omega _{t}}\right )- \mathbb {E} \left [{f\left ({\omega _{t+1}}\right ) }\right ] \\&{}+ \frac {L\overline {\gamma }\beta ^{2} \epsilon ^{2} m E^{2}}{2} \sum _{k \in S_{t+1}} \Bigg |\frac {d_{k}}{D}\Bigg |^{2} G^{2} \\&{}+\frac {LC \epsilon ^{2} E^{2} G^{2} }{2}\sum _{k \in S_{t+1}^{\prime }} \left [{ m^{2}\left ({t-t_{k}}\right )^{2} +1 }\right ]\gamma _{k}^{2} \\&{}+ \frac {\overline {\gamma }\beta \epsilon ^{3} m L^{2} E (E-1)^{2}}{2} \sum _{k \in S_{t+1}}\Bigg |\frac {d_{k}}{D}\Bigg |^{2} G^{2}. \tag{72}\end{align*} View SourceRight-click on figure for MathML and additional features. Let f^{*} denote the optimal loss, for equation (72), it follows that, \begin{align*}&\frac {1}{T} \sum _{t=0}^{T-1} \mathbb {E}\left [{ \left \|{\nabla f\left ({\omega _{t}}\right )}\right \|^{2}}\right ] \leq \frac {2}{T \overline {\gamma } E\beta \epsilon } \left ({f\left ({\omega _{0}}\right ) - \mathbb {E} \left [{f\left ({\omega _{T}}\right ) }\right ] }\right ) \\&{}+ \frac {L m \beta \epsilon E}{T} \sum _{t=0}^{T-1}\sum _{k \in S_{t+1}} \Bigg |\frac {d_{k}}{D}\Bigg |^{2} G^{2} \\&{}+ \frac {C L\epsilon E}{\overline {\gamma } \beta T} \sum _{t=0}^{T-1} \sum _{k \in S_{t+1}^{\prime }} \left [{ m^{2}\left ({t-t_{k}}\right )^{2} +1 }\right ]\gamma _{k}^{2} G^{2} \\&{}+ \frac {\epsilon ^{2} m L^{2} (E-1)^{2}}{T}\sum _{t=0}^{T-1}\sum _{k \in S_{t+1}}\Bigg |\frac {d_{k}}{D}\Bigg |^{2} G^{2} \tag{73}\\\leq&\frac {2\left [{f\left ({\omega _{0}}\right ) - f^{*} }\right ]}{T E \overline {\gamma } \beta \epsilon } + \frac {L m \beta \epsilon E}{T} \sum _{t=0}^{T-1}\sum _{k \in S_{t+1}} \Bigg |\frac {d_{k}}{D}\Bigg |^{2} G^{2} \\&{}+ \frac {C L\epsilon E}{ \overline {\gamma } \beta T} \sum _{t=0}^{T-1} \sum _{k \in S_{t+1}^{\prime }} \left [{ m^{2}\left ({t-t_{k}}\right )^{2} +1 }\right ]\gamma _{k}^{2} G^{2} \\&{}+ \frac {\epsilon ^{2} m L^{2} (E-1)^{2}}{T}\sum _{t=0}^{T-1}\sum _{k \in S_{t+1}} \Bigg |\frac {d_{k}}{D}\Bigg |^{2} G^{2}. \tag{74}\end{align*} View SourceRight-click on figure for MathML and additional features. This concludes the proof for Theorem 2.

References

References is not available for this document.