Slingshot: Globally Favorable Local Updates for Federated Learning

Federated Learning (FL), as a promising distributed learning paradigm, is proposed to solve the contradiction between the data hunger of modern machine learning and the increasingly stringent need for data privacy. However, clients naturally present different distributions of their local data and inconsistent local optima, which leads to poor model performance of FL. Many previous methods focus on mitigating objective inconsistency. Although local objective consistency can be guaranteed when the number of communication rounds is infinite, we should notice that the accumulation of global drift and the limitation on the potential of local updates are non-negligible in those previous methods. In this article, we study a new framework for data-heterogeneity FL, in which the local updates in clients towards the global optimum can accelerate FL. We propose a new approach called Slingshot. Slingshot's design goals are twofold, i.e., i) to retain the potential of local updates, and ii) to combine local and global trends. Experimental results show that Slingshot helps local updates become more globally favorable and outperforms other popular methods under various FL settings. For example, on CIFAR10, Slingshot achieves 46.52% improvement in test accuracy and 48.21× speedup for a lightweight neural network named SqueezeNet.


I. INTRODUCTION
In each communication round of a standard FL called Fe-dAvg [1], each selected client first receives the global model from a central server and executes stochastic gradient descent (SGD) with local data in several local epochs.The updated local model is then returned to the server for aggregation.Compared to traditional distributed learning, FL protects data privacy by exchanging models instead of the local data of each participant.Therefore, FL can be applied to areas with strict privacy restrictions such as healthcare [2], [3].On the other hand, FL reduces the aggregation frequency, thereby lowering communication costs [1].
The reality is that data heterogeneity (also known as statistical heterogeneity or non-identically distributed) prevents FL from being largely applied in practice.In the real-world environment, each client often has its own data distribution because of personal preferences and attributes.Combining as much data as possible to train an optimal global model that fits the total data distribution is our expectation for FL.However, in data-heterogeneity settings, FL has been found to converge slowly to a sub-optimal point or not at all [4], [5], particularly if the learning rate has not been specifically optimized.
The poor model performance and slow convergence of dataheterogeneity FL result from inconsistent local optima across clients [6], [7], [8], [9].Seriously heterogeneous data means that the clients' local optima are far from each other.Thus, the global optimum (the average of all local optima) would be far from each local optimum.Such inconsistencies cause two detrimental effects on FL, i.e., i) local updates deviate from global updates (Local Drift as shown in Fig. 1), and ii) the aggregated global model deviates from the global optimum (Global Drift as shown in Fig. 1).
It is a common approach to address the data heterogeneity problem in FL by alleviating objective inconsistency.These methods can be divided into the following two categories.The first method is to add the regularization term to constrain the distance between the local updated model and the global model [8], [10], [11].The other one is to set the correction term of Local Drift [6], [12].However, these methods cannot eliminate objective inconsistency, because the global model in the regularization or correction term also includes drifts [13].Although all client solutions can be aligned at the end of federated training in these methods, the drift has gradually accumulated during the progressive alignment.In other words, these methods ensure faster convergence of FL, but can not guarantee the converged model is closer to the global optimum.The accumulation of Global Drift leads to poor model performance of FL.In addition, strict restrictions on consistency limit the potential of local updates to improve local models, thus degrading the updates of the global model [14].
In this article, we reconstruct a new perspective on dataheterogeneity FL that prioritizes global benefits over objective consistency.Particularly, we no longer asymptotically reduce the Local Drift to ensure the consistency of clients, but focus on increasing local updates in a globally favorable direction.This is because local updates towards the global optimum can potentially accelerate data-heterogeneity FL.Inspired by this idea, we propose a new method called Slingshot by taking the two following design goals into account, i.e., i) to retain the potential of local updates, and ii) to combine local and global trends such that local updates are more globally favorable.
Our study includes the following contributions.r Our extensive experiments show that the improved lo- cal updates of Slingshot make it applicable to various datasets, models, and other settings in FL, with better performance and faster convergence.For example, in a severe data-heterogeneity setting, Slingshot reduces the communication rounds of FedAvg by 72% on the CI-FAR10 dataset.Here is the guide for the subsequent sections.In Section II, we review related work in the field of data-heterogeneity federated learning and catastrophic forgetting.In Section III, we discuss the underlying cause of FedAvg's poor performance with heterogeneous data is ineffective local updates.We also define a new metric called MGAI to measure the value of local updates.Following that, Section IV introduces the proposed method Slingshot, detailing its motivation and designs.Experimental setups and results are presented in Section V. Finally, we conclude in Section VI with a summary of key contributions.We hope our study will stimulate more researchers to discuss what kind of local updates are more beneficial to global updates and global models.

II. RELATED WORK A. DATA-HETEROGENEITY FEDERATED LEARNING
The data heterogeneity among all clients is a key challenge in FL.A common solution to addressing the data-heterogeneity issue is to alleviate objective inconsistency by reducing the Local Drifts aforementioned.The representative studies are reviewed as follows.FedProx [10] simply adds a proximal term to the objective.Scaffold [6] views the Local Drift as "client-variance" and explores control variate to correct for the Local Drift.Moon [8] adopts a model-level contrastive loss by comparing representations learned by global models, local models, and previous local models.FedDyn [11] adds linear and quadratic penalty terms that dynamically modify the clients' objective to ensure objective consistency in the limit.FedDC [12] decouples local training from global training and bridges the Local Drift with a local-drift variable.FedGA [15] promotes the alignment of gradients across clients with an implicit regularization.
In addition, some methods improve on the original weighted aggregation of FedAvg with a focus on the server side.These methods are designed to reduce Global Drift.For example, FedNova [13] starts with various amounts of local updates and normalizes all local gradients before the aggregation.Considering the permutation invariance of neural network parameters, FedMa [16] matches and averages similar weights layer by layer in the aggregation phase.To mitigate feature drifts of heterogeneous data, FedBN [17] does not aggregate the parameters of local BatchNorm layers.FedOPT [18] converts the various dynamic optimizers into federated versions, and applies them to the global updates on the server.However, none of the mentioned methods can integrate with Local-Drift-reduction methods well.In fact, Global Drift is actually the vector sum of Local Drift [9].Thus, the Local-Drift-reduction methods are not orthogonal to the Global-Drift-reduction methods in FL.We focus on the Local-Drift-reduction methods, which are directly related to the raw data and are more popular recently [12], [15].
From these previous studies mentioned above, we observe that paying too much attention to objective consistency could be harmful to FL.Because of the strong constraints or corrections in these methods, the length of gradient descent of the local model could be too short.Aggregating these less updated models, the server collects less novel information per communication round, thus increasing the communication rounds [7].In addition, the global model in the regularization or correction term also includes drifts and the cumulative Global Drift in these methods can not be ignored.Thus, we try to find a more general and intuitive way to solve the issue of data heterogeneity.

B. CATASTROPHIC FORGETTING IN FEDERATED LEARNING
Catastrophic forgetting refers to an important problem when neural networks train the current task but forget the knowledge of the previous tasks [19], if given a series of tasks with different data distributions.Some recent papers attribute the poor performance of data-heterogeneity FL to the catastrophic forgetting across clients [20], [21].Because of inconsistent objectives, the global model forgets the previous knowledge when each client executes local SGD, and local models forget the locally learned knowledge when the updated models are aggregated by servers.The biggest challenge in tackling dataheterogeneity FL and catastrophic forgetting is how to balance the knowledge with different data distributions.Consequently, we can learn from the methods of catastrophic forgetting to address data heterogeneity FL.
The authors of [22], [23] classified the most related papers on catastrophic forgetting into three main categories, i.e., regularization [24], [25], [26], replay [27], [28], [29] and parameter isolation approaches [30], [31], [32], [33].It is a common agreement for regularization-based methods to be applied for data-heterogeneity FL, such as FedProx and Fed-Dyn aforementioned.The replay-based methods have been also proposed for FL, e.g., [7] corrects the classifier of neural network by replaying virtual representations generated from the Gaussian distribution.Inspired by the parameter-isolation approaches [32], [33], we find that a globally favorable local update in FL is actually fixing those model parameters that improve both local and global accuracy while updating other model parameters.Thus, we adopt two dynamic targets in the local training phase to help local updates improve both local and global accuracy.Moreover, these targets are not proximal and do not unduly constrain the potential of local training to update other parameters.

III. INEFFECTIVE LOCAL UPDATES
In this section, we define the optimization goals of federated learning and analyze the deep reasons why conventional

A. PRELIMINARIES
Assuming that there are a number of N ∈ N + clients participating FL.We denote ω as the global model.Thus, the global optimum in FL is defined as follows. where indicates the number of data samples in all clients, s is a data sample in D k , and f (ω; s) means the supervised loss function given ω and s, such as the common cross-entropy loss.
In FedAvg [1], each client first lets local model ω k = ω, then executes mini-batch gradient descent at its device.In fact, with its local data, each client k can only help the local model get closer to the local optimum by executing local SGD following FedAvg [1]: where ω is the averaged model.And the server uses ω as the new global model ω.

B. HETEROGENEOUS DATA MAKES UPDATES INEFFECTIVE
In a centralized learning, after one step of global update, the global model will be changed to It is obvious that ω is worse than ω + because the direction of and there is a large angle between the directions of −∇ f

C. MEAN GLOBAL ACCURACY INCREASE
In order to better explore what local updates are globally beneficial, we integrate the findings from the previous section and first define a metric called MGAI: Mean Global Accuracy Increase.A study shows that the final test accuracy of FL is greatly affected by the early phase of the training process [34].This is because if the global model is optimized to a suboptimal point far away from the global optimum in the early training phase, the complex loss surface in data-heterogeneity FL prevents the global model from escaping the sub-optimal point.Thus, we focus on the global impact of local updates in the early training phase.Note that measuring the accuracy of a local model simply with a local test set or measuring the accuracy of a global model simply with a global test set can not directly account for the global impact of local updates.MGAI measures the increase of local updates on global accuracy.The experimental results show that MGAI is positively correlated with the final model performance.This validates the idea that globally beneficial local updates can improve the model performance of FL.Note that measuring MGAI can be done directly on the server side.As the training task is initiated on the server, we can collect the test set on the server and measure MGAI in the early stages of training.In each communication round, the first measurement of test accuracy occurs before the local models are updated.At this time, each local model is equal to the global model.The second measurement of test accuracy occurs before aggregation.At this time, the updated models are sent to the server.Although such testing consumes some additional resources, it allows users to select effective algorithms at an early phase.For example, for a task to train a thousand rounds, it may only take 50 rounds of MGAI measurement to select an effective algorithm, which can save a lot of training resources.

IV. THE METHOD OF SLINGSHOT
Inspired by the observation from Fig. 2 and the analysis described in previous section, we propose Slingshot to accelerate data-heterogeneity FL.Slingshot exploits the directions and the quality of local updates, which is based on the thinking of how to improve MGAI.The design goal of Slingshot is to enforce each local update becoming more favorable to the update of a global model such that their local updates can help the updated local model approach the global optimum.With the goal in mind, the challenge is that each client does not know where the global optimum is.
To assist each client to capture the globally favorable direction, Slingshot has the following two changes compared to FedAvg.Firstly, each local update is guided by two dynamic targets.Secondly, the global model is moved back before local updates and is compensated after local updates.These Local model ω k ← ω 6: l ← l ce + l ss 6: ω k ← ω k − η∇l 7: end for 8: end for 9: return ω k two changes are elaborated in Designs 1 and 2, respectively.Before describing the two designs, we depict the motivation of Design 1 as follows.

A. MOTIVATION: BALANCING THE LOCAL AND GLOBAL PERFORMANCE
Ineffective local updates in data-heterogeneity FL originates from the fact that all clients share a common set of global model parameters but the local and global optima are different.Motivated by the parameter-isolation approaches [32], [33] ensuring minimal drop in performance, we argue that the key to data-heterogeneity FL is to balance the local and global performance.However, the local updates in FedAvg are unbalanced, which aims at local optima but ignores the global optimum.On the other hand, FedProx-like methods add a strong penalty term written as μ 2 ω k − ω 2 , where μ controls the weight of this penalty term.This penalty term prevents the updated local ω k model from getting too far from the global model ω.In fact, such stringent penalty term limits the effective forwarding of local updates towards local optima.
Then global updates are also constrained accordingly.Therefore, a globally-favorable local update should simultaneously consider the update trends of both the local and global models, while preserving its potential of moving towards local and global optima.
and the global target is constructed as where α is a hyper-parameter controlling how far the target is from the global model.

C. DESIGN 2: MOVING GLOBAL MODEL BACK
Although the penalty terms μ 2 ( ω k − ω loc 2 + ω k − ω glo 2 ) help local updates more globally favorable, they also prevent the local updates from converging to the local optimum fast.Note that there is a trade-off in the size of the hyperparameters α.When α is too large, the dynamic targets are too far from the local model, reducing the impact of targets on local updates.On the other hand, these dynamic targets are close to the global model.Our proposed Slingshot is then equivalent to FedProx [10], which limits the potential of local updates to improve local models.To prevent the value of hyper-parameters α from unduly affecting the performance of Slingshot, we further improve local updates.
In the state-of-the-art data-heterogeneity FL methods, local updates are mainly affected by local datasets and regularization terms (or correction terms).These methods often design regularization terms (or correction terms) elaborately, but ignore the raw effects of local datasets.When is the direction toward a local optimum approximately equal to the direction toward the global optimum?The answer is when the global model is far away from both local and global optimum.
Let the local dataset guide the local model to the global optimum as much as possible.We propose a method that increases the distance between the global model and the global optimum before local training but decreases the distance after training.Specifically, we let each client move the global model back αm before local updates ω ← ω − αm, (9) and compensate it after local updates (as shown in line 13 of Algorithm 1), where m is the global momentum, i.e. the momentum of global updates.This momentum is updated as shown in line 14 of Algorithm 1.
It is worth noting that Design 2 is fundamentally different from the conventional FL methods using momentum [18], [35].Although we use server-side momentum, this momentum is not really incorporated into the optimizer.This is because the momentum that we add to the global model cancels out before and after the local training, while the momentum added to the model in the conventional FL methods is maintained.Considering that it is difficult to directly correct the global update, (especially in the scenario of Big Data and deep models, even a minor correction to the global model will have a huge impact) we do not directly correct the global update.The reason we introduce server-side momentum is to implicitly improve local updates, i.e., to implicitly add the gradient pointing to the global optimum on the local update.Our experiments in Section V-B prove that both of our proposed designs can improve the performance of FL.Also, Slingshot's performance is less sensitive to hyper-parameter α because of Design 2.

D. PROPERTY ANALYSIS
Compared with the classical FL algorithm FedAvg, most data-heterogeneity FL methods have additional resource consumption [6], [8], [10].To the best of our knowledge, an efficient data-heterogeneity FL method that does not require preserving additional models or gradient states has not yet emerged.Slingshot also needs to maintain some historical models locally on the clients to improve the training, such as ω pre k and ω rec k .These models consume additional memory resources.On the other hand, the effectiveness of some methods depends heavily on synchronizing the extra saved models.For example, the conventional data-heterogeneity FL algorithm Scaffold [6] incurs 2×communication overhead for synchronizing the global variate c.However, Slingshot does not require this kind of synchronization, eliminating additional communication resource consumption.This means that Slingshot's communication overhead per round is equal to that of FedAvg.

2) THE PARTITION OF DATASETS
The whole test dataset is stored at the FL server, but each client has only a subset of the train dataset during the entire FL training.We adopt a popular approach that appeared in FL papers to partition datasets [7], [8], [12].Concretely, for each class, a distribution of data samples across all clients is generated by the Dirichlet distribution (Dir(β )).The parameter β is the concentration of the Dirichlet distribution, and a lower concentration leads to a more heterogeneous distribution of data.To implement different degrees of data heterogeneity, we set β = 0.1, 0.3, and 0.5, respectively.In these challenging settings, the data distribution and the amount of data can vary widely from client to client.

3) FEDERATED LEARNING SETTINGS
In each communication round, a central FL server randomly samples a specified number of clients to perform local SGD via a fixed random seed.For CIFAR10, CIFAR100, and SVHN, we set 1000 communication rounds, and 10 out of 100 clients are selected for each round.For FashionMNIST, we set 300 communication rounds, and 10 out of 200 clients are selected for each round.For EMNIST, we set 200 communication rounds, and 10 out of 100 clients are selected.Unless otherwise specified, we set the number of local epochs in each round to 5, the batch size in the local update phase to 64, the initial learning rate to 0.1, the decay rate of learning rate in each round to 0.998 [11], the momentum to 0.9, and the weight decay to 10 −4 , respectively.We also apply the same data-augmentation techniques for each dataset.

4) BASELINES AND HYPER-PARAMETERS
We compare Slingshot with the standard FL paradigm FedAvg, as well as other three popular methods in data-heterogeneity FL including FedProx [10], FedGa [15], and Moon [8].For CIFAR100, we tune μ of FedProx from {0.1, 0.01, 0.001} and use the best μ 0.01, α of FedGa is set to 0.05 tuned from {0.1, 0.05, 0.025}, μ of Moon is set to 0.01 tuned from {1, 0.1, 0.01, 0.001} and α of Slingshot is set to 0.2 tuned from {0.2, 0.1, 0.05}.For CIFAR10, the best μ of Moon is changed to 0.001, and the best α of Slingshot is 0.1.In order to fairly compare the generalization of these methods, their specific hyper-parameters for the other tasks are set to the best for CIFAR10.We explore the sensitivity of Slingshot's hyper-parameters α and show the results in Fig. 6.All the settings of μ in Slingshot are the same as those set in FedProx.

B. IMPROVED LOCAL UPDATES
Our key idea is to help local updates more globally favorable in data-heterogeneity FL.Thus, we compare the global impacts of local updates of various federated learning baselines.Our proposed metric MGAI measures the increase in global accuracy of the locally updated models, which shows the effectiveness of local updates.As shown in Fig. 3, Slingshot improves the value of local updates compared to other baselines, especially with a large number of local epochs.This is because Slingshot can not only appropriately guide the directions of local updates but also preserve the possibility of a large improvement of the updated model.
Effective local update results in good performance of the global model.We compare the performance of Slingshot with  that of FedAvg in the early stages (known as critical periods).The experimental results show that MGAI is positively correlated with model performance.Slingshot has better model performance than FedAvg, with a higher MGAI.Note that the number of local epochs in a communication round is a crucial parameter of FL.A simple and crude way to reduce the impact of heterogeneous data in FedAvg is to reduce the number of local epochs.However, decreasing local epochs seriously increases communication costs.It is generally believed that a large number of local epochs causes a serious challenge of data heterogeneity.Thus, we conduct five groups of experiments corresponding to local epochs = 2, 5, 10, 15, and 20 respectively, to test the performance under various degrees of data heterogeneity.Fig. 4 shows that the acceleration of Slingshot in the early stage is robust for different local epochs.This observation suggests that Slingshot works well under various degrees of data heterogeneity.

C. CONVERGENCE COMPARISON 1) NON-IID DATASET AND PARTIAL PARTICIPANTS
The reason why the early training stage is called the critical round is that the early training often determines the quality of the final model and the overall convergence speed of the algorithm.Our experimental results also prove that the convergence rate and model performance in the early training stage imply the eventual convergence rate and model In addition, the positive relationship between MGAI and the final model performance is also confirmed.Thus, researchers can predict the final performance of various FL algorithms by measuring MGAI during the early testing phase.
Table 2 shows the convergence of the mentioned methods for five benchmark datasets by comparing both the top-1 accuracy of the global model and the number of rounds that achieve the target accuracy.In order to better compare the convergence, we set higher target accuracies for the settings with larger β (smaller challenge).Compared to other baselines, Slingshot shows the best model performance and communication efficiency in almost all settings.For CIFAR10 (β = 0.1), the top-1 accuracy of Slingshot is 7.08% higher than that of FedAvg, and the number of communication rounds to achieve 63% accuracy is reduced by 72%.Slingshot is also 5.30× faster than FedAvg on EMNIST with fewer challenges of convergence.Slingshot's local updates are more globally beneficial and not pulled by different local optima, thus Slingshot achieves the target accuracy with fewer communication rounds.
Among the five dataset tasks, the most difficult one is CIFAR100.For CIFAR100 with heterogeneous data, some methods even cannot train effectively.For instance, when β = 0.1, the global accuracy in FedAvg is stuck at 1%.However, in this case of significantly heterogeneous data, Slingshot still converges best.Sometimes, other methods are no better than vanilla FedAvg while Slingshot is always better than FedAvg with heterogeneous data.It is because Slingshot is not overly constrained by objective consistency and not seriously affected by cumulative Global Drift such as FedProx, but it is more general to choose the globally favorable direction to update.

TABLE 3. Top-1 Accuracy, With IID Dataset and Full Participants
The test accuracy results of training models on CIFAR10 and CIFAR100 are shown in Fig. 5.In each communication round, only a subset of clients are selected by the central server, and these clients may have little data in data-heterogeneity settings.In addition, the selected clients may have data that are useless or even have negative impacts on global training at this stage.Thus, the test accuracy results of all methods have some degree of oscillation.However, Slingshot captures the globally favorable direction even if with oscillation, resulting in faster convergence.

2) IID DATASET AND FULL PARTICIPANTS
Table 3 shows that Slingshot does not degrade performance with full participants and IID dataset, where all clients have the same data distribution and the same amount of data.Due to the feature drifts among samples and the randomness of stochastic gradient descent, Slingshot is also effective under the setting of IID and full participants.

D. ROBUSTNESS AND EFFECTIVENESS
It is not our intention to get a competitive method by tuning hyper-parameters.Specific hyper-parameters, however, definitely affect the convergence of all methods.We explore the sensitivity of Slingshot's parameter α in Fig. 6(a).This sensitivity is similar to that of FedGa while Slingshot outperforms FedGa for all α.What we want to emphasize is that the designs of Slingshot are generalized to help solve the problem of data heterogeneity.Fig. 6(b) shows both Slingshot's designs can speed up data-heterogeneity FL and Design 1 has larger impact than Design 2. Despite using a similar loss function adopted by FedProx, Slingshot is more effective than FedProx.This is because Slingshot constrains local updates to a globally favorable direction rather than a global model given a Global Drift.
Since deep models are transmitted in FL instead of data, FL is often associated with high communication costs.It is natural to choose a lightweight model in FL.However, experimental results show that lightweight models that are more effective in centralized training are possibly more difficult to train in data-heterogeneity FL.This is because such lightweight models are compressed.Compared with the sparse model, their gradient norm is larger, so they are more susceptible to the drifts of FL.Fig. 7 shows that Slingshot can be applied to various lightweight models to tackle the problem of data heterogeneity.For example, Slingshot achieves 46.52% improvement on test accuracy and 48.21× speedup for a lightweight neural network named SqueezeNet-1.0.

VI. CONCLUSION
A challenge in federated learning (FL) is the dataheterogeneity problem.Previous approaches have paid plenty of attention to alleviating data-heterogeneity FL by focusing on objective inconsistency.Differently, we find that it is a significant challenge to guarantee local updates that are globally favorable.To address this challenge, we carried out the following two attempts.First, we propose a new metric called Mean Global Accuracy Increase (MGAI) to evaluate what kind of local updates are globally favorable.MGAI helps the researchers predict the final performance of various FL algorithms during the early testing phase.Meanwhile, this metric helps researchers understand that truly effective local updates for FL should point to global optimum and be of appropriate length.Such local updates can greatly improve MGAI and the final performance of the global model.Thus, we propose a new method of FL called Slingshot that exploits the globally favorable direction and the quality of local updates.Our experiments demonstrate a faster convergence of FL training under the proposed Slingshot, as well as its robustness and effectiveness.Hopefully, our studies can spark more discussions about the directions and the quality of local updates in FL.

FIGURE 1 .
FIGURE 1.Comparison between the conventional federated learning paradigm FedAvg [1] and our proposed Slingshot.FedAvg produces two kinds of drifts when dealing with data heterogeneity.Slingshot is designed to facilitate the convergence of local update models toward the global optimum while preserving the quality of local updates.

FIGURE 2 .
FIGURE 2. Underlying cause of FedAvg's poor performance with heterogeneous data is the fact that local updates of a client have little or even negative impacts on global accuracy.In the right subgraph, the solid lines indicate the periods during which a client performs local SGD, and the dashed lines represent the intervals when this client is in idle status.While FedAvg's local updates can significantly improve local accuracy, their impacts on the global accuracy of local models remain marginal.In contrast, Slingshot strikes a balance between local and global accuracy and improves the quality of local updates.Local update consists of multiple iterations, which is omitted as one iteration for simplicity in this figure.
(ω) and −∇ f k (ω) especially when D k is significantly different from D. The difference between −∇ f (ω) and −∇ f k (ω) is the mentioned Local Drift.The difference between ω + and ω is Global Drift.As shown in Fig. 2, the direction of the local update deviates from the direction of the ideal local update in FedAvg.In consequence, the speed of the global model approaching the global optimum is slowed down because local updates have limited contributions to achieving the global optimum.We demonstrate this inference by testing a client's local accuracy (using a local test set with the same distribution as the local train set) and global accuracy (using the global test set) on CIFAR10.As shown in Fig. 2, when this client executes local SGD under FedAvg, the local accuracy increases significantly, while the global accuracy increases a little or even decreases.In fact, the overall local accuracy also increases slowly.This is because, in each communication round, all local models start from the global model sent by the central server.The slowly-growing global accuracy implies that the starting point of local accuracy in each round grows slowly.Therefore, the bad model-training performance and slow convergence of data-heterogeneity FL are fundamentally induced by the ineffective local updates.When the local updated model is not much closer to the global optimum than the global model before training, such a local update is ineffective.
The effective measurement scheme is to test the effectiveness of local updates with a global test set.Specifically, in each early critical communication round, we measure the test accuracy of local models on the global test set twice.The first measurement occurs before the first local epoch, and the second measurement is conducted after each selected client sends the updated model to the FL server.The difference between the two measurements k → ∞ indicates an effective local update of client k, while k → −∞ indicates an ineffective one.Then we compute the mean 1 |S| k∈S k of five communication rounds, and the MGAI is obtained.

B. DESIGN 1 :
SETTING TWO DYNAMIC TARGETS We construct two dynamic targets for each local update (as shown in lines 6-7 of Algorithm 1), i.e., local target ω loc and global target ω glo .They imply the local and global trends, respectively.For client k ∈ [N], ω rec k is the global model last received from the central server, and ω pre k is the previous updated local model sent to the server.Both targets are constructed by global model ω adding accumulative gradients.The local target ω loc is constructed as

FIGURE 3 .
FIGURE 3. Improved local updates of Slingshot.We measure the metric called MGAI (Mean Global Accuracy Increase).MGAI measures the impacts of local updates on global accuracy.A larger MGAI means that the locally updated model performs better on the global test set.The concentration of the Dirichlet distribution β is set to 0.3.

FIGURE 4 .
FIGURE 4. Comparison of the early critical stage performance between Slingshot and FedAvg under the different number of local epochs.The number in each black bracket represents the number of local epochs.In addition, we mark the differences in performance and communication efficiency between Slingshot and FedAvg, respectively.The oscillation is the true test accuracy curve.

FIGURE 5 .
FIGURE 5. Global accuracy v.s.communication rounds.A smaller β represents a greater degree of heterogeneous data.FedAvg is stuck at 1% accuracy in (c).The color band represents the oscillation of the true accuracy curve of the corresponding method.

FIGURE 6 .
FIGURE 6. Effectiveness of Slingshot.(a) Testing hyper-parameters α of Slingshot on EMNIST.(b) Ablation study on CIFAR10.The color band represents the oscillation of the true accuracy curve of the corresponding method.

FIGURE 7 .
FIGURE 7. Model performance and convergence when training different models on CIFAR10 (β = 0.1).The color band represents the oscillation of the true accuracy curve of the corresponding method.

TABLE 1 .
Symbols and Notationsfederated learning performs poor model performance in the case of heterogeneous data.At last, we propose a new metric to validate our analysis.Table1provides explanations for important symbols and notations.