FedCau: A Proactive Stop Policy for Communication and Computation Efficient Federated Learning

This paper investigates efficient distributed training of a Federated Learning~(FL) model over a wireless network of wireless devices. The communication iterations of the distributed training algorithm may be substantially deteriorated or even blocked by the effects of the devices' background traffic, packet losses, congestion, or latency. We abstract the communication-computation impacts as an `iteration cost' and propose a cost-aware causal FL algorithm~(FedCau) to tackle this problem. We propose an iteration-termination method that trade-offs the training performance and networking costs. We apply our approach when clients use the slotted-ALOHA, the carrier-sense multiple access with collision avoidance~(CSMA/CA), and the orthogonal frequency-division multiple access~(OFDMA) protocols. We show that, given a total cost budget, the training performance degrades as either the background communication traffic or the dimension of the training problem increases. Our results demonstrate the importance of proactively designing optimal cost-efficient stopping criteria to avoid unnecessary communication-computation costs to achieve only a marginal FL training improvement. We validate our method by training and testing FL over the MNIST dataset. Finally, we apply our approach to existing communication efficient FL methods from the literature, achieving further efficiency. We conclude that cost-efficient stopping criteria are essential for the success of practical FL over wireless networks.


I. INTRODUCTION
The recent success of artificial intelligence and largescale machine learning heavily relies on the advancements of distributed optimization algorithms [1].The main objective of such algorithms is better training/test performance for prediction and inference tasks, such as image recognition [2].However, the costs of running the algorithms over a wireless network may hinder achieving the desired training accuracy due to the communication and computation costs.The stateof-the-art of such algorithms requires powerful computing platforms with vast amounts of computational and communication resources.Although such resources are available in modern data centers that use wired networks, they are not easily available in wireless devices due to communication and energy resource constraints.Yet, there is a need to extend machine learning tasks to wireless communication scenarios.Use cases as machine leaning over IoT, edge computing, or public wireless networks serving many classes of traffic [3].
One of these prominent algorithms is Federated Learning (FL), which is a new machine learning paradigm where each individual worker has to contribute to the learning process without sharing their own data with other workers and the master node.Specifically, FL methods refer to a class of privacy-preserving distributed learning algorithms in which individual workers [M ] execute some local iterations and share only their parameters, with a central controller for global model aggregation [4].The FL problem consists in optimizing a finite sum of M differentiable functions f j , j ∈ [M ], which take inputs from R d for some positive d and give their outputs in R, i.e., {f j : R d → R} j∈ [M] with corresponding local parameters {w j ∈ R d } j∈ [M] .The common solution to such a problem involves an iterative procedure wherein at each global communication iteration k, workers have to find the local parameter {w j k } j∈ [M] and upload them to a central controller.Then, the master node updates the model parameters as w k+1 and broadcasts it to all the nodes to start the next iteration [5].
The FL algorithm alleviates computation and privacy by parallel computations at workers using their local private data [5].However, such an algorithm introduces a communication cost: parameter vectors, such as weight and bias, must be communicated between the master and the workers to run a new iteration.The weights can be vectors of huge sizes whose frequent transmissions and reception may deplete the battery of wireless devices.Therefore, every communication iteration of these algorithms suffers some costs 1 , e.g., computation, latency, communication resource utilization, and energy.As we argue in this paper, the communication cost can be orders of magnitude larger than the computation costs, thus making the iterative procedure over wireless networks potentially very inefficient.Moreover, due to the diminishing return rule [6], the accuracy improvement of the final model gets smaller with every new iteration.Yet, it is necessary to pay an expensive communication cost to run every new communication iteration of marginal importance for training purposes.
In this paper, we investigate the problem of FL over wireless networks to ensure an efficient communication-computation cost.Specifically, we define our FL over wireless networks as follows.We consider a star network topology and focus on avoiding the extra communication-computation cost paid in FL training to attain a marginal improvement.We show that a negligible improvement in training spends valuable resources and hardly results in test accuracy progress.We propose novel and causal cost-efficient FL algorithms (Fed-Cau) for both convex and non-convex loss functions.We show the significant performance improvements introduced by FedCau through experimental results, where we train the FL model over the wireless networks with slotted-ALOHA, CSMA/CA, and OFDMA protocols.We apply FedCau on top of two well-known communication-efficient methods, Topq [7], and LAQ [8] and the results show that FedCau algorithms further improve the communication efficiency of other communication-efficient methods from the literature.Our extensive results show that the FedCau methods can save the valuable resources one would spend through unnecessary iterations of FL, even when applied on top of existing methods from literature focusing on resource allocation problems [5], [9]- [11].

A. Literature Survey
Cost-efficient distributed training is addressed in the literature through communication-efficiency [4], [12]- [19] or tradeoff between computation and communication primarily by resource allocation [11], [20].Mainly, we have two classes of approaches for communication-efficiency in the literature focusing on 1) data compression, like quantization and sparsification of the local parameters in every iteration, and 2) communication iteration reduction.
The first class of approaches focuses on data compression, which reduces the amount of information exchanged in bits among nodes, thereby saving communication resources.However, we may need more iterations to compensate for quantization errors than the unquantized version.Recent studies have shown that proper quantization approaches, together with some error feedback, can maintain the convergence of the training algorithm and the asymptotic convergence rate [12], [13].However, the improved convergence rates depend on the number of iterations, thus, requiring more computation resources to perform those iterations.Sparsification is an alternative approach to quantization to reduce the amount of exchanged data for running every iteration [14].A prominent example of this approach is top-q sparsification, where a node sends only the q most significant entries, such as the ones with the highest modulus, of the stochastic gradient [12], [15].
The second class of approaches focuses on the reduction of the communication iterations by eliminating the communication between some of the workers and the master node in some iterations [16].The work [16] has proposed lazily aggregated gradient (LAG) for communication-efficient distributed learning in master-worker architectures.In LAG, each worker reports its gradient vector to the master node only if the gradient changes from the last communication iteration are large enough.Hence, some nodes may skip sending their gradients at some iterations, which saves communication resources.LAG has been extended in [17] by sending quantized versions of the gradient vectors.In [19], local SGD techniques reduce the number of communication rounds needed to solve an optimization problem.In a generic FL setting, adding more local iterations may reduce the need for frequent global aggregation, leading to a lower communication overhead [4].Moreover, it allows the master node to update the global model with only a (randomly chosen) subset of the nodes at every iteration, which may further reduce the communication overhead and increase the robustness.The work in [18] has improved the random selection of the nodes and proposed the notion of significance filter, where each worker updates its local model and transmits it to the master node only when there is a significant change in the local parameters.Furthermore, [18] has shown that adding a memory unit at the master node and using ideas from SAGA [21] reduce the upload frequency of each worker, thus improving the communication efficiency.
The two classes mentioned above present opportunities for reducing the cost of running distributed training algorithms and adapting them to wireless communication protocols.However, these classes focus primarily on the complexity of the iterative algorithm in terms of bits per communication round or the number of communication rounds [9].Moreover, they neglect other crucial costs associated with solving federated learning (FL) problems, such as latency [3] and energy consumption [20].These costs can render distributed algorithms ineffective in bandwidth or battery-limited wireless networks, where latency and energy consumption are critical factors.
Recent works have explored the co-design of optimization problems and communication networks, particularly in the context of computational offloading [10], [11], [22].These works have addressed task offloading, resource allocation optimization, and joint learning of wireless resource allocation and user selection.In contrast to existing literature, our approach differs by proactively designing stopping criteria to optimize tradeoffs rather than treating them as hyper-parameters set through cross-validations.This distinction makes our approach original and distinct from current state-of-the-art algorithms.
In our preliminary works, we have characterized the overall communication-computation of solving a distributed gradient descent problem where the workers had background traffic and followed a channel from medium access control (MAC) protocols using random access, such as slotted-ALOHA [23] or CSMA/CA [24] in the uplink.Going beyond such papers, to achieve a cost-aware training workflow, we need to consider the diminishing return rule of the optimization algorithms, which reveals that as the number of iterations increases, the improvement in training accuracy decreases.Then, we need to balance iteration cost and achievable accuracy before the algorithm's design phase.This paper constitutes a major step in addressing this important research gap.Previously in [23], [24], we proposed a cost-efficient framework considering the cost of each iteration of gradient descent algorithms along with minimizing a convex loss function.However, the theory of these papers was only limited to convex loss functions, the iteration costs did not consider the FedAvg algorithm and the computation latency, and there was no adequate study between the achievable test accuracy and the iteration costs.Hence, this paper proposes a new and original study compared to our preliminary works by 1) Considering FedAvg algorithm; 2) Assuming both convex and non-convex loss functions; 3) Developing a novel theoretical framework for FedAvg that includes the communication-computation costs.We apply the proposed framework to several wireless communication protocols and other communication-efficient algorithms for which we show original training and testing results.

B. Contributions
We investigate the trade-off between achievable FL loss and the overall communication-computation cost of running the FL over wireless networks as an optimization problem.This work focuses on training a cost-efficient FedAvg algorithm in a "causal way", meaning that our approach does not require the future information of the training to decide how much total cost, e.g., computation, latency, or communication energy, the training algorithm needs to spend before terminating the iterations.Different than our approach, most papers in the literature aim to train the FedAvg algorithm in resourceconstraint conditions and propose the best resource allocation policies "before" performing the training [11], [20].These approaches rely mainly on approximating the future training information by using some lower and upper bounds of that information.In this work, we propose to train the FedAvg algorithm in a causal, communication, and computation efficient way.To this end, we utilize the well-known multiobjective optimization approach according to the scalarization procedure in [25].Therefore, we propose FedCau to improve the FedAvg algorithm by training in a cost-efficient manner without any need to know the future training information or any upper and lower bounds on them.To the best of our knowledge, this is the first work that considers such causal approaches to train the FedAvg algorithm in a communication and computation efficient manner.The main contributions of this work are summarized as: • We propose a new multi-objective cost-efficient optimization that trades off model performance and communication costs for an FL training problem over wireless networks; • We develop three novel causal solution algorithms, named FedCau, for the multi-objective optimization above, one with a focus on original FL and the others with a focus on stochastic FL.We establish the convergence of these algorithms for FL training problems using both convex and non-convex loss functions; We conclude that a co-design of distributed optimization algorithms and communication protocols is essential for the success of cost-efficient FL over wireless networks, including its applications to edge computing and IoT.The rest of this paper is organized as follows.Section II describes the general system model and problem formulation.In Section III, we derive some useful results and propose our non-causal and causal FL algorithms (FedCau), which are by design intended to run over communication networks.In the analysis, we consider both convex and non-convex loss functions.In Section IV, we apply our algorithms to slotted-ALOHA, CSMA/CA, and OFDMA.In Section V, we analyze the performance of the FedCau algorithms.We then conclude the paper in Section VI.We moved all the proofs and extra materials to the Appendix.
Notation: Normal font w, bold font small-case w, boldfont capital letter W , and calligraphic font W denote scalar, vector, matrix, and set, respectively.We define the index set [N ] = {1, 2, . . ., N } for any integer N .We denote by • the l 2 -norm, by |A| the cardinality of set A, by [w] i the entry i of vector w, by w T the transpose of w, and 1 x is an indicator function taking 1 if and only if x is true, and takes 0 otherwise.

II. SYSTEM MODEL AND PROBLEM FORMULATION
In this section, we represent the system model and the problem formulation.First, we discuss the FedAvg algorithm, and afterward, we propose the main approach of this paper.

A. Federated Learning
where Optimization problem (1) applies to a large group of functions as convex and non-convex (such as deep neural networks).
The standard iterative procedure to solve problem (1) with the initial vector w 0 is For a differentiable loss function f (w), we choose to perform (2) by the Federated Averaging (FedAvg) algorithm.
Initializing the training process with w 0 , Federated Averaging (FedAvg) is a distributed learning algorithm in which the master node sends w k−1 to the workers at the beginning of each iteration k ≥ 1.Every worker j ∈ [M ] performs a number E of local iterations, i = 1, . . ., E, of stochastic gradient descent [7] with data subset of ξ j k ≤ |D j |, and computes its local parameter w j i,k , considering the initial point of w j 0,k = w k−1 , [28], for any k = 1, . . ., K, where w j k = w j E,k .Then each worker transmits w j k to the master node for updating w k according to (2).Note that in FedAvg, when E = 1 and we use the exact gradient vector in the place of the stochastic gradient, we achieve the basic FL algorithm.Considering the FedAvg solver (3) for the updating process in (2), and without enforcing convexity for f (w), we use the following Remark throughout the paper.
The workers use the FedAvg algorithm (3) to compute their local parameters w j k , while the master node performs the iterations of (2) until a convergence criteria for f (w k ) − f (w * ) is met [30].We denote by K the first iteration at which the stopping criteria of the FedAvg algorithm is met, namely where ǫ > 0 is the decision threshold for terminating the algorithm at iteration K and f (w * ) is the optimum of the loss function at the optimal parameter w * .The state-ofthe-art literature defines the threshold ǫ independently before training.However, an optimal threshold must be designed to optimize communication-computation resources in solving (1).Since knowing f (w * ) beforehand is not realistic, we propose an alternative approach to find K in (4) without this prior knowledge.Our main contribution is determining K as a function of the communication-computation cost and the loss function of the FedAvg algorithm (3).We will substantiate this significant result in Section II-B.Let c k > 0, k = 1, 2, . . .denote the cost of performing a complete communication iteration k.Accordingly, when we run FedAvg, namely an execution of (2) and (3), the complete training process will cost K k=1 c k .Some examples of c k in real-world applications are: • Communication cost: c k is the number of bits transmitted in every communication iteration k; • Energy consumption: c k is the energy needed for performing a global iteration to receive w k at a worker and send {w j k } j∈[M] to the master node over the wireless channel; • Latency: c k is the overall delay to compute and send parameters from and to the workers and the master node over the wireless channel [11].
Considering latency as the iteration cost, the term c k for running every training iteration of the FedAvg algorithm (3) is generally given by the sum of four latency components: 1) ℓ 1,k : communication latency in broadcasting parameters by master node; 2) ℓ 2,k : the computation latency in computing w j k for every worker j; 3) ℓ 3,k : communication latency in sending w j k to master node; 4) ℓ 4,k : computation latency in updating parameters at the master node.
See Section IV for more detailed modeling of the components of c k for slotted-ALOHA, CSMA/CA, and OFDMA protocols.

B. Problem Formulation
To solve optimization problem (1) over a wireless network, the FedAvg algorithm (3) faces two major challenges: The termination iteration K in (4) strongly impacts the overall training costs for solving the optimization problem (1).Thus, selecting an appropriate value for K is crucial to prevent potentially adverse effects on communication-computation resource utilization in FedAvg (3) over wireless networks.
We propose an original optimization of the termination iteration K in the FedAvg algorithm (3) to tackle the mentioned challenges.By explicitly considering the cost of training iterations, we aim at obtaining an optimal stopping iteration that solves the following optimization problem.
where K k=1 c k quantifies the overall iteration-cost expenditure for the training of loss function f (w) when transmitting in a particular wireless channel in uplink.Note that (5a) represents a multi-objective function, which aims at minimizing the training loss function f (w), and the overall iteration cost K k=1 c k .Note that the values of c k , for k ≤ K, can be, in general, a function of the parameter w k , but neither c k nor w k are optimization variables of problem (5a).Optimization problem (5) states to devote communication-computation resources as efficiently as possible while performing FedAvg algorithms (3) to achieve an accurate training result for loss function f (w).Thus, by solving optimization problem (5a), we can obtain the optimal number of iterations for FedAvg algorithm (3), which minimizes the communication-computation costs while also minimizing the loss function of FedAvg.
Remark 2. We have formulated optimization problem (5) according to the "unfolding method" of iterative algorithms [31], where it is ideally assumed that the optimizer knows beforehand (before iterations (2) and (3) occur) what the cost of each communication iteration in (2) would be and when they would terminate.Such an ideal formulation cannot occur in the real world since it assumes knowledge of the future, thus being called "non-causal setting".However, this formulation is useful because its solution gives the best optimal value of the stopping iteration k * .In this paper, we show that we can convert such a non-causal solution of problem (5) into a practical algorithm in a so-called "causal setting".We will show that the solution to the causal setting given by the practical algorithm is very close to k * .Solving (5) presents several challenges: it is multi-objective, involves integer variables, and contains non-analytical objective and constraint functions with non-explicit dependencies on K. Additionally, the problem is non-causal, making it difficult to determine the optimal K without knowing w k 's in advance.Thus, addressing such non-explicit and non-causal optimization problems can be highly challenging [25].In the next section, we propose a practical solution to problem (5).

III. SOLUTION ALGORITHMS
In this section, we present preliminary technical results, propose an iterative solution to (5), and demonstrate that the proposed methods achieve optimal or sub-optimal solutions while converging in a finite number of iterations.

A. Preliminary Solution Steps
In this subsection, we develop some preliminary results to arrive at a solution to the optimization problem (5).We start by transforming (5) according to the scalarization procedure of multi-objective optimization [25].Specifically, we define the joint communication-computation cost and the loss function of FedAvg algorithm (3) as a scalarization of the overall iterationcost function K k=1 c k and the loss function f (w K ).Note that such a joint cost is general in the sense that, depending on the values of c k , it can naturally model many communicationcomputation costs, including constant charge per computation and mission-critical applications.
We transform the multi-objective optimization problem ( 5) into its scalarized version as where G(K) and C(K) are defined as C(K) is the iteration-cost function representing all the costs the network spends from the beginning of the training until the termination iteration K and β ∈ (0, 1) is the scalarization factor of the multi-objective scalarization method [25].
The following lemma states that if G(K) is monotonically decreasing, we can find k * where G(K) is minimized.
Lemma 1.Consider optimization problem (6).Let G(K) be a non-increasing function of all K ≤ k * .Then, k * indicates the index at which the sign of discrete derivation [32] of G(K) changes for the first time, i.e.
Proof: See Appendix A-A in [33].In the following section, we present three algorithms to solve optimization problem (6).First, we discuss the noncausal setting for characterizing the minimizer, then, introduce a causal setting to design algorithms that achieve practical minimizers for convex and non-convex loss functions.Finally, we establish the optimality and convergence of the algorithms.

B. Non-causal Setting
An ideal approach to solve problem ( 6) is an exhaustive search over the discrete set of K ∈ [0, +∞).However, this approach requires knowing in advance the sequences (f (w k ))k and (c k )k for all k ∈ [0, +∞), which is not practical as the Set w j 0,k+1 = w k 8: end for

11:
Set w j k+1 = w j E,k+1 12: Send w j k+1 and f j k to the master node 13: end for

14:
Wait until master node collects all {w j k+1 }j and set 18: Master node broadcasts w k+1 to the workers 19: else 20: Set kc = k, Break and go to line 24

22:
Set k ← k + 1 23: end for 24: Return w kc , kc, G(kc) sequence of parameters (w k ) k , and consequently (f (w k )) k , are not available in advance.For analytical purposes, our noncausal setting assumes that all these values are available at k = 0, enabling us to find the ultimate minimizer k * .While this approach is not feasible in practice, we investigate it to establish a benchmark for the performance evaluation of subsequent causal solution algorithms (see Section V).

C. FedCau for Convex Loss Functions
Here, we propose an approximation of the optimal stopping iteration k * , referred to as k c .Our analysis demonstrates that k c can be practically computed using a causal setting scenario.Under certain conditions, we establish that k c corresponds to k * or k * + 1.Specifically, when k * = K max , with K max denoting the maximum allowable number of iterations, we have k c = k * , otherwise, k c = k * + 1 (see Section III-E).Thus, we develop three implementation variations of FedAvg algorithm (3), Algorithms 1-3, with our causal termination approach, FedCau, for solving (6).Algorithms 1 and 2 are batch and mini-batch implementations using convex loss functions, while Algorithm 3 considers non-convex loss functions.
In the batch update of Algorithm 1, workers compute and transmit them to the master node (see lines [6][7][8][9][10][11][12].We assume that the local parameter of each worker consists of the value of local FL model w j k and the local loss function f j k2 .The master node updates w k and f (w k ) upon receiving all local parameters {w j k , f j k } j∈[M] from workers at each iteration k (see lines [14][15].Then, the iteration cost c k , representing the iteration cost, is calculated.To prevent termination in the first iteration, we initialize G(0) = +∞, and subsequently, the multi-objective cost function G(k) is updated (line 16).A comparison between G(k) and its previous value G(k − 1) is made (see line 17) to determine the termination of iterations (see lines [19][20][21][22][23][24]. In FedAvg, there are many scenarios where specific workers can upload their local parameters to the master node, resulting in implicit sub-sampling and approximations of f (w k ) denoted as f (w k ).This sub-sampling results in approximating the joint communication-computation and FL cost function, Ĝ(K).Algorithm 2 employs mini-batch updates to avoid excessive resource consumption for marginal test accuracy improvements.It leverages the descent property of FedAvg algorithm (3) for a monotonic decreasing loss function f (w), as described in Remark 1. Algorithm 2 aims at achieving nonincreasing sequences of f (w k ) k and G(k) k≤k * .
Algorithm 2 introduces partial worker participation and fairness in training FedCau.M n k represents the node selection subset at each communication iteration k, and Fair-count[j] denotes the counter for the number of successfully-sent local parameters by worker j ∈ M n k .We introduce a "Fairness-Factor" F f ≤ K max that restricts workers from transmitting more than F f local parameters until all workers satisfy F f local parameter transmission.At the first communication iteration k = 1, once a worker j ∈ M n 1 successfully transmits its local parameter w j 1 , it is removed from the selected node subset M n 1 (see lines7-21).Thus, worker j will not transmit any packets until all workers send their local parameters.The master node computes the resource used to perform the first communication iteration as T s .It considers T s as a benchmark to determine T ≤ T s as the maximum allowable time slots for future iterations k = 2, . . ., K (see line 25) 3 .Note that in k = 1, low-power workers have a higher probability of transmitting their local parameter, and the latency T is smaller compared to full worker participation.After completing communication iteration k = 1, partial worker participation begins at k ≥ 2 when the master node updates M 2 k+1 (see line 26).For k ≥ 2, the selected workers j ∈ M n k have a time budget T to compute and transmit their local parameters.This constraint creates competition among the selected workers to communicate with the master node.However, some workers may fail to send their local parameters.To address this challenge, we introduce the set M a k , which contains the indexes of the successful workers j s ∈ M n k that managed to transmit during iteration k (see line 31).Additionally, the fairness counter of each successful worker, Fair-count[j s ], is increased (see line 32) to influence future selections for communication iterations.Afterward, the master node updates the global parameter by the local parameters it has received, w j k , j ∈ M a k , and then replaces the missing local parameters by the values of the previous iteration, for the local parameters [30][31].Algorithm 2 utilizes this replacement strategy to ensure the non-increasing behavior of G(k), k = 1, . . ., k * , and maintain a descent sequence of f (w k ), k = 1, . . ., k * .Since Algorithm 2: Stochastic cost-efficient mini-batch FedCau.

12:
Set w j k+1 = w j E,k+1 13: Send w j k+1 and f j k to the master node 15: end if

16:
end for

22:
Set Ts = t s

25:
Set a time budget T ≤ Ts, and t s k = 0 26:

28:
for t s k ≤ T do ⊲ Assigning time budget T

29:
Every node j ∈ M n k send w j k+1 30: if Successful node js ∈ Mn then 31:

34:
end for

35:
Master node set ⊲ Global update with replacements

37:
Master node update Algorithm 2 considers convex loss functions, the replacement of missing parameters guarantees the descent behavior of the sequence f (w k ), k = 1, . . ., k * (Lemma 2).Additionally, the master node updates the selected workers based on the fairness factor, ensuring fair worker participation for the upcoming communication iterations (see lines [37][38][39][40].This process requires the master node to retain a memory of all previous local parameters.The remaining part of Algorithm 2 (lines 43-51) handles parameter updates and checks for the potential stopping iteration k c , similar to lines 12-20 in Algorithm 1.
Lemma 2. Let f j k be the local loss function at the communication iteration k for each worker j ∈ [M ].Suppose that f j (w) be a convex function w.r.t.w.Then, Algorithm 2 guarantees the decreasing behavior of f (w k ), ∀k.
Proof: See Appendix A-B in [33].As explained above, Algorithm 2 allows for both full and partial participation, offering fairness in worker participation based on the parameter T .The distinction between full and partial participation lies in the fact that in partial participation, the update of the global parameter w k depends on the new local parameters from the subset M n k .However, it remains uncertain which workers within the subset successfully transmit their local parameters and which ones fail, particularly when workers possess non-iid training data.To address this challenge, we introduce the fairness-factor F f to mitigate the impact on the global update.The value of F f can be tailored to the specific training application, enabling customization of the partial participation scheme.
Another challenge in the partial participation of Algorithm 2 is determining the appropriate time budget T for each iteration.Algorithm 2 suggests selecting a value for T < T s by causally computing T s in the first iteration, considering full worker participation and excluding background traffic.However, the choice of T depends on the specific learning application, such as healthcare, autonomous driving, or video surveillance.One should consider a suitable time budget of T based on the requirements of the learning application.For latency-sensitive scenarios like autonomous driving, where quick decisions are crucial to prevent accidents, a smaller value of T is preferred.

D. FedCau for Non-convex Loss Functions
Here, we extend Algorithms 1 and 2 to include non-convex loss functions.We consider FedAvg [5], where each worker j performs E ≥ 1 local iterations over its local data subset, ξ j k ≤ |D j |.The master node updates the global parameter w k+1 by averaging and broadcasting the local parameters to the workers.Additionally, the master node calculates F (w k ) by averaging the local loss functions of the workers [5].
We design a cost-efficient algorithm which optimizes G(K), an estimate of the multi-objective cost function G(K) defined as G(K) := βC(K) + (1 − β) F (w K ), where recall that C(K) is the iteration-cost function at K. We design such an estimate since the stochastic nature of the sequences of ( F (w k )) k , arises from the local updates by ξ j k ≤ |D j | using mini-batches, results in a stochastic sequence of ( G(k)) k .This sequence ( G(k)) k hinders the application of Algorithms 1, 2 and might lead to their early stopping at a communication iteration.Thus, we need to develop an alternative algorithm.
We propose a causal approach to establish non-increasing upper and lower bounds, G u (K) and G l (K), for the stochastic sequence ( G(k)) k .As this sequence is not necessarily nonincreasing and may have multiple local optimum points, we aim at obtaining an interval k u c ≤ k c ≤ k l c , where k u c and k l c represent the stopping iteration for G u (K) and G l (K) functions, respectively.According to the definition of G(K) function, we define G u (K), and G l (K) functions as Algorithm 3: Stochastic non-convex cost-efficient mini-batch FedCau. 1: 3: Master node broadcasts w0 to all nodes 4: for k ≥ 1 do ⊲ Global iterations

5:
Ma = {} Each node j calculates: ⊲ local iterations 6: Randomly select a subset of data with size ξ j k 9: end if

11:
Set w j 0,k+1 = w k , F j 0,k+1 = F j k 12: Randomly select a subset of data with size ξ j k 14: 15: end for

17:
Set w j E,k+1 = w j k+1 , and Send w j k+1 and F j k+1 to the master node 19: end for Master node calculates: ⊲ Global update with replacement 20: Ma ← Ma ∪ {Successful workers} 21:

26:
else ⊲ Update Fu(w k ) and F l (w k )

37:
if F (w k ) < F l (w k l max ) then

38:
Set F l (w k ) = F (w k )

51:
end if ⊲ Evaluate (9) for G l and Gu 52: 53: where F u (w K ) and F l (w K ) represent the estimation of the loss function at upper and lower bounds.To obtain the sequences of (G u (k))k and (G l (k))k, the master node computes the upper and lower bounds for F (w k ) while ensuring the monotonic decreasing behavior of (F u (w k ))k and (F l (wk)) k to satisfy Remark 1.In the following, we now concentrate on the process of obtaining the bounds for F (w k ).
Algorithm 3 shows the steps required for the costefficient FedAvg with causal setting and non-convex loss function F (w). Lines 3-18 summarize the local and global iterations of FedAvg.Here, we introduce M a as the set of workers which successfully transmit their local parameters to the master node (see line 20).We initialize F u (w k ) = F l (w k ) = F (w k ), k ≤ 2 for the first two iterations (see line 25).For iterations k ≥ 3, if the new value of loss function fulfils F (w k ) ≥ F( w k−1 ), the algorithm updates F u (w k ) = F (w k ) (see line 28).Then, the algorithm checks if F (w k ), which is now equal to F u (w k ), is greater than the previous value of F u (w k−1 ) (see line 29).This checking is important because we must develop a monotonic decreasing sequence of F u (w i ) i=1:k u c .When F (w k ) ≥ Fu (w k−1 ), the master node returns to the history of F u (w i ) i=1:k−1 and checks for i < k, when the condition F u (w i ) > F u (w k ) is satisfied.Since at each communication iteration k we carefully check the monotonic behavior of F u (w k ), we are sure that if we find the proper maximum communication iteration i that fulfills i < k, for which F u (w i+1 ) < F u (w k ) < F u (w i ), we have the result of F u (w j ) < F u (w k ), j < i.Let us define this communication iteration i as k u max (see line 30).Thus, it is enough to find such i to update the sequence of F u (w i1 ), i 1 = i, . . ., k.Now, we need to update the sequences of F u (w i2 ), i 2 = k u max , . . ., k to obtain the monotonic decreasing upper bound.We choose the monotonic linear function because it satisfies the sufficient decrease condition (see [29], Section 11.5).Therefore, we satisfy the decreasing behavior for F u (w k ) and the upper bound behavior, which means that the maximum values of F (w k ) are always lower than F u (w k ).Thus, we update the sequences of F u (w i ), i = k u max , . . ., k according to (11), with where Next, we need to update F l (w k ).Here, let us define the difference between two consecutive values of F u (w k ) and F u (w k−1 ) as δ u k , and the difference between F l (w k ) and F l (w k−1 ) as δ l k .Then, we update the corresponding values for (G u (K)) K=k u max ,...,k and G l (k), respectively (see lines [34][35].Afterward, we need to check the condition at which F (w k ) < F l (w k l max ), where k l max represents the last communication iteration at which the value of F (w k l max ) has been considered as F l (w k l max ).If F (w k ) < F (w k l max ), we need to update the lower bound sequences (see lines [37][38] to avoid overdecreasing the lower bound function F l (w k ) by the approximation of line 34.Subsequently, we need to calculate and the value of k l max = k (see lines 41-43).
The last condition to check is when F l (w k l max ) < F (w k ) < F (w k−1 ).In this condition, the monotonic decreasing behavior of F (w k ) is satisfied, whereas the decreasing behavior is not satisfied for the lower bound F l (w k ).Thus we set F u (w k ) = F (w k ), and k , and update G u (k) and G l (k) (see lines 45-48).Finally, lines 52-61 show when to stop the algorithm.

E. Optimality and Convergence Analysis
In this subsection, we investigate the existence and optimality of the solution to problem (6) and the convergence of the algorithms that return the optimal solutions.We are ready to give the following proposition, which provides us with the required analysis of Algorithms 1, and 2.
First, we start with the monotonic behavior of G(K), K ≤ k * .In practice, we have this desired monotonically decreasing behavior, as we show in the following proposition: Proof: See Appendix A-C in [33].Remark 3. The previous proposition implies that, without loss of generality, we can assume that max k c k is high enough and min k c k is close to zero (setting the initial cost to zero, for example).Thus β can, in practice, vary between 0 and 1, without restricting the applicability range of the multiobjective optimization.
Proposition 2. Optimization problem (6) has a finite optimal solution k * .Proof: See Appendix A-D in [33].Proposition 2 implies that when G(K) is monotonically decreasing with K, k * is equal to K max .According to the training setup, the maximum number of iterations is set as K max at the beginning of the training.Thus, monotonically decreasing G(K) results in k * = K max , which means that the value of the FL loss function is dominant in G(K), and the FedCau procedure is similar to the FedAvg method.
The following Theorem clarifies an important relation between the non-causal and causal solutions of Algorithms 1 and 2.
Theorem 1.Let k * be the solution to optimization problem (6), and let k c denote the approximate solution obtained in the non-causal and causal settings of Algorithm 1, and 2, respectively.Then, the following statements hold Proof: See Appendix A-E in [33].
Remark 4. Note that k * and k c are fundamentally different.k c is obtained from Algorithms 1 or 2, while k * is the optimal stopping iteration that we would compute if we knew beforehand the evolution of the iterations of FedAvg algorithm (3), thus non-causal.Nevertheless, we show that the computation of the stopping iteration k c that we propose in the causal setting of Algorithms 1 and 2 is almost identical to k * .
Theorem 1 is a central result in our paper, showing that we can develop a simple yet close-to-optimal algorithm for optimization problem (6).In other words, Algorithms 1 and 2 in causal setting solve problem (6) by taking at most one extra iteration compared to the non-causal to compute the optimal termination communication iteration number.
Next, we focus on the convergence analysis of Algorithm 3. From Section III-D, we define F u (w k ) and F l (w k ) as the upper and lower bound functions for F (w k ), respectively, such that for every k ≥ 1, inequalities F l (w k ) ≤ F (w k ) ≤ F u (w k ) hold.The following remark highlights the important monotonic behavior of F u (w k ) and F l (w k ).
Remark 5.The proposed functions F u (w k ) and F l (w k ) are monotonic decreasing w.r.t.k, i.e., F u (w k ) < F u (w k−1 ), and F l (w k ) < F l (w k−1 ) for ∀k ≥ 1.These results hold because we consider a linear function, which is monotonically decreasing, w.r.t.k, for updating each value of F l (w k ) and F u (w k ) for k ≥ 1.Since the monotonically decreasing linear function fulfills the sufficient decreasing condition (see [29], Section 11.5), we claim that F u (w k ) and F l (w k ) are monotonic decreasing w.r.t.k.
Remark 5 indicates that we can apply the batch FedCau update of Algorithm 1 to obtain the causal stopping point for G u (K) and G l (K) denoted as k u c and k l c , respectively.Therefore, according to Proposition 2, there are finite optimal stopping iterations for minimizing G u (K) and G l (K).Thus, Theorem 1 is valid for k u c and k l c , and we guarantee the convergence of G u (K) and G l (K).The following Proposition characterizes the relation of causal stopping iteration k c of G(K) with k u c and k l c .Proposition 3. Let k c , k u c , and k l c be the causal stopping iterations for minimizing G(K), G u (K), and G l (K), respectively.Then, the inequalities k u c ≤ k c ≤ k l c hold.Proof: See Appendix A-F in [33].Proposition 3 characterizes an interval in which k c can take values to stop Algorithm 3. As k u c ≤ k c ≤ k l c , it is enough that we find k u c and terminate the algorithm.However, the maximum allowable number of iterations is k l c , which can be achieved if the resource budget allows us.Using Proposition 3, we can obtain a sub-optimal k c by applying the FedCau update Algorithm 3 to non-convex loss functions.Proof: See Appendix A-A.
Lemma 3 specifies that the maximum distance between the upper and lower bound functions F u (w k ) and F l (w k ), k = 1, . . ., K, is determined by the variations of non-convex sequence F (w k ), k = 1, . . ., K. In the following Proposition, we investigate the tightness of the interval (k u c , k l c ). Proposition 4. Let K max , k u c and k l c be the maximum number of iterations, the causal stopping iteration for minimizing G u (K), and the causal stopping iteration for minimizing G l (K), respectively.Recall the definition of k u max in line 29 of Algorithm 3.Then, where, for k Proof: See Appendix A-B.Proposition 4 denotes that the tightness of the interval (k u c , k l c ) is mainly determined by c k and the variations of the non-convex sequence F (w k ), k = k u c + 1, . . ., K max .To summarize, FedCau is applicable for both full and partial worker participation, as well as when f (w k ) is monotonically decreasing and not monotonic decreasing.Specifically, we have used the FedCau theory to propose Algorithm 3 that obtains a suboptimal solution for k c when f (w k ) is not monotonically decreasing.

F. Complexity Analysis of Algorithms 1-3
In this part, we analyze the computation complexities of Algorithms 1-3 and compare them with the computation complexity of FedAvg.Recall that in FedCau of Algorithms 1-3, the stopping iteration k c ≤ K max , K max is the number of FedAvg global iterations.By assuming the training is done considering a neural network with N l number of layers, d N as the maximum number of neurons, the backpropagation of local gradients in each worker j after E local iterations, results in a complexity of O(EN l d 3 N ).Thus, Algorithm 1 has the complexity of O(k c E(|D|d + N l d 3 N )), which is less than or equal to the complexity of FedAvg as Finally, the complexity of Algorithm 3, by considering the complexity from the neural network setting we mentioned before, is obtained as

IV. APPLICATION TO COMMUNICATION PROTOCOLS
We consider wireless communication scenarios with a broadcast channel in the downlink from the master node to the workers.In the uplink, we consider three communication protocols, slotted-ALOHA [34] and CSMA/CA [26] with a binary exponential backoff retransmission policy [35], and OFDMA [27] by which the workers transmit their local parameters to the master node.We assume that in each communication iteration k, local parameters are set at the head of the line of each node's queue and ready to be transmitted.Thus, upon receiving w k , each worker j ∈ [M ] computes its local parameter w j k and puts it in the head of the line of its transmission queue.In a parallel process, each worker may generate some background traffic and put them on the same queue, and send them by the first-in-first-out queuing policy.We obtain the average end-to-end communication-computation latency at each iteration k, denoted by c k , for slotted-ALOHA and CSMA/CA protocols: by taking an average over the randomness of the protocols.Hence, at the end of each communication iteration K, the network has faced a latency equal to K k=1 c k .It means that we consider each time slot (in ms) and sum up the spent computation delay and time slots in each communication iteration k to achieve c k , thus following the Algorithms 1, 2, and 3 to solve optimization problem (6).
The critical point to consider is that we should choose a stable network in which packet saturation will not happen.We only consider the latency of transmitting local parameters, positioned at the head of line queues, which is influenced by the number of workers M , transmission probability p x , and packet arrival probability p r at each time slot.Local parameters at each iteration k are distinct from background traffic packets influenced by the probability of p r .
Recall the definition of the communication-computation cost components ℓ 1,k , ℓ 2,k , ℓ 3,k and ℓ 4,k in Section II-A.For ℓ 1,k , we consider a broadcast channel with data rate R bits/s and parameter size of b bits (which includes the payload and headers), leading to a constant latency of ℓ 1,k = b/R s.Also, it is natural to assume that ℓ 4,k is a given constant for updating parameters at the master node [36].The computation latency ℓ j 2,k in each iteration k at each worker j ∈ [M ] is calculated as ℓ j 2,k = a j k |D j k |/ν j k , where a j k is the number of processing cycles to execute one sample of data (cycles/sample), |D j k | ≤ |D j | is a subset of local dataset each worker chooses to update its local parameter w j k , and ν j k is the central processing unit (cycles/s) [37].Without any loss of generality, we consider that |D j k | = |D j |, k = 1, . . ., K. We assume that all the worker nodes start transmitting their local parameters simultaneously.Thus, the network must wait for the slowest worker to complete its computation.Therefore, ℓ 2,k = max j∈[M] ℓ j 2,k .The third term, ℓ 3,k , is determined by the channel capacity, resource allocation policy, and network traffic.We characterize this term for two batch and mini-batch update cases with a defined time budget.Further, every specific broadcast channel imposes a particular R and b, which do not change during the optimizing process.Therefore, to compute the iteration-cost function k c k , we take into account the ℓ 3,k and ℓ 2,k terms and ignore the latency terms of ℓ 1,k , and ℓ 4,k because they do not play a role in the optimization problem (6) in the presence of shared wireless channel for the uplink.Note that in this paper, without loss of generality, c k := ℓ 2,k +ℓ 3,k , in which ℓ 2,k is independent of the communication channels/protocols.We wish to obtain the upper bound for communication delay when the users in the network follow MAC protocols, such as slotted-ALOHA and CSAMA/CA, to transmit their local parameters of FedAvg algorithm (3) to the master node [35], [38].There are many papers in the literature computing the average transmission delay for MAC protocols.However, we have a specific assumption that at each communication iteration k, each worker puts its local parameter at the head of the line in its queue and makes it ready for transmission.Note that in FedAvg algorithm (3), the master node needs to receive all the local parameters to update the new global parameter w k .Accordingly, we calculate the average latency of the system while all workers must successfully transmit at least one packet to the master node.The following Proposition establishes bounds of the average transmission latency ℓ 3,k .Proposition 5. Consider random access MAC protocols in which the local parameters of FedAvg algorithm (3) are headof-line packets at each iteration k.Let M , p x , and p r be the number of nodes, the transmission probability at each time slot, and the background packet arrival probability at each time slot.Consider each time slot to have a duration of t s seconds.Then, the average transmission delay, E{ℓ 3,k } is bounded by where p is the probability of an idle time slot.
Proof: See Appendix A-G in [33].Proposition 5 introduces the bounds for transmission delay, thus for c k , while considering slotted-ALOHA and CSMA/CA communication protocols.Recall that c k = ℓ 2,k + ℓ 3,k , then by considering the slowest worker in local iteration, the iteration cost c k is bounded by (18) which helps us to design the communication-computation parameters for FedCau.Note that we consider a setup where the transmission starts simultaneously for all the workers.This is an important setup by which we have developed Algorithms 1-3 and the bounds on the iteration-cost c k in Proposition 5 and inequalities (18).The assumption that all workers transmit at each iteration k is only for Algorithm 1.However, in the updated Algorithm 2, we can consider either partial or full worker participation, which allows us to skip the slowest worker and not wait for it at each iteration k.Finally, in Algorithm 3, we have developed a general approach by which FedCau can be applied to any scenario, e.g., full or partial worker participation, non-convex loss functions f (w) or any G(K) with various local optimum points.Thus, the assumption that workers start transmissions to the master node simultaneously does not contradict the cost-efficiency of FedCau because we have considered various scenarios, like full or partial worker participation, in Algorithms 1-3.
Finally, in OFDMA, we consider uplink transmissions in a single-cell wireless system with s = 1, . . ., S c orthogonal subchannels [39].Let h s l , p s l be the channel gain and the transmit power of link l on subchannel s by which worker j sends its local parameters to the master node.Therefore, the signal-to-noise ratio (SNR) for the uplink is defined SNR(p s l , h s l ) := p s l h s l /σ s l .The corresponding data rate (bps/Hz) is as ).The master node randomly decides at each iteration k which worker should use which subchannel link, and the remaining workers will not participate in the parameter uploading.

V. NUMERICAL RESULTS
In this section, we illustrate our results from the previous sections.We numerically show the extensive impact of the iteration costs when running the FedAvg algorithm ( 3) training problem over a wireless network.We use a network with M workers and simulation to implement slotted-ALOHA, CSMA/CA (both with binary exponential backoff), and the OFDMA.In each of these networks, we apply our proposed Algorithms 1, 2, and 3. We train the FedCau by the wellknown MNIST dataset with non-iid distribution among workers and the CIFAR-10 dataset with both iid and non-iid cases.For the non-iid implementation, we first sort the dataset w.r.t. the label numbers of y i = i, where i ∈ {0, 1, . . ., 9}, where i is the index of each data sample with size |D i |.Moreover, in the MNIST dataset, the labels are the same as the digits, while in CIFAR-10, the labels demonstrate airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck.Afterward, we assign an equal portion of data to each worker j ∈ {1, . . ., M }, starting from the beginning of the sorted dataset.According to the size of each dataset, CIFAR-10 with 50000 and MNIST with 60000 data samples, the data portion of every class in the datasets assigned to each worker is different.Finally, we apply our proposed FedCau on top of existing methods from the literature, such as top-q and LAQ.

A. Simulation Settings
First, we consider solving a convex regression problem over a wireless network using a real-world dataset.To this end, we extract a binary dataset from MNIST (hand-written digits) by keeping samples of digits 0 and 1 and then setting their labels to -1 and +1, respectively.We then randomly split the resulting dataset of 12600 samples among M workers, each having {(x ij , y ij )}, where x ij ∈ R 784 is a data sample i, and     a vectorized image at node j ∈ [M ] with corresponding digit label y ij ∈ {−1, +1}.We use the loss function [40] where we consider that each worker j ∈ Second, we consider a non-convex image classification problem with the workers using convolutional neural networks (CNNs) with a cross-entropy loss function.The architecture of the CNN consists of a convolutional layer, Conv2D(32, (3, 3)), a MaxPooling2D layer with a pool size of (2, 2), a Flatten layer, two Dense (fully connected) layers with size 64 and 10, and a final layer that produces probability distributions over 10 classes of the CIFAR-10 dataset.Overall, the CNN has 462410 parameters.
We implement the network with M workers performing local updates of w j k , ∀j ∈ [M ] and imposing computation latency of ℓ 2,k to the system.We assume a synchronous network in which all workers start the local iteration of w j k simultaneously right after receiving w k−1 .Note that the latency counting of c k at each iteration k starts from the beginning of the local iterations until the uplink process is complete.Regarding the computation latency, we consider ν k ∈ [10 6 , 3×10 9 ] cycles/s, and a k = [10,30] cycle/sample for k = 1, . . ., M .In slotted-ALOHA, we consider a capacity of one packet per slot and a slot duration of 1 ms.In CSMA/CA, we consider the packet length of 10 kb with a packet rate of 1 k packets per second, leading to a total rate of 1 Mbps.We set the duration of SIFS, DIFS, and each time slot to be 10 µs, 50 µs, and 10 µs respectively [41] and run the network for 1000 times.In the OFDMA setup, we consider the uplink in a single cell system with the coverage radius of r c = 1 Km.There are L p cellular links on S c subchannels.We model the subchannel power gain h s l = ζ/r 3 , following the Rayleigh fading, where ζ has an exponential distribution with unitary mean.We consider the noise power in each subchannel as −170 dBm/Hz and the maximum transmit power of each link as 23 dBm.We assume that S c = 64 subchannels, the total bandwidth of 10 MHz, and the subchannel bandwidth of 150 KHz.We define c k as the latency caused by the slowest worker to send the local parameters to the master node.

B. Performance of FedCau Update from Algorithms 1, 2 and Non-causal Approach
Fig. 1 characterizes the non-causal and causal behaviors along with the performance of FedCau update of Algorithms 1 and 2 for slotted-ALOHA and CSMA/CA protocols.The general network setup has M = 100, p x = 1, p r = 0.2, and the mini-batch time budget of T = 0.3s.We observe that while the behavior of f (w k ) is similar across the protocols in Fig. 1(a), the iteration-cost function C(K) of the batch update for slotted-ALOHA is much larger among all the setups in Fig. 1(b).This behavior affects the multi-objective function G(K) in Fig. 1(c) and causes an earlier stop.However, the test accuracy is not sacrificed, as shown in Fig. 1(d).From Fig. 1, we conclude that the batch update of Algorithm 1 satisfies the causal setting and preserves the test accuracy   while optimizing both the loss function f (w k ) and the latency over the communication protocols.
Fig. 2 characterizes the effect of β on the performance of batch FedCau update of Algorithm 1 with M = 50, 100 and CSMA/CA protocol for parameter upload.Fig. 2(a) shows that C(k c ) decreases while β takes the values between (0, 1).This decreasing behavior is a valid result since the higher values of β increase the effect of the term C(K) in scalarized version (6).Since C(K) is an increasing function of K, the higher values of C(K) result in stopping at the smaller causal iterations, called k c .Finally, Fig. 2(c) demonstrates the test accuracy we achieve while changing β.Since k c decreases as β increases, the corresponding test accuracy decreases.Therefore, choosing β ∈ [0.2, 0.5] gives us a lower causal iteration cost and sub-optimal test accuracy.Fig. 3 represents the mini-batch FedCau update of Algorithm 2 and the FedAvg baseline for CSMA/CA with p x = 1, p r = 0.01 and M = 50.Figs.3(a)-3(b) show the results for M = 50, with T = 0.5, 0.9, 1.1s.Fig. 3(a) highlights that with a smaller time budget, C(k c ) decreases, while Fig. 3(b) shows the similarity in the test accuracy.Fig. 3(c)-3(e) compare the test accuracy of FedCau in Algorithm 2 with the FedAvg by assigning the time budget T = 0.5, 0.9, 1.1 respectively.For the time budget T = 0.5, 0.9, 1.1s, the test accuracy of FedAvg is lower than the results of mini-batch FedCau update of Algorithm 2 with a similar time budget T .These results highlight the role of F f combined with T , where F f ensures participation fairness, especially for the smaller T , such as T = 0.5.Therefore, with the equal T , the FedCau in Algorithm 2 outperforms FedAvg in test accuracy and fairness in worker participation.Figs.3(f)-3(g) reveal the behavior of the mini-batch FedCau causal latency and test accuracy for M = 50, and T = 1.1s w.r.t.F f .Fig. 3(a) demonstrates that the causal latency increases for small and large fairness factors F f .Meanwhile, Fig. 3(g) shows that the test accuracy decreases while F f increases due to the lack of participation fairness.For smaller F f , the participation fairness results in  better test accuracy, while a higher causal latency arises from more frequent transmission of low-power workers.

C. Impact of Communication Parameters on FedCau Performance
Fig. 4 characterizes the iteration-cost function C(K) for the same setup as in Fig. 1.The iteration-cost function for slotted-ALOHA is larger than CSMA/CA, as we see in Figs.4(a) and 4(c).On the other hand, the iteration-cost function for CSMA/CA increases exponentially when the probability p r increases, as shown in Fig. 4(b).This result also holds for the bounds of the iteration cost in Eq. ( 17), as Fig. 4 shows.Furthermore, the results from Fig. 4(c) show that C(k c ) increases on a slower rate than M increases, such that    where M 2 and M 1 are number of workers, and C(k c2 ) and C(k c1 ) are the total communication-computation with the stopping causal iterations k c2 and k c1 , respectively.Thus, considering full worker participation as the worst case when investigating the scalability, we conclude that the total communication-computation cost of FedCau is scalable in M .

D. Performance of Non-convex FedCau from Algorithm 3
The experimental results presented in Fig. 5  ), which decrease as E increases.Additionally, the tightness of the interval (k u c , k l c ) established in Proposition 4 is validated, according to the variations in the non-convex sequences of F (w k ).Moreover, Fig. 5(c) shows the causal iteration-cost C(K) as a function of E, which increases as E increases.This observation highlights the significant impact of computation latency on the performance of the FedCau.Based on the findings in Fig. 5, selecting E = 10 as the optimal number of local iterations is recommended, which provides the best accuracy with a lower causal iteration cost compared to E > 10.These results offer valuable insights into selecting E and understanding the trade-off between E, test accuracy, iteration cost, and causal stopping iterations.and K max = 200 after 100 realizations to have smoother curves.Notably, FedAvg with K max = 200 increases the total iteration cost by 55% (non-iid) and 20% (iid), but the test accuracy improvement over FedCau is only 2.2% (non-iid) and 0.65% (iid).We observe that non-iid FedCau terminates at iteration k l c = 93 while the FedAvg test accuracy curve becomes flat at iteration k = 101.Moreover, the test accuracy of non-iid FedCau with k l c = 93 is 1.2% higher than noniid FedAvg at iteration k = 101.The communication costs of the local parameters for every extra iteration are high; thus, stopping the training at a proper iteration saves a huge amount of communication resources (14.7Mbits per iteration per worker).As a result, FedCau, with the knowledge of when to terminate the training, i.e., k l c = 93, is significantly superior to FedAvg in terms of saving communication-computation resources and achieving higher test accuracy.We choose LAQ because it achieves the same linear convergence as the gradient descent while effecting major savings in the communication resources [8].Among all the compression methods, we choose top-q sparsification because it suffers the least from non-i.i.d.data, and the training converges.Moreover, applying top-q for the logistic regression classifier trained on MNIST, the convergence does not slow down [7].Despite the previous numerical results, which characterize the overall latency as the iteration-cost c k , k ≥ 1, here we consider the number of bits per each communication iteration as c k , k ≥ 1.In LAQ, b shows the element-wise number of bits for the local parameters, and we train the FedAvg algorithm over the MNIST dataset.Moreover, in the top-q method, we change the percentage of the dimension of each local parameter as 0 < q ≤ 1, but considering that each element contains 32 bits.TABLE I compares FedAvg and FedCau with and without considering the communication-efficient methods LAQ and Top-q.In Table I, FedCau LAQ with b = 2 achieves 94.2% test accuracy, using the least number of bits (total cost of 4.56Mbits).
To explore the trade-off between communication cost and test accuracy in the FedAvg baseline, we examine three stopping iterations, namely 55, 56, 57, and 60, which are close to the FedCau stopping iteration of k c = 56.We set these FedAvg stopping iterations because we have obtained k c in FedCau.We choose the stopping iterations close to k c for FedAvg for fair comparison and to show the superiority of FedCau in test accuracy and overall communication cost.We highlight that these stopping iterations for FedAvg cannot be set beforehand in practice.When terminating FedAvg at iteration 55, the achieved accuracy is 2.32% lower than Fed-Cau, while offering a 1.82% reduction in communication cost.Similarly, FedAvg, with a stopping iteration of 57, requires a 1.82% increase in communication cost to achieve a marginal improvement of 0.14% in test accuracy compared to FedCau.Furthermore, considering FedAvg at stopping iteration 60, FedCau significantly saves 7.2% in the total cost with only a minor reduction of 0.382% in test accuracy compared to FedAvg.These findings highlight the effectiveness of FedCau in selecting the appropriate stopping iteration, ensuring that terminating the training before k c proves inefficient in terms of test accuracy while continuing after k c becomes costly with minimal improvements in accuracy.Moreover, the results for FedAvg with stopping iteration of 56, the same as FedCau, show that FedCau with causal termination k c outperforms Fe-dAvg in test accuracy.Furthermore, we compare FedCau LAQ b = 2 and k c = 57 with FedAvg LAQ b = 2 and stop iterations of 55 and 60.The test accuracy results indicate that FedCau LAQ with b = 2 outperforms FedAvg by increasing the test accuracy by 3.07% at the cost of 3.6% higher iteration cost.Thus, FedCau achieves the optimal causal stopping iteration in the context of LAQ with b = 2, considering the tradeoff between test accuracy and iteration cost.Furthermore, comparing FedCau with FedAvg at a stopping iteration of 60, FedAvg achieves a test accuracy of 94.46% with an iteration cost of 4.8 Mbits.Compared to FedCau at k c = 57, FedAvg incurs a 5.27% increase in iteration cost while gaining only a marginal 0.26% improvement in test accuracy.This comparison highlights that beyond k c , the increase in iteration cost becomes significantly higher compared to the increase in test accuracy.
We conclude that FedCau obtains the optimal stopping iteration regarding the iteration cost and the achievable test accuracy, even when applying it on top of existing communicationefficient methods, such as LAQ and top-q.

VI. CONCLUSION
In this paper, we proposed a framework to design cost-aware FL over networks.We characterized the communicationcomputation cost of running iterations of generic FL algorithms over a shared wireless channel regulated by slotted-ALOHA, CSMA/CA, and OFDMA protocols.We posed the communication-computation latency as the iteration-cost function of FL.We optimized the iteration-termination criteria to minimize the trade-off between FL's achievable objective value and the overall training cost.To this end, we proposed a causal setting, FedCau, utilized in two convex scenarios for batch and mini-batch updates, and for non-convex scenarios as well.
The numerical results showed that in the same background traffic, time budget, and network situation, CSMA/CA has less communication-computation cost than slotted-ALOHA.We also showed that the mini-batch FedCau update could perform more cost-efficiently than the batch update by choosing the proper time budgets.Moreover, the numerical results of the non-convex scenario provided a sub-optimal interval of the causal optimal solution close to the optimal interval, which provides many opportunities for non-convex FL problems.In the end, we applied the FedCau method on top of the existing methods like top-q sparsification and LAQ with characterizing the iteration cost as the number of communication bits.We concluded that FedCau, with or without LAQ and top-q, obtains the causal termination iteration and, compared to FedAvg, achieves a significantly better trade-off between test accuracy and the total iteration cost of training.
Our future work will extend the FedCau update of nonconvex scenarios and design communication protocols for cost-efficient FL considering power allocation.

A. Proof of Lemma 3
The proof is directly obtained from the definitions of F u (w k ), F l (w k ), δ u k and δ l k in Algorithm 3. Recall that F u (w k ) = F (w k ) (see line 24,27,30,44) or F u (w k ) = F u (w k−1 ) + δ l k (see line 40) and the same arguments considering are valid for F l (w k ) = F (w k ) or F l (w k ) = F l (w k−1 ) + δ u k (see lines 24,33,37,46).Thus, by assuming a finite sequence of | F (w k )|, k = 1, . . ., K, the inequality |F u (w k ) − F l (w k )| ≤ Fmax − Fmin is the tightness between the upper bound F u (w k ) and the lower bound F l (w k ).

B. Proof of Proposition 4
The stopping iteration k c given by Algorithm 3 is in the form of an interval of k c ∈ [k u c , k l c ].This interval's tightness depends on different scenarios, as we explain in the following.Assuming that Algorithm 3 has obtained k u c , after which we face several situations for updating F l (w k ) according to the behavior of F (w k ) for k ≥ k u c + 1.There are three different scenarios, which are explained in the following: • F (w k ) > F u (w k−1 ) (see lines 28-34 in Algorithm 3): In this case, the update of F u (w k ) and F l (w k ) are according to the linear update we proposed in Section III-D in the revised manuscript.Thus, the update of δ u k is as and F l (w k ) = F l (w k−1 ) + δ u k , and F u (w k ) = F (w k ).Next, we calculate G l (k) as where k l c is Therefore, according to the mentioned scenarios, we obtain where, for k ∈ [k u c + 1, K max ], k d := the first value of k | F (w k ) > F u (w k−1 ). (29) Consider a star network of M workers that cooperatively solve a distributed training problem involving a loss function f (w).Consider D as the whole dataset distributed among each worker j ∈ [M ] with D j data samples.Let tuple (x ij , y ij ) denote data sample i of |D j | samples of worker j and w ∈ R d denote the model parameter at the master node.Considering M j=1 |D j | = |D|, and j, j ′ ∈ [M ], j = j ′ , we assume D j ∩ D j ′ = ∅, and defining ρ j := |D j |/|D|, we formulate the following training problem

Lemma 3 .
Let F u (w k ) and F l (w k ) be respectively the upper bound and the lower bound of F (w k ) obtained from the stochastic non-convex cost-efficient mini-batch FedCau Algorithm 3. Let us define Fmax := max k∈[2,K] F (w k ) and Fmin := min k∈[2,K] F (w k ).Assuming that | F (w k )| < ∞, k = 1, . . ., K, then |F u (w k ) − F l (w k )| ≤ Fmax − Fmin is the tightness between the upper bound F u (w k ) and the lower bound F l (w k ).

Fig. 1 :M
Fig. 1: Illustration of non-causal and FedCau batch update of Algorithm 1 and FedCau mini-batch update of Algorithm 2 with T = 0.3s the presence of slotted ALOHA and CSMA/CA, and OFDMA for M = 100, px = 1, and pr = 0.2.a) Loss function f (w k ) b) Iteration-cost function C(K) c) Multi-objective cost function G(K), and d) Test accuracy.

Fig. 2 :
Fig. 2: Illustration of the effect of β on the performance of batch FedCau update of Algorithm 1, M = 50, 100 and CSMA/CA with px = 1, pr = 0.01.a) The causal iteration-cost C(kc) decreases while β increases.b) The causal stopping iteration kc is smaller for larger β. c) Test accuracy also decreases when β increases.

Fig. 4 :
Fig. 4: Illustration of the batch FedCau update Algorithm 1: iteration-cost C(k * ) and C(kc) and the bounds of Eq. (17) vs transmission probability px, arrival probability pr, and network size M .
kc vs E.
Iteration-cost vs E.

Fig. 5 :
Fig. 5: Illustration of the effect of number of local iterations E on the performance of mini-batch FedCau update of Algorithm 3 for non-convex loss functions with CIFAR-10 iid dataset and CSMA/CA, M = 50, 100, px = 0.8, and pr = 0.01.
investigate the impact of the number of local iterations (E) on the performance of mini-batch FedCau updates of Algorithm 3. The study focuses on the CIFAR-10 iid dataset and CNN architecture, employing CSMA/CA with different values of M = 50 and M = 100, along with p x = 0.8 and p r = 0.01.Fig. 5(a) reveals distinct behaviors in the causal test accuracy concerning E for M = 50 and M = 100.While the changes in test accuracy are less pronounced for M = 50, the corresponding values are lower than M = 100.Fig. 5(b) showcases the causal stopping iterations (k u c and k l c

Fig. 6
compares the performance of mini-batch FedCau update of Algorithm 3 in iid and non-iid data distribution

Fig. 6 :
Fig. 6: Performance comparison of CIFAR-10 iid and non-iid data in minibatch FedCau of Algorithm 3 for non-convex loss functions with CSMA/CA, M = 100, E = 10, K max = 200, px = 0.8, and pr = 0.01.a) Test accuracy of CIFAR-10 iid and non-iid dataset obtained by the lower bound causal stopping iteration k l c .b) Loss function F (w k ) with its upper bound Fu(w k ) and lower bound F l (w k ) for iid data, and c) non-iid data.d) Comparison between FedCau and FedAvg.
[27] investigate the training and test performance of the proposed algorithms using MNIST and CIFAR-10 datasets, over the communication protocols: slotted-ALOHA, CSMA/CA, and OFDMA.We consider these protocols because they are the dominant communication protocols in most wireless local area networks, such as IEEE 802.11-based products[26], or fixed assignment access protocol like OFDMA[27]; •

TABLE I :
Comparison between FedCau and FedAvg with and without LAQ and Top-q, M = 50, and K max = 200.