Federated TD Learning over Finite-Rate Erasure Channels: Linear Speedup under Markovian Sampling

Federated learning (FL) has recently gained much attention due to its effectiveness in speeding up supervised learning tasks under communication and privacy constraints. However, whether similar speedups can be established for reinforcement learning remains much less understood theoretically. Towards this direction, we study a federated policy evaluation problem where agents communicate via a central aggregator to expedite the evaluation of a common policy. To capture typical communication constraints in FL, we consider finite capacity up-link channels that can drop packets based on a Bernoulli erasure model. Given this setting, we propose and analyze QFedTD - a quantized federated temporal difference learning algorithm with linear function approximation. Our main technical contribution is to provide a finite-sample analysis of QFedTD that (i) highlights the effect of quantization and erasures on the convergence rate; and (ii) establishes a linear speedup w.r.t. the number of agents under Markovian sampling. Notably, while different quantization mechanisms and packet drop models have been extensively studied in the federated learning, distributed optimization, and networked control systems literature, our work is the first to provide a non-asymptotic analysis of their effects in multi-agent and federated reinforcement learning.


Introduction
Is it possible to obtain statistical models of high accuracy for supervised learning problems (e.g., regression, classification, etc.) by aggregating information from multiple devices while keeping the raw data on these devices private?This is the central question of interest in the popular machine learning paradigm of federated learning (FL) [1][2][3].When the data-generating distributions of the participating devices are identical (or sufficiently similar), several works have shown that one can reap the benefits of collaboration by exchanging locally trained models via a central aggregator (server) [4][5][6][7][8][9][10][11][12][13][14].In practice, these models are typically high-dimensional and need to be exchanged over unreliable communication links of limited bandwidth.As such, a large body of work in FL has investigated the effects of uploading quantized models (or model-differentials, i.e., gradients) over channels prone to packet drops/erasures [15,16].Drawing inspiration from this literature, in this paper, we ask: Can we establish collaborative performance gains for federated reinforcement learning (FRL) problems subject to similar communication challenges?As it turns out, little to nothing is known about this question from a theoretical standpoint.
Towards this direction, we study one of the most basic problems in RL, namely policy evaluation, in a federated setting.Specifically, in our problem, N agents, each of whom interacts with the same Markov Decision Process (MDP), communicate via a server to evaluate a fixed policy.While each agent can evaluate the policy on its own using Monte-Carlo sampling or temporal difference (TD) learning algorithms [17,18], the reason for communicating is the same as in the standard FL setting: to achieve an N -fold speedup in the sample-complexity of policy evaluation relative to when an agent acts alone.In the recent survey paper on FRL [19], the authors mention that the goal of the FRL framework is to achieve such speedups while respecting privacy constraints, i.e., without revealing the raw data (states, actions, and rewards) of the agents.Relative to the FL setting, proving finitetime rates for FRL is significantly more challenging since we need to deal with temporally correlated Markovian samples.Indeed, even for the single-agent setting, finite-time rates under Markovian sampling have only recently been established [20][21][22][23].Works prior to these developments either provided a finite-time analysis under a restrictive i.i.d.sampling assumption [24,25], or only came with asymptotic guarantees [18,26].For the multi-agent setting, almost all the prior works on TD learning make a restrictive i.i.d.sampling assumption [27,28].The only two exceptions to this are the very recent papers [29,30] that establish linear speedups under Markovian sampling; however, none of the above works consider any communication constraints.As such, establishing linear speedups in FRL under Markovian sampling and communication constraints remains largely unexplored.In this regard, our main contributions are as follows.
Contributions.Our first contribution is to formulate a federated policy evaluation problem under two practical constraints on the communication channels: finite capacity and packet drops (lossy links).To capture these constraints, we propose and analyze QFedTD -a federated TD algorithm with linear function approximation where agents upload quantized TD update directions to the server over Bernoulli erasure channels [31,32].While various quantization and erasure models have been extensively analyzed in the FL [15,16], distributed optimization [33][34][35][36], and networked control literature [31,32] for almost two decades, our work is the first to formally study their non-asymptotic effects in the context of multi-agent/federated RL.
Our second and most significant contribution is to provide a rigorous non-asymptotic analysis of QFedTD that clearly highlights the effects of quantization and erasures, and establishes an Nfold linear speedup in sample-complexity relative to the single-agent setting.Since RL algorithms often require several samples to achieve acceptable accuracy, our speedup result under realistic communication models is of significant practical importance.We now comment on some of the highlights of our analysis relative to [29] and [30].Our work crucially departs from both these papers in that, in addition to correlated Markovian samples, we need to contend with two other sources of randomness: one due to randomized quantization and the other due to the Bernoulli packet-dropping processes.Even in the absence of communication challenges, our analysis has the following key benefits.Unlike [30], our work does not require any projection step to ensure the boundedness of iterates.Moreover, compared to [30], and the analysis in [29] that relies on Generalized Moreau Envelopes, our proof is significantly shorter and simpler.As a byproduct of this simpler analysis, we derive bounds that have a tighter linear dependence on the mixing time (consistent with the centralized setting) as opposed to the quadratic dependence in [29,30]. 1 In fact, the dependence of O(τ ) in our variance bounds (where τ is the mixing time) is informationtheoretically optimal [38].The other natural advantage of our simple proof template is that one can potentially build on it while trying to establish linear speedups for more involved RL settings.

System Model and Problem Formulation
We consider a setting involving N agents, where all agents interact with the same Markov Decision Process (MDP).Let us denote the shared MDP by M = (S, A, P, R, γ), where S is a finite state space of size n, A is a finite action space, P is a set of action-dependent Markov transition kernels, R is a reward function, and γ ∈ (0, 1) is the discount factor.We are interested in a policy evaluation (PE) problem where the agents exchange information via a central aggregator (server) to evaluate the value function associated with a policy µ.Here, the policy is a map from the states to the actions, i.e., µ : S → A. In what follows, we first briefly review some key concepts relevant to PE with function approximation.Then, we formally describe our communication model, objectives, and technical challenges.
Policy Evaluation with Linear Function Approximation.The policy µ to be evaluated induces a Markov Reward Process (MRP) with transition matrix P µ and reward function R µ : S → R. The purpose of PE is to evaluate the value function V µ (s) for each s ∈ S, where V µ (s) is the discounted expected cumulative reward obtained by playing policy µ starting from initial state s.Formally, we have where s k represents the state of the Markov chain at the discrete time-step k under the action of the policy µ.Our particular interest is in the RL setting where the Markov transition kernels and reward functions are unknown.
In several large-scale practical settings, the size n of the state space S is large, thereby creating a major computational challenge.To work around this issue, we will resort to the popular idea of linear function approximation where V µ is approximated by vectors in a linear subspace of R n spanned by a set of m basis vectors {φ } ∈[m]2 ; importantly, m n.To be more precise, let us define the feature matrix Φ [φ 1 , ..., φ m ] ∈ R n×m .Given a weight (model) vector θ ∈ R m , the parametric approximation V θ of V µ is then given by V (θ) := V θ = Φθ.If we denote the s-th row of Φ as φ s , then the approximation of V µ (s), in particular, is given by V θ (s) = θ, φ s .Throughout, we will make the standard assumption [20] that the columns of Φ are independent and that the rows are normalized, i.e., φ s 2 2 ≤ 1, ∀s ∈ S. Communication Model and QFedTD Algorithm.Given the above setup, the goal of the server-agent system is to collectively estimate the model vector θ * corresponding to the best linear approximation of V µ in the span of Φ.To achieve this goal, we now describe a multi-agent variant of the classical TD(0) algorithm [17].All agents start out from a common initial state s 0 ∈ S with an initial estimate θ 0 ∈ R m .Subsequently, at each time-step k ∈ N, a global model vector θ k is broadcasted by the server to all agents.Each agent i ∈ [N ] then takes an action a i,k = µ(s i,k ), and observes the next state s i,k+1 ∼ P µ (•|s i,k ) and instantaneous reward r i,k = R µ (s i,k ); here, s i,k is the state of agent i at time-step k.Using the model vector θ k and the observation tuple o i,k = (s i,k , r i,k , s i,k+1 ), agent i computes the following local TD update direction: We will often use g i,k (θ k ) as a shorthand for g i,k (θ k , o i,k ).Note that although all agents play the same policy µ, and interact with the same MDP, the realizations of the local observation sequences {o i,k } can differ across agents.We assume that these observation sequences are statistically in-dependent across agents. 3Intuitively, based on this independence property, one can expect that exchanging agents' local TD update directions should help reduce the variance in the estimate of θ * .This is precisely where the communication-induced challenges we describe below play a role.
Channel Effects.We model two key aspects of realistic communication channels in large-scale FL settings: finite capacity (due to limited bandwidth) and erasures/packet drops.To account for the first issue, we will employ a simple unbiased quantizer which is a (potentially random) mapping Q : R m → R m satisfying the following constraints [39].Definition 1. (Unbiased Quantizer) We say that a quantizer Q is unbiased if the following hold for all x ∈ R m : (i) E [Q(x)] = x, and (ii) there exists some constant , where the expectation is w.r.t. the randomness of the quantizer.
The constant ζ captures the amount of distortion introduced by the quantizer.Using any quantizer that satisfies Definition 1, each agent i computes an encoded version Here, we assume that the randomness of the quantizer is independent across agents and also independent of the Markovian observation tuples.
Next, to capture packet drops, we assume that the encoded TD directions are uploaded to the server over Bernoulli erasure channels.Specifically, the transmission of information from an agent i to the server is over a channel whose statistics are governed by an i.i.d.random process {b i,k }, where for each k, b i,k follows a Bernoulli fading distribution.To be more precise, b i,k = 0 with erasure probability (1 − p), and b i,k = 1 with probability p.The packet-dropping processes are assumed to be independent of all other sources of randomness in our model.
We are now in a position to describe the global model-update rule at the server: where α is a constant step-size/learning rate.We refer to the overall updating scheme described above as the Quantized Federated TD learning algorithm, or simply QFedTD.
Objective and Challenges.The main goal of this paper is to provide a finite-time analysis of QFedTD.This is non-trivial for several reasons.Even in the single-agent setting, providing a non-asymptotic analysis of TD(0) without any projection step is known to be quite challenging due to temporal correlations between the Markov samples.To analyze QFedTD, we need to contend with three distinct sources of randomness: (i) randomness due to the temporally correlated Markov samples {o i,k } i∈[N ] ; (ii) randomness due to the quantization step; and (iii) randomness due to the Bernoulli packet dropping processes {b i,k } i∈ [N ] .Each of these sources of randomness influence the evolution of the parameter vector θ k .Furthermore, unlike a single-agent setting, our goal is to establish a "linear speedup" w.r.t. the number of agents under the different sources of randomness above.This necessitates a very careful analysis that we provide in Section 4.
Remark 1.We note here that both the quantization mechanism and the channel model studied in this paper are quite simple.We have chosen to stick to these models primarily because the focus of our paper is on establishing the linear speedup effect under Markovian sampling.That said, we conjecture that the analysis in Section 4 can potentially be extended to cover more involved encoding schemes (e.g., the use of error-feedback [40]), and more realistic channels with noise, interference, and non-stationary behavior.We reserve these questions for future work.

Main result
In this section, we state and discuss our main result pertaining to the non-asymptotic performance of QFedTD.First, however, we need some technical preparation.As is standard, we assume that the rewards are uniformly bounded, i.e., ∃r > 0 such that R µ (s) ≤ r, ∀s ∈ S.This ensures that the value function in (1) is well-defined.Next, we make a standard assumption that plays a key role in the finite-time analysis of TD learning algorithms [18,20,21].
Assumption 1.The Markov chain induced by the policy µ is aperiodic and irreducible.
An immediate consequence of the above assumption is that the Markov chain induced by µ admits a unique stationary distribution π [41].Let Σ = Φ DΦ, where D is a diagonal matrix with entries given by the elements of the stationary distribution π.Since Φ is assumed to be full column rank, Σ is full rank with a strictly positive smallest eigenvalue ω < 1; ω will later show up in our convergence bounds.Next, we define the steady-state local TD update direction as follows: Essentially, the deterministic recursion θ k+1 = θ k + αḡ(θ k ) captures the limiting behavior of the TD(0) update rule.In [20], it was shown that the iterates generated by this recursion converge exponentially fast to θ * , where θ * is the unique solution of the projected Bellman equation Π D T µ (Φθ * ) = Φθ * .Here, Π D (•) is the projection operator onto the subspace spanned by {φ } ∈[m] with respect to the inner product •, • D , and T µ : R n → R n is the policy-specific Bellman operator [18].We now define the notion of mixing time τ that will play a crucial role in our analysis.Definition 2. Let τ be the minimum time such that the following holds: Assumption 1 implies that the Markov chain induced by µ mixes at a geometric rate [41], i.e., the total variation distance between P (s i,k = •|s i,0 = s) and the stationary distribution π decays exponentially fast ∀k ≥ 0, ∀i ∈ [N ], ∀s ∈ S.This immediately implies the existence of some K ≥ 1 such that τ in Definition 2 satisfies τ ≤ K log( 1 ) [22].Loosely speaking, this means that for a fixed θ, if we want the noisy TD update direction to be -close (relative to θ) to the steady-state TD direction (where both these directions are evaluated at θ), then the amount of time we need to wait for this to happen scales logarithmically in the precision .For our purpose, the precision we will require is = α q , where q is an integer satisfying q ≥ 2. Unlike the centralized case where q = 1 suffices [20,21], to establish the linear speedup property, we will require q ≥ 2. Henceforth, we will drop the subscript of = α q in τ and simply refer to τ as the mixing time.Let us define by σ max{1, r, θ * } the "variance" of the observation model for our problem.Finally, let ζ max{1, ζ}, where ζ is as in Definition 1, and δ We are now in a position to state the main result of this paper.
Theorem 1.Consider the update rule of QFedTD in (2).There exist universal constants C 0 , C 2 , C 3 ≥ 1, such that with α ≤ ω(1−γ) C 0 τ ζ , the following holds for T ≥ 2τ : where Discussion: There are several important takeaways from Theorem 1. From (4), we first note that QFedTD guarantees linear convergence (in expectation) to a ball around θ * whose radius depends on the variance σ 2 of the noise model.While the linear convergence rate gets slackened by the probability of successful transmission p, the "variance term", namely the second term in (4), gets inflated by the quantization parameter ζ.Both of these channel effects are consistent with what one typically observes for analogous settings in FL [15].Next, compared to the centralized setting [21,Theorem 7], the variance term in (4) gets scaled down by a factor of N , up to a higher-order O(α 3 ) term that can be dominated by the (α/N ) term for small enough α.Before we make this point explicit, it is instructive to note that our variance bound exhibits a tighter dependence on the mixing time τ relative to [29] and [30], where similar bounds are established.In particular, while this dependence is O(τ ) for us, it is O(τ 2 ) in [29,Theorem 4.1] and in [30,Theorem 4].Notably, the O(τ ) dependence that we establish is consistent with results on centralized TD learning [20,21], and is in fact the optimal dependence on τ under Markovian data [38].We have the following immediate corollary of Theorem 1.
Corollary 1.Consider the update rule of QFedTD in (2).Let the step-size α and the number of iterations T be chosen to satisfy: where C 0 is as in Theorem 1.We then have the following bound: To appreciate the above result, let us compare it to the result for the single agent TD setting in [20].Under Markovian sampling, part (c) of Theorem 3 in [20] establishes that the mean-square error for single-agent TD decays at the following rate: where G, as defined in [20], captures the effect of both the projection radius (in [20], the authors consider a projected version of TD learning) and the noise variance. 5The term G 2 can be viewed as the analog of max{δ 2 0 , σ 2 } in our bound.Comparing the above bound with that in Eq. ( 6), we make two immediate observations.(i) The term T in the centralized bound gets replaced by N T in our bound.This is precisely what we wanted since in our setting, each agent has access to T samples, yielding a total of N T samples.Essentially, this goes on to show that our algorithm is sample-efficient in that it makes use of all the samples from all the agents and achieves a linear speedup w.r.t. the number of agents.Second, the effect of channel effects is succinctly captured by the term in blue in Eq. ( 6).This term essentially inflates the variance max{δ 2 0 , σ 2 } of our noise model.When the number of agents N = 1, the probability of successful transmission p = 1, and there is no quantization effect (i.e., ζ = 1), our bound exactly recovers the bound in the centralized setting (even up to log factors).As far as we are aware, our work is the first to establish such a tight result in multi-agent/federated reinforcement learning under Markovian sampling and communication constraints.

Proof of the Main Result
In this section, we will prove Theorem 1.We start by introducing some definitions to lighten the notation, and by recalling some basic results from prior work.Let us define: For our analysis, we will need the following result from [20].
Lemma 1.The following holds ∀θ ∈ R m : We will also use the fact that the random TD update directions and their steady-state versions are 2-Lipschitz [20], i.e., ∀i ∈ [N ], ∀k ∈ N, and ∀θ, θ ∈ R m , we have: Finally, we will use the following bound from [21]: Equipped with the above basic results, we now provide an outline of our proof before delving into the technical details.
Outline of the proof.We start by defining: Since for all i ∈ Thus, recalling that δ 2 k θ * − θ k 2 , and using (2), we obtain The main technical burden in proving Theorem 1 is in bounding E v k 2 and E [ψ k ] in the above recursion.Following the centralized analysis in [20,21], one can easily bound E v k 2 using (9).However, this approach will fall short of yielding the desired linear speedup property.Hence, to bound E v k 2 , we need a much finer analysis, one that we provide in Lemma 2. Leveraging Lemma 2, we then establish an intermediate result in Lemma 3 that bounds E [ θ k − θ k−τ ].This result, in turn, helps us bound E [ψ k ] in Lemma 4. We now proceed to flesh out these steps.In what follows, τ = τ with = α q , q ≥ 2.
Lemma 2. (Key Technical Result) For k ≥ τ , we have and We now proceed to bound T 1 − T 3 .To that end, we first write T 1 as , and Now using (9), we obtain Next, to bound the cross-terms in T 12 , we will exploit the mixing property in Definition 2. To that end, we note that since (i) ḡ(θ * ) = 0 [20], (ii) the packet-dropping processes are independent of the Markovian tuples, and (iii) g i,k (θ * ) and g j,k (θ * ) are independent for i = j, Using the Cauchy-Schwarz inequality followed by Jensen's inequality, we can further bound the above inner-product via E η k,τ (θ * ) ≤ 4σ 2 α 2q .For the last inequality, we used the mixing property by noting that k ≥ τ .Specifically, appealing to Definition 2, and recalling that σ max{1, r, θ * }, we have Clearly, the same bound also applies to η (j) k,τ (θ * ) via an identical reasoning.Combining this analysis with the fact that ] and E [T 12 ] thus yields: Now, using (8), we see that , we now turn to bounding T 3 by writing it as , and We now proceed to bound E [T 31 ] and E [T 32 ] as follows: where (a) follows from the variance bound of the quantizer map Q(•), and (b) follows from (9).Next, observe that: Using the fact that the randomness of the quantization map is independent across agents, and the unbiasedness of Q(•), we conclude that E [T 32 ] = 0. Combining the bounds on , and E [T 3 ] above yields the desired result.
Remark 2. As the rest of our analysis will reveal, Lemma 2 is really the key technical result that will help us establish the desired linear speedup effect under Markovian sampling.One important takeaway from the proof of this result is that we do not need to exploit the fact that the TD update direction is an affine function of the parameter θ k .As such, Lemma 2 should essentially be applicable (with potentially minor modifications) to more general stochastic approximation schemes where the operator under consideration satisfies basic smoothness properties.
Our next result is the final ingredient needed to prove Theorem 1.
and let α ≤ 1/(484ζ τ ) and k ≥ 2τ .We have Proof.We can write To bound T 1 , observe the following inequalities: In the above steps, (a) follows from the Cauchy-Schwarz inequality.For (b), we used the fact that given any two positive numbers x and y, the following holds for any η > 0: We used the above inequality with η = ατ to arrive at (b).Finally, for (c), we used the fact that ḡ(θ * ) = 0; hence, ḡN (θ * ) = 0. We now proceed to bound the expectations of each of the terms S 1 − S 3 , starting with S 3 .Note that using (8), i.e., the Lipschitz property of the TD update directions, we get: Taking expectations on each side of the above inequality then yields: In arriving at the above inequality, we used the following facts: (i) the randomness in θ k depends on all the sources of randomness in our model up to time k−1; (ii) the Bernoulli packet-drop random variables {b i,k } i∈[N ] are independent of all the sources of randomness up to time k − 1.Hence, for 2 can be directly bounded using Lemma 3 in the following way: Finally, the only term that remains to be bounded is E g N (θ k ) 2 .Note that we can write: and Observe that T 1 and T 2 above correspond exactly to the terms T 1 and T 2 in the proof of Lemma 2. Thus, they can be bounded as follows: So, plugging ( 24), ( 25), (26), and the above bound into (23), we get the final bound on E [T 1 ] as follows: Next we bound E [T 3 ] and E [T 4 ].Observe that: Using δ 2 k−τ ≤ 2δ 2 k + 2δ 2 k,τ and Lemma 3, we then obtain: Using the same process, we can derive the exact same bound for E [T 4 ].We now bound E [T 2 ].For ease of notation, let us define F k,τ = ({o i,k−τ } N i=1 , θ k−τ ).Observe: where in the last step, we made use of the mixing property.Since α < 1, we have Using q ≥ 2, we obtain: Using δ 2 k−τ ≤ 2δ 2 k + 2δ 2 k,τ and Lemma 3, and then simplifying yields: Finally, to bound T 5 , let We have Note that With the help of the auxiliary lemmas provided above, we are now ready to prove our main result, i.e., Theorem 1.
Proof of Theorem 1. Setting α ≤ 1 484ζ τ , we can apply the bounds in Lemmas 1, 2, and 4 to (11).This yields: For α ≤ ω(1−γ) C 0 τ ζ with C 0 = 6446, we then obtain: Iterating the last inequality, we have ∀k ≥ 2τ : where ρ = (1 − αω(1 − γ)p), C 2 = 5162, C 3 = 61, and we set q = 2.It only remains to show that with our choice of α, E δ 2 2τ = O(δ 2 0 + σ 2 ).This follows from some simple algebra and steps similar to those in the proof of Lemma 3. We provide these steps below for completeness.Note that, defining T = N i=1 b i,k g i,k (θ k ) 2 , and using ( 9), Letting T 3 be as defined in (13), note that Plugging this inequality into (18) and iterating, Using the same arguments used to arrive at (20), we have where we used the fact that ατ ≤ This concludes the proof.
We now provide the proof of Corollary 1.
Proof of Corollary 1.We first recall the main result of Theorem 1, i.e., the following bound: Let us also recall the choice of step-size α and number of iterations T from Corollary 1: To simplify the first term in Eq. ( 34), we use the fact that for all x ∈ (0, 1), it holds that (1 − x) ≤ e −x .Using this in conjunction with the choice of α in (35) yields the following bound on T 1 in Eq. (34): To bound T 2 , we simply substitute the choice of α in Eq. ( 35).For T 3 , we first substitute the choice of α to obtain:  From our choice of T in Eq. ( 35), the following hold: Using these two inequalities, we immediately note that: Combining the individual bounds on T 1 , T 2 , and T 3 leads to Eq. ( 6).Let us complete our derivation with a couple of other points.First, straightforward calculations suffice to check that the choice of α and T in Eq. ( 35) meet the requirement on α in the statement of Theorem 1. Finally, recall from the discussion following Definition 2 that the mixing time τ satisfies: for some constant K ≥ 1.Throughout our analysis, we set = α 2 , and then dropped the dependence of τ on for notational convenience.Plugging in the choice of α from Eq. ( 35), we obtain: for N T ≥ e.The point of the above calculation is to explicitly demonstrate that one can indeed meet the requirement on T in Eq. ( 35) for large enough T .

Numerical simulations
Basic Setup.In this section, we provide simulation results on a synthetic example to corroborate our theory.We consider an MDP with |S| = 20 states and a feature space spanned by d = 10 orthonormal basis vectors.We fix the discount factor to be γ = 0.5.In all the simulations, we fix the step size α = 0.6 for all the algorithms.In the simulations presented here, we generate the erasure channels with Bernoulli random variables, and employ uniform scalar quantization of the TD update directions, assigning a certain number of bits for the quantization of each vector component of each agent.
Observation 1: The Linear Speedup Effect.In Figure 1, we compare the proposed QFedTD algorithm with a vanilla version of federated TD learning (i.e., QFedTD without erasures and quantization) that we refer to as FedTD.For the results shown in Figure 1, we set the probability of successful transmission p = 0.6, and we quantize the TD update directions assigning 4 bits for the quantization of each vector component.In the figure, we show the performance of FedTD and QFedTD in the single agent (N = 1) and the multi-agent (N = 40) cases.The simulation results confirm the linear speedup with the number of agents for QFedTD, as established by our theoretical findings.From Figure 1, we note two important aspects established by the theory and confirmed by the experiment: the rate of convergence of QFedTD is slackened by the probability of unsuccessful transmission 1 − p, while the size of the neighbourhood of θ * to which the algorithm converges is inflated by the quantization noise.
Observation 2: The Effect of the Bernoulli Erasure Channel.In Figure 2, we show the performance of QFedTD under different values of the erasure probability p, while fixing the number of agents N = 40.Once again, the effect of the successful transmission probability p on the linear convergence rate is evident, consistent with our theoretical result provided in Theorem 1. Indeed, lowering the probability of successful transmission slows down the rate of convergence; the ball to which the iterates converge has the same size for all the variants.
Observation 3: The Effect of Quantization.In Figure 3, similarly to what we did for Figure 2, we compare the performance of QFedTD for different quantization noise levels.In particular, we show the performance of QFedTD for three values of the number of bits assigned to each vector component of the TD update directions: 3, 4, and 5 bits per vector component.We fix the erasure probability p = 0.6, and the number of agents N = 40.Consistent with Theorem 1, we see from this figure that, while the linear rate is the same for all the QFedTD variants, the size of the neighbourhood of θ * to which the algorithm converges is inflated by the quantization noise, i.e., it increases when we diminish the number of bits used to quantize the TD update directions.

Figure 1 :
Figure 1: Comparison between vanilla FedTD and QFedTD in single-agent (N = 1) and multi-agent (N = 40) settings.The number of bits used to quantize the TD update direction is 4 per vector component, and the erasure probability is p = 0.6.

Figure 2 :
Figure 2: Performance of QFedTD under different values of the erasure probablity p.The number of agents cooperating in the simulations performed to obtain this figure is N = 40, and each component of the TD update directions is quantized with 4 bits.

Figure 3 :
Figure 3: Performance of QFedTD under different values of the number of bits used to quantize the TD update directions.The number of agents cooperating in the simulations performed to obtain figure is N = 40 and the erasure probability is p = 0.6.