Distributed Value Function Approximation for Collaborative Multi-Agent Reinforcement Learning

In this paper we propose novel distributed gradient-based temporal difference algorithms for multi-agent off-policy learning of linear approximation of the value function in Markov decision processes. The algorithms are composed of: 1) local parameter updates based on the single-agent off-policy gradient temporal difference learning algorithms, including eligibility traces with state dependent parameters, and 2) linear dynamic consensus scheme over the underlying, typically sparsely connected, inter-agent communication network. The proposed algorithms differ in the way of how the time-scales are selected, how local recursions are performed and how consensus iterations are incorporated. The algorithms are completely decentralized, allowing applications in which all the agents may have completely different behavior policies while evaluating a single target policy. In this sense, the algorithms may be considered as a tool for either parallelization or multi-agent collaborative learning under given constraints. We provide weak convergence results, taking rigorously into account properties of the underlying Feller-Markov processes. We prove that, under nonrestrictive assumptions on the time-varying network topology and the individual state-visiting distributions of the agents, the parameter estimates of the algorithms weakly converge to a consensus point. The variance reduction effect of the proposed algorithms is demonstrated by analyzing a limiting stochastic differential equation. Specific guidelines for network design, providing the desired convergence points, are given. The algorithms' properties are illustrated by characteristic simulation results.


Introduction
Interest in decentralized multi-agent algorithms for automatic decision making in uncertain and dynamically changing environments has recently been dramatically increased mainly due to the fundamental role of these algorithms in design and operation of the cutting edge technologies and concepts such as Cyber-Physical Systems (CPS), Internet of Things (IoT), Smart Mobile Networking, Industry 4.0, etc. Distributed estimation, optimization and adaptation methods play an essential role in development of these algorithm, with a large class of them being based on recursive collaboration aimed at achieving consensus on certain variables (e.g.[1][2][3][4][5][6][7][8][9][10][11][12][13][14][15] and references therein).The main idea is to use the underlying inter-agent communication network to achieve a global consensus in a completely decentralized and distributed way, without presence of any type of fusion center.
Reinforcement learning (RL) is a powerful methodology for decision making in uncertain environments which typically uses Markov Decision Process (MDP) modeling, providing approximate solutions to dynamic programming problems [16,17].One of the most important ideas coming out from the RL field is the temporal difference (TD) learning, typically used to learn approximations to the value function of an MDP.This problem is especially acute under complex conditions, such as very large state space and presence of discrepancy between the behavior policy of an agent and a policy which is currently targeted for evaluation (off-policy learning, e.g.[18]).In [19][20][21][22][23][24] several fast gradient-based algorithms for TD learning have been proposed which can successfully handle both of these situations of crucial importance for practice.
Distributed and multi-agent RL methods have received recently a lot of attention mainly due to their high potential for solving essential problems within the mentioned emerged areas dealing with complex, intelligent and networked systems, such as CPS (see e.g.[25][26][27] and references therein).Typically, a specific distributed setup is adopted in which it is assumed that each agent can access (observe transitions) of a given MDP independently, without mutual interactions through the MDP environment, but with possibly different behavior (control) policy of each agent.Similar problem setups have been adopted in several recent works [28][29][30][31][32][33][34][35].In [28] a TD based algorithm for policy approximation was proposed, assuming that the behavior of all the agents is the same, while in [29] a distributed version of the popular Q-learning algorithm is proposed, but assuming the presence of a global controller and perfect state knowledge.Contributions from [30,31,34] belong to the same way of thinking, but without including eligibility traces, providing convergence analyses under the assumption of independent sampling from the underlying Markov chain stationary distributions, which drastically simplifies technical developments.In [30] the mean-square convergence analysis of a consensus-based algorithm is provided, under the assumption that the communication graph is fixed.In [35] finite-time convergence rate of the algorithms is analyzed, while in [33] a modified algorithm with linear convergence rate is proposed.The authors of [32,36,37] developed multiagent distributed actor-critic schemes; the approach from [36] includes an iterative consensus-based emphatic temporal difference algorithm [38].
In this paper, we propose several new decentralized and distributed algorithms for multi-agent off-policy gradient temporal difference learning of linear approximation of the value function in MDPs.The basis of the introduced decentralized schemes are recently proposed single-agent offpolicy gradient-based algorithms [19][20][21][22][23]39].The main idea is to incorporate linear distributed dynamic consensus iterations over the underlying network of agents who can communicate only with their corresponding neighbors, avoiding in such a way dependence on any type of fusion center.In the adopted distributed framework, it is of fundamental importance that the proposed algorithms are of off-policy type, since this allows applications in which all the agents may have different behavior policies while evaluating a single target policy.Hence, in practice, the agents are able to successfully perform an overall task together, even if individual groups of agents cannot.Another important property of the proposed algorithms is that the local recursions of each agent can be based on eligibility traces [20,23], where each agent may choose different (possibly state dependent) λ parameters.The consensus convexification steps are applied to the parameter vector in the approximation function and not necessarily to the auxiliary parameter vectors introduced by the single-agent algorithms construction [6,30,40].Furthermore, each of the proposed algorithms can be realized using one or two time scales.Under nonrestrictive assumptions on the time-varying network topology and the individual behavior policies (compare e.g. with [30,40]), we prove that the parameter estimates generated by the proposed algorithms weakly converge to consensus.These proofs are derived using the results from [6,12,23,41] and represent the main theoretical contribution of the paper.In the convergence proofs, the stochastic nature of the underlying MDP's has been rigorously taken into account, without the simplifying assumption about independence of MDP transitions, sometimes used in the literature [19,30,31,34].The effect of "denoising" (noise variance reduction) provided by the consensus scheme has been verified by a theoretical analysis of the algorithms' rate of convergence, using their limit stochastic differential equations.We also formulate specific guidelines on how to design the network topology and weights in order to ensure the desired convergence points.Finally, selected simulation results illustrate the main concepts and demonstrate that the proposed algorithms can be considered as efficient tools for practice.
The paper is organized as follows.In Section 2 we formulate the problem and define the algorithms.The first part of Section 3 is devoted to preliminary results, including some basic properties of the Feller-Markov state-trace processes (using [23]) and of the incorporated consensus scheme (using [6]).In the second part of Section 3, a weak convergence analysis is presented for all the proposed algorithms.Section 4 is devoted to a discussion on several important issues, such as a possibility to introduce constraints of the parameter vector, the overall impact of consensus, the communication network design and the variance reduction effect.Finally, in Section 5 the results of simulations demonstrating the effectiveness of the proposed algorithms are shown.

Problem Formulation. Definition of the Algorithms
Consider N + 1 autonomous agents, each acting on a separate Markov Decision Process (MDP), denoted as MDP (i) , i = 0, . . ., N , characterized by the quadruplets {S, A, p(s |s, a), R(s, a, s )}, where S = {s 1 , . . ., s M } is a finite set of states (|S| = M ), A is a finite set of actions, p(s |s, a) is a function defining probabilities of moving from s ∈ S to s ∈ S by applying action a ∈ A, and R(s, a, s ) are the one-step random rewards; let MDP (0) represent a reference MDP.Each MDP (i) , i = 0, 1, . . ., N , has an associated fixed stationary policy π (i) (a|s) (indicating the probability of taking action a at state s), so that the resulting state processes {S i (n)}, where n ≥ 1 denotes integer transition times, are time homogenous Markov chains.The goal of the agents is to learn the state value function for a given target policy π (0) in MDP (0) , where each agent i, i = 1, . . ., N , can observe only state transitions and receive (noisy) rewards in MDP (i) with behavior policy π (i) .Let P (i) denote the resulting transition matrices of the Markov chains {S i (n)}, i = 0, . . ., N respectively.Therefore, we are dealing with a cooperative off-policy reinforcement learning problem [16,34].
Introduce the local importance sampling ratios ρ i (s, s ) = P (0) ss for s, s ∈ S (with 0/0 = 0).The following assumption ensures well defined value function and the importance ratios: (A1) (Assumptions on target and behavior policies) a) P (0) is such that I − P (0) Γ is nonsingular; b) P (i) are irreducible and such that for all s, s ∈ S, P (i) ss = 0 ⇒ P (0) ss = 0, i = 1, . . ., N .Let φ : S → R p be a function that maps each state to a p-dimensional feature vector φ; let the subspace spanned by these vectors be L φ .Our goal is to find where Φ is an M × p matrix composed of p-vectors φ(s) as row vectors and θ ∈ R p is a parameter vector.
In order to construct a distributed algorithm for collaborative estimation of the parameter vector θ by using observations from MDP (i) , i = 1, . . ., N , we define the global parameter vector Θ = T at the network level, and define the following constrained optimization problem Subject to ξi are the local objective functions, q i > 0 a priori defined weighting coefficients, λ i the local λ-parameter and Π ξi {•} the projection onto the subspace L φ w.r.t. the weighted Euclidean norm v 2 ξi = s∈S ξ i;s v(s) 2 for a positive M -dimensional vector ξ i with components ξ i;s , i = 1, . . ., M (see [23,34]).In accordance with [23,42], we take ξ i to be the invariant probability distribution for the local Markov chain MDP (i) , with the transition matrix where matrix with the components of ξ i on the diagonal, and w i (θ) represents the unique solution (in w i ) of the equation Φw [23].Alternatively, we can reformulate (3) in the following way: Let ρ i (n) = ρ i (S i (n), S i (n + 1)) and γ i (n) = γ(S i (n)) [23,42].The local temporal-difference terms are given by δ is the one-step local random reward possibly containing a zero-mean white noise term.
We propose several algorithms composed of two main parts: 1) local parameter updates, based on the gradient descent methodology, using local state transition and reward observations from MDP (i) , i = 1, . . ., N , and 2) inter-agent communications (restricted to the agents' neighborhoods) of the current local parameter estimates, aimed at achieving consensus between the agents.
For the first part, we propose two algorithm types.The first is derived from (3) and denoted as D1-GTD2(λ), according to the algorithm GTD2 proposed in [19].The local updates are defined by where v θi(n) = Φθ i (n), and e i (n) are sequences of eligibility trace vectors generated by each agent using the following equation: with λ i (n) being the λ-parameters introduced above, and to be defined more precisely in the next section.The initial values θ i (0) are chosen arbitrarily; however, w i (0), as well as e i (0), have to satisfy w i (0), e i (0) ∈ span{φ(s)} in order to achieve the desired convergence properties.Sequences {α(n)} and {β(n)} are positive step sizes, which can be either of the same order of magnitude (single time-scale) or satisfying α(n) << β(n) (two time-scales), see [23].
The second algorithm type, denoted as D1-TDC(λ), is derived starting from the algorithm TDC from [19].Its local updates, derived directly from (4), are defined by: with the recursions for w i (n) and e i (n) identical to those in ( 6) and ( 7), respectively.
The second part of the algorithms aimed at achieving consensus on approximation parameters, performs the following convexification (for both D1-GTD2(λ) and D1-TDC(λ)): where a ij (n) are random variables, elements of the random matrix [12,34,43].If one adopts that the available N agents are connected by communication links in accordance with a directed graph G = (N , E), where N is the set of nodes and E the set of arcs, then matrix A(n) has zeros at the same places as the graph adjacency matrix A G and is row-stochastic, i.e.
A modification of D1-GTD2(λ) and D1-TDC(λ), denoted as D2-GTD2(λ) and D2-TDC(λ), respectively, involves convexification for both θ and w, in such a way that (9) remains the same, while (10) becomes In the given context, the multi-agent network as a whole can be considered as a tool for: a) parallelization, speeding up the whole learning process, and b) improvement of performance in two main directions: 1) better approximation than in the case of the local algorithms by adequate selection of the behavior policies in the local MDP's and their proper weighting, 2) reduction of the covariance of the estimates by averaging over the network nodes (see the Discussion section below).It is important that the proposed algorithms are fully decentralized, contributing in such a way to their scalability and robustness.

Preliminaries
Before proceeding to the proof of convergence of the proposed algorithms, we will focus on several important prerequisites based on the recent results from [23,42].

Properties of the State-Trace Processes
We will consider state-dependent λ i , when λ i (n) = λ i (S i (n)) for a given function λ i : S → [0, 1].It can be shown that the state-trace processes (S i (n), e i (n)) are Markov chains with the weak Feller property (see [23,42] for details).
Let Z i (n) = (S i (n), e i (n), S i (n + 1)).According to (5), for D1-GTD2(λ) and D2-GTD2(λ), denoting z = (s, e, s ), we introduce functions and where δi (s, , and r i (s, s ) is one-step expected reward following policy π (i) when transitioning from state s to s .Notice that δ i (v θi(n) ; n) and δi (S i (n), S i (n + 1), v θi(n) ) differ by the zero-mean noise term e i (n)ω i (n + 1), where We have further: Recalling that for any given θ i there is a unique solution w θi to the linear equation ki (θ i , w) = 0, w ∈ span{φ(S)} (which we denote as w θi = wi (θ i )), we obtain that ḡi (θ i , wi In the case of D1-TDC(λ) and D2-TDC(λ), we have: Relevant properties of the state-trace process can be found in [23].Lemma 1 represents one of the main pillars of the analysis below: Lemma 1 ( [23]) Under (A1), the following holds for each θ i and w i and each compact set together with the corresponding vector components.Then, we have the following global model (at the network level) where ⊗ denotes the Kronecker's product, while Γ for the algorithms of GTD2-type, and for the algorithms of TDC-type (g i (•) is defined by (17) and F w (X(n), n) remains the same as in the case of GTD2-type algorithms).

Convergence Proofs
(A2) There is a scalar α 0 > 0 such that a ii (n) ≥ α 0 , and, for i = j, either (A3) For all n there are a scalar p 0 > 0 and an integer n 0 such that P Fn {agent j communicates to agent i on the interval [n, n + n 0 ]} ≥ p 0 , i, j = 1, . . .N .
(A4) The digraph G is strongly connected.
Remark 1 Assumptions (A2)-(A5) are related to the "consensus part" of the proposed algorithms.Lemma 2 represents a slight generalization of the results from [6], based on [12].
Remark 2 Assumption (A7) is frequent for weak convergence proofs.As stated in [6], one can assume w.l.o.g. that {X(n)} is tight by simply truncating the dynamical terms in the algorithms.It is introduced at this point for the sake of placing emphasis on other structural aspects of the proposed algorithms and underlying networks.In Section 4 we provide a related comment.
In the sequel, we will pay attention to several distinct cases, in relation with the proposed algorithms and their specific properties.We will assume that the step sizes are equal and constant, i.e. α i (n) = α and β i (n) = β.The theorems given below are concerned with the asymptotic properties of the algorithms as α, β approach zero, in such a way that α/β → ε, where ε is equal either to one (one-time-scale) or zero (two-time-scale).With standard modifications, the presented results can be applied to the case of diminishing step sizes, when α i (n), β i (n) → 0 as n → ∞ (see [23,41]).
Theorem 1 Let (A1)-(A7) hold.Let X α (n) be generated by ( 5), ( 6), ( 9) and ( 10), with is tight and converges weakly to a process , where θ(•), w 1 (•), . . ., w N (•) satisfy the following system of ODE's with initial conditions θ 0 , w 1,0 , . . ., w N,0 .Moreover, for any integers n α such that αn α → ∞ as α → 0, there exist positive numbers {T α } with T α → ∞ as α → 0, such that for any > 0 lim sup for some k ∈ [0, T α /α], where N (•) denotes the -neighborhood, while is the set of points θ, . . ., θ, w1 , . . ., wN satisfying where is a constant M -vector in the affine function Proof: Part 1. Iterating (18) back, one obtains where (23) with the analogous relation from the proof of Theorem 3.1 in [6] shows very slight formal difference, coming out from specific form of the model (18).Having in mind general properties of the matrix Ψ(k), it is not difficult to conclude that the results of Theorem 3.1 from [6] hold in our case.What remains to be done, is to verify the basic assumptions from [6] connected to the global model itself.Using the preliminary part of this section, we can easily conclude that the results of Lemma 1, together with the derivations from [23], imply that the assumptions C(3.2) and C(3.3') from Section 3 in [6] are satisfied.This fact leads to the conclusion that sup α,n≥nα The asymptotic ODE ( 20) is obtained, according to [6], by defining the following function of continuous time t: for each real valued function f (•) with compact support and continuous second derivatives.Using [6], it is possible to show that M f (t) is a continuous martingale by applying the Skorokhod embedding to the analysis of the limit process X α (•) → X(•).Consequently, M f (t) = 0, having in mind that X(•) is Lipschitz continuous and that M f (0) = 0.This implies that Ẋ = diag{ Ψ ⊗ I p , I N p } F (X).By Lemma 2 and (A2)-(A6), all the rows of Ψ are equal.It follows that the p-dimensional vector components of Θ must be equal, i.e. we obtain that Θ(•) is in the form Θ( and that θ(•) satisfies the first ODE from (20).The remaining ODE's related to w i follow in a more conventional way ( [41], Theorem 8.2.2).Part 2. In order to study the limit set of the ODE (20), denoted as L θ,w1,••• ,w N , we will follow the methodology from [23] (in relation with Proposition 4.1), and introduce the Lyapunov function where θ and wi are given by (22).We have directly where •, • denotes the scalar product.Therefore, V (θ, w 1 , . . .w N ) < 0 for w i ∈ span{φ(S)} and w i = wi , showing that ŵi = wi if [ θT ŵT 1 . . .ŵT N ] T ∈ L θ,w1,...,w N and ŵi ∈ span{φ(S)}, i = 1, . . ., N .Similarly, we can demonstrate that if [ θT wT 1 . . .wT N ] T ∈ L θ,w1,...,w N , then θ = θ.Similarly, reasoning as in [23], we infer that for initial conditions w i (0) ∈ span{φ(S)} the limit set of ODE (20) is the set Σ of the points satisfying (22).
The first part of the proof is analogous to the first part of the proof of Theorem 1.We will indicate here only some characteristic details related to the associated mean ODE.Namely, for the fast time-scale we have (27), as a consequence of the fact that (α/β)ḡ i (θ, w) is negligible when β, α/β → 0. Therefore, for any given θ there is a unique solution wi (θ) to the linear equation ki (θ, w i ) = 0, w i ∈ span{φ(S)}.Therefore, at the slow time scale we have (28).

Case C)
Algorithm D1-TDC(λ), ε = 0.The TDC algorithm from [19] has been originally formulated as a two-time-scale algorithm.We have the same result as in the case of D1-GTD2(λ) with two time-scales.
In the same way as in relation with Theorem 1, we define X α 0 and X α (•) after replacing diag{(A(n) ⊗ I p ), I N p } by diag{(A(n) ⊗ I p ), (A(n) ⊗ I p )} in the corresponding equations.
where θ(•) and w(•) satisfy the following ODE: with initial conditions θ 0 and w 0 .Moreover, for any integers n α such that αn α → ∞ as α → 0, there exist positive numbers {T α } with T α → ∞ as α → 0 such that for any > 0 lim sup i = 1, . . ., N , where N (•) denotes the -neighborhood, while Σ = Σθ × Σw is the set of points where Proof: The procedure of the proof follows from Theorem 1, after replacing diag{(A(n)⊗I p ), In order to analyze the limit set L θ,w of ( 33), we introduce the Lyapunov function where θ and w are given by (35).We have directly wherefrom the result follows using the methodology of Theorem 1.

Constrained Algorithms
Following [23] and the above theorems, it is easily possible to construct constrained versions of all the proposed algorithms and to prove their weak convergence.For example, the constrained form of D1-GTD2(λ) is obtained by applying projections Π B θ {} and Π Bw {•} to the right hand sides of ( 5) and ( 6) on the constraint sets B θ and B w , respectively, w.r.t.• 2 .The sets B θ and B w can be taken to be closed balls in R p centered at the origin and with sufficiently large radii.We will not go into details of the convergence proof for the constrained algorithms: it is possible to apply directly the main methodological lines from [23].We would like only to mention at this point that, in accordance with Remark 1, assumption (A7) can be removed in the case of the constrained algorithms.Also, one has to take into account that the asymptotic ODE's contain now the boundary reflection terms, influencing, in general, definition of their limit sets [23] and imposing, in some cases, additional constraints w.r.t.B θ and B w (see e.g.Lemmas 3.1 and 3.2 from [23]).
For the algorithms D2-GTD2(λ) and D2-TDC(λ) we have the additional consensus w.r.t.w i , implying the constraints w1 (θ) = • • • = wN (θ) = w(θ), where w(θ) satisfies Therefore, for any given θ from which it follows that the convergence points differ from the convergence points of D1-GTD2(λ) and D1-TDC(λ).Notice that in the case of equal λ-parameters and equal behavior policies for all the agents all the proposed algorithms provide the same solution.

Inter-Agent Communications and Network Design
The proposed distributed multi-agent algorithms can be considered as: 1) a tool for organizing coordinated actions of multiple agents contributing to the value function estimation and 2) a parallelization tool, allowing faster convergence, useful particularly in the problems with large dimensions.Notice that the in the first case the proposed algorithms can become a part of multi agent actor-critic schemes (see, e.g., [36,37]).
In general, the agents have specifically tailored behavior policies, as well as different ways of defining the local λ-parameters.The choice of weighting factors in the products ψi q i , i = 1, . . ., N , enables the user to place more emphasis on those agents that can provide greater contribution to the overall goal.Practically, it is obvious that complementary actions with adequate weights within possibly partially overlapping subsets of local MDP states, can contribute significantly to the overall rate of convergence.In this sense, it appears advisable to implement multi-step consensus within time intervals between successive observations [13].Generically, q i is chosen a priori, while ψi depends solely on the network properties, formally expressed by the choice of the consensus matrix A(n).This implies that one of the prerequisites for final planning of the whole system is the network design, including the network topology.There is a great flexibility from this point of view, provided the network is strongly connected (Assumption (A4)).For example, if one adopts that A(n) = A, the problem reduces to the definition of the elements of an N × N matrix A which provides ψi = 1 N , having in mind the initial freedom in selecting arbitrary weights q i .In this case, one has to solve for A the standard equation which always has a solution satisfying the given constraints (a ij ≥ 0, j a ij = 1) [12].Furthermore, the adopted algorithm formulation allows random matrices A(n).Such an additional freedom enables assuming communication dropouts and, additionally, some form of asynchronous communications.A detailed analysis of this problem is given in [12] for gossip type communications.

Variance Reduction Effect
Estimation algorithms based on consensus are characterized, in general, by "denoising" properties, consisting of the reduction of asymptotic variance of the estimates by averaging over a network implicitly introduced by the consensus scheme.Recall that the variance reduction is one of fundamental problems in temporal difference algorithms, in general.Variance reduction can represent one of important motivations for adopting the proposed multi agent approach to off-policy value function approximation.In order to demonstrate this effect in the context of the proposed algorithms, we provide below a concise analysis of the rate of convergence, using the methodology from [6,Section 6].
Consider D2-GTD2(λ) with single-time-scale, starting from the global model (18).Define where Y α (n) follows from the global model in such a way that T (in the context of Theorem 3), and assume it is tight for n ≥ N T .Define also where Under appropriate conditions [6, Section 5.1], it is possible to show that when X α (n) converges weakly to where matrix Q is the Jacobian matrix of ( Ψ ⊗ I 2p ) F Y ( Ȳ ) ( F Y ( Ȳ ) follows from F Y ( Ȳ , n) in the same way as F (X) follows from F (X, n) in the global model description) and v(•) a Wiener process satisfying where ψ 1 (k), . . ., ψ N (k) are the elements of each row of the row-stochastic matrix Ψ(k) (E{•} is understood in the sense of ergodic mean [6]).
can be taken as a measure of the asymptotic quality of the algorithm.As in [6], we specialize to a very simple case by assuming that If we consider the situation in which there are no communications between the agents, we simply conclude that the SDE model has the same form in (42), but with cov{v(1)} = R.The advantage of the distributed algorithm is obvious, having in mind that Σ N i=1 E{ψ i (n) 2 } < 1.

Simulation Results
In this section we illustrate the main properties of the proposed algorithms by applying them to a version of the Boyan's chain, a frequently used benchmark in the literature, e.g.[19,34,40].The diagram of the underlying Markov chain is shown in Fig.
The chain has 15 states with one absorbing state.We assume that the discount factor is γ = 0.85.The chain can be interpreted as a decision making problem when traveling on a highway, with possibilities of exiting and using alternative roads.The policy which driver can choose at each state is the probability of selecting the exit action a exit at state s: π(s, a exit ).The reward for exiting is r(s, a exit , s ) = −4 for all s and s (can be interpreted as the consumed fuel), but the probability of staying in the same state (jammed) is fixed to 0.2.If we choose action a h (to stay on the highway) the reward is r(s, a h , s ) = −1 for all s and s , but the probability of staying in the same state grows with the state number as 1− 1 s , where s is the state number.The target policy is the stationary policy π(s, a exit ) = 0.8.We assume that there are 10 agents with time-invariant communication graph, such that the agents communicate only with several randomly chosen neighbors (minimum 3 and maximum 6), with equal weights of all the neighbors.Furthermore, it is assumed that the agents are only able to obtain 7-features Gaussian radial basis representations of the state vector as functions of distances to the states 1, 3, 5, 7, 9, 11 and 13 (φ i (s) = e (s−z i ) 2 2σ 2 , i = 1, ..., 7, z i ∈ {1, 3, 5, 7, 9, 11, 13}, where the "variance" σ 2 parameter will be specified latter).Note that the chain has an absorbing state (it does not satisfy the conditions for convergence); hence we run the algorithms in multiple episodes by resetting the states back to 1 when the absorbing state is reached).
In the first experiment, we demonstrate the case in which the agents, individually, are not able to estimate the value function due to their restrictive behavior policies; however, they are able to obtain convergent estimates of the value function using the proposed consensus algorithm.We assume that these policies are such that the agents can individually visit only a subset of the states, with the following agents' starting and stopping states [(1, 3), (2, 4), (4, 7), (5,15), (5,15), (3,14), (8,15), (1,6), (5,10), (6,11)], i.e. the first agent always starts in state 1 and stops in state 3, and so on.Formally, we model this situation by assuming a possibility of choosing the third action (besides a h It can be observed that better approximation of the value function is obtained for the latter states (after state 5), because the behavior policies of the agents are such that overall they visit these states more frequently (with higher probability), and, hence, they will have higher weights in the overall criterion (2).
In the second experiment we demonstrate the denoising effect of the introduced distributed algorithms.We assume that the agents have the same stationary behavior policies as above, but that they all start in state 1 and are able to advance to the final state 15.In this example, we also assume that the agents implement the algorithms with eligibility traces, with different λ parameters: [0.6, 0.1, 0.25, 0.5, 0.05, 0.01, 0.3, 0.5, 0.4, 0.7].In Fig. 3, the value function approximation obtained using D1-GTD2(λ) is represented for step sizes α = β = 0.5 and for σ 2 = 2.The true value function is depicted using the dashed line.It can be seen that the approximation is better for the states z i ∈ {1, 3, 5, 7, 9, 11, 13}, since these are references for the radial basis representation.The approximation at all of these states have similar precision, since they all have similar overall weights in the criterion (2) (it is not possible to converge to the true value function because of the radial basis functions approximation).As can be seen from the figure, all the agents have achieved consensus: the final value function approximations are practically the same for all the agents.Fig. 4 shows the parameter estimates θ i (n) as functions of the number of iterations n.Note that in this case 20 episodes were needed for the obtained approximation, which is much less compared to the single agent case (Fig. 5), which also has much larger variance.This demonstrates the importance of the multi-agent collaboration from the point of view of denoising.In Fig. 6, the obtained value function approximation for D2-TDC(λ) with two time scales is depicted for α = 0.01 and β = 0.5, while Fig. 7 shows the parameter estimates of D2-TDC(λ) as a function of the number of iterations.In general, the benefit of using two time scales depends on the choice of initial conditions for the parameters w i .

Conclusion
In this paper we proposed several novel algorithms for distributed off-policy gradient based value function approximation in a collaborative multi-agent reinforcement learning setting.The algorithms are based on integration of linear dynamic consensus schemes into local recursions which are based on recently proposed convergent off-policy gradient temporal difference learning schemes, including those based on eligibility traces possibly with state dependent parameters.The proposed algorithms differ in the way of how these local recursions are performed and how consensus iterations are incorporated.The algorithms are completely decentralized; the distribution of functions is realized by sparse inter-agent communication over an underlying network.The algorithms are applicable in the practically important scenarios in which all the agents have different behavior policies while evaluating a single target policy.Under nonrestrictive assumptions we have proved that the parameter estimates of all the proposed algorithms weakly converge to consensus.The proofs themselves represent the major theoretical contribution of the paper.We have also analyzed different aspects of the algorithm implementation, including the convergence rate of the algorithms.We have demonstrated their nice "denoising" (variance reduction) properties, very important, in general, for temporal difference algorithms.Furthermore, we have presented a discussion on how to design the network topology and the corresponding weights in order to set appropriate convergence points at consensus.Finally, the presented thorough theoretical analysis of the properties of the algorithms have been illustrated by simulations.
Further work could be devoted to the weak convergence analysis of alternative multi-agent temporal difference schemes, including the emphatic temporal difference algorithm [38,44] and actor-critic algorithms [36,37].Also, the proposed schemes could be extended to the cases of nonlinear value function approximations (such as using deep neural networks [27]).
respectively.It is possible to show that vectors u(•) and v(•) asymptotically satisfy the following Itô stochastic differential equation (SDE)

Figure 1 :
Figure 1: Diagram of the simulated MDP

Figure 2 :
Figure 2: Value function approximation obtained using D1-GTD2(λ) in which the agents have behavior policies such that they can individually visit only a subset of the states.True value function is shown using blue line.

Figure 3 :Figure 4 :Figure 5 :
Figure 3: Value function approximation obtained using D1-GTD2(λ) in which all the agents have behavior policies such that they can visit all the states.True value function is shown using blue line.

Figure 6 :Figure 7 :
Figure 6: Value function approximation obtained using D2-TDC(λ) in which all the agents have behavior policies such that they can visit all the states.True value function is shown using blue line.