Distributed Off-Policy Temporal Difference Learning Using Primal-Dual Method

The goal of this paper is to provide theoretical analysis and additional insights on a distributed temporal-difference (TD)-learning algorithm for the multi-agent Markov decision processes (MDPs) via saddle-point viewpoints. The (single-agent) TD-learning is a reinforcement learning (RL) algorithm for evaluating a given policy based on reward feedbacks. In multi-agent settings, multiple RL agents concurrently behave, and each agent receives its local rewards. The goal of each agent is to evaluate a given policy corresponding to the global reward, which is an average of the local rewards by sharing learning parameters through random network communications. In this paper, we propose a distributed TD-learning based on saddle-point frameworks, and provide rigorous analysis of finite-time convergence of the algorithm and its solution based on tools in optimization theory. The results in this paper provide general and unified perspectives of the distributed policy evaluation problem, and theoretically complement the previous works.


I. INTRODUCTION
Temporal difference (TD)-learning [1], [2] is a reinforcement learning (RL) algorithm to evaluate a given policy without the model knowledge of a Markov decision process (MDP), which is called the policy evaluation problem. In a multi-agent RL setting, multiple RL agents concurrently behave and evaluate a policy based on the global reward, which is a sum of the local rewards, where each agent only receives a local reward. The learning strategies are distributed in the sense that each agent only has access to its local reward which only contains partial information on the global reward.
The associate editor coordinating the review of this manuscript and approving it for publication was Wentao Fan .
Among them, our main focus is on distributed off-policy TD-learning algorithms with saddle-point approaches. Here, the term 'policy' implies a mapping from the current state to an action, 'off-policy' means that the target policy we want to evaluate is different from the behavior policy we apply to collect experiences, and the 'saddle point' implies a point on the surface of the graph of a function where its partial derivatives equal to zero but at which the function has neither a maximum nor a minimum value. The saddle-point approaches mean overall frameworks that address the underlying problem as a saddle-point problem, where the saddle-point problem is the problem of finding a saddle-point of a given function, which is an important mathematical subject that has tremendous applications ranging from optimal control [22], mathematical programming [23], [24], to deep learning [25], and developing algorithms to solve it have been an active research topic for decades [26].
The main advantages of the saddle-point approaches for TD-learning lie in the following facts: (a) they provide rigorous and unified mathematical structures for analysis of solutions and convergence; (b) analysis tools from optimization perspectives, such as [26], [27], [28], [29], [30], and [31] can be easily applied to prove its convergence; (c) the frameworks offer more algorithmic flexibility such as additional cost constraints and objective, for example, entropic measures and sparsity promoting objectives. The saddle-point viewpoints have been recently applied to TD-learning algorithms for single-agent MDPs [1], [2] and multi-agent MDPs [7], [8], [11], [12], [13]. They address reinforcement learning algorithms with the saddle-point approaches. Although fruitful advances have been made recently, rigorous theoretical analysis of distributed off-policy TD-learning algorithm with saddle-point frameworks has not been fully studied so far.
Motivated by the above discussions, the main goal of this paper is to provide theoretical analysis templates and additional insights of distributed off-policy TD-learning algorithms with saddle-point frameworks, which provide general and unified perspectives of the distributed policy evaluation problem. To this end, we propose a new distributed TD-learning algorithm, called a distributed gradient temporal-difference (DGTD) learning, for multiagent MDPs [32], [33], which generalizes the single-agent GTD [1], [2] to the multi-agent MDPs. We note that this DGTD is developed to provide additional insights and analysis templates rather than improving convergence speed of previous approaches. In this respect, we view the proposed algorithm and its analysis as theoretical complement of the previous works rather than an improvement over them. Compared to existing results in the literature, the present work systematically provides rigorous analysis of the distributed GTD in a pure saddle-point form, which are lacking in the literature to the authors' knowledge, and through this, new insights and analysis templates are proposed. They can motivate new algorithm developments, and offer additional insights and understanding on the distributed GTD. Additional features of the proposed algorithm are (a) consideration of random communication networks; (b) consideration of a pure saddlepoint formulation, which are lacking in the literature. In the following section, some related works are briefly presented to compare them with the proposed work.

2) POLICY EVALUATION PROBLEM
Distributed TD-learning algorithms are studied in [8], [9], [10], [11], [12], [13], and [14]. The results in [8], [9], and [34] consider central rewards with different assumptions. The result in [10] suggests a distributed TD learning with an averaging consensus steps, and proves its convergence rate. The main difference is that [10] considers an on-policy learning, while this work considers off-policy learning methods. Here, 'on-policy learning' means reinforcement learning which is opposite to 'off-policy learning,' i.e., on-policy learning implies that the target policy we want to evaluate is identical to the behavior policy we apply to collect experiences. The TD-learning in [11] considers a stochastic primal-dual algorithm for the policy evaluation with stochastic variants of the consensus-based distributed subgradient method akin to [18]. The main difference is that the algorithm in [11] introduces gradient surrogates of the objective function with respect to the local primal and dual variables, and the mixing steps for consensus are applied to both the local parameters and local gradient surrogates. Moreover, [12] provides a new stochastic primal-dual algorithm for the policy evaluation, which is offpolicy. The work in [13] develops the so-called homotopy stochastic primal-dual algorithm with local actions and adaptive learning rates. A distributed gradient TD-learning with value functions that lie in reproducing kernel Hilbert spaces is proposed in [14] with its finite-sample analysis.

B. DISCUSSIONS
Among the previous works, approaches closest to the present work are [11], [12], and [13], which are also based on primal-dual approaches. Especially, [12] considers off-policy TD-learning as well, and [11] and [13] can be also potentially extended to the off-policy setting. In particular, [11], [12] provide convergence to solutions of stochastic approximate surrogate objective functions instead of the true solutions. On the other hand, we provide convergence to solutions of the true objective function. Although [13] also addresses convergence to solutions of true objective function, it does not include a rigorous solution analysis including their bounds. Moreover, the consensus constraint is not involved in the saddle-point form, while in our results, the overall problem is blended into a single saddle-point problem. Therefore, the method in [13] cannot enjoy advantages of the proposed pure saddle-point framework such as the generalization to the random communication network scenario. The objective functions are different across the current and previous works as well. A regularization term is used in [11], and the proximal term is introduced in [12] in the saddle-point forms to make the objective functions strongly convex-concave, while in our work, the regularization term is not used in order to obtain an unbiased solution. Besides, additional features of the proposed work are summarized as follows: 1) Most aforementioned distributed TD-learning algorithms assume static network structures except for [10], which addresses deterministic time-varying networks, and adopts a different assumption on the graph connectivity. On the other hand, we consider stochastic networks. 2) A new saddle-point formulation of the distributed policy evaluation problem is proposed with rigorous analysis. The new formulation is a general and unified framework, which provides new insights, and can potentially open up opportunities to developments of new multi-agent reinforcement learning algorithms.
Preliminary results are included in the conference version [35], which only provides asymptotic convergence. On the other hand, this work provides more rigorous solution analysis and finite-time convergence. In addition, we consider stochastic network communications, and provide a modified algorithm to improve its convergence properties.

A. NOTATION AND TERMINOLOGY
The following notation is adopted: R n denotes the n-dimensional Euclidean space; R n×m denotes the set of all n × m real matrices; R + and R ++ denote the sets of nonnegative and positive real numbers, respectively, A T denotes the transpose of matrix A; I n denotes the n × n identity matrix; I denotes the identity matrix with appropriate dimension; · 2 denotes the standard Euclidean norm; x D := √ x T Dx for any positive-definite D; λ min (A) denotes the minimum eigenvalue of A for any symmetric matrix A; |S| denotes the cardinality of the set for any finite set S; E[·] denotes the expectation operator; P[·] denotes the probability of an event; [x] i is the i-th element for any vector x; [P] ij indicates the element in i-th row and j-th column for any matrix P; if z is a discrete random variable which has n values and µ ∈ R n is a stochastic vector, then z ∼ µ stands for P[z = i] = [µ] i for all i ∈ {1, . . . , n}; 1 n ∈ R n denotes an n-dimensional vector with all entries equal to one; dist(S, x) denotes the standard Euclidean distance of a vector x from a set S, i.e., dist(S, x) := inf y∈S x − y 2 ; for any S ⊂ R n , diam(S) := sup x∈S,y∈S x − y 2 is the diameter of the set S; for a convex closed set S, S (x) is the projection of x onto the set S, i.e., S (x) := arg min y∈S x − y 2 ; a continuously differentiable function f : [24, pp. 691]; f x (x) is a subgradient of a convex function f : R n → R at a given vectorx ∈ R n when the following relation holds: [28, pp. 209].

B. GRAPH THEORY
An undirected graph with the node set V and the edge set E ⊆ V × V is denoted by G = (E, V). We define the neighbor set of node i as N i := {j ∈ V : (i, j) ∈ E}. The adjacency matrix of G is defined as a matrix W with [W ] ij = 1, if and only if (i, j) ∈ E. If G is undirected, then W = W T . A graph is connected, if there is a path between any pair of vertices.

The graph Laplacian is
If the graph is undirected, then L is symmetric positive semi-definite. It holds that L1 |V| = 0. If G is connected, 0 is a simple eigenvalue of L, i.e., 1 |V| is the unique eigenvector corresponding to 0, and the span of 1 |V| is the null space of L.

C. RANDOM COMMUNICATION NETWORK
We will consider a random communication network model considered in [17]. In this paper, agents communicate with neighboring agents and update their estimates at discrete time instances k ∈ {0, 1, . . .} over random time-varying network . We assume that G(k) is a random graph that is independent and identically distributed over time k. A formal definition of the random graph is given below.
Assumption 1: Let F := ( , B, µ) be a probability space such that is the set of all |V| × |V| adjacency matrices, B is the Borel σ -algebra on and µ is a probability measure on B. We assume that for all k ≥ 0, the matrix W (k) is drawn from probability space F.
Define the expected value of the random matrices W (k), H (k), L(k), respectively, by for all k ≥ 0. An edge set induced by the positive elements of the matrix W is E := {(j, i) ∈ V × V : [W ] ij > 0}. Consider the corresponding graph (E, V), which we refer to as the mean connectivity graph [17]. We consider the following connectivity assumption for the graph.
Assumption 2 (Mean connectivity): The mean connectivity graph (E, V) is connected.
Under Assumption 2, 0 is a simple eigenvalue of L [36, Lemma 1]. It implies that L1 |V| = 0 holds, and later this assumption is used for the consensus of learning parameters.

D. REINFORCEMENT LEARNING OVERVIEW
We briefly review a basic single-agent RL algorithm from [37] with linear function approximation. A Markov decision process (MDP) is characterized by a quadruple M := (S, A, P, r, γ ), where S is a finite state space (observations in general), A is a finite action space, P(s, a, s ) := P[s |s, a] represents the (unknown) state transition probability from state s to s given action a,r : S × A × S → [0, σ ], where σ > 0 is the bounded random reward function, and γ ∈ (0, 1) is the discount factor. If action a is selected with the current state s, then the state transits to s with probability P(s, a, s ) and incurs a random reward r(s, a, s ) ∈ [0, σ ] with expectation r(s, a, s ). The stochastic policy is a map π : S × A → [0, 1] representing the probability π(s, a) = P[a|s], P π denotes the transition matrix VOLUME 10, 2022 whose (s, s ) entry is P[s |s] = a∈A P(s, a, s )π (s, a), and d : S → R denotes the stationary distribution of the state s ∈ S under the behavior policy β. We also define r π (s) as the expected reward given the policy π and the current state s, i.e., r π (s) := a∈A s ∈S π (s, a)P(s, a, s )r(s, a, s ). The infinite-horizon discounted value function with policy π and rewardr is where E stands for the expectation taken with respect to the state-action trajectories following the state transition P π . Given pre-selected basis (or feature) functions φ 1 , . . . , φ q : S → R, ∈ R |S|×q is defined as a full column rank matrix whose s-th row vector is φ(s) := φ 1 (s) · · · φ q (s) . The goal of RL with the linear function approximation is to find the weight vector w such that J w := w approximates the true value function J π . This is typically done by minimizing the mean-square Bellman error loss function [2] min w∈R q MSBE(w) := where MSBE(w) := 1 2 r π + γ P π w − w 2 D , D is a symmetric positive-definite matrix and r π ∈ R |S| is a vector enumerating all r π (s), s ∈ S. For online learning, we assume that D is a diagonal matrix with positive diagonal elements d(s), s ∈ S. In the model-free learning, a stochastic gradient descent method can be applied with a stochastic estimates of the gradient ∇ w MSBE(w) = (γ P π − ) T D(r π +γ P π w− w). The temporal difference (TD) learning [22], [37] with a linear function approximation is a stochastic gradient descent method with stochastic estimates of the approximate gradient ∇ w MSBE(w) ∼ = (− ) T D(r π + γ P π w − w), which is obtained by dropping γ P π in ∇ w MSBE(w). If the linear function approximation is used, then this algorithm converges to an optimal solution of (1). The GTD in [2] solves instead the minimization of the mean-square projected Bellman error loss function min w∈R q MSPBE(w) := where is the projection onto the range space of , denoted by R( ): The projection can be performed by the matrix multiplication: we write (x) := x, where := ( T D ) −1 T D. Compared to the standard TD learning, the main advantage of the GTD algorithms [1], [2] is their off-policy learning abilities.
Remark 1: Although its direct application to real problems is limited, the policy evaluation problem is a fundamental problem which is a critical building block to develop more practical policy optimization algorithms such as SARSA [38] and actor-critic [39] algorithms.
Note that d depends on the behavior policy, β, while P π and r π depend on the target policy π that we want to evaluate. This corresponds to the off-policy learning. The main problem is to obtain samples, (s, a,r, s ) under π , from the samples under β. It can be done by the importance sampling or sub-sampling techniques [1], which will be described soon.

III. DISTRIBUTED REINFORCEMENT LEARNING OVERVIEW
In this section, we introduce the notion of the distributed RL, which will be addressed throughout the paper. Consider N RL agents labelled by i ∈ {1, . . . , N } =: V. A multi-agent Markov decision process is characterized by (S, {A i } i∈V , P, {r i } i∈V , γ ), where γ ∈ (0, 1) is the discount factor, S is a finite state space, A i is a finite action space of agent i, a := (a 1 , . . . , a N ) is the joint action, In particular, if the joint action a is selected with the current state s, then the state transits to s with probability P(s, a, s ), and each agent i observes a random reward r i (s, a, s ) ∈ [0, σ ] with expectation R i (s, a, s ). We assume that each agent does not have access to other agents' rewards. For instance, there exists no centralized coordinator; thereby each agent does not know other agents' rewards. In another example, each agent/coordinator may not want to uncover his/her own goal or the global goal for security/privacy reasons. We denote by R π i (s) the expected reward of agent i, given the current state s, R π i (s) := E [r i (s k , a k , s k+1 )|s k = s, π]. Throughout the paper, a vector enumerating all R π i (s), s ∈ S is denoted by R π i ∈ R |S| . In addition, denote by P i (s, a, s i ) the state transition probability of agent i given joint state s and joint action a.
In this paper, we adopt the standard assumptions: the MDP with given behavior policy, β, has a stationary distribution.
Assumption 3: With a fixed policy β, the Markov chain P β is ergodic with the stationary distribution d with d(s) > 0, s ∈ S.
We note that Assumption 3 is a fundamental hypothesis of most TD-learning algorithms [1], [2], [22], [40], [41]. In particular, if the Markov chain is ergodic, it implies that the stationary state distribution lim k→∞ P[s k = s|β] =: d(s), ∀s ∈ S, exists and is unique. Without the ergodicity assumption, some notions in the infinite-horizon discounted Markov decision problem are not well defined because of the time-varying nature of the state distribution.
Next, we summarize additional definitions and notations for some important quantities below.
1) D is defined as a diagonal matrix with diagonal entries equal to those of d.

2) The central reward is defined as
and the corresponding conditional expectation is 3) J π is the infinite-horizon discounted value function corresponding to the policy π and the central reward r = (r 1 + · · · + r N )/N defined as where E stands for the expectation taken with respect to the state-action trajectories following the state transition P π . Note that J π satisfies the Bellman equation
Problem 1 (Multi-agent RL problem (MARLP)): In the multi-agent RL problem with the linear function approximation, the goal of each agent i is to learn the weight vector w such that J w := w approximates the true value function J π corresponding to the centralized reward r = (r 1 +· · ·+r N )/N by minimizing the mean-square projected Bellman error loss function [2] min w∈R q MSPBE(w) := and D is a symmetric positive-definite matrix and R π ∈ R |S| is a vector enumerating all R π (s), s ∈ S. The projection can be performed by the matrix multiplication: we write (x) := x, where Firstly, we can prove that solving Problem 1 is equivalent to solving the optimization problem where . . , N }, C ⊂ R q is assumed to be a compact convex set which includes an unconstrained global minimum of (4). The following result formally establishes the equivalence between (4) and Problem 1.
Proposition 1: The following statements hold: 1) The solution of (4) is identical to the unique solution w * of the projected Bellman equation 2) The solution of (4) is given by 3) The solution of Problem 1 is identical to the solution of (4). Proof: Pre-multiplying the equation by yields the projected Bellman equation (5).
2) A solution w * of the projected Bellman equation (5) exists [22, pp. 355]. To prove the second statement, premultiply (5) by T D to have pre-multiply both sides of the above equation by ( T D(I − γ Pπ ) ) −1 to obtain (6). The solution is unique because the objective function in (4) is strongly convex.
3) According to the first statement, the solution of Problem 1 satisfies the projected Bellman equation (5). Moreover, we can easily prove that the solution of the projected Bellman equation (5) is identical to the solution of the optimization in (3). This is because by taking the gradient on (3) with respect to w, (5) is obtained. This completes the proof.
Note that if = I |S| , then the results are reduced to those of the tabular representations. Therefore, all the developments in this paper include both the tabular representation and the linear function approximation cases.
Next, to develop a distributed algorithm, we first convert (4) into the equivalent distributed optimization problem [16] Distributed optimization form of MARLP: where (8) implies the consensus among N copies of the parameter w. We first establish the equivalence between (4) and (7). Proposition 2: The solution of Problem 1 is identical to that of (7).
Proof: In (7), the equality constraint can be merged with the objective by setting w 1 = w 2 = · · · = w N =: w. Then, (4) is recovered. Next, according to the third statement of Proposition 1, the solution of (4) is identical to that of Problem 1. This completes the proof.
To make the problem more feasible, we assume that the learning parameters w i , i ∈ V, are exchanged through a random communication network represented by the undirected graph G(k) = (E(k), V(k)). In Section V, we will make several conversions of (7) to arrive at an optimization form, which can be solved using a primal-dual saddle-point algorithm [26], [28].

IV. STOCHASTIC PRIMAL-DUAL ALGORITHM FOR SADDLE-POINT PROBLEM
The proposed RL algorithm is based on a saddle-point problem formulation of the distributed optimization problem (7). In this section, we briefly introduce the definition of the saddle-point problem and a stochastic primal-dual algorithm [26] to find its solution.
Definition 1 (Saddle-point [28]): Consider the map L : The pair (x * , w * ) is called a saddle-point of L. The saddle-point problem is defined as the problem of finding saddle points (x * , w * ). It can be also defined as solving min x∈X max w∈W L(x, w) = max w∈W min x∈X L(x, w).
In our analysis, it will use the notion of approximate saddle-points in a geometric manner. In particular, the concept of the ε-saddle set is defined below.
Definition 2 (ε-saddle set): For any ε ≥ 0, the ε-saddle set is defined as From the definition, it is clear that H 0 is the set of all saddle-points. The goal of the saddle-point problem is to find a saddle-point (x * , w * ) defined in Definition 1 over the set X × W. The stochastic primal-dual saddle-point algorithm in [26] can find a saddle-point when we have access to stochastic gradient estimates of function L. It executes the following updates: where L x (x, w) and L w (x, w) are the gradients of L(x, w) with respect to x and w, respectively, and ε k , ξ k are i.i.d. random variables with zero means. In this paper, the proposed algorithm can be seen as a class of the stochastic primal-dual saddle-point algorithms, and the availability of stochastic gradient estimates is a critical assumption for this purpose. Such stochastic gradient estimates can be obtained using state-action transition samples obtained by interacting with the environment.
To proceed, define the history of the algorithm until time k, In the following result, we provide a finite-time convergence of the primal-dual algorithm with high probabilities.

The proof of Proposition 3 can be found in Appendix.
Remark 2: 1) Convergence of stochastic primal-dual algorithms has been studied in previous works such as [26,Section 3.1]. The analysis given in Proposition 3 is tailored for our purposes, and has not been formally reported in the literature to our knowledge. We note that the analysis given in Proposition 3 can be viewed as a complement rather than replacement of the existing results. 2) We assume i.i.d. samples in Proposition 3. In practice, uniform samplings from a sufficiently large replay buffer [42] can sufficiently approximate this scenario. Convergence of stochastic distribution primal-dual algorithms was studied in [13] with Markovian samplings. We expect that the present work can be potentially extended to Markovian sampling scenario along similar directions, and it remains as a potential future direction.

V. SADDLE-POINT FORMULATION OF MARLP
In the previous section, we introduced the notion of the saddle-point and a stochastic primal-dual algorithm to find it. In this section, we study a saddle-point formulation of the distributed optimization (7) as a next step. Once obtained, the MARLP can be solved by using the stochastic primaldual algorithm. For notational simplicity, we first introduce stacked vector and matrix notations.
Using those notations, the MSPBE loss function in (7) can be compactly expressed as , where ⊗ is the Kronecker's product. Note that by the mean connectivity Assumption 2, the consensus constraint (8) can be expressed asLw = 0, as L has a simple eigenvalue 0 with its corresponding eigenvector 1 |S| [36, Lemma 1]. Motivated by the continuous-time consensus optimization algorithms in [29], [30], and [31], we convert the problem (7) into the augmented Lagrangian problem [24,Sec. 4 subject toLw = 0, where a quadratic penalty termw TLLw for the equality con-straintLw = 0 is introduced. If the model is known, the above problem is an equality constrained quadratic programming problem, which can be solved by means of convex optimization methods [23]. Otherwise, the problem can be still solved using stochastic algorithms with observations. The latter case is our main concern. To develop model-free stochastic algorithms, some issues need to be taken into account. First, to estimate a stochastic estimate of the gradient, we need to assume that at least two independent next state samples can be drawn from any current state, which is impossible in most practical applications. The problem is often called the double sampling problem [22]. Second, the inverse matrix (¯ TD¯ ) −1 in the objective function (14) needs to be removed. In particular, the main reason we use the linear function approximation is due to the large size of the state-space to the extent that enumerating numbers in the value vector is computationally demanding or even not possible. The computation of the inverse (¯ TD¯ ) −1 is not possible due to both its computational complexity and the existence of the matrixD including the stationary state distribution, which is assumed to be unknown in most RL settings. In GTD [2], this problem is resolved using a dual problem [8]. Following the same direction, we convert (14) into the equivalent optimization problem min ε,h,w whereε andh are newly introduced parameters. The next key step is to derive its Lagrangian dual problem [23], which can be obtained using standard approaches [23]. Proposition 4: The Lagrangian dual problem of (15) is given by where ψ(θ ,v,μ) : Tv . Proof: The dual problem can be obtained using standard manipulations as in [23,Ch. 5]. Define the Lagrangian function whereθ ,v,μ are Lagrangian multipliers. If we fix (θ ,v,μ), then the problem minε ,h,w L(ε,h,w,θ ,v,μ) has a finite optimal value, whenθ TB −v TL −μ TL = 0. The optimal solutions satisfyε = (¯ TD¯ )θ ,h =v. Plugging them into the Lagrangian function, the dual problem is obtained.
One can observe that the inverse matrix (¯ TD¯ ) −1 no more appears in the dual problem (16). To solve (16), we again construct the following Lagrangian function of (16) as in [8]: wherew is the Lagrangian multiplier. We further modify (17) by adding the term −(κ/2)w TLw : where κ ≥ 0 is a design parameter. Note that the solution of the original problem is not changed for any κ ≥ 0. The term, −(κ/2)w TLw , is added to accelerate the convergence in terms of the consensus ofw.

VI. SOLUTION ANALYSIS
In Section III, we derived a saddle-point formulation of the distributed optimization (7). In this section, we analyze the set of saddle-points. In particular, we obtain formulations of the set of saddle-points which solve (19). The formulations will be used in subsequent sections to develop the proposed RL algorithm. According to the standard results in convex optimization [23, Sec. 5.5.3, pp. 243], any saddlepoint (θ * ,v * ,μ * ,w * ) satisfying (20) must satisfy the following Karush-Kuhn-Tucker (KKT) condition although its converse is not true in general: However, by investigating the KKT points, we can obtain useful information on the saddle-points. We first establish the fact that the set of KKT points corresponds to the set of optimal solutions of the consensus optimization problem (8).

Proposition 5: The set of all the KKT points satisfying (21) is given by
wherev * = 0, w * is given in (6) (the unique solution of the projected Bellman equation (5)), and F * is the set of all solutions to the linear equation forμ Proof: The KKT condition in (21) is equivalent to the linear equations: Since the mean connectivity graph (E, V) of G(k) is connected by Assumption 2, the dimension of the null space of L is one. Therefore, span(1 |V| ) is the null space, and (26) implies the consensus w * = w * 1 = · · · = w * N . Plugging (26) into (25) In addition, from (24), the stationary point forθ satisfies Plugging the above equation into (28) yields

Multiplying (30) by (1 ⊗ I ) T on the left results in
Since T D(I − γP π ) is nonsingular [22, pp. 300], premultiplying both sides of the last equation with (( T D(I − γP π ) ) T ) −1 results in Pre-multiplying (31) with T from left yields the projected Bellman equation in Proposition 1, and w * is any of its solutions. In particular, multiplying (24) by (1 ⊗ I ) T from left, a KKT point forw * is expressed asw * = 1 ⊗ w * with From (30),μ * is any solution of the linear equation (30). Lastly, (31) can be rewritten as

Subtracting (29) by the last term, we obtainθ
This completes the proof.
Since the set of saddle-points of L in (17) is a subset of the KKT points, we can estimate a potential structure of the set of saddle-points. According to Corollary 1, the set of KKT points corresponding toθ ,v, andw is a singleton {θ * }×{v * }×{1 N ⊗w * }. Therefore, it is the unique saddle-point corresponding toθ ,v, andw. On the other hand,F * is a set. We can prove thatF * is an affine space.
Proposition 6: We haveμ * =L †¯ T (I − γP π ) T D¯ θ * ∈F * . Proof: Since F * is the set of solutions of the linear equationLμ =¯ T (I −γP π ) TD¯ θ * , F * is the set of general solutions of the linear equation, which are given by the affine spaceμ =L †¯ T (I −γP π ) TD¯ θ * +(L †L −I )z, whereL † is a pseudo-inverse ofL and z ∈ R |S|N is arbitrary. In addition, sinceF * ⊆ F * andF * is also affine by Lemma 1, one concludes thatμ =L †¯ T (I − γP π ) TD¯ θ * ∈F * . For some technical reasons that will become clear later, algorithms to find a solution need to confine the search space of an algorithm to compact and convex sets which include at least one saddle-point inR in Corollary 1. To this end, we compute a bound on at least one saddle-point (θ * ,v * ,μ * ,w * ) in the following lemma.
For the pseudo-inverse of the graph Laplacian in [43], we can use the expression To prove Lemma 2, we will first prove a bound on w * ∈ W * . Claim: If w * is an optimal solution presented in Proposition 1, then

Proof of Claim:
We first bound the term w * ∞ as follows: where the first inequality follows from the triangle inequality, the second inequality uses · ∞ ≤ · 2 , the third inequality uses x T Dx/ √ ξ , the fourth inequality comes from [22, Proposition 6.10], and the last inequality uses the bound on the rewards. On the other hand, its lower bound can be obtained as where the first inequality comes from v 2 ≤ √ n v ∞ for any v ∈ R n and the second inequality uses Combining the two relations completes the proof.
The first bound easily follows fromw * = 1 N ⊗ w * and the Claim. Sincev * = 0 from Proposition 5, the second inequality is obvious. For the third bound, we use the expression for θ * in Proposition 5 to prove where the first inequality follows from · ∞ ≤ · 2 , the third inequality follows from the nonexpansive property of the projection (see [22, Proof of Proposition 6.9., pp. 355] for details), and the fact that v 2 ≤ √ n v ∞ for any v ∈ R n is used in the fifth inequality. Lower bounds on ¯ Tθ * ∞ are obtained as Combining the two inequalities yields the third bound. For the last inequality, we use Proposition 6 and obtain a bound onL †¯ T (I − γP π ) TD¯ θ * ∈F * where the third inequality follows from the fact that absolute values of all elements of (I − γ P π ) T D are less than one, and the fourth inequality uses the bounds on θ * ∞ . In this section, we analyzed the set of saddle-points corresponding to the MARLP. In the next section, we introduce the proposed distributed RL algorithm, which solves the saddle-point problem of the MARLP in (19) by using the stochastic primal-dual algorithm.

VII. PRIMAL-DUAL DISTRIBUTED GTD ALGORITHM (PRIMAL-DUAL DGTD)
In this section, we study a distributed GTD algorithm to solve Problem 1. The main idea is to solve the saddle-point problem of the MARLP in (19) by using the stochastic primal-dual algorithm, where the unbiased stochastic gradient estimates are obtained by using samples of the state, action, and reward. To proceed, we first modify the saddle-point problem of the MARLP in (19) to a constrained saddle-point problem whose domains are confined to compact sets.
Lemma 2 provides rough estimates of the bounds on the sets that include at least one saddle-point of the Lagrangian function (17). Define the cube B β := {x ∈ R |S|N : x ∞ ≤ β} and Cθ = B cθ +βθ , Cv = B cv+βv , Cμ = B cμ+βμ , Cw = B cw+βw for βθ , βv, βμ, βw > 0. Then, the constraint sets satisfyθ * ∈ Cθ ,v * ∈ Cv,w * ∈ Cw, and Cμ ∩ F * = ∅. Estimating cθ , cv, cμ, cw > 0 requires additional analysis or is almost infeasible in most real applications because the model is assumed to be unknown. However, in practice, we can consider sufficiently large parameters cθ , cv, cμ, cw > 0 so that they include at least one solution. With this respect, we assume that sufficiently large sets Cθ , Cv, Cμ, Cw satisfy Cμ ∩ F * = ∅. Such an assumption is frequently adopted in the literature when projection steps are required for general stochastic approximation algorithms to guarantee the convergence, e.g. [4], [27], [44]. For simpler analysis, we also assume that the solutions are included in interiors of the compact sets.
Assumption 4: The constraint sets satisfyθ * ∈ Cθ ,v * ∈ Cv,w * ∈ Cw, and Cμ ∩ F * = ∅. Under Assumption 4, finding a saddle-point in (19) can be reduced to the constrained saddle-point problem For notational convenience, introduce the notation Then, the saddle-point problem is minx ∈X maxw ∈W L(x,w). If the gradients of the Lagrangian are available, then the deterministic primal-dual algorithm [28] can be used as follows:x In this paper, our problem allows only stochastic gradient estimates of the Lagrangian function: the exact gradients are not available, while only their unbiased stochastic estimations are given. In this case, the stochastic primal-dual algorithm [26] introduced in Section IV can find a solution under certain conditions where ε k and ξ k are i.i.d. random variables with zero mean. In our case, stochastic estimates of the Lagrangian function (18) can be obtained by using samples of the state, action, and reward. The overall algorithm is given in Algorithm 1. The algorithm assumes additional partial information sharing among agents, e.g., sharing of learning parameters, through random network communications, where the network structure is represented by a randomly changing undirected graph. Despite the additional communication model, the algorithm is still distributed in the sense that each agent has a local view of the overall system: it is only accessible to the learning parameter of the neighboring RL agents in the graph. Potential applications are distributed machine learning, distributed resource allocation, and robotics, where the reward information is limited due to physical limitations or privacy constraints.
Some remarks are summarized below: Remark 3:

1)
In Algorithm 1, the sets Cθ , Cv, Cμ, Cw will be defined soon. 2) Similar to GTD algorithms [1], [2], Algorithm 1 is an off-policy algorithm, which means that the behavior policy, β, and target policy, π , can be different. The behavior policy, β, is the policy used to sample transitions, while the target policy, π , implies the policy we want to evaluate by predicting the value function J π . 3) An advantage of the off-policy learning is the flexibility in choosing the behavior policy used to sample transitions, which makes additional flexibility and convenience in practical implementation. 4) As can be seen in Algorithm 1, the transitions, (s, a,r, s ), are sampled under the behavior policy β, while transitions under the target policy can be obtained by using the importance sampling or subsampling techniques [1], which will be described soon.
Remark 4: The proposed work deals with the policy evaluation problem, where we try to evaluate a given fixed policy using the value prediction approaches. The policy evaluation is a fundamental problem, which does not have direct realworld applications. However, it can be applied to the policy optimization problem to find an optimal policy. For example, the proposed model-free policy evaluation algorithm can be used in the actor-critic algorithm [39] and SARSA [38] for policy optimization. In particular, the actor-critic algorithm is a policy optimization algorithm which consists of two parts, 1) policy evaluation and 2) policy improvement. The first part evaluates the current policy, and the second part improves the policy using the value function estimate computed in the first part. The proposed distributed TD learning in this paper can be applied to the first policy evaluation part in the actor-critic algorithm in order to find an optimal policy in a distributed manner. Similarly, SARSA also consists of the two parts, 1) policy evaluation and 2) policy improvement. The main difference between actor-critic and SARSA lies in  4: end for 5: for k ∈ {0, . . . , T − 1} do 6: for agent i ∈ {1, . . . , N } do 7: Sample (s, a, s ) with s ∼ d, a i ∼ β i (·|s), s ∼ P(s, a, ·),r i :=r i (s, a, s ). 8: Update primal variables according to where N i (k) is the neighborhood of node i on the graph G(k), φ := φ(s), φ := φ(s ), and ρ := π(s,a) β(s,a) is the importance ratio. 9: Update dual variables according to

10:
Project parameters: that actor-critic applies gradient steps for the policy improvement, while SARSA uses a greedy policy. In this case, the distributed RL can be used in resource management problems or multi-robot systems where the local reward cannot be shared with other rewards due to physical or security constraints.
In Line 6, each agent samples the state, action, and the corresponding local reward, Line 8 updates the primal variable VOLUME 10, 2022 according to the stochastic gradient descent step, and Line 9 updates the dual variable by the stochastic gradient ascent step. Line 10 projects the variables to the corresponding compact sets Cθ , Cv, Cμ, Cw, and Line 13 outputs averaged iterates over the whole iteration steps instead of the final iterates. Note that the averaged dual variables can be computed recursively [22, pp. 181].
Remark 5: In this paper, we consider the off-policy learning. In particular, to obtain a stochastic approximation of P π from the samples under the behavior policy, β, we use the fact The next proposition states that the averaged dual variable converges to the set of saddle-points in terms of the ε-saddle set with a vanishing ε.
Proposition 7 provides a convergence of the iterates of Algorithm 1 to the ε-saddle set, H ε , with O(1/ε 2 ) samples (or O(1/ √ T ) rate). In particular, the proposed analysis provides a sample complexity in terms of ε, while the other dependencies are unclear due to some unknown parameters such as C. Moreover, the convergence to the ε-saddle set, H ε , roughly implies that each w (i) k converges to w * which solves the optimization in (3). For the specific L for our problem, we can obtain stronger convergence results with convergence rates.
The first result in (34) implies that the iterate,w T , reaches a consensus with at most O(1/ε 2 ) samples or at O(1/ √ T ) rate. Similarly, (35) implies that the squared norm of the errors ofθ T andv T , θ T −θ * 2 2 + v T 2 2 , converges at O(1/ √ T ) rate. However, (34) does not suggest anything about the convergence rate of w T −w * 2 2 and μ T −μ * 2 2 . Still, their asymptotic convergence is guaranteed by Proposition 7. The main reason is the lack of the strong convexity with respect to these variables. However, we can resolve this issue with a slight modification of the algorithm by adding the regularization term (ρ/2)μ Tμ − (ρ/2)w Tw to the Lagrangian L with a small ρ > 0 so that L(θ ,v,μ) is strongly convex inθ and strongly concave inμ. In this case, the corresponding saddle-points are slightly altered depending on ρ. T ) convergence rate. The recent primal-dual algorithm in [13] has faster O(1/T ) rate, and can be applied to solve the saddle-point problem in (19).
Remark 7: The last line of Algorithm 1 indicates that both the averaged iterates,ŵ can be used for estimates of the solution. The result in [35] proves the asymptotic convergence of the last iterate of Algorithm 1 by using the stochastic approximation method [46]. For the convergence, the step-size rules should satisfy α k > 0, α k → 0, ∞ k=0 α k = ∞, ∞ k=0 α 2 k < ∞, called the Robbins-Monro rule. An example is α k = α 0 /(k + β) with α 0 , β > 0. On the other hand, Proposition 7 and Proposition 8 prove the convergence of the averaged iterate of Algorithm 1 with convergence rates by using tools in optimization. The step-size rule is α k = α 0 √ k + β with α 0 , β > 0, which does not obey the Robbins-Monro rule.
Remark 8: There exist several RLs based on stochastic primal-dual approaches. The GTD can be interpreted as a stochastic primal-dual algorithm by using Lagrangian duality theory [8]. The work in [27] proposes primal-dual reinforcement learning algorithm for the single-agent policy optimization problem, where a linear programming form of the MDP problem is solved. A primal-dual algorithm variant of the GTD is investigated in [47] for a single-agent RL problem.
Remark 9: When nonlinear function approximation, such as the neural network, is used, convergence of TD-learning is not guaranteed in general [41] even for the single-agent case. Similarly, with the nonlinear function approximation, the convergence of Algorithm 1 to a global optimal solution is not guaranteed. In particular, for minimization problems, stochastic gradients converge to a local stationary point [48]. On the other hand, convergence of stochastic primal-dual algorithms to a saddle-point for general non-convex min-max problems is still an open problem [49]. In this respect, the convergence of our algorithm with general nonlinear function approximation is a challenging open question, which needs significant efforts in the future.

VIII. SIMULATION
In this section, we provide simulation studies that illustrate potential applicability of the proposed approach. All numerical examples in the sequel were treated with the help of MATLAB R2020a running on a PC with AMD Ryzen 3.69 GHz CPU, 128 GB RAM.
Example 1: In this example, we provide a comparative analysis using simulations. We consider the Markov chain , local expected reward and the five RL agents over the network given in Figure 1.   These hyperparameters were manually set properly for the proposed algorithm to perform reasonably well.
It shows that the parameters of five agents reach a consensus and converge to certain numbers. The results empirically demonstrate the proposed DGTD.
Example 2: Consider a 20[ m] × 20[m] continuous space X with three robots (agent 1 (blue), agent 2 (red), and agent 3 (black)), which patrol the space with identical stochastic motion planning policies π 1 = π 2 = π 3 = π . We consider a single integrator system for each agent i:ẋ i (t) = u i (t) with the control policy u i (t) = −h(x i (t) − r i ) employed from [50], where t ∈ R + is the continuous time, h ∈ R ++ is a constant, r i is a randomly chosen point in X with uniform distribution over X . Under the control policy u i (t) = −h(x i (t) − r i ), x i (t) globally converges to r i as t → ∞ [50, Lemma 1]. When x i is sufficiently close to the destination r i , then it chooses another destination r i uniformly in X , and all agents randomly maneuver the space X . The continuous space X is discretized into the 20 × 20 grid, which forms the statespace S. The collaborative objective of the three robots is to reach the goal regions using individually collected reward by each robot. Note that we only consider the policy evaluation problem for a given fixed policy, and the algorithm cannot find an optimal policy. Nevertheless, the algorithm can be applied to the actor-critic algorithm [39] to find an optimal policy. In particular, the actor-critic algorithm is a policy optimization algorithm which consists of two parts, 1) policy evaluation and 2) policy improvement. The first part evaluates the current policy and the second part improves the policy using the value function computed in the first part. The developed distributed TD learning in this paper can be applied to the first policy evaluation part in the actor-critic algorithm.
We assume that each robot is equipped with a different sensor that can detect different regions, while a pair of robots can exchange their parameters, when the distance between them is less than or equal to 5. We assume that robots do not interfere with each other; thereby we can consider three independent MDPs with identical transition models. FIGURE 3. Three rectangular regions that can be detected by three different UAVs. The blue region is detected only by agent 1 (blue circle), the red region is detected only by agent 2 (red circle), and the black region only by agent 3 (black circle). For each agent, the detection occurs only if the UAV flies over the region, and a rewardr = 100 is given in this case. Figure 3, where the blue region is detected only by agent 1 (blue circle), the red region is detected only by agent 2 (red circle), and the black region only by agent 3 (black circle).

The three regions and robots are depicted in
For each agent, the detection occurs only if the UAV flies over the region, and a rewardr = 100 is given in this case. Algorithm 1 is applied with γ = 0.5 and = I |S| (tabular representation). We run Algorithm 1 with 50000 iterations, the step-size α k = 100/(k + 10000), and κ = 1. The the projection sets Cθ , Cv, Cμ, Cw are set to be a hypercube with length 20. The simulation results are shown in Figure 4, where the brighter regions imply that the value is higher than darker regions, and the three bright regions indicate that the agent's value function has higher values at states in these regions. The results suggest that all agents successfully estimate identical value functions, which are aware of three regions despite of the incomplete sensing abilities. The obtained value function can be used to design a motion planning policy in combination with the actor-critic algorithm. x and y axis are the state-space, the brighter regions imply that the value is higher than darker regions, and the three bright regions indicate that the agent's value function has higher values at states in these regions.

IX. CONCLUSION
In this paper, we study a distributed GTD-learning for multiagent MDPs using a stochastic primal-dual algorithm. Each agent receives local reward through a local processing, while information exchange over random communication networks allows them to learn the global value function corresponding to a sum of local rewards. Possible future research includes its extension to actor-critic and Q-learning algorithms.

X. PROOF OF PROPOSITION 3
In this section, we will provide a proof of Proposition 3. We begin with a basic technical lemma.
Lemma 3 (Basic Iterate Relations [28]): Let the sequences (x k , w k ) ∞ k=0 be generated by the stochastic subgradient algorithm in (9) and (10). Then, we have: 1) For any x ∈ X and for all k ≥ 0, 2) For any w ∈ W and for all k ≥ 0, w)). Proof: The result can be obtained by the iterate relations in [28, Lemma 3.1] and taking the expectations.
Lemma 4 (Berstein Inequality for Martingales [51]): Let For any x ∈ X and y ∈ Y, define We use E[ L x (x k , w k ) + ε k 2 2 |F k ] ≤ C 2 and rearrange terms in Lemma 3 to have . As a next step, we derive bounds on the terms 1 (x) and 2 (w). First, 1 (x) is bounded by using the chains of inequalities 1 k+1 (x) VOLUME 10, 2022 ≤ C 2 2 where (49) is due to E (1) k (x) ≤ C 2 , ∀x ∈ X . Similarly, we have 2 (w) ≤ C 2 2 α T . Combining the last inequality with (41) yields Plugging α k = α 0 / √ k + 1 into the first term, we have Moreover, plugging α k = α 0 / √ k + 1 into the second term leads to Therefore, combining the bounds yields To prove L(x T , w) − L(x,ŵ T ) ≤ ε, it suffices to prove 2C 2 α −1 0 +C 2 α 0 √ T ≤ ε/2 and 1 T M T ≤ ε/2. By simple algebraic manipulations, we can prove that the first inequality holds if To prove the second inequality with high probability, we will use the Bernstein inequality in Lemma 4. To do so, one easily proves that E[M T +1 |F T ] = M T , and hence, (M T ) ∞ T =0 is a Martingale. Moreover, we will find constants a > 0 and b > 0 such that M T +1 := M T +1 − M T ≤ b and 1 T +1 |F T ]), we obtain the bounds for the first two terms ≤ α k C 2 /2 + 2C 2 , where (47) follows from the relation a − b 2 2 = a 2 2 + b 2 2 − 2a T b for any vectors a, b, (48) follows from the Cauchy-Schwarz inequality, (49) follows from the nonexpansive map property of the projection X (a) − X (b) 2 ≤ a − b 2 , and the last inequality is obtained after simplifications. Similarly, one gets 1 2α k (E (2) k+1 − E[E (2) k+1 |F k ]) ≤ α k C 2 /2 + 2C 2 . Combining the last two inequalities, we have Next, we will prove that 1 T M T ≤ a holds, where a = (α 0 + 2) 2 C 4 . Using E[E (1) where the inequality follows from the fact that the variance of a random variable is bounded by its second moment. For bounding (50), note that 1 is written as Here, the first two terms have the bound where (51) follows from the relation a − b 2 2 = a 2 2 + b 2 2 − 2a T b for any vectors a, b, (52) follows from the Cauchy-Schwarz inequality, (53) is due to the nonexpansive map property of the projection X (a) − X (b) 2 ≤ a − b 2 , (54) comes from (11), (12), and (13). Similarly, the second two terms in 1 are bounded by α k (α 0 C 2 + 2C 2 ). Combining the last two results leads to 1 ≤ 2α k (α 0 C 2 + 2C 2 ), and plugging the bound on 1 into (50) and after simplifications, we obtain E[|M k+1 − M k | 2 |F k ] ≤ (α 0 + 2) 2 C 4 , which is the desired conclusion.