New Versions of Gradient Temporal Difference Learning

Sutton, Szepesv\'{a}ri and Maei introduced the first gradient temporal-difference (GTD) learning algorithms compatible with both linear function approximation and off-policy training. The goal of this paper is (a) to propose some variants of GTDs with extensive comparative analysis and (b) to establish new theoretical analysis frameworks for the GTDs. These variants are based on convex-concave saddle-point interpretations of GTDs, which effectively unify all the GTDs into a single framework, and provide simple stability analysis based on recent results on primal-dual gradient dynamics. Finally, numerical comparative analysis is given to evaluate these approaches.


I. INTRODUCTION
Temporal-difference (TD) learning [1] is one of the most popular reinforcement learning (RL) algorithms [2] for policy evaluation problems.However, its main limitation lies in its inability to accommodate both off-policy learning and linear function approximation for convergence guarantees, which has been an important open problem for decades.In 2009, Sutton, Szepesvári, and Maei [3], [4] introduced the first TD learning algorithms compatible with both linear function approximation and off-policy training based on gradient estimations, which are thus called gradient temporal-difference learning (GTD).
The goal of this paper is to propose new variants of GTDs, and new analysis template for convergence analysis.The main pathways to these developments are based on convex-concave saddle-point interpretations of GTDs, which were first introduced in [5] based on the Lagrangian duality and in [6] based on the Fenchel duality [7].In particular, GTD2, proposed in [4], can be interpreted as a stochastic primal-dual gradient dynamics (PDGD) of a convex-concave saddlepoint problem, and hence, its convergence analysis can be approached from a different angle using optimization theory [5], [6].These interpretations were subsequently applied to distributed RL problems in [8]- [11].
Although the saddle-point perspectives can provide unified viewpoints and greater flexibilities in analysis & design of GTDs and RLs, to the authors' knowledge, their potentials have not been fully investigated yet.Motivated by this insight, we develop new versions of GTD which are unified with GTD2 [4] in a single framework through saddle-point formulations.The main contributions of this paper are summarized as follows: 1) New algorithms: Three new versions of GTDs are proposed, which are named GTD3, GTD4, and GTD5.These variants, especially GTD4 and GTD5, can be viewed as regularized GTD2 algorithms, where the regularization potentially improves the convergence empirically.From simulation experiments, their convergence and performance are evaluated.
Choi is with Shinhan bank research team, Seoul, South Korea.
2) Unified saddle-point perspectives: The proposed versions can be interpreted in a unified way based on saddle-point interpretations, which are derived from slightly different angles from those in [5], [6].3) Comparative analysis: Comprehensive numerical experiments are given to compare convergence of the proposed GTDs and GTD2 in [4].Empirically, it turns out that the proposed GTD4 and GTD5 tend to converge faster than the other methods for the randomly generated 5000 environments.4) General analysis templates: In existing GTD algorithms, the convergence analysis mainly exploits the ODE (ordinary differential equation) model-based stochastic approximation theory [12], where the main challenge is proving the asymptotic stability of the ODE model corresponding to the underlying algorithm.This approach does not allow general and formal analysis frameworks because the asymptotic stability of the ODE model significantly depends on the specific algorithm, and it is in general hard to establish the stability of the ODE model.On the other hand, the proposed analysis applies the recent asymptotic stability theory of primal-dual gradient dynamics (PDGD) [13], where control theoretic frameworks for stability analysis of PDGD are developed.Using this recent result, we provide a new template for convergence analysis of RL algorithms based on the saddlepoint formulations.This framework leads to simple and unified convergence analysis for linear RL algorithms based on saddlepoint formulations.Moreover, the new template allows us to easily analyze the proposed new versions of GTD algorithms derived based on the saddle-point formulations.This template can potentially be applied to many other RL variants in the future.Related previous works are briefly summarized as follows: As mentioned before, saddle-point perspectives of GTDs and RLs were introduced in [5], [6] based on the Lagrangian duality [5] and Fenchel duality [6].These ideas were applied to distributed RL problems in [8]- [11].Even though they and the proposed saddle-point framework lead to the same algorithm, the latter one is derived from a slightly different way based on a simple constrained convex optimization formulation, which are compatible with techniques in [13].In addition, we note that GTD5 proposed in this paper can be interpreted as GTD2 with a quadratic regularization term, which was also used in the distributed RLs in [8]- [10].Compared to them, GTD5 focuses on the single agent case, has different algorithmic structures, and uses diminishing weights on the regularization term in the comparative analysis.The so-called TD with Regularized Corrections (TDRC) was introduced in [14], which adds an additional term to TDC updates in [4] corresponding to l2 regularization, and [15] extends the ideas of GTDs to nonlinear function approximations.

A. Markov decision process
A Markov decision process (MDP) is characterized by a quadruple M := (S, A, P, r, γ), where S is a finite state-space, A is a finite action space, P (s ′ |s, a) represents the (unknown) state transition probability from state s to s ′ given action a, r : S ×A×S → R is the reward function, and γ ∈ (0, 1) is the discount factor.In particular, if action a is selected with the current state s, then the state transits to s ′ with probability P (s ′ |s, a) and incurs a reward r(s, a, s ′ ).The stochastic policy is a map π : S × A → [0, 1] representing the probability, π(a|s), of selecting action a at the current state s, P π denotes the transition matrix under policy π, and d π : S → R denotes the stationary distribution of the state s ∈ S under π.We also define R π (s) as the expected reward given the policy π and the current state s.The infinite-horizon discounted value function with policy π is J π (s) := E ∞ k=0 γ k r(s k , a k , s k+1 ) s0 = s , where E stands for the expectation taken with respect to the stateaction trajectories under π.Given pre-selected basis (or feature) functions φ1, . . ., φq : S → R, the matrix, Φ ∈ R |S|×q , called the feature matrix, is defined as a matrix whose s-th row vector is φ(s) := φ1(s) • • • φq(s) .Throughout the paper, we assume that Φ ∈ R |S|×q is a full column rank matrix.The policy evaluation problem is the problem of estimating J π given a policy π.

B. Basics of nonlinear system theory
We will briefly review basic nonlinear system theory, which will play an important role in convergence analysis and stochastic approximation methods.Consider the nonlinear system where xt ∈ R n is the state, t ≥ 0 is the time, x0 ∈ R n is the initial state, and f : R n → R n is a nonlinear mapping.For simplicity, we assume that the solution to (1) exists and is unique.In fact, this holds true so long as the mapping f is globally Lipschitz continuous.
The equilibrium point is an important concept in nonlinear system theory.In particular, a point, x∞ ∈ R n , in the state-space is said to be an equilibrium point of (1) if whenever the state of the system starts at x∞, it will remain at x∞ [16].For (1), the equilibrium points are the real roots of the equation f (x) = 0.The equilibrium point x∞ is said to be globally asymptotically stable if for any initial state x0 ∈ R n , xt → x∞ as t → ∞.

C. ODE-based stochastic approximation
Due to its generality, the convergence analysis of many RL algorithms rely on the ODE (ordinary differential equation) approach [12], [17].It analyzes convergence of general stochastic recursions by examining stability of the associated ODE model based on the fact that the stochastic recursions with diminishing step-sizes approximate the corresponding ODEs in the limit.One of the most popular approaches is based on Borkar and Meyn theorem [18].We briefly review Borkar and Meyn's ODE approach, which analyzes convergence of the general stochastic recursions where f : R n → R n is a nonlinear mapping.Basic technical assumptions are given below.
1) The mapping f : R n → R n is globally Lipschitz continuous, and there exists a function f∞ : The origin in R n is an asymptotically stable equilibrium for the ODE θt = f∞(θt).
Borkar and Meyn theorem states that under Assumption 1, the stochastic process (θ k ) ∞ k=0 generated by ( 2) is bounded and converges to θ∞ with probability one.It will be used to prove convergence of various algorithms throughout the paper.

D. Saddle-point problem
In this subsection, we briefly review the saddle-point problem [13], [19].Consider a convex-concave function L : R n × R n → R.

Definition 1 (Saddle-point). A saddle-point is defined as a pair
The saddle point problem is the problem of finding a saddle-point, which arises in a number of areas such as constrained optimization duality, zero-sum games, and general equilibrium theory [19].Moreover, it is also known to be a solution of the min-max problem.
Note that, (θ, λ) is a saddle-point if and only if the stationary point condition holds, i.e., ∇ θ L(θ * , λ * ) = ∇ λ L(θ * , λ * ) = 0.The so-called primal-dual gradient method [19] is a popular method for solving Problem 1: where (α k ) ∞ k=0 is a step-size.This iteration will be called the discrete-time primal-dual gradient dynamics (PDGD) throughout the paper.Its continuous-time counterpart is and is called the continuous-time PDGD [13].Both PDGDs converge to a saddle-point under some mild assumptions [13], [19].If the gradients are not accessible, but only their stochastic approximations are available, then the stochastic counterpart is as follows: noise with zero mean.In this paper, it will be called the stochastic PDGD.Stochastic PDGD also converges to a saddle-point in probabilistic senses [20], [21].As an application, let us consider the constrained convex optimization problem.
Problem 2. Solve for x ∈ R n the optimization where f is convex and continuously differentiable.
The corresponding (convex-concave) Lagrangian function is defined as where λ ∈ R n is called the Lagrangian multiplier.Using the standard results in convex optimization theory, if Problem 2 satisfies the Slater's condition [7,Chap. 5], then its solution can be found by solving the saddle-point problem in Problem 1.

III. REVIEW OF GTD ALGORITHM
In this section, we briefly review the gradient temporal difference (GTD) learning developed in [3], which tries to solve the policy evaluation problem.Roughly speaking, the goal of the policy evaluation is to find the weight vector θ such that Φθ approximates the true value function J π .This is typically done by minimizing the so-called mean-square Bellman error loss function [3].The overall problem is summarized below.
is the number of feature functions, D β is a diagonal matrix with positive diagonal elements d β (s), s ∈ S, and x D := √ x T Dx for any positive-definite D. Here, d β can be any state visit distribution under the behavior policy β such that d β (s) > 0, ∀s ∈ S.
The GTD in [4] considers another objective function called the mean-square projected Bellman error loss function.
Problem 4. Solve for θ ∈ R q the optimization where Π is the projection onto the range space of Φ, denoted by The projection can be performed by the matrix multiplication: we write Π(x) := Πx, where Π := Φ(Φ T D β Φ) −1 Φ T D β .Note that minimizing the objective means minimizing the error of the projected Bellman equation Φθ = Π(R π + γP π Φθ) with respect to • D β .Moreover, note that in the objective of Problem 4, d β depends on the behavior policy, β, while P π and R π depend on the target policy, π, that we want to evaluate.This structure allows us to obtain an off-policy learning algorithm through the importance sampling [22] or sub-sampling techniques [3].Throughout the paper, we adopt the following standard assumption.
Assumption 2. Φ T D β (γP π − I)Φ is nonsingular, where I denotes the identity matrix with an appropriate dimension.
Note that Assumption 2 is common in the literature, and is adopted in [3], [4], [14] for convergence of GTD algorithms.Some properties related to Problem 4 are summarized below for convenience and completeness.

Lemma 3. The following statements hold true:
1) A solution of Problem 4 exists, and is unique.
2) The solution of Problem 4 is given by Proof.For the first statement, the equation, Π(R π +γP π Φθ)−Φθ = 0, can be equivalently written as Φ is nonsingular by Assumption 2, the last equation admits a unique solution.Moreover, the second statement is directly proved from the last equation.This completes the proof.
Based on this objective function, [4] developed GTD2.The reader is referred to [4] for more details.After [4], some different interpretations were developed based on saddle-point perspectives.Before proceeding, they are briefly presented in the following subsections.

A. First approach: dual representatoin
A saddle-point perspective of GTD2 was introduced in [5].The main idea is to convert Problem 4 into the equivalent quadratic constrained optimization problem where w ∈ R q is a newly introduced vector variable.Introducing the Lagrangian function L(θ, w, λ) , where λ ∈ R q is the Lagrangian multiplier, the dual problem [7] is The main reason to consider the dual problem instead of the primal problem is that the dual formulation removes the matrix inverse in the objective.Next, we can again construct the corresponding Lagrangian function for the dual problem as follow: where θ ∈ R q is the Lagrangian multiplier.Then, it turned out that GTD2 is identical to a stochastic PDGD for solving the saddle-point problem, min λ∈R q max θ∈R q L(θ, λ).For more details, the reader is referred to [5].

B. Second approach: Fenchel duality
GTD2 can be also interpreted in a different direction using the Fenchel dual to Problem 4 as shown in [6].In particular, using the Fenchel duality, the conjugate form of MSPBE(θ): is given by Therefore, Problem 4 can be represented by the convexconcave saddle-point problem, min θ∈R q MSPBE(θ) = min θ∈R q max λ∈R q L(θ, λ).Then, GTD2 is identical to a stochastic primal-dual algorithm for solving the above saddle-point problem.In the next section, we introduce an alternative saddle-point approach to derive GTD2 from a different angle.

IV. THIRD APPROACH
In this section, we introduce a slightly different approach to derive GTD2.To this end, let us consider the following constrained optimization problem.Problem 5. Solve for θ ∈ R q the optimization Note that in Problem 5, we introduce a null objective, f ≡ 0, to fit the problem into an optimization form.Problem 5 can be seen as the projected Bellman equation Φθ = Π(R π + γP π Φθ) in the form of an optimization problem.We can prove that the optimization admits a unique solution, which is identical to the solution of Problem 4. Proposition 1.A solution of Problem 5 exists and is unique given by θ * defined in (5).
Proof.The equality constraint in ( 6) can be equivalently written as (5).This completes the proof.
To formulate Problem 5 into a min-max saddle-point problem, we introduce the corresponding Lagrangian function L(θ, λ) := λ T Φ T D β (R π + γP π Φθ − Φθ).Instead of directly deriving the corresponding dual problem, we introduce a regularization term to make it strongly concave in λ, and obtain the following modification: The corresponding saddle-point problem of ( 7) is then given as follows.
Problem 6. Solve for (θ, λ) ∈ R q × R q the min-max problem Note that a quadratic penalty term (or regularization term) has been added to the original Lagrangian function in (7), which is not typical in terms of the standard Lagrangian duality theory.In this sense, Problem 6 and Problem 5 are not equivalent.This additional term is introduced in order to derive GTD2 using the saddle-point viewpoints, and this process can give additional insights on GTD2.Since Problem 6 is modified, a natural question is if the original equality constrained optimization in Problem 5 can be solved by addressing the saddle-point problem in Problem 6 for the regularized Lagrangian function (7).We can conclude that the solutions of Problem 6 is indeed identical to those of Problem 4. Proposition 2. A solution of Problem 6 exists, is unique, and is given by θ = θ * and λ = 0.
Intuitively, the additional regularization term in (7) penalizes the Lagrangian multiplier λ from being large, while this change does not affect the primal variable θ.Now, let us turn our attention to its continuous-time PDGD [13]: , the corresponding stochastic PDGD can be obtained as follows: This recursion is identical to GTD2, which is summarized in Algorithm 1 for completeness of presentations.
Update parameters according to where , and Note that in Algorithm 1, an importance sampling ratio, ρ k := π(a k |s k )/β(a k |s k ), is introduced for off-policy learning [22].
Although the convergence of GTD2 was given in [4], we will provide another approach based on recent results in [13] in the next section.

Remark 1. A different algorithm can be obtained with the Lagrangian function
λ T λ, which may have different convergence properties.In general, the corresponding algorithm performs better with smaller step-sizes, while in general, GTD2 converges faster.

V. CONVERGENCE OF GTD2
In this section, we will provide an alternative approach to the convergence of GTD2 based on the recent results in [13] in combination with the constrained optimization perspective of GTD2 in the previous section.Before proceeding, some results of [13] are briefly summarized.

Moreover, suppose that A is full row rank. Consider the corresponding Lagrangian function (4).
Then, the corresponding saddle-point (θ * , λ * ) is unique, and the corresponding continuous-time PDGD, In the sequel, we will apply Lemma 4 to prove the convergence of GTD2, especially, for the global asymptotic stability of its ODE model.The main difficulty in applying Lemma 4 to Problem 5 is that Problem 5 has a null objective, f ≡ 0, which does not satisfy the strong convexity assumption of the objective function in Lemma 4. To resolve this problem, we will consider the dual problem of Problem 5 instead of its original form.
Problem 7 (Dual problem).Solve for λ ∈ R q the optimization Proof.Consider the Lagrangian function (7), and the corresponding min-max problem in Problem 6.The Lagrangian function can be written by If we fix λ, then the problem min θ∈R q L(θ, λ) has a finite optimal value, when λ T Φ T D β γP π Φ − λ T Φ T D β Φ = 0. Therefore, the dual problem max λ∈R q min θ∈R q L(θ, λ) is Problem 7.This completes the proof.
Now, Problem 7 (maximization problem) can be equivalently written as the minimization problem The corresponding Lagrangian function is and the corresponding continuous-time PDGD is We can easily check that the PDGD is identical to that of Problem 5, given in (8).Therefore, Lemma 4 can be applied to Problem 5 in place of Problem 7.
Proof.Note that (9) has a strongly convex, smooth, and twice differentiable objective function.Moreover, Φ T (I − γP π ) T D β Φ is nonsingular by Assumption 2, and hence is full row rank.The other assumptions are also met.Therefore, we can apply Lemma 4 to obtain the desired conclusion.Now, we can easily apply Borkar and Meyn theorem with Proposition 4 to complete the proof.Details of the remaining parts can be found in [4].

Remark 2. Lemma 4 provides an exponential convergence, while Lemma 5 provides an asymptotic convergence. The convergence result in this paper relies on the standard ODE methods (Borkar and Meyn theorem), which do not provide convergence rates in general even if the corresponding O.D.E model's solution converges exponentially fast. Convergence rate analysis can be a potential future topic.
In this section, we proposed a saddle-point interpretation of GTD2 from a slightly different perspective, and presented a different analysis for the global stability of the corresponding ODE model.Starting from this new perspective, we will provide two new versions of GTD in the next section.

VI. GTD3
In this section, we propose a new optimization formulation for the policy evaluation problem, Problem 4. Based on this form, we will derive another version of GTD, called GTD3 in this paper.Problem 8. Solve for θ ∈ R q the optimization min Compared to Problem 5 for GTD2, the main difference is that it has a quadratic objective instead of the null objective.A natural question is if this optimization admits the identical solution to Problem 5.The answer is indeed positive.

Proposition 5. A solution of Problem 8 exists, and is unique given by
Proof.It is clear from Proposition 2 that the inequality constraint has a unique feasible point θ = θ * .Therefore, the optimal solution is also uniquely determined by θ * .This completes the proof.
To derive a saddle-point formulation again, let us consider the Lagrangian function for Problem 8 Note that it is concave in λ and strongly convex in θ.Compared to the Lagrangian function of GTD2 in (7), the regularization term, − 1 2 λ T Φ T D β Φλ, which is strongly concave in λ, is replaced with the regularization term, 1  2 θ T Φ T D β Φθ, which is strongly convex in θ.The corresponding min-max saddle-point formulation is given as follows.
Problem 9. Solve for (θ, λ) ∈ R q × R q the optimization min θ∈R q max λ∈R q L(θ, λ) := Again, we can conclude that the solutions of Problem 9 is also identical to those of Problem 4. Proposition 6. Problem 9 admits a unique solution given by θ = θ * and λ = λ * , where θ * is given in Lemma 3, and The results in Proposition 6 can be easily obtained by solving the stationary point condition for Problem 9, i.e., ∇ θ L(θ, λ) = 0 and ∇ λ L(θ, λ) = 0. Therefore, the detailed proof is omitted here.Similar to the previous section, the continuous-time PDGD of Problem 8 (or equivalently, Problem 9) is and its discrete-time counterpart (by Euler discretization) is With the samples s k ∼ d β , a k ∼ π(•|s k ), and s ′ k ∼ P (•|s k , a k ), a stochastic approximation [20], [21] of the discrete-time counterpart is given as The proposed algorithm is summarized in Algorithm 2, which is equivalent to the updates in the above recursion with the importance sampling [22].

Algorithm 2 GTD3
1: Set the step-size (α k ) ∞ k=0 , and initialize (θ0, λ0).2: for k ∈ {0, . ..} do Update parameters according to , and Note that Algorithm 2 is different from GTD2, linear TD with gradient correction (TDC) [4], and the original GTD in [3].Moreover, the optimization problem corresponding to Algorithm 2 (Problem 8) already has the required structures in Lemma 4. Therefore, Lemma 4 can be directly applied to prove the global stability of the corresponding PDGD in (12).Note also that the PDGD in ( 12) is also the ODE model of Algorithm 2.
Proof.Problem 8 has a strongly convex, smooth, and twice differentiable objective function.Moreover, Φ T (I − γP π ) T D β Φ is nonsingular by Assumption 2, and hence is full row rank.The other assumptions are also met.Therefore, the PDGD of Problem 8, given in (12), is globally asymptotically stable, and converges to its unique equilibrium point (θ * , λ * ), where λ * is defined in (11).This completes the proof.
Based on Proposition 7, convergence of Algorithm 2 can be proved using the Bokar and Mayn theorem.
The proof of Theorem 1 can be found in Appendix.

Remark 3. A different algorithm can be obtained with the following Lagrangian function
, which has different convergence properties.In general, the corresponding algorithm performs better with smaller step-sizes, while in general, GTD3 converges faster.
In this section, we proposed a new version of GTD based on a new saddle-point formulation in Problem 9.In the next section, we propose another version of GTD based on the Lagrangian function which has more symmetric form.

VII. GTD4 & 5
Let us recall the Lagrangian functions (7) for GTD2 and (10) for GTD3.The Lagrangian function in (10) In this section, we investigate the function where σ ≥ 0 is a weight on the first regularization term, i.e., a design parameter.Note that with σ = 0, ( 14) is reduced to (7).The function includes both 1 2 λ T Φ T D β Φλ and 1 2 θ T Φ T D β Φθ, and hence, more symmetric than (7) and (10).Moreover, it is strongly convex in θ and strongly concave in λ.The corresponding min-max saddle-point problem is summarized below for convenience.Problem 10.Solve for (θ, λ) ∈ R q × R q the min-max problem Solving the stationary point conditions, ∇ θ L(θ, λ) = 0 and ∇ λ L(θ, λ) = 0, we obtain the following solution of Problem 10 for the θ-coordinate: From the result, it turns out that the solution of Problem 10 for the θ-coordinate is not exactly identical to θ * , but includes a bias term σE.However, since θσ → θ * as σ → 0, one can control the degree of the error by adjusting σ.Moreover, larger σ can more stabilize the final algorithm, and speed up its convergence because it improves the degree of the strong concavity of (14).Therefore, there exists a tradeoff between stability and bias by choosing σ.Besides, the solution is formally stated in the following proposition for convenience.Proposition 8.The unique saddle-point of Problem 10 for θ-coordinate is given by θσ The proof of Proposition 8 is a direct calculation, and so omitted here for brevity.Similar to the previous section, the continuous-time PDGD is In the following theorem, we first establish the global asymptotic stability of the PDGD in (15).Proposition 9. Consider the trajectory (θt, λt) of the PDGD in (15), and let ( θσ, λσ) be the corresponding unique saddle-point.Then, (θt, λt) → ( θσ, λσ) as t → ∞.
With the samples s k ∼ d β , a k ∼ π(•|s k ), and s ′ k ∼ P (•|s k , a k ), a stochastic PDGD corresponding to (15) is given as which is the second proposed algorithm, called GTD4 in this paper.
The overall algorithm with the importance sampling (for off-policy learning) is summarized in Algorithm 3.
Proof.By Proposition 9, the PDGD in (15) is globally asymptotically stable.The remaining parts of the proof are almost identical to those of Theorem 1, and hence, are omitted here for brevity.
Remark 4. The main difference of Algorithm 3 (GTD4) and Algorithm 1 (GTD2) lies in the existence of the additional regularization term in their Lagrangian functions in (14) and (7).By adding the regularization term, the saddle-point problem becomes strongly convex-concave, which may potentially accelerate the convergence.Intuitively, the additional regularization terms make the slopes of the Lagrangian functions more steep in both ascent and descent directions, and hence, one can expect that the stochastic gradient ascent and descent methods may converge faster to the stationary points.
Finally, a modification of ( 14) leads to GTD5 where σ 2 θ T Φ T D β Φθ in ( 14) is replaced with σ 2 θ T θ in the above equation.The algorithm is summarized in Algorithm 4.
Since σ > 0 leads to biases in solutions, a reasonable heuristic approach is to diminish σ, i.e., σ → 0 as k → ∞.A comparative analysis of several GTDs is given in the next section.
The step-sizes were selected such that all the algorithms perform reasonably well.The results show an instance where GTD4 and GTD5 overcome GTD2, and GTD3.GTD3 converges slightly faster than GTD2 in this example.
Remark 5.The diminishing weight σ k has been selected from trial and errors.Intuitively, if the weight σ k diminishes too fast, then the algorithm quickly becomes identical to the standard GTD2 and GTD3.Therefore, the convergence speeds also become similar to GTD2 and GTD3.For the algorithms to be effective, the weight, σ k , should not diminish too fast.On the other hand, if σ k diminishes , GTD4 (green), GTD5 (magenta) in a logarithmic scale.For GTD4 and GTD5, we used a diminishing σ: σ k = 100/(k + 100).
too slowly, then the bias induced by the regularization vanishes too slowly during the learning.Therefore, there exists a trade-off between the convergence speed and bias, which leads to some tuning issues.
Figure 2 provides another instance where GTD2 performs slightly better than or equal to the other approaches.From our experiences, GTD4 and GTD5 outperform the other two approaches more frequently.For a fair and more comprehensive analysis, we ranked the four approaches based on the performance index τ k=0 θ k − θ * 2/1000 for 5000 randomly generated MDPs, where N is the total number of iterations set to be τ = 250000 in this example.In addition to the random generation scheme used in the previous two MDP instances, we also randomly select the number of states and number of actions uniformly distributed in {3, 4, . . ., 100} and {2, 3, . . ., 30}, respectively.The number of feature functions is chosen such that it is around 1/10 of the state size.Rankings of the different GTDs for 5000 MDP instances are summarized in Figure 3, where each bar implies the number of MDP instances where the corresponding ranking is achieved by each method in terms of the performance index.We used the step-size α k = 5/(k + 5) and diminishing weight σ k = 100/(k + 100) for this experiment.GTD5 takes the first place most frequently (3316 times over 5000 trials), and GTD4 takes the second-best places most frequently (3016 times over 5000 trials).GTD2 and GTD3 are comparable to each other.The results suggest that GTD5 and GTD4 outperform the other approaches in most cases.

IX. CONCLUSION
In this paper, we proposed variants of GTDs based on convexconcave saddle-point interpretations of GTDs, which allow new stability analysis based on recent results [13] on stability of PDGD.Performance of the GTDs was evaluated through numerical experiments, which suggest that GTD4 and GTD5 overcome the other methods for randomly generated 5000 MDPs.Therefore, we can conclude that the use of regularization terms with diminishing weights can potentially improve the convergence speed empirically.Besides, the convergence rate analysis is an important next step beyond the asymptotic convergence as the former analysis gives insights on how fast the iterates approach to the solution.Moreover, more combinations of regularization methods can lead to more versions of GTD, which have different properties and performances.These topics can be potential future works.Therefore, its origin is the globally asymptotically stable equilibrium point. 3 where the second equality is due to the i.i.d.assumption of samples.Using this identity, we have Therefore, (M k ) ∞ k=0 is a Martingale sequence, and ε k+1 = M k+1 − M k is a Martingale difference.Moreover, it can be easily proved that the second statement of the fourth condition of Assumption 1 is satisfied by algebraic calculations.Therefore, the fourth condition is met.