Skip to Main Content
We consider the framework of a set of recently proposed two-timescale actor-critic algorithms for reinforcement-learning using the long-run average-reward criterion and linear feature-based value-function approximation. The actor update is based on the stochastic policy-gradient ascent rule. We derive a stochastic-gradient-based novel critic update to minimize the variance of the policy-gradient estimator used in the actor update. We propose a novel baseline structure for variance minimization of an estimator and derive an optimal baseline which makes the covariance matrix a zero matrix - the best achievable. We derive a novel actor update based on the optimal baseline deduced for an existing algorithm. We derive another novel actor update using the optimal baseline for an unbiased policy-gradient estimator which we deduce from the Policy-Gradient Theorem with Function Approximation. The computational results demonstrate that the proposed algorithms outperform the state-of-the-art on Garnet problems.