This paper proposes a fast and efficient actor-critic reinforcement learning algorithm that is novel in at least two ways: it updates the critic only when the best action is executed and it takes full advantage of the powerful temporal difference (TD) prediction method to train a continuous-valued actor. Both actor and critic are represented separately by two adaptive neural fuzzy systems tuned by a backpropagation algorithm. While the critic adapts to the actor by minimizing the quadratic sum of TD error, the actor adapts to the critic, by not only using the TD error, but also by using the state value function. The new actor-critic architecture is applied to an inverted pendulum system, which is widely used to compare reinforcement learning architectures
Published in:
Electrical and Computer Engineering, 2000 Canadian Conference on
(Volume:2
)
Date of Conference: 2000