System Maintenance:
There may be intermittent impact on performance while updates are in progress. We apologize for the inconvenience.
By Topic

Implementation of gradient estimation to a constrained Markov decision problem

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

3 Author(s)

Consider the problem of a constrained Markov Decision Process (MDP). Under a parameterization of the control strategies, the problem can be transformed into a non-linear optimization problem with non-linear constraints. Both the cost and the constraints are stationary averages. We assume that the transition probabilities of the underlying Markov chain are unknown: only the values of the control variables are known, as well as the instantaneous values of the cost and the constraints, so no analytical expression for the stationary averages is available. To find the solution to the optimization problem, a stochastic version of a primal/dual method with an augmented Lagrangian is used. The updating scheme uses a "measure valued" estimator of the gradients that can be interpreted in terms of a finite horizon version of the Perturbation Analysis (PA) method known as the "perturbation realization factors". Most finite horizon derivative estimators are consistent as the sample size grows, so it is common to assume that large enough samples can be observed so as to make the bias negligible. This paper deals with the actual implementations of the gradient estimators in finite horizon with small sample sizes, so that the iterates of the stochastic approximation can be performed very often, as would be required for on-line learning. We identify the asymptotic bias of the stochastic approximation for the constrained optimization method, and by so doing we propose several means to correct it. As is very common with these problems, the bias correction introduces a conflict between precision and speed: the smaller the bias, the slower the reaction time. In the sequel, we present the theoretical basis for the study of bias and learning rate. Our experimental results indicate that smoothing at a faster time scale may not be necessary at all, only at a slower time scale. We include results where the algorithms have to track changes in the environment.

Published in:

Decision and Control, 2003. Proceedings. 42nd IEEE Conference on  (Volume:5 )

Date of Conference:

9-12 Dec. 2003