Skip to Main Content
The problem of controlling a finite Markov chain so as to maximize the long-run expected reward per unit time is studied. The chain's transition probabilities depend upon an unknown parameter taking values in a subset [a,b] of Rn. A control policy is defined as the probability of selecting a control action for each state of the chain. Derived is a Taylor-like expansion formula for the expected reward in terms of policy variations. Based on that result a recursive stochastic gradient algorithm is presented for the adaptation of the control policy at consecutive times. The gradient depends on the estimated transition parameter which is also recursively updated using the gradient of the likelihood function. Convergence with probability 1 is proved for the control and estimation algorithms.