Skip to Main Content
A new class of reinforcement schemes for learning automata that makes use of estimates of the random characteristics of the environment is introduced. Both a single automaton and a hierarchy of learning automata are considered. It is shown that under small values for the parameters, these algorithms converge in probability to the optimal choice of actions. By simulation it is observed that, for both cases, these algorithms converge quite rapidly. Finally, the generality of this method of designing learning schemes is pointed out, and it is shown that a very minor modification will enable the algorithm to learn in a multiteacher environment as well.