Skip to Main Content
Optimal feedback controllers are generally computed offline assuming full knowledge of the system dynamics. Adaptive controllers, on the other hand, are online schemes that effectively learn to compensate for unknown system dynamics and disturbances. Generally, direct adaptive schemes do not converge to optimal control solutions for user-prescribed performance measures. During the past years, it has been shown that reinforcement learning techniques from computational intelligence can be used to learn optimal feedback controllers online using direct adaptive control techniques without knowing the system dynamics. Most reinforcement learning methods require full measurements of the system internal state. In this paper we develop reinforcement learning methods which require only output feedback and yet converge to an optimal controller. Deterministic linear time-invariant systems are considered. Both policy iteration (PI) and value iteration (VI) algorithms are derived. This corresponds to optimal control for a class of partially observable Markov decision processes (POMDPs). It is shown that, similar to Q-learning, the new output-feedback optimal learning methods have the important advantage that knowledge of the system dynamics is not needed for their implementation. Only the order of the system must be known and an upper bound on its ‘observability index’. The learned output feedback controller is in the form of a polynomial ARMA controller that has equivalent performance with the optimal state variable feedback gain.