Loading [MathJax]/extensions/MathMenu.js
Hidden Markov Model Estimation-Based Q-learning for Partially Observable Markov Decision Process | IEEE Conference Publication | IEEE Xplore

Hidden Markov Model Estimation-Based Q-learning for Partially Observable Markov Decision Process


Abstract:

The objective is to study an on-line Hidden Markov model (HMM) estimation-based Q-learning algorithm for partially observable Markov decision process (POMDP) on finite st...Show More

Abstract:

The objective is to study an on-line Hidden Markov model (HMM) estimation-based Q-learning algorithm for partially observable Markov decision process (POMDP) on finite state and action sets. When the full state observation is available, Q-learning finds the optimal action-value function given the current action (Q-function). However, Q-learning can perform poorly when the full state observation is not available. In this paper, we formulate the POMDP estimation into a HMM estimation problem and propose a recursive algorithm to estimate both the POMDP parameter and Q-function concurrently. Also, we show that the POMDP estimation converges to a set of stationary points for the maximum likelihood estimate, and the Q-function estimation converges to a fixed point that satisfies the Bellman optimality equation weighted on the invariant distribution of the state belief determined by the HMM estimation process.
Date of Conference: 10-12 July 2019
Date Added to IEEE Xplore: 29 August 2019
ISBN Information:

ISSN Information:

Conference Location: Philadelphia, PA, USA

I. Introduction

Reinforcement learning (RL) is getting significant attention due to the recent successful demonstration of the ‘Go game’, where the RL agents outperform humans in certain tasks (video game [1], playing Go [2]). Although the demonstration shows the great potential of the RL, those game environments are confined and restrictive compared to what ordinary humans go through in their everyday life. One of the major differences between the game environment and the real-life is the presence of unknown factors, i.e. the observation of the state of the environment is incomplete. Most RL algorithms are based on the assumption that complete state observation is available, and the state transition depends on the current state and the action (Markovian assumption). Markov decision process (MDP) is a modeling framework with the Markovian assumption. Development and analysis of the standard RL algorithm are based on MDP. Applying those RL algorithms with incomplete observation may lead to poor performance. In [3], the authors showed that a standard policy evaluation algorithm can result in an arbitrary error due to the incomplete state observation.

Contact IEEE to Subscribe

References

References is not available for this document.