Conferences >2019 American Control Confere...

Hidden Markov Model Estimation-Based Q-learning for Partially Observable Markov Decision Process

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

The objective is to study an on-line Hidden Markov model (HMM) estimation-based Q-learning algorithm for partially observable Markov decision process (POMDP) on finite st...Show More

Metadata

Abstract:

The objective is to study an on-line Hidden Markov model (HMM) estimation-based Q-learning algorithm for partially observable Markov decision process (POMDP) on finite state and action sets. When the full state observation is available, Q-learning finds the optimal action-value function given the current action (Q-function). However, Q-learning can perform poorly when the full state observation is not available. In this paper, we formulate the POMDP estimation into a HMM estimation problem and propose a recursive algorithm to estimate both the POMDP parameter and Q-function concurrently. Also, we show that the POMDP estimation converges to a set of stationary points for the maximum likelihood estimate, and the Q-function estimation converges to a fixed point that satisfies the Bellman optimality equation weighted on the invariant distribution of the state belief determined by the HMM estimation process.

Published in: 2019 American Control Conference (ACC)

Date of Conference: 10-12 July 2019

Date Added to IEEE Xplore: 29 August 2019

ISBN Information:

ISSN Information:

DOI: 10.23919/ACC.2019.8814849

Conference Location: Philadelphia, PA, USA

Contents

I. Introduction

Reinforcement learning (RL) is getting significant attention due to the recent successful demonstration of the ‘Go game’, where the RL agents outperform humans in certain tasks (video game [1], playing Go [2]). Although the demonstration shows the great potential of the RL, those game environments are confined and restrictive compared to what ordinary humans go through in their everyday life. One of the major differences between the game environment and the real-life is the presence of unknown factors, i.e. the observation of the state of the environment is incomplete. Most RL algorithms are based on the assumption that complete state observation is available, and the state transition depends on the current state and the action (Markovian assumption). Markov decision process (MDP) is a modeling framework with the Markovian assumption. Development and analysis of the standard RL algorithm are based on MDP. Applying those RL algorithms with incomplete observation may lead to poor performance. In [3], the authors showed that a standard policy evaluation algorithm can result in an arbitrary error due to the incomplete state observation.

References is not available for this document.

Hidden Markov Model Estimation-Based Q-learning for Partially Observable Markov Decision Process

Abstract:

Metadata

Abstract:

ISSN Information:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Hidden Markov Model Estimation-Based Q-learning for Partially Observable Markov Decision Process

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

I. Introduction

Authors

Figures

References

Citations

Keywords

Metrics

Footnotes

References

IEEE Account

Purchase Details

Profile Information

Need Help?