Online Learning with Implicit Exploration in Episodic Markov Decision Processes | IEEE Conference Publication | IEEE Xplore

Online Learning with Implicit Exploration in Episodic Markov Decision Processes


Abstract:

A wide range of applications require autonomous agents that are capable of learning an a priori unknown task. Additionally, an autonomous agent may be put in the same env...Show More

Abstract:

A wide range of applications require autonomous agents that are capable of learning an a priori unknown task. Additionally, an autonomous agent may be put in the same environment multiple times, each time having to learn a different task. Motivated by these applications, we study the problem of learning an a priori and evolving task in an online manner. In particular, we consider an agent whose behavior is modeled by an episodic Markov decision process. The agent's task, captured by a loss function, is unknown to the agent and, furthermore, may change in an adversarial manner from episode to episode. However, in each episode, the agent receives a bandit feedback corresponding to the loss function at that episode every time it takes an action. Given a limited budget of T episodes, the objective is to learn a policy with minimum regret with respect to the best policy in hindsight. We propose a policy search algorithm that employs online mirror descent using an optimistically biased estimator of the loss function. We prove that the proposed algorithm achieves both on expectation and with high probability a sublinear regret of ~O(√{L T|S||A|}), where L is the length of each episode, |S| is the number of states, and |A| is the number of actions.
Date of Conference: 25-28 May 2021
Date Added to IEEE Xplore: 28 July 2021
ISBN Information:

ISSN Information:

Conference Location: New Orleans, LA, USA

Funding Agency:


Contact IEEE to Subscribe

References

References is not available for this document.