Skip to Main Content
Reinforcement learning methods have been successfully used to optimise dialogue strategies in statistical dialogue systems. Typically, reinforcement techniques learn on-policy i.e., the dialogue strategy is updated online while the system is interacting with a user. An alternative to this approach is off-policy reinforcement learning, which estimates an optimal dialogue strategy offline from a fixed corpus of previously collected dialogues. This paper proposes a novel off-policy reinforcement learning method based on natural policy gradients and importance sampling. The algorithm is evaluated on a spoken dialogue system in the tourist information domain. The experiments indicate that the proposed method learns a dialogue strategy, which significantly outperforms the baseline handcrafted dialogue policy.
Spoken Language Technology Workshop (SLT), 2012 IEEE
Date of Conference: 2-5 Dec. 2012