Hierarchical Reinforcement Learning With Guidance for Multi-Domain Dialogue Policy

Achieving high performance in a multi-domain dialogue system with low computation is undoubtedly challenging. Previous works applying an end-to-end approach have been very successful. However, the computational cost remains a major issue since the large-sized language model using GPT-2 is required. Meanwhile, the optimization for individual components in the dialogue system has not shown promising result, especially for the component of dialogue management due to the complexity of multi-domain state and action representation. To cope with these issues, this article presents an efficient guidance learning where the imitation learning and the hierarchical reinforcement learning (HRL) with human-in-the-loop are performed to achieve high performance via an inexpensive dialogue agent. The behavior cloning with auxiliary tasks is exploited to identify the important features in latent representation. In particular, the proposed HRL is designed to treat each goal of a dialogue with the corresponding sub-policy so as to provide efficient dialogue policy learning by utilizing the guidance from human through action pruning and action evaluation, as well as the reward obtained from the interaction with the simulated user in the environment. Experimental results on ConvLab-2 framework show that the proposed method achieves state-of-the-art performance in dialogue policy optimization and outperforms the GPT-2 based solutions in end-to-end system evaluation.


I. INTRODUCTION
A S THE dialogue system has become widespread due to its potential applications in real-world scenarios, designing a high-performance task-oriented dialogue system with low computation cost remains a crucial issue that must be addressed, especially in a multi-domain dialogue with a large number of possible combinations of user intents, semantic slots and values that must be correctly satisfied by the system. The common approaches to dialogue system based on the end-to-end method [1] and the pipeline method [2], [3] have been proposed with various strengths and shortcomings. Using end-to-end method, Manuscript  the dialogue task is formulated as a generative model where all dialogue system components including natural language understanding [4], dialogue state tracking, dialogue management and natural language generation are jointly optimized to provide appropriate response to the user. The recent end-to-end strategies built based on the generative pre-trained transformer 2 (GPT-2) [5] have achieved state-of-the-art (SOTA) results both in the automatic evaluation based on interaction with the bot as well as in the human evaluation which has been done by using the Amazon mechanical turk [6]. Unfortunately, such an approach suffers from two problems. First, the required computation cost is huge due to the usage of large-scaled language model via GPT-2. Second, several pre-and post-processing stages must be conducted as GPT-2 is not specially designed for solving dialogue task. On the other hand, the modular approach which optimizes individual components in the dialogue system offers simpler training and lower computation. However, the performance in [7], [8] only showed sub-optimal results. Finding an ideal dialogue policy, which determines system response to the user, is extremely difficult in component-level optimization. Recent attempts reveal that the majority of dialogue policies have been formulated as a reinforcement learning (RL) task [9], [10], [11]. Unfortunately, the high dimensional state and action spaces may contain hundreds of entries which easily confuse the dialogue policy in determining appropriate action. The problem is aggravated by the fact that the exploration in this task is very limited, unlike in a common RL task like robotic control and Atari games where the agent needs to explore the environment more frequently to find more possible solutions. In the multi-domain dialogue task, making too much exploration during training may harm the performance and lead to outof-domain response. As a consequence of the aforementioned constrains, the results revealed that good performance could only be obtained in the component-wise single-turn evaluation rather in the end-to-end system evaluation using the whole turns of each dialogue session [7] which indicates that the trained dialogue policy is not yet suitable for implementation in a real-world setting.
This paper presents a guidance learning to handle three dialogue strategies where two learning stages are performed. The first stage is the imitation learning which implements the behavior cloning (BC) with auxiliary tasks (denoted by BCAux) to improve the generalization to unseen data which are likely observed once the BC agent is applied into the real environment to interact with the user. The generalization is improved by utilizing the learned features obtained from an auxiliary network. The This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ auxiliary tasks consist of predicting current belief state and user action which are selected due to their importance in determining the system action in every dialogue turn. Furthermore, the recent works [12], [13], [14] show that the introduction of auxiliary tasks can reduce the causal confusion phenomenon which cause the inability of agent to understand the true cause of expert action from a given dataset of state-action pairs. The strategy of finding reliable pre-trained weights for the subsequent stage is implemented. The second stage is built based on two strategies where the first is the hierarchical RL (HRL) and the second is the human-in-the-loop (HITL). The hierarchical strategy is implemented to simplify the multi-domain task by treating each base domain using its corresponding sub-policy in the low-level policy. The base domain is defined as the domain that firstly occurs in a dialogue session, which indicates the domain with the highest priority. The low-level policy is trained by using proximal policy optimization [15] through interaction with the simulated user in a given environment. The weights of policy network obtained from BCAux are set as the pre-trained weights for low-level policy. Meanwhile, the high-level policy in the hierarchical structure generates a latent vector which is used to activate a sub-policy in a given dialogue session. The hierarchical policies are trained by using the policy gradient method. Due to the fact that the dialogue agent is easily being trapped in the confounded states, HITL learning paradigm is utilized to provide the guidance to the agent by providing the action correction for the confounded state. Furthermore, the action evaluation in each dialogue turn is proposed to handle the dialogue policy which may result in sub-optimal actions in some states. An efficient guidance learning from human and environment is developed to fulfill the dialogue policy optimization with the performance close to the rule-policy which serves as a human in this work. Even with low human supervision, the hierarchical scheme still enhances the learning efficiency when compared to baseline systems with the maximal human guidance. The proposed work could achieve competitive result with low-computation cost.
The rest of this paper is organized as follows. In Section II, the multi-domain task-oriented dialogue system and dialogue management are surveyed. Section III presents the efficient guidance learning for dialogue management with the optimization process which includes imitation learning as well as hierarchical RL. Section IV addresses the experimental settings for evaluation of dialogue management. The experimental results to illustrate the learning efficiency of the proposed method relative to the recent methods are addressed. The summary of findings from this study is provided in Section V.

II. MULTI-DOMAIN TASK-ORIENTED DIALOGUE
The recent approaches to build dialogue system and handle dialogue management are surveyed and discussed.

A. Multi-Domain Dialogue System
Many academics have been devoted to work on the realistic scenario based on multi-domain task-oriented dialogue system. However, providing an appropriate solution to address this task is very challenging. Different from the previous dialogue tasks  considering only limited domains like movie ticket booking [16], flight booking [17] and restaurant reservation [18], [19], multidomain dialogue offers the involvement of various domains in a single user goal as shown by Fig. 1. As a result, the dialogue structure becomes complicated due to the increase of possible scenarios. In order to obtain a desirable performance, dialogue system needs to satisfy all of the user intentions in each domain concerning in the current goal in a limited number of time steps. Among various frameworks designed for multi-domain dialogue task, [20], [21], ConvLab-2 [22] is the most popular framework that is mainly designed for handling the MultiWOZ 2.1 dataset [21]. ConvLab-2 provides flexible structures of dialogue system for supporting various ways of optimization as illustrated in Fig. 2. Therefore, researchers are allowed to build their own dialogue system in a pipeline fashion that requires optimization of individual components including natural language understanding (NLU), dialogue state tracking (DST), dialogue policy (POL) and natural language generation (NLG) [23] or in an end-to-end manner that optimizes the overall components jointly. It is also possible to investigate the joint optimization that incorporates some pipeline system components such as word-level optimization scenarios like word-DST and word-POL that jointly optimize NLU-DST and policy-NLG, respectively. Another important benefit of ConvLab-2 framework is the end-to-end system evaluation which faithfully reflects the human evaluation in real-world application. In this evaluation, a system-wise evaluation is performed instead of component-wise evaluation which merely examines specific component of dialogue pipeline system by using a single-turn evaluation assuming that the model is provided with the ground truth from the other components or from the previous dialogue turn. Metrics of task success and inform rate are measured by using the current user utterance, dialogue state, and database query as has been done in previous works [24]. On the other hand, the system-wise evaluation considers all components in the dialogue pipeline system via an end-to-end system evaluation along with the multi-turn conversation done by utilizing the simulated user, which represents human as a user. All of the assessments in this study take a system-wise or end-to-end approach, which closely resembles a real-world scenario.

B. Multi-Domain Dialogue Policy
Many studies have been devoted to develop multi-domain dialogue policy, which is regarded as a critical component in dialogue systems. Based on the current benchmark result, both of word-level [2], [3], [25] and end-to-end optimization strategies [1], [26] resulted in sub-optimal performance in the end-to-end system evaluation although good performance was achieved in the component-wise evaluation. Accordingly, many attempts have been designed to improve dialogue policy by using reinforcement learning (RL) [27] as shown in the two most popular dialogue framework benchmarks, ConvLab-2 [22] and PyDial [20], [28]. To build an RL agent, the first important step is to train the dialogue policy using the behaviour cloning (BC) which is seen as a type of imitation learning by utilizing the state and action pairs from a dialogue dataset. Those pairs are commonly formed by using the pre-defined vectorized functions that convert the sentences in dialogue dataset to the vectors that are suitable for RL training. In case of MultiWOZ 2.1 dataset, the vectorization process yields a state vector with size of 340 consisting of six different partitions which are user action, system action, belief state information, booking information, database pointer and state termination. In addition, the action represented as a vectorized version of dialogue act with a dimension of 209, which consists of four information sources including domain, action type, slot and value. By using the state-action pairs D = {s n , y n } N n=1 where s n and y n denote the state and the target corresponding to an expert action, respectively, the policy network π θ (·) for finding action a given by the optimal parameter θ * is estimated by maximizing the log likelihood or minimizing the mean squared error or cross entropy error L BC (·) for regression or classification, respectively, from the training data D via Because the dataset D only contains successful trajectories, the RL agent that uses BC weights is prone to produce failed trajectories owing to the unobserved trajectories if the agent takes an incorrect action in the environment. Therefore, while BC is a simple strategy, achieving acceptable outcomes in real-world application is still very difficult. Some sophisticated approaches have been proposed, such as training the agent by incorporating the learned reward function based on the adversarial inverse reinforcement learning [29] by using expert trajectories [30], [31], [32]. The training process was done similar to that of generative adversarial networks [33]. Another approach was developed by a model-based RL [34], [35] where the model was trained to replicate the user behavior so that the agent might progress through the planning phase with sample efficiency. Unfortunately, such an approach either only worked in a somewhat simple setting or only performed well in component-wise evaluations that solely looked at single turns. As a result, when the learned agents were evaluated in an end-to-end system evaluation via multi-turn dialogues, the desired results were not achieved.
Due to the success of transformer [36] in natural language processing tasks, many attempts have been proposed to apply it for multi-domain dialogue system [37], [38] in an end-to-end optimization manner. Recent work has also introduced offline RL optimization to improve the performance of the transformerbased system [39]. However, the significant results have been reported in the latest DSTC9 track 2 challenge in which endto-end optimization based on the GPT-2 model [40], [41] has successfully achieved SOTA performance. That was the first time that end-to-end approach outperformed the componentlevel optimization. However, because a large-scaled language model (LM) using GPT-2 was used as the default component, the end-to-end method required a high computation cost. Furthermore, the enhanced data from other datasets as well as the extensive pre-and post-processing were needed to make LM operate in a task-oriented conversation system. In this paper, a new method is proposed by preserving low-cost computation while significantly improving multi-domain dialogue policy. The proposed strategy is designed by addressing the shortcomings of previous policies by means of hierarchical RL which can simplify the problem formulation in multi-domain dialogue system. This strategy is efficient and does not require data enhancement during the pre-and post-processing stages.

III. HIERARCHICAL RL WITH GUIDANCE
Overview of the proposed method is depicted in Fig. 3. The first step involves data pre-processing to obtain the inputs and the corresponding labels for training neural networks for BCAux. Next, the weights of policy network in BCAux is used as the pre-trained weights for low-level policy in the hierarchical RL (HRL). HRL is trained according to the hierarchical policy gradient with the help of human to provide additional guidance during training.

A. Imitation Learning With Auxiliary Tasks
Typically, introducing the auxiliary objectives in construction of a target model is promising to regularize the model when dealing with unseen data [12], [42], [43]. The main purpose of auxiliary tasks is to share the learned representations from the auxiliary objectives to the primary model to boost primary task performance. In case of dialogue policy optimization, the auxiliary tasks help the primary model to understand the true cause of an expert action given a certain state. This study presents the behavior cloning with auxiliary tasks as a specialized imitation learning to provide reliable pre-trained weights for low-level policy in subsequent HRL optimization. At the beginning, the input states and the corresponding targets, both for primary and auxiliary tasks must be formed from dialogue dataset D = {s n , y n } N n=1 . The input state and the primary target for system action are denoted by s ∈ R 340 and y sa ∈ R 209 , respectively. The auxiliary tasks consist of predicting the targets of belief state y bs ∈ R 24 and user action y ua ∈ R 78 which are selected due to their importance in determining appropriate system response. Since all tasks are seen as the multi-label binary classification, the binary cross-entropy losses are modified by considering the balanced parameters {β sa , β bs , β ua } due to the class imbalance between positive and negative samples. In addition to one-hot target vector y n , we calculate the ratios of the numbers of negative samples over all samples corresponding to the classification labels for system actions, belief states and user actions to determine β sa , β bs and β ua , respectively. Such ratios are popular to handle the class imbalance [44] in multi-class classification which is fitted to the setting in this work.
Next, the architecture of the proposed BCAux is depicted in Fig. 4. The primary and auxiliary networks share a common feature extractor s fea f ψ fea (s) with parameter ψ fea that is designed to provide meaningful features for primary task by taking advantage of auxiliary tasks. The loss function of this scheme L BCAux = L sa + L bs + L ua is integrated by three losses for classification prediction in a policy network and an auxiliary network where four parameters ψ = [ψ fea , ψ sa , ψ bs , ψ ua ] are included. The first loss is devoted to the primary task which predicts the expert action given an input state. The remaining two losses belong to auxiliary tasks for prediction of belief state and user action. Given the samples of input feature s fea and target y (y sa , y bs or y ua ) from dataset D, and the ratios β (β sa , β bs or β ua ), the balanced cross-entropy loss L (L sa , L bs or L ua ) between true target y and predicted target y f ψ (s fea ) ( y sa , y bs or y ua ) Fig. 4. Architecture of the proposed BCAux consisting of the policy network for handling primary task and predicting system action y sa and the auxiliary networks for predicting belief state y bs and user action y ua .
with mapping parameter ψ (ψ sa , ψ bs or ψ ua ) is yielded by which is consistently applied for three auxiliary tasks.

B. Hierarchical RL With Sub-Policies
Instead of dealing with each dialogue session using standard RL, in this work, a hierarchical RL (HRL) based on policy gradient method is proposed to elaborate the response to the user's goals by treating them uniquely based on their base domain. Base domain is defined as the domain that becomes the main concern in a dialogue which always occurs in the beginning of the dialogue. Different from the common HRL that high-level policy chooses an action in every pre-determined period of time or after reaching a specific sub-goal [45], in this work, the high-level policy only outputs an action at the beginning of a dialogue session to activate a sub-policy in the low-level policy that corresponds to the base domain of current dialogue. This scenario is reasonable to build the task-oriented dialogue system where dialogue policy needs to satisfy a user's goal in a very limited time step, ideally less than 15 time steps in average, which are significantly lower than standard HRL tasks like maze or robotic tasks which require hundreds to thousands of time steps to achieve the goal. By implementing HRL in this setting, the complexity of the task which involves huge state and action spaces can be reduced so that the dialogue policy training can be optimized. As hierarchical approach is used, standard PG [46] is calculated over the accumulated reward with the trajectory of states and actions τ = {s t , a t } T −1 t=0 drawn by a policy π θ (·) with the hierarchical parameters θ = {θ h , θ l } in different levels in a form of where θ h denotes the high-level policy and θ l denotes the low-level policy. The reward R(τ ) = {R h (s 0 ), R l (τ )} involves the ones R h and R l for high-level and low-level policies using initial state s 0 and remaining trajectory τ , respectively. Given the output of high-level policy z from {z k } K k=1 , the trajectory distribution of an agent with K sub-policies is yielded and expanded over a trajectory of T steps by Considering (3) and (4), the hierarchical policy gradient is accordingly calculated by where the terms independent of θ are disregarded and the hierarchical setting of {z k } K k=1 gives the gradient Unfortunately, such a gradient is prone to be unstable during training if high-level policy outputs a wrong z k , i.e. the optimal output z * is missed as z k = z * , which means high-level policy assigns an incorrect sub-policy to deal with user specific goal. Suppose 0 < π θ l (a t |s t , z k ) < ρ for a wrong z k in trajectory τ is considered. Then, the probability of non-optimal trajectory for each low-level policy is upper bounded by ρ T . The gradient for each non-optimal low-level policy using the output of high-level policy z k can be derived and the computation complexity can be obtained by By merging (7) in (6), the gradient is then updated by considering the calculation corresponding to the optimal latent variable z * as well as the other K − 1 non-optimal z k ∇ θ log π θ (τ ) = ∇ θ π θ h (z * |s 0 ) T −1 t=0 π θ l (a t |s t , z * ) π θ h (z * |s 0 ) T −1 t=0 π θ l (a t |s t , z * ) where the computation (K − 1)O T ρ T −1 is the source of instability that must be carefully tackled in learning process.
For the reward setting in HRL, R l (τ ) = {r l (s t , z)} is the reward from environment, and r h (s 0 , z * ) = 1 and r h (s 0 , z k ) = 0 at initial state s 0 are defined. As a result, the gradient of low-level policy with non-optimal z k can be eliminated by disregarding any trajectory τ with r h (s 0 , z k ) = 0 stored in high-level replay buffer D h . In the implementation, high-level policy π θ h (·) is trained by the policy gradient (PG) and low-level policy is trained by the proximal policy optimization (PPO) [15] with the clipped surrogate objective given by using individual buffer D l and previous parameter θ old l in L s, a, z, θ old l , θ l = min r(θ l )A θ old l (s, a, z), s, a, z) .
Here, the ratio r(θ l ) = π θ l (a|s,z) π θ old l (a|s,z) between current policy π θ l and old policy π θ old l in low-level policy is calculated with the output of high-level policy z, and the advantage function using current policy π θ old l is estimated by [47] z) is seen as the temporal difference (TD) error [48] of value function V with the updated and the current critic parameters φ and φ − in a learning epoch, respectively. γ is the discount factor, and λ is a factor to adjust the bias-variance dilemma in model construction. The learning objective is set by choosing either the weighted advantage function r(θ l )A θ old l (s, a, z) or the function clip(r(θ l ), 1 − , 1 + )A θ old l (s, a, z) with a clipping threshold . This clipped surrogate function L clip (θ l ) is maximized to estimate the policy parameter θ l .
In addition, the PPO critic parameter φ is updated by minimizing the regression error between the predicted value function V φ (s t , z) and the target value function y t = r l (s t , z) + γV φ old (s t+1 , z) where the HRL state (s t+1 , z) and the reward r l (s t , z) are sampled from the low-level replay buffer D l . Therefore, the regression loss of PPO critic network is yielded as a TD error of value function In order to boost the training of low-level policy θ l , the optimal weights of policy network in BCAux {ψ * fea , ψ * sa }, as seen in Fig. 4, are used as the pre-trained weights for θ l . From the empirical investigation, the best BC agent could not be determined by its validation loss during training. Instead, the model selection could be performed according to the task success rate in the policy evaluation stage which accordingly reflects real implementation performance.

C. Guidance Learning With Human-in-The-Loop
Human-in-the-loop (HITL) is mainly introduced to provide a guidance to the agent during training due to the fact that agent is prone to be trapped in the confounded states. Instead of using  5. Hierarchical RL with human-in-the-loop process for dialogue policy improvement. a denotes the action from human or π rule as a guidance. Meanwhile, a denotes the action from π θ l . s denotes the state from environment due to a executed which is obtained either from a or a.
real human, the rule-based agent π rule provided by ConvLab-2 framework, is set to act as a human during the learning process. Rule-based agent is a handcrafted agent designed by human that serves as the upper bound in dialogue policy optimization. The guidance in HITL is employed in the designed protocol program which is then performed in dialogue environment as shown Fig. 5. In general, the guidance or feedback from the human can be done in the form of action pruning, reward shaping or state manipulation. Unfortunately, the last two feedback are hard to be designed in the multi-domain dialogue task due to its complexity and manual tuning requirement. Therefore, in this work, the guidance from human which is governed by a protocol program is employed in two different ways based on the action pruning and evaluation scenario. The first way is to identify a confounded state which reflects three repetitions of state representation consecutively as illustrated in Table I, human must provide a corrective action that is executed into the environment through the protocol program. Next, the protocol program removes any trajectory that involves either confounded state or failed trajectory generated by the agent due to the incorrect action of high-level policy. For the second feedback, human must evaluate low-level policy action in every step by assuming that agent may choose a sub-optimal action in certain time steps. Therefore, instead of only maximizing L(s, a, z, θ old l , θ l ), the low-level policy in PPO explores the environment by further maximizing the negative cross-entropy for the evaluations of the selected actions between human a and system a stored in low-level replay buffer D l L CE (θ l ) = E (s,a, a,z)∼D l [π rule ( a|s)π θ l (a|s, z)] = E D l [H(π rule ( a|s))] − D KL (π rule ( a|s) π θ l (a|s, z)) (13) which is expressed by the entropy H(·) and the Kullback-Leiblier divergence D KL . By maximizing L CE , the action distribution of low-level policy becomes close to the human action distribution which eventually results in near optimal performance. Maximizing the negative cross-entropy for guidance learning is richer than maximizing the policy entropy in standard RL. It is because simply maximizing policy entropy may not assure the performance of agent since the out-of-domain response is likely produced due to the heterogeneous state and action spaces in multi-domain dialogues. Combining all together, the high-level policy is learned according to the PG ∇ θ h J(θ h ) = E (s 0 ,z)∼π θ h [log π θ h (z|s 0 )r h (s 0 , z)], and the low-level policy is estimated by maximizing the regularized PPO objective J(θ l ) = E z∼π θ h ,(s,a)∼π θ old l , a∼π rule [L clip (θ l ) + L CE (θ l )]. The overall learning procedure of actor-critic for HRL with HITL is shown by Algorithm 1.

IV. EXPERIMENTS
The experiments were done by using ConvLab-2 framework [22] which provided the interaction between simulated user and dialogue agent in an environment based on MultiWOZ 2.1 dataset [21]. MultiWOZ 2.1 was an updated version of MultiWOZ 2.0 [49], known as a multi-domain, multi-intent task-oriented dialog corpus [50] that contained 7 domains which are hotel, attraction, restaurant, train, taxi, police and hospital, 13 user intents, 25 slot types, 10,483 dialog sessions, and 71,544 dialog turns. By using ConvLab-2, the end-to-end system evaluation was performed to reflect real-world scenario convincingly. For reward setting, the dialogue agent received −1 in every conversation it made, +5 if current domain was satisfied, and +40 if the task succeeded.

A. Experimental Settings
In the first stage of optimization which involved BCAux training, MultiWOZ 2.1 dataset was split into the training, validation and test data with 8434, 999 and 1000 dialogues, respectively. The policy network consisting of f ψ fea and f ψ sa was a feedforward network with two hidden layers while the auxiliary network containing f ψ bs and f ψ ua was a feedforward network with one hidden layer. Activation functions in all hidden layers were ReLU, and the output layer was sigmoid. For the hierarchical RL architecture, π θ h was formed by feedforward network with two hidden layers with ReLU activation and the softmax output layer. Number of domains K was chosen as 5 instead of 7 because taxi, hospital and police domain was merged together as one base domain due to their limited samples. Interpolation parameters for three terms in L BCAux were 1, 0.8 and 0.6. Meanwhile, considering the model size, each sub-policy in low-level policy π θ l and the critic network V φ were identical to the policy network in BCAux. The only difference was the output activation where critic network used linear activation function. During HRL training, the agent collected roughly 2048 dialogue utterances that were divided into 32 batches for updating the parameters. The actor network was optimized by using Adam with initial learning rate 1e-4. The hyperparameters λ, and γ were set as 0.95, 0.2 and 0.99, respectively. The critic network was optimized by RMSprop [51] with initial learning rate 5e-5.
The experimental results were investigated by using the endto-end system evaluation, involving the simulated user where the NLU, DST and NLG in a pipeline system were identical to the ConvLab-2 default settings including BERT [52] based NLU, rule-based DST and template NLG, respectively. The same configuration was also applied into system agent which used hierarchical RL as its dialogue policy. With this configuration, a strong policy should be learned to compensate for the imperfect state representation caused by NLU's inability to provide faultless user conversation acts over the whole dialogue flow. The following six main metrics were set for providing the comparative study.
− success rate: judges whether user goals of constraints (e.g. hotel location or hotel price) and requests (e.g hotel phone number) have been satisfied by system − F1 score: judges if all requested information like taxi type or taxi phone number has been informed − complete rate: ratio of the completed user constraints − booking rate: calculates the proportion of the successful dialogues for booking hotel, restaurant or train − average turn: calculates the average number of returns to handle user goals for successful and all dialogues − computation time: measures the computation in seconds required to complete 1000 test dialogues The proposed method was compared with two types of baseline methods. The first type of baselines was the methods which only optimized the dialogue policy as follows − maximum likelihood estimation (MLE): a standard BC that learns to choose an action given a certain state using the supervised learning method − policy gradient (PG): a standard policy based method in RL where the gradient of objective for cumulative reward is calculated to estimate π θ (·) − proximal policy optimization (PPO) [15]: an actor-critic implementation [53] by maximizing the clipped surrogate objective, (10), to train actor and minimizing the regression error, (12), to train critic − guided dialogue policy learning (GDPL) [30]: a method based on adversarial inverse RL [29] which learns the reward by using the expert trajectories and uses it to train the dialogue policy agent sequentially in the same loop Another type of baselines conducted the optimization in an end-to-end manner. All components from NLU until NLG were optimized jointly.
− domain aware multi-decode (DAMD) [1]: a multi-action data augmentation scheme to produce diverse response by using additional state-action pairs − minimalist transfer learning (MinTL) [37]: a transfer learning framework offering plug-and-play approach for task-oriented dialogue system − UBAR [38]: a task-oriented dialogue system in a dialogue session using distilGPT-2 model [54]. The model was fed not only with the user and response sentences, but also with database search result and belief state from the previous steps.  ) INDICATE THE IMPROVEMENT AND DEGRADATION OF THE PROPOSED DIALOGUE POLICY THAT USES  BEHAVIOR CLONING WITH AUXILIARY TASKS (BCAUX) AS THE PRE-TRAINED WEIGHTS COMPARED TO THE POLICY WITH ORIGINAL MODEL IMPLEMENTATION  INDICATED BY *, RESPECTIVELY. THE DIALOGUE TEST SET CONTAINS 1, 2  − three offline RL methods including GPT-critic [39], critic regularized regression (CRR) [55] and decision transformer [56]. All of them were trained by using GPT-2. − the first winner of DSTC9 track 2 [40]: performed five critical processes. The first was the domain adaptation using the pre-trained GPT-2 where the datasets including Schema [57], Camrest [58], Taskmaster 2019, Taskmaster 2020 [59] and MSR-e2e were used. Multi-task fine-tuning using MultiWOZ 2.1, data pre-processing and post-processing, fault tolerance mechanism, and rule-based post-processing for refining the agent utterances were the other four processes. − the second winner of DSTC9 track 2 [41]: conducted similar implementation as the first winner with two distinctions. First, there was no post-processing approach in this work. Second, the auxiliary tasks were employed to increase the consistency in sentence generation given the identical system action responses.

B. End-to-End System Evaluation
A number of experiments were carried out in this work to evaluate the performance of the proposed methods for multidomain dialogue task over 1000 test dialogues where 337, 523 and 140 dialogues containing 1,2 and 3 domains respectively. Fig. 6 depicts the discourse percentage of individual domains during evaluation. There were seven imbalanced domains where the domain structure was complicated due to high-dimensional semantic slots and values for different acts in high-proportion domains such as hotel and restaurant which resulted in a difficult assignment for the agent. All tests were carried out on a PC with Firstly, an evaluation to examine the benefit of auxiliary tasks in BC was done as shown in Table II. Original model results were obtained by using the weights provided by ConvLab-2 framework. The result shows that introducing auxiliary tasks into BC optimization resulted in significant performance improvement. In particular, compared to the original BC, all model that utilizes BCAux weights gained 3% absolute improvement in success rate and more than 4.4% in F1 score and complete rate. Furthermore, for PPO optimization, using BCAux as the pre-trained weights dramatically advanced the performance indicated by a significant performance gap compared to the original PPO implementation. The only drawbacks of using BCAux as the pre-trained weights were reported in case of PG and GDPL, where the booking rate metric shows the decreasing trend. In order to show more advantages of the BCAux, Table III reports the comparison between original model implementation and the model with BCAux in the dialogue test set containing two and three domains. More domains imply the increasing challenge and the degraded performance. The NLU, DST and NLG components were set identically to Table II. It is shown that the models use BCAux weights can handle multi-domain dialogue much better compared to the original model marked by *. More convincing improvement is obtained in the dialogue test with three domains which is a very challenging task.   IV  COMPARISON OF DIFFERENT CONFIGURATIONS IN END-TO-END SYSTEM EVALUATION FOR MULTI-DOMAIN DIALOGUE TASK. THE BEST RESULTS EXCEPT THE  RULE POLICY ARE SHOWN IN BOLD. THE PROPOSED HIERARCHICAL RL WITH GUIDANCE (HRLG), SHOWN IN THE LAST TWO ROWS, OBTAINS THE COMPETING  PERFORMANCE WITH LOW COMPUTATION COST COMPARED TO THE END-TO-END OPTIMIZATION APPROACHES. * MEANS THAT THE MODELS WERE TESTED BY  USING THE PROVIDED WEIGHTS.  ‡ MEANS THAT THE RESULTS WERE TAKEN  HRLG where t-SNE [60] is used. These latent codes are successfully diverse over five domains. Therefore, the proposed learning strategy may simplify the multi-domain dialogue task to individual base domain task which leads to outperform the other dialogue policy optimization strategies with very significant numbers. Furthermore, due to the task simplification, the feedback from both human and environment by interacting with the simulated user is learned efficiently. The result of the proposed hierarchical RL with guidance (HRLG) compared to the baseline methods is shown by Table IV. The result shows that all of the dialogue policy optimization methods which are not built based on PPO perform very bad since the corresponding metrics indicated very low score. These methods even obtained the task success and completion rates that are less than 50%. Meanwhile, very low booking rates are revealed with a rate of less than 30%. These empirical results have demonstrated the difficulty of establishing multi-domain task-oriented dialogue policy with good performance due to the huge state and action spaces. On the other hand, the dialogue system configuration with PPO-based policy showed competitive results by exhibiting reasonable performances.
Especially, in case of utilizing HRLG, this work attained very convincing result by performing very close to the rule policy which serves as the human to provide a guidance to the agent during training. Moreover, considering all results in dialogue test set, either successful or unsuccessful, the proposed HRLG only required 13.1 turns in average which are 6 turns fewer than the average of those dialogue policy baselines. As a result, HRLG showed faster computation time in completing the entire dialogue test than the dialogue policy baselines. When compared to the majority of the end-to-end optimization methods utilizing GPT-2 model, the suggested learning strategy shows domination in all metrics while taking significantly low computation time to complete the test. Furthermore, when compared to the first winner of DSTC9 track 2 [40], the proposed method shows very competitive results with considerably reduced computation cost about 8 times cheaper with lower average dialogue turns for satisfying user goals. The reduction of computation cost can be explained by comparing the computation time required by each model to finish the test and the average turn conducted by each model during test stage.
There are two main reasons of why end-to-end approaches required high computation cost. The first is the pre-and postprocessing steps in every sentence generation turn. The second, since they used one model to represent whole system, then in every turn, they needed to initially predict the user dialogue act and belief state using a large pre-trained model, for example GPT-2 model. Next, the predicted belief state and the dialogue history were fed to the GPT-2 model for generating the response. In other words, the inferences using large model must be done at least twice until producing the system response. This sequential process required very long time to complete. Meanwhile, the large model using BERT was only used once in the pipeline system, that is in the NLU part for predicting the user's dialogue act.
An interesting finding is depicted by the last row of  increases significantly as the dialogue act from user sentences can be predicted perfectly. It is because, the information in DST which is transformed as a state for dialogue policy contains true dialogue act from the current turn. It subsequently affects the dialogue policy to choose appropriate decision. This evidence clearly indicates that the solution to improving NLU part is urgently required to further improve the performance of the pipeline dialogue system.
In addition, the last column of Table IV shows the number of parameters of the policy models in various methods. It is found that the baseline models had smaller number of parameters than the proposed method, but required more computation time. This is because that the baseline models needed larger number of turns than the proposed method to complete each dialogue in test set. The model size of pipeline method is significantly reduced relative to that of end-to-end approaches.  VIII  EXAMPLES OF SUCCESS AND FAILURE OF A MULTI-DOMAIN DIALOGUE CONTAINING THREE GOALS WHERE TAXI DOMAIN OCCURRED AT THE LAST TURN. THE  GREEN COLORED TEXT IS A CORRECT SYSTEM RESPONSE, INDICATED BY SUCCESSFULLY PROVIDING TAXI PHONE NUMBER. OTHERWISE, THE RED COLORED  TEXT IS AN INCORRECT SYSTEM RESPONSE. THE PROPOSED HRLG IS COMPARED TO THE PPO WITH ACTION PRUNING

C. Ablation Study on Efficient Learning and Guidance
Ablation study is conducted by evaluating individual components of the proposed method in order to show their impact on learning efficiency. All of the learned dialogue policies are trained by using PPO due to its dominant performance compared to the other policy optimization methods. There are six configurations which are built by investigating three components in HRLG including the hierarchy in HRL, the action pruning for confounded state and the additional objective L CE for low-level policy. The last two components involve in HITL as a guidance for agent during training. First, the importance of hierarchical strategy is evaluated without any guidance which is shown by the first two rows of Table V. It can be seen that this strategy benefits the agent in the training which improves the performance in the majority of metrics. This result indicates the hierarchical method successfully simplifies the multi-domain dialogue task into several dialogue tasks based on the base domain occurrence. An efficient learning can be achieved. Next, various combinations of hierarchical strategy with guidance learning are assessed. There are two points in this analysis and comparison. The first point is to examine if the guidance is efficiently learned by the hierarchy in RL. The advantage can be observed in the third and fourth rows of Table V. The guidance that solely comprised of action pruning is learned effectively by employing a hierarchical method, as evidenced by the improved metric scores. By only receiving action pruning as a guidance from human during training, the dialogue policy with hierarchy shows competitive performance compared to the previous work [42] which applied non-hierarchical dialogue policy. The second point examines the joint advantage of action pruning and objective L CE . The results are shown by the last two rows. The efficient learning from guidance was successfully achieved with the combination of three components indicated by the dominant improvement among the other configurations, especially in the success rate, complete rate and booking rate which are important metrics to indicate model capability in handling user goal. Even though this setting resulted in lower precision, recall and F1 score than the setting which only applied two guidances, the differences are not really significant.
To further demonstrate different learning strategies and dialogue properties in HRLG, Table VI reports the test results containing one, two and three domains as shown in top, middle and bottom, respectively, in presence of different number of dialogues. In the first test which involves 337 dialogues with one domain from the test set, very convincing performance is exhibited by all of the configurations which successfully achieve the success rate more than 94%, the complete rate more than 95%, very high number in booking rate and F1 score with very low turn. Different configurations perform comparably. The benefit of the guidance provided by human becomes clear in the test with dialogues containing two domains where there is a clear gap especially in term of success rate and booking rate, more than 2% in absolute improvement from the first two to the last three rows in the configurations. Unfortunately, the PPO implementation in HRLG could not take the advantage from action pruning in the third configuration as the obtained score in all metrics are just the same as the first two configuration. There are 523 dialogues in this test which take the highest proportion. Meanwhile, in the last test which involved 140 dialogues from test set containing 3 domains, which is the most challenging task, the advantage of implementing hierarchical approach is shown convincingly, especially in term of the success rate and complete rate where the absolute improvement reaches more than 9%. The benefit of guidance learning using action pruning and L CE is obvious. In this test, one weakness is that the booking rate tends to be reduced due to the hierarchical strategy, even though all other scores are very convincing.
Table VII illustrates the assessment of success rate and F1 score for different configurations in each domain of the dialogue test set. The most impacted domain owing to the suggested hierarchical setting and guidance learning may be identified by examining this outcome. All of the dialogue policy configurations in the police and hospital domains earn a perfect success rate and a nearly perfect F1 score. In addition, the proposed method significantly benefits the performance for those 140 dialogues with three domains where taxi domain occurs in the last order of domains. Even without human assistance, introducing the hierarchical strategy improves the success rate and F1 score of taxi domain by near 10%. The improvement is increased if the human guidance using action pruning and objective L CE is merged during dialogue policy optimization as the results of both success rate and F1 reach nearly perfect score. This evidence indicates that the proposed learning strategy could properly identify the state information from DST, conditioned by the previous dialogue turns, resulting in the appropriate response. An empirical example of dialogue is shown by Table VIII where the proposed system successfully generates response in taxi domain after addressing the previous turns involving two different domains. The previous method [42] could not provide a correct information due to inability to extract the information from the complicated state built from three different domains. Meanwhile, all of the configurations including those which utilize the guidance could not perform well in the hotel domain, which is the most challenging domain. It has 47 and 21 different possibilities of system and user dialogue act, respectively. Even the rule policy could only achieve the success rate around 81%. In general, the proposed method performs well in most of domains for multi-domain dialogue management. Source codes and model parameters are provided and can be accessed on https://github.com/NYCU-MLLab/.

V. CONCLUSION
A novel strategy to efficiently learn from the guidance from both of the human and the environment to establish high performance dialogue policy with low computational cost has been proposed in this work. The strategy was initially started by imitation learning with auxiliary tasks from the dialogue dataset to provide a good pre-trained weights for the subsequent stage which applied hierarchical reinforcement learning with humanin-the-loop (HITL). The end-to-end system evaluation findings indicated that the suggested learning technique outperformed most of the previous approaches and performed nearly identically to the rule-based agent that served as a human in HITL. When compared to state-of-the-art approaches that employed end-to-end optimization with large-sized language model GPT-2 as the core model, the proposed method needed much reduced computation cost with competitive performance. Furthermore, based on the ablation study, the hierarchical strategy enabled the dialogue policy agent to learn feedback from human and environment effectively, as the problem formulation in multi-domain dialogue task was simplified due to the hierarchy property. Future research will be undertaken to add optimization in the NLU component in order to appropriately provide acceptable state representation to dialogue policy, allowing the dialogue pipeline system to pick appropriate action to satisfy user goals.