Policy Reuse for Dialog Management Using Action-Relation Probability

We study the problem of policy adaptation for reinforcement-learning-based dialog management. Policy adaptation is a commonly used technique to alleviate the problem of data sparsity when training a goal-oriented dialog system for a new task (the target task) by using knowledge when learning policies in an existing task. The methods used by current works in dialog policy adaptation need much time and effort for adapting because they use reinforcement learning algorithms to train a new policy for the target task from scratch. In this paper, we show that a dialog policy can be learned without training by reinforcement learning in the target task. In contrast to existing works, our proposed method learns the relation in the form of probability distribution between the action sets of the source and the target tasks. Thus, we can immediately derive a policy for the target task, which significantly reduces the adaptation time. Our experiments show that the proposed method learns a new policy for the target task much more quickly. In addition, the learned policy achieves higher performance than policies created by fine-tuning when the amount of available data on the target task is limited.


I. INTRODUCTION
Reinforcement learning (RL) is a widely used framework for modeling the decision-making process in such tasks as the dialog management of conversational systems [1]- [3]. In general, training a policy with reinforcement learning helps the system to learn a more robust behavior and its performance will not be upper-bounded by the human performance in the training dialog samples, which is sometimes not optimal. Nonetheless, the training process of reinforcement learning is usually tedious and needs a large number of samples to train an optimal policy. Policy adaptation, or policy transfer, is a very useful technique that can tackle this problem in reinforcement learning.
Policy adaptation refers to the process of reusing knowledge, i.e., a policy that is learned in one or multiple source tasks to a new target task. Various literatures which has studied about policy adaptation in RL proposed different The associate editor coordinating the review of this manuscript and approving it for publication was Utku Kose.
techniques that show such promising results as the acceleration of the convergence rate and the reduction of data volume requirements [4], [5].
Within the scope of reinforcement learning-based dialog management, the application of policy adaptation remains very limited. Current works in dialog policy adaptation [6]- [8] follow the weight initialization strategy, which contains two steps: pre-training and fine-tuning. Pre-training refers to the process that trains a policy in the source task, where the policy is usually represented by a neural network. Some of the source policy's weight parameters are used to initialize the neural network's weights to train a policy in the target task. Next, we fine-tune the network weights by training with a reinforcement learning algorithm.
However, when we do not have enough data on the target task, this strategy does not work well because it barely uses the knowledge from the source task's policy, forcing the target task's policy to be basically trained from scratch. Consider a situation with a dialog policy that handles restaurant reservations and we want to adapt to manage the task of booking VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ hotel rooms. Obviously, we expect that this adaptation can be performed with minor adjustment to the restaurant booking task's policy. Motivated by this observation, we proposed a novel method to adapt dialog policy called the Dialog Policy Reuse Algorithm (DPRA). Unlike previous works, our proposed method learns the relation between the action sets of the source and the target tasks using a mixture density network [9], and then we can quickly infer the policy for the target task without RL training. In other words, DPRA allows us to ''reuse'' the source task policy for action decision-making in the target task. The following are the primary contributions of our work: • We propose a novel method for the policy adaptation of reinforcement-learning-based dialog management.
In particular, the source task's policy can be used for the action selection procedure in the target task through a special mapping called action-relation probability.
We propose a mixture density network to model this distribution.
• Since our proposed method can learn a policy for the target task without RL training, it reduces the effort to construct dialog policies in the target task.
• Experimental results show that, when only a small amount of data is available for training, our proposed method achieves higher performance than the baseline adaptation method, which is based on fine-tuning, and also requires much less time for training.
The remainder of this paper is organized as follows. Section II provides background knowledge in reinforcement learning and dialog management. In Section III, we explain our proposed method, the DPRA algorithm. Section IV describes our experiment setting, its results, and analyses. Section V explains current works in dialog policy adaptation and their drawbacks. Finally, we summarize our work and discuss future directions in Section VI.

II. REINFORCEMENT LEARNING AND DIALOG MANAGEMENT
A. REINFORCEMENT LEARNING Reinforcement learning is a popular framework for learning autonomous behavior. We consider the standard reinforcement learning setting where an agent interacts with an environment that follows a Markov decision process (MDP) over a number of discrete time steps [10]. At each time step t, the state, action, and reward are respectively denoted by s t ∈ S, a t ∈ A, and r t ∈ R. The dynamics of the task (the environment) are characterized by two random variables: state transition probabilities P a ss = P(s t+1 = s |s t = s, a t = a), and the expected reward, given by R a s = E[r t+1 |s t = s, a t = a] = r t+1 r t+1 P(r t+1 |s t = s, a t = a). The agent's procedure for selecting action a given state s is the agent's policy, denoted by π(s, a) = P(a|s). We define the return, which is the total rewards that received by the agent, as R t = T −k k=1 r t+k where T is the final time step. The agent's objective is to maximize expected return E[R t |s t , π] at each time step t when following policy π. If the agent-environment interactions do not stop, T goes to ∞. We describe our task as continuing. If the interactions eventually end when we reach certain terminal states, then our task is called episodic. In this setting, the interactions from the beginning until the agent reaches a terminal state is called an episode. Appropriate setting are chosen depends on the problem we want to solve using reinforcement learning.
Given policy π, the state-action value is defined as: , which is the expectation of the return if action a is chosen at time step t. Similarly, we define the state value of policy π: V π (s) = E[ T −k k=1 r t+k |s t = s, π]. Note that the above definitions of state-action, and state value fall under the undiscounted setting.
Reinforcement learning algorithms can be divided into two classes: value-based and policy-based. In value-based reinforcement learning methods, we estimate action-state Q π (s, a) or the state value V π (s) by using a function approximator, such as neural networks or simple value tables. Classic Q-learning [11] or deep Q-network [12] are examples of this class of algorithms. In contrast to value-based methods, policy-based RL algorithms parameterize policy π by parameters θ, which we update by performing (typically approximate) gradient ascent to maximize E[R t |π]. There are various policy-based RL studies, especially with policy gradient methods, such as Actor-Critic [13], [14] or REINFORCE [15], which we are using in our evaluation.

B. DIALOG MANAGEMENT USING REINFORCEMENT LEARNING
A dialog can be divided into multiple turns, where each turn contains a user utterance and a system response. We can formulate the problem of dialog management as an MDP and apply any RL algorithm to solve it. We can define a set of actions for the system to interact with the user and define a reward function based on the system's goal. Since all dialogs only have a finite number of turn (or time steps) we need to use the episodic and undiscounted setting as a formulation of dialog management problem.
At each time step t, the required information for the action selection procedure is defined as an observation, denoted by o t . The type of information included in an observation depends entirely on the task that is being solved. For example, in dialog management, the observation may consist of recognition results of slot information [6], [7], [16], user's dialog actions [17], [18], or such high-level multimodal information as user's deception [19]. Recall that the policy is a conditional probability for selecting action a given state s, P(a|s); thus, the state must include critical information for making the decision. A natural approach is to represent the state by the vector of observation or the concatenation of observations from multiple time steps. We call such state representation as explicit state representation. In modular-based dialog systems [1], [2], [17]- [19], the state is represented by this explicit representation. In dialog management procedures, the system needs to consider the dialog history, which contains the user's utterances and the system's actions from the beginning. With such long-term tracking requirements, explicit representation becomes unsuitable if the observations are high-dimensional or the conversations are lengthy. A common solution is using a Recurrent Neural Network (RNN) to learn state embedding, which resembles a latent state representation that stores the dialog history and current observation [16], [20]. Such state representation is used in end-to-end dialog modeling [20]. Fig. 1 shows an example of this approach. The RNN plays the role of dialog state tracker (DST) and its output, the latent state representation, is used by the dialog policy module for management. With an end-to-end approach, we are free from designing a complicated explicit representation of the dialog states.
RL training requires a huge amount of agent-environment interactions to learn a good policy. In dialog management, it is impossible to collect enough dialog samples to fulfill this requirement. Scheffler and Young proposed a user simulator as replacement for actual human users to train the policy [1]. This approach has become the standard for training dialog policies with RL. The user simulator is built from dialog samples by maximum likelihood or supervised learning methods, to imitate the human user's behavior. The user simulator can also be viewed as an approximation of the dialog management task's true state transition P a ss , which is provided by the actual human user.

C. POLICY ADAPTATION
Humans are capable of learning a task better and faster by transferring the knowledge retained from solving similar tasks. This observation motivates the idea of transferring knowledge across different but related tasks to improve the performance of machine learning (ML) algorithms. The techniques that allow such knowledge transferring is called transfer learning.
Application of transfer learning to RL algorithms started to gain attention of the machine learning community from the middle of the 2000s. In reinforcement learning, the number of samples needed to learn an optimal solution is usually prohibitive, especially for dialog management, where data sparsity is a huge challenge [21]. Transfer learning can build prior information from the knowledge collected to solve a set of source tasks and be used for learning a policy in the target task.
Many types of knowledge can be used in transfer learning in RL, such as samples, representation, or parameters [4]. Policy adaptation, or policy transfer, refers to transfer learning methods that use knowledge of the policy from the source task for the transferring process. Policy adaptation methods are subclass of transfer learning solutions in reinforcement learning.
We usually face the problem of data sparsity when training policies for dialog tasks, because they do not require additional knowledge between the source and the target task. In this situation, the policy adaptation approach is a promising solution. Multiple studies investigate its application in reinforcement-learning-based dialog management [6]- [8]. All of these methods improve learning speed, reduce data requirements, or performance.

D. PROBLEM STATEMENT OF POLICY ADAPTATION
Current policy adaptation studies in reinforcement-learningbased dialog management follow the weight initialization approach [6]- [8]. As explained in Section I, this strategy requires us to train the target policy from scratch with an RL algorithm, which is a tedious process that requires great effort, especially for complex tasks. In addition, the construction of dialog policies usually involves user simulations. When we only have a small amount of data in the target dataset, the user simulator does not represent the behavior from the actual human users very well. Therefore, a policy that we train can have high performance against the simulator even though fails to work well versus actual human users.
Problem Statement: Given a source task with the state space S A and action set A, assume that we have trained policy π(s, a) for the source task. The target task's state space and action set are denoted as S B and B, our goal is to derive policy π(s, b) for the target task from π(s, a) without RL training.

III. POLICY REUSE BASED ON ACTION-RELATION PROBABILITY
In this section, we show that a policy can be adapted for the target task without RL training from the scratch. Instead of training a policy by interactions with a user simulator, we establish a connection between the policies of the source and target tasks through a special mapping distribution called action-relation probability. Our proposed adaptation method, DPRA, learns this distribution from dialog samples in both tasks, and can immediately derive a policy for the target task. Thus, DPRA can remarkbly reduce learning time. Since our proposed method does not use the user simulator, we can avoid the problem of low performance due to errors in constructing the user simulator. VOLUME 8, 2020

A. POLICY ADAPTATION BY ACTION-RELATION PROBABILITY
We consider policy adaptation from a source task to a target task. First, we make the following assumptions, which allow the derivation of relation between the policies of the source and the target tasks: Assumption 1: The source and the target tasks share identical state space S. This assumption is actually not restrictive. Union space S = S A ∪ S B is the state space that satisfies the assumption.
Assumption 2: The source and the target tasks have identical state representation.
We define observation as the information that is necessary for the agent's action selection in each time step. An example is the features extracted from the user's utterance at each turn. Obviously, in a policy adaptation setting, the source and the target tasks are different and they require distinctive sets of features. Thus, the state representation of the source task is also not identical to the target task's. However, we can define a unified set of features for those tasks and have the same observation and state representation across the source and the target tasks.
We establish a connection between the source and the target task's policy as follows. Denote the source task's policy as π(s, a) and the target task's policy as π(s, b), where a ∈ A, b ∈ B are action sets in the source and target tasks.
Equation (1) argues that with any conditional distribution of P(b|a, s); from source task's policy π(s, a), we can directly infer a policy for target task π(s, b). We call this P(b|a, s) action-relation probability. Our proposed policy adaptation method models this distribution instead of performing RL training.

B. ACTION-RELATION MODELING WITH MIXTURE DENSITY NETWORK
This section explains the modeling of action-relation probability distribution based on a mixture model. First, denote the state and action at time step t as s = s t and a = a t , where the state at the next time step is s = s t+1 . The state transition in the target task is given: The state transition of the source task has the form of a mixture model with the action-relation probability as the component weights. Mixture density network (MDN) [9], is a suitable approach for modeling state transition P(s |a, s). In principle, a mixture density network is a type of Gaussian mixture model (GMM) that utilizes an artificial neural network. Given multivariate random variables x, y, MDN models conditional probability density p(y|x): M is the number of components, and w m (x), µ m (x), and σ 2 m (x) are the component weight, mean, and standard deviation for component m. We assume that these mixture variables are functions of input x that are approximated by neural networks With some slight abuse of notation, we have: With MDN, we assume that all the components in multivariate random variable y are mutually independent, and thus the covariance matrix is diagonal and can be represented by a vector with the same dimension as f µ m (x). Denote the dataset as D = {(x (i) , y (i) )}, i = 1..N , where x, y are the observed data for random variables x and y. The parameters are optimized using gradient descent with the following negative loglikelihood: By replacing the probabilities with probability density functions, the mixture model in (2) is now given: An illustration of these mixture models is given in Fig. 2. In principle, the density of the state transition caused by a i is a mixture model with each component p ij (s |s) = p(s |a i , b j , s) follows Gaussian distribution N (s ; µ ij (s), σ 2 ij (s)). For each action a i in the source task, we can train its corresponding MDN with just the samples (s , a i , s). Particularly, s is observed variable x and s is the latent variable y in (5), a i is the indicator to which Gaussian component corresponds to sample s, s . The action-relation probability s) is approximated by f w (x; θ w ij )), as shown in (4a). However, in that case, we cannot guarantee that N (s ; µ ij (s), σ 2 ij (s)) truly models state transition p(s |a i , b j , s) since the source task samples do not contain any information of b j . A natural solution is to additionally train the components using the samples (s , b j , s) by component matching process, which actually ''matches'' the distribution of component N (s ; µ ij (s), σ 2 ij (s)) to the transition of p(s |b j , s). In this work, we propose two methods of component matching. In the first, we assume that p(s |a i , b j , s) = p(s |b j , s)∀a i ∈ A, b j ∈ B. With this assumption, we can perform component matching by simply optimizing the networks' parameters using the negative log-likelihood of the target task's samples: ; µ(s (i) ), σ 2 (s (i) ))).
Since the training of this component matching method resembles the training process of a regression model, we define it component matching by regression. Algorithm 1 shows the pseudo code for the training process of the mixture model in 6 using component matching by regression.  (5) Update parameters θ w , θ µ , θ σ with dθ MDN end for In principle, the components in column of b j in Fig. 2 form a mixture model for the density of state transition p(s |b j , s). Similarly, we can train this mixture with a loss function that resembles (5) using the target task samples (s , b j , s): ; µ m (x (i) ), σ 2 m (x (i) )))).
We call this method component matching by MDN. The pseudo code for action-relation probability modeling using MDN component matching is shown in Algorithm 2. Finally, the procedure of our proposed method, the dialog policy reuse algorithm, is shown in Algorithm 3:

Algorithm 3 Dialog Policy Reuse Algorithm: DPRA
Step 1: Train policy π(s, a) for source task Step 2: Model action-relation probability P(b|s, a) using either Algorithms 1 or 2 Step 3: Create a policy for target task π(s, b), by using 1 with the action-relation probability learned in Step 2.
We summarize this section by showing that resultant policy π(s, b) of DPRA is proper: b∈B π(s, b) = 1 (10) VOLUME 8, 2020 Intuitively, DPRA works as follows. Given policy π(s, a) in the source task, assume that an agent with policy π selects action a given current state s. DPRA finds action b in the target task that makes similar transition s → s to action a: in other words, P(s |s, a) P (s |s, b). Instead of making a deterministic mapping, we learn a distribution that connects a to all the available actions in the target task, which is P(b|a, s). This is why we use the term action-relation probability to define this special distribution. With the condition that the source and target tasks are also similar in the reward dynamic, if learned policy π(s, a) in the source task is optimal, then policy π(s, b) learned by DPRA is nearly optimal as well.
An illustration of DPRA is shown in Fig. 3. DPRA requires identical state space and state representation of the source and target tasks (Assumption 1 and 2). Recall that all policy adaptation methods only work in the cases where the source and target tasks are similar. In this situation, even if the two tasks' state spaces and state representations are not completely identical, we expect that they still share considerable similarities, and thus, the proposed method can still learn a good policy for the target task without the ideal conditions in Assumption 1 and 2.

A. SETTING
We experimentally assessed the following hypotheses: 1) Our proposed adaptation algorithm, DPRA, requires much less training time than conventional fine-tuning methods. 2) DPRA learns policies that achieves equivalent or higher performance than those learned with current methods when limited data available in the target task. For comparison with our proposed methods, we used policy adaptation by action embedding [8]. This method uses the same network as our end-to-end dialog modeling and changes its last layer of the network to connect to a new action space in the target task. This simple model does not require any prior information such as relations between actions. Since our proposed algorithm also does not require such prior information either, we selected this method as the baseline.
In our evaluation, we performed policy adaptation for a multimodal goal-oriented dialog system with an end-to-end approach. We chose a multimodal dialog setting because the available corpora for such conversations are mostly small-scale [21] and thus suitable to assess our second hypothesis. In particular, we augment the original end-to-end dialog model (Fig. 1) with a multimodal fusion component that uses the Hierarchical Tensor Fusion Network [22]. This component's role is to efficiently combine features from multiple modalities: linguistic, visual, and acoustic. Since it is fully connected to the dialog state tracker (the LSTM layer in Fig. 1), our dialog model still adheres to the end-to-end paradigm.
We formulated the problem of dialog management with an episodic and undiscounted reward setting and trained the dialog policies using REINFORCE [15], which updates policy parameters θ with the following gradient: As in (11), the ''vanilla'' version of REINFORCE has high variance and slow convergence. To combat this problem, a baseline technique was introduced [23], [24]. Reference [25] described how the most natural and effective baseline is the state-value function. Thus, we use V π (s, a) as the baseline and the gradient is given by: The baseline's role in (12) is to reduce the variance of the gradient estimation and smoothen the training. In principle, since any RL algorithm can be used for training the policies, we chose REINFORCE due to its simplicity and good performance.
The output layers of the multimodal fusion component and the DST both have 128 units. Therefore, the dialog state is represented by vector s ∈ R 128 . The neural network that represents the dialog policy has one single hidden layer with 128 units and is fully connected to the input and output layers. We used the Adam optimizer for optimization of the networks' parameters and initialized the learning rate at 1e − 3. We trained the policies for the source and target tasks with 20,000 and 10,000 episodes, of which can be seen as a simulated dialog with the user simulator. The learning rate decreased by 10% every 1000 episodes.
The neural networks that approximate the mixture variables in Algorithms 1 and 2 have one hidden layer with 256 units. We also used the Adam optimizer for parameter optimization. The learning rate is fixed at 1e − 4, and the number of training epochs is 10. Note that in Algorithms 1 and 2, the network parameters are updated sequentially with two different gradients. Thus, the training process has large oscillation and converges slowly. To avoid this problem, we adopted the p : q training scheme. Every epoch, we perform component matching for p = 2 times and trained the mixture model for q = 2 times.

B. PERFORMANCE METRICS
We measured the training time of all the methods in seconds. Only the time spent in adaptation is measured; the learning times of the source task's policy and for creating the simulator is not included. We run the adaptation algorithms on identical hardware 1 to ensure a fair comparison among the evaluated methods.
To assess out second hypothesis, we used two performance metrics: average reward per episode and the system's action selection accuracy. With the first metric, we used a user simulator created from the target dataset. We ran the learned policies against this simulator for 1000 episodes and measured the average total reward per episode. For the second metric, several human experts read the dialog transcript and selected the most appropriate system action for each turn. These actions are used as groundtruth for measuring the action selection accuracy of the policies. This metric shows the appropriateness of the learned policy behaviors on each turn.

C. DIALOG TASK
In this evaluation setting, both the source and the target tasks are negotiation task in healthcare domain [19]. The system tries to convince the users that their current living style is unhealthy and suggests that they adopt its proposed living habits.
We used the previously proposed healthcare consultation dataset [19], which contains conversations on six topics: sleeping, eating, working, exercising, social media usage, and leisure activities. The conversations of the first four topics (51 dialogs) are used for training the policies of the source task. The remaining 24 dialogs in two topics are used for training the target task's policy. In each turn, we split the recorded video of the users into 30 segments and randomly sampled one frame from each segment to be used for extracting visual features. We extracted 14 face action unit (AU) regressions and 6 AU classification values for each frame with the OpenFace toolkit [26]. The visual observation at each turn o v is a vector that contains these extracted values. We extracted acoustic features from the audio using the OpenSMILE toolkit [27] with the Interspeech 2009 (IS09) emotion challenge standard feature-set as our feature template [28]. We used these extracted features to create acoustic observation o a . 1 CPU: Intel Xeon CPU E5-2630 v4, GPU: GTX Titan X. The set of system's actions in the source task is identical as [19], which includes {Offer, Framing, End}. We changed the system's action set of the target task into {Offer_New, Offer_Change, Framing_Argue, Fram-ing_Answer, End_Dialog}. The source policy never sees these actions during training.
For the RL training, we created a user simulator that generates labels of user action u and deception information d with the following intention and deception models: Recall that in each dialog turn, the system takes input features (observations) from three modalities: linguistic, visual, and acoustic. We employed the user action u as linguistic observation o l . The visual and acoustic observations, o v and o a , are sampled uniformly from the dialog corpus using u and d.

D. EXPERIMENT RESULTS
We conduct the following experiments to assess the hypotheses raised in the beginning of this section. Table 1 shows the training times required for all the policy adaptation methods. The numbers reported for DPRA are from a case in which all 24 dialogs of the target task are available for training. Cases with less data will obviously take less time for training with DPRA. With policy adaptation by action embedding [8], we chose the number of episodes (interactions with simulated users) for training the target policy based on the average rewards received per episode. As seen from Table 1, since DPRA requires significantly less time for training, our first hypothesis stands.

2) DIALOG POLICY PERFORMANCE COMPARISON
We recreated scenarios in test sets where the amount of data available for training in the target task is limited to assess the second hypothesis. In particular, we sampled k dialogs from the target task dataset, k ∈ {1, 2, 4, 8, 16}. We used these k dialog samples to create a user simulator for training the target policy in the ActEmb adaptation method and for modeling the action-relation probability in DPRA. For each value of k, we sampled k dialogs five times, thus making five different datasets. With each dataset, we perform policy VOLUME 8, 2020  adaptation ten times for each method and conduct 50 runs of the policy adaptation experiment with each value of k. Table 2 shows the performance of the dialog policies in terms of the average reward per episode. The details of the reward function for both the source and target tasks are provided in Appendix A. DPRA-MDN and DPRA-regression respectively refer to the proposed policy adaptation method with component matching by MDN and regression. ActEmb-2k and ActEmb-10k refer to policy adaptation by action embedding method, where the number of episodes for training in the target task is 2,000 and 10,000. Finally, NoAdapt refers to a policy that is trained on the source task without adaptation, where the notation for number of training episodes is identical as ActEmb. As seen in the table, the performance generally increases when more data are available. In Table 2, bold numbers indicate the policy with the highest average reward per episode in each scenario where the number of available dialogs k. Policies adapted by DPRA shows significantly higher performance than those from ActEmb and NoAdapt (p < 0.05) with all k, and the difference is bigger when k is small. ActEmb-2k and ActEmb-10k perform similarly; on the other hand, the performance of NoAdapt-10k is remarkably higher than NoAdapt-2k.

b: PERFORMANCE IN TERMS OF DA SELECTION
The performance in terms of dialog act selection accuracy for all the policies is shown in Table 3. Similarly, the dialog policies learned by DPRA outperformed those of ActEmb and NoAdapt with a large margin (p < 0.05) for all values of k. Surprisingly, when more data available the performance gap between DPRA and the other methods increases. In fact, there is only a subtle increase in the action selection accuracy of the policies learned by ActEmb and NoAdapt when k increases from 1 to 16. Recall that the results in Table 2 were reported under a setting where the policies were run against a simulator that is created from all 24 dialogs in the target dataset. Therefore, if we train a policy using a simulator created from 16 dialogs, the performance in terms of average reward per episode will be much higher than using a simulator from just one dialog. However, because ActEmbed and NoAdapt does not use full knowledge from the source task policy, 16 dialogs are insufficient to train a policy with high action selection accuracy. Thus, the gain is modest when increasing the amount of available data in Table 3. In contrast, DPRA can retain knowledge from the source task policy and effectively adapt it to the target task, thus achieving high performance in terms of action selection accuracy, especially when more data are available.

V. RELATED WORKS
Many studies have addressed policy adaptation for reinforcement-learning-based goal-oriented dialog management.
Chen et al. [6] proposed a policy adaptation method using a multi-agent dialog policy. They used an explicit representation of the dialog state, which contains of multiple slot information. For each slot, they built an ''agent'' that learns how to choose actions corresponding to this slot. The dialog policy is an ensemble of these agents. In the target task, for each new slot information, the network weights of its corresponding agent are initialized using the weights of the agents that have been trained in the source task. Although this adaptation method does not require the state representation to be similar, as in DPRA, it has a restriction: the state representation of both tasks must be explicit, such as slots or values. In addition, this method requires identical action sets for each agent to perform weight initialization. This makes Chen et al.'s method [6] less flexible than DPRA in terms of action space restriction.
Ilievski et al. [7] introduced a new policy adaptation method by using weight bootstrapping and also used slot information for the state representation. Their method shares the slots and actions across the source and target tasks and constructs a neural network where the input layer 's number of neurons equals to the number of unique slots in both tasks. Similarly, the output layer has the number of neurons that is equal to the number of total grouped actions. First, they trained a policy on a source task by reinforcement learning. The network weights are then fine-tuned by training on the target task. This adaptation method has the same restriction of state representation as in a previous work [6]. Furthermore, [7] also requires overlapping of the action sets of the source and target tasks. Our proposed method is more flexible and lacks this limitation.
Mendez et al. [8] proposed an adaptation of dialog policy using action embeddings. They argued that there is a set of action embeddings can be shared across the source and the target tasks. In practice, action embedding is represented by a hidden layer that is fully connected to both the input and output layers. After training the policies from all the source tasks, the weight parameters that connect the input and the hidden layers are used to initialize the corresponding weights in the target policy's neural network. We used this method as our baseline because this adaptation method requires no additional knowledge or relations between the source and target tasks and is comparable to our proposed method in terms of flexibility.

VI. CONCLUSION
We propose a novel method for policy adaptation in dialog management called the dialog policy reuse algorithm -DPRA. Our proposed method uses action-relation probability for adaptation, which allows the source task policy to be reused for the action selection of the target task. DPRA learns the action-relation probability from dialog samples in both tasks using a mixture density network, and can immediately derive a policy for the target task. Thus, DPRA learns the target task's policy much more quickly than conventional methods that require RL training and user simulation. Since our proposed method does not employ the user simulator, it can avoid the problem of low performance due to errors in constructing the user simulator.
Future work will conduct a deeper scrutiny of DPRA to determine in which adaptation setting it works and what kind of performance we should expect from it. In addition, DPRA currently does not take the changes in reward dynamics into consideration. We believe that if we can incorporate an estimation of such changes in reward functions of the source and target tasks, we can further improve the performance of the policies learned by DPRA. Finally, we also want to investigate the applications of our method to other settings, such as adaptation for autonomous control tasks. Table 4 shows the reward in the source task that is received by the agent when selecting an action given the user dialog act (u) and the deception labels (d), which are generated by the user simulator.

APPENDIX A REWARD FUNCTIONS IN THE EXPERIMENT
The reward definition for the target task in Section IV is shown in Table 5. Note that Offer_New gives +10 reward only if it is selected in the first turn.

APPENDIX B PROOF OF EQUATION 1 AND EQUATION 10
First, we show the proof for (1) Table 6 shows the details of the environment that was used for our experiments described in Section IV.