Table-to-Dialog: Building Dialog Assistants to Chat With People on Behalf of You

Artificial Intelligence (AI) personal assistant has attracted much attention from both academia and industry. Almost all existing AI personal assistants serve as service terminals to chat with human users for certain tasks. We are instead interested in building AI personal assistants for a different yet important dialog scenario, where they chat with people to fulfill specific tasks on behalf of their human users. As the personal assistants are playing a requester role, instead of a service terminal role, the conversation goal becomes delivering or requesting information according to specific user requests precisely and efficiently. The challenge for the conversation policy is that all user requests must be delivered precisely, while the challenge for the response generation is that it’s generally expected for machine generated responses to cover multiple information slots, either requesting or delivering, to make the conversation efficient. In this paper, we present Table-to-Dialogue, a novel approach to address the above challenges when building a requester role AI personal assistant. We employ an encoder-decoder network to learn explicit conversation policy, which generates the corresponding information slots based on the conversation context and the user request table. We further integrate a novel Multi-Slot Constrained Bi-directional Decoder (MS-CBD) into the above encoder-decoder network, to generate machine response according to the multiple slot values and their intermediate representations from the policy decoder. Different from the existing single direction text decoder approaches, MS-CBD leverage the bi-directional context of the response when generating it to enhance the semantic coherence. The experiments shows that our approach significantly outperform the state-of-the-art conversation approaches on automatic and human evaluation metrics.


I. INTRODUCTION
Building AI personal assistants is always a fascinating research topic ever since the middle of last century [1]- [3], [11]. Many successful personal assistant products, such as Apple Siri, Google Now, and Amazon Alexa, emerged in the last decades [5]. However, almost all existing AI personal assistants act as service terminals, who answer users' questions, control home automation devices, play music, and even manage users' calendars with verbal commands [6]- [9], [11].
There is a large number of routine dialogues in everyday life in which people need to deliver their requests to some staff service for hotel booking, movie ticket purchasing and restaurant reservation etc. Building AI personal assistants for The associate editor coordinating the review of this manuscript and approving it for publication was Sun-Yuan Hsieh .
such scenarios requires a role exchange from the service role to the requester roll, i.e. AI personal assistants are supposed to chat with real people on behalf of their users. The ultimate goal of such conversations is to deliver precise information efficiently according to the user request.
The existing conversation approaches [12], [14] usually employ encoder-decoder networks to generate machine responses directly from conversation context without explicit conversation policies. When playing a service roll, the underlying conversation policies always converge to make sure that the important slots are filled. Such policies are user independent so that can be easily implicitly learned from conversations. When playing a requester roll, the expected policies aim to deliver specific user request precisely without missing anything. Implicit policy learning won't reach this goal as the request fulfillment is not an objective in the existing approaches. On the other hand, to make the conversation  Table-to-Dialogue generation. The personal assistant (PA) chats with the staff service (SS) based on the dialogue context and the request table. The latest PA response covers multiple slots from the table (the underline words). efficient, machine generated responses are expected to carry multiple information slots, either requesting or delivering. This raises a new challenge for the response generation, i.e. how to generate semantic coherent machine response to carry multiple information slots.
In this paper, we present Table-to-Dialogue, a novel endto-end approach to build AI personal assistants that chat with human staff services on behalf of their users. Each user request consists of a set of information slots that stored in a request table. As it is required to deliver user specific request precisely, the request table together with the conversation context is fed into a policy learning network, which learns explicit policies that describe what slots to address in the next round conversation. Since the policy output is a sequence of slots, the response generation needs to handle keywords from multiple slots. We further integrate a novel Multi-Slot Constrained Bi-directional Decoder (MS-CBD) into the above policy model, which generates machine response carrying multiple slots. Different from the existing text decoding methods, which all fall into single direction language generation [13], [14], [17]- [19], MS-CBD leverages bi-directional context to enhance the coherence of the generated response.
We conduct experiments on the task-oriented dataset Mul-tiWOZ [14], which contains over 10K conversations and the corresponding request tables over 6 domains. Although the MultiWOZ dataset is designed for building service agents, we can exchange the rolls so that we can leverage it for building the requester role AI personal assistants. The experiments shows that our approach Table-to-Dialogue significantly outperform the state-of-the-art conversation approaches on both automatic and human evaluation metrics. Here we summarize our contributions as follows.
• We propose a policy learning model to learn explicit conversation policy to make sure that each user request is fulfilled precisely.
• We carry out a novel response generation model called MS-CBD together with the policy model to generate multi-slot machine response considering bi-directional context of the response.
• We conduct empirical studies on the proposed approach to show the effectiveness against the state-of-the-art methods.

II. PROBLEM FORMULATION
This work aims to model a personal assistant by using a novel end-to-end Table-to-Dialogue generation method, where the personal assistant chats with staff services to deliver user request precisely. In this section, we give a brief problem definition of our task and describe the motivation of the proposed MS-CBD model.

A. PROBLEM DEFINITION
As shown in Figure 1, a user task is presented with a request table T , which is designed into hierarchical structure. The Context in Figure 1 is a conversation between a personal assistant (PA) and a staff service (SS). Each slot of the request table T is composed of four components: Domain, Key, Intent and Value. The Domain denotes the domain information of the current dialogue, such as the restaurant domain in Figure 1. The Intent consists of three types: Inform, Request and Fail, where Inform and Request represents requesting or delivering information. Fail stands for that the personal assistant should inform another solution when the current request can not be satisfied. Given the context of a dialogue X = (x 1 , . . . , x k ) and a request table T , our task is to generate a response sequence represented as p(y|X , T ), where y denotes the generated response. The response should cover multiple slots S = (s 1 , . . . , s m ) precisely from the request table, as shown in PA Response of Figure 1.

B. MOTIVATION
Our task aims to generate a conversation with staff services as a requester role from the request table, and deliver information precisely and efficiently. To encode the specific user policy from the request table, a explicit policy learning is needed to represent the request information precisely. The policy learning is simplified to multi-slot generation in this paper. After learning the policy, a machine response generated according to the policy, which is expected to carry multiple information slots.
In addition, we observe the phenomenon that people generate a sentence from multi-keywords by keyword association ability and bi-directional thinking ability. People firstly associate several related word clusters for each keyword. Then they generate response leveraging the bi-directional context from the word clusters.
Motivated by this phenomenon, we decompose the multislot response generation into two steps. To simulate the association ability, we firstly use the intermediate representation of slot s i and s i+1 to generate the word cluster clu i for the i-th segment. Then, to simulate the bi-directional thinking ability, we generate sentence y by bi-directional attention on the overall word clusters to enhance the semantic coherence. The whole process can be formulated as the following equations: where p(S|T , X ) denotes the policy learning phase and p(y|S, T , X ) denotes the response generation phase. In the response generation phase, p(clu|S, X ) denotes the word cluster association and p(y|clu, X ) denotes the final response generation. The details of each phase are introduced in Section III.

III. APPROACH
In this section, we present our proposed model in detail. The model structure is illustrated in Figure 2, which consists of an encoder and a policy decoder integrated with a multi-slot constrained bi-directional decoder (MS-CBD). The encoder generate the representations of dialogue context and a request In the following sub-sections, we will introduce each part of our model structure.

A. ENCODER
The encoder is composed of two parts: context encoder and table encoder, which encodes the dialogue context and a request table respectively.

1) CONTEXT ENCODER
The context encoder encodes the dialogue context into vector representation. We obtain each input word representation by a pre-trained word embedding. As shown in Context Encoder of Figure 2, then the encoder runs a bi-directional long shortterm memory network (BI-LSTM) [20] over the word representation, which encodes the input n tokens into context representations H ∈ R n×d m , where d m is the dimension of encoder.

2) TABLE ENCODER
Given the request table T , the table encoder aims to generate the context-aware slot representation from the context and table, As shown in Figure 2. The intrinsic of slot embedding E ∈ R m×d m is generated by a linear layer after concatenating constituent component embeddings, where m denotes the number of slots in table T . The table encoder further performers context-aware attention to capture the interaction of dialogue context and the request table. The attention matrixÊ is calculated according the following equation: where the shape ofÊ is same with the shape of E and W t ∈ R d m ×d m is trainable parameter. We finally concatenate the slot embedding and attention to generate the context-

B. DECODER
To generate the response covered multiple information slots, a policy decoder is used to learn conversation policy. The policy here denotes the inferred slot sequences from the request table. Further more, a novel multi-slot constrained bi-directional decoder (MS-CBD), which is integrated into the decoder network, is used to generate the response according to the multiple slot values and their intermediate representations from the policy decoder.

1) POLICY DECODER
The policy decoder uses the RNNs to model the generation of a slot sequence from the request table. The hidden state at the VOLUME 8, 2020 i-step d i is computed by a LSTM function, which inputs are previously predicted hidden state d i−1 and slot representation s i−1 ∈ S. We use D ∈ R l×d h to denote the policy decoder hidden matrix, where the l denotes the length of slot sequence and d h is the hidden size. The probability matrix P s ∈ R l×m of slot generation is calculated via a pointer network [21], where l denotes the length of slot sequence and m denotes the number of slots in the request table, as shown in following equation: where C s denotes the context attention matrix, whose shape is the same as D. To generate the outputD of the policy decoder, a non-linear layer is conducted on the concatenation of the decoder hidden matrix D and the attention matrix C s .
We finally obtain a slot sequence S = (s 1 , s 2 , . . . , s m ) from the policy decoder.

2) RESPONSE DECODER
The MS-CBD generates segment between each adjacent slots in a keyword association and bi-directional thinking ability way, as illustrated in Figure 2. To simulate the keyword association ability, the intermediate representation of policy decoder is used to generate the word cluster clu of each segment. Then, to simulate the bi-directional thinking ability, the response decoder performs a bi-directional attention over overall word clusters when generating response. The first step is to associate word cluster using the intermediate representation of the policy decoder. We concatenate the hidden states of two adjacency slots to represent segment state: where i denotes the i-th segment of the sequence, and d i ∈ D. Then, the segment state is used to generate the word cluster clu i . We calculate the possibility of selecting w d in the i-th segment via: where e d are the embedding vector for word w d and trainable parameter W r ∈ R d m ×2d h . For the i-th segment, we use top-k relevant words as cluster clu i to aggregate the distribution of words in the segment. Then, the response decoder generates tokens between two adjacent slots by RNNs in a bi-directional thinking way. The decoder leverage bi-directional attention of the overall word clusters clu when generating current token to enhance the semantic coherence of generated response. To make use of the position between each word cluster, inspired by relative position representations [22], we further incorporate the relative position vector into word embedding in cluster. Specifically, when decoding the tokens in the i-th segment, the representation for the k-th word in j-th cluster clu j is computer as:ê i k = e k + p j−i (7) where e k is the embedding for the k-th word and the p j−i denotes the trainable relative position vector. Then we calculate bi-directional attention score between decoder hidden state v t at time step t with word w k via a dot-similarity ω(·) and get the cluster-aware representation o t similar to Equation 3. Finally, the response decoder maintains generated state by delivering it in a left-to-right manner among segments via RNNs and calculates the possibility of generated word w d via: where c t is the context attention vector the same as Equation 4 and W v ∈ R (2d m +d h )×d m is trainable parameter. Taking Figure 2 as an example, the decoder generates the yellow tokens from the slot 'BEG' until the slot 'Italy' in purple and passes the hidden states to the next segment.

IV. EXPERIMENT
In this section, we first show the policy and response generation results of MS-CBD compared with state-of-the-art conversation approaches. Then, we evaluate the personal assistant in human rating manner, and conduct case visualization.

A. EXPERIMENT SETUP 1) DATASET
We conduct our experiments on the MultiWOZ [14], the latest human-to-human multi-turn dataset for Task-Oriented dialogue across several domains, which contains 10,438 dialogues on 6 different domains. To build the Table-to-Dialogue task where personal assistant servers as a request roll, we extract the request tables from user's dialogue act in conversation and split the conversation into fixed context with labeled policy and response in a single round. In total, there are 30,275/4,071/4,083 samples for training, validation and testing, respectively. Specially, 59.9% of dialogues contain more than 6 slots in the request table and the average slot sequence length for a response is 1.24.

2) BASELINES
We compare our proposed model against the following approaches: • Seq2Seq: a seq2seq model with attention mechanism [13] generates response from left to right. We also evaluate the CopyNet [23] that the seq2seq model augmented with a copying mechanism. To compare fairly, we integrate baseline models with our table encoder.
• MD-DAS: a multi-domain dialogue architecture [14] for Context-to-Text generation encodes the policy implicitly and generates response from the hidden representation and context.
• FGSD: a field-gating seq2seq with dual attention [24] proposed for the Table-to-Text task and achieves stateof-the-art result on the WikiBio [32]. We encode the request table with field-gating encoder and apply the dual attention between key and value component of slot in the decoder.
• ABDNMT: a sequence generation model uses backward decoder and forward decoder to explore bi-directional information for neural machine translation [35]. We integrate the model with our policy decoder to predict slot explicitly for comparing fairly with our response decoder.

3) METRICS
We use three kinds of metric to evaluate the Table-to-Dialogue task. (1) NLG Metrics. Following original settings [14], we measure fluency with BLEU score [26].
Considering the lack of reference labels, we further use NIST, METEOR and ROUGE [27]- [29] to evaluate the similarity between generated output and labeled response. (2) Policy Metrics. We compute F1 and Acc scores between predicted and labeled slot sequence. The F1 score measures slots overlap while the Acc measures the predicted sequence as a whole is equivalent to the label. (3) Human Rating. The human evaluator is asked to label the conversation with 2-level score, 0 and 1. If our system fulfills the request table (including duplication) without mistaking the request, the human evaluator will label it as 1; otherwise, the label is 0. We use Acc to measure the result.

4) IMPLEMENTATIONS
The dimensions of word embedding, attention matrix and encoder are set to 300. The hidden size of slot and response decoder are set to 100. Word embedding is initialized with Glove [33] and shared between the context and table encoder. They are fixed during training. We train the slot decoder and response decoder with teacher forcing learning [31] by the negative log-likelihood loss. We train the word cluster using words between two adjacent slots (including slot itself) as labels by the bag-of-word loss [30]. In our experiment, We use top-5 most related words as word cluster. We optimize models using Adam [36] with default hyperparameters and the batch size is set to 64.

B. POLICY RESULTS
To conduct a detailed analysis of policy generation, we evaluate the result from component and sequence levels.
We measure the F1 score on Domain, Intent and Key between prediction and ground true sequence on component level.
To measure the output sequence as a whole on sequence level, we use Acc metric. Table 1   explicitly (ABD-NMT,MS-CBD) exhibits much better performance in policy generation, especially in the Key component. When we replace the MS-CBD with a vanilla LSTM decoder, which generates response with implicit policy learning, the performance drop dramatically about 20.3% in the sequence level. These results testify our hypothesis that a explicit policy is essential in delivering specific user request precisely.

C. SEQUENCE GENERATION RESULTS
We first investigate the effectiveness on response generation in a single round. As shown in Table 2, MS-CBD significantly outperforms all the baselines by a substantial margin on all the NLG metrics. It obtains 3.7% BLEU score absolute improvement over the state-of-the-art model FGSD [24] of Table-to-Text generation. To study the performance of multislot constrained generation, we use the ground true labeled slot sequence as an input, which is depicted in Table 2. we find that MS-CBD still achieves the best performance among various baselines. Moreover, we draw the following conclusions: (1) The result of our model with Associate performs better (+2.6% BLEU), which verifies our speculation that word clusters can simulate the human association ability, and therefore, the simulation will lead to a better generation. We further explore the effectiveness of the number of words in the cluster, as shown in Figure 4. We observe that the performance of response generation and word association both draw back when the cluster contains more than 7 words. This phenomenon, on the one hand, demonstrates the association ability of cluster when containing proper number of words. On the other hand, it shows the word cluster has up bound in the improvement. One underlying reason may be that the feature redundancy confuses the decoder in selecting related words.
(2) When using the bi-directional attention over word clusters, the BLEU score increases 2.2% and 1.9% in response and multi-slot constrained generation, respectively. Compared with ABDNMT [35], which explores bi-directional context with a backward decoder and integrated our explicit policy decoder, it still has 1.2% BLEU score absolute improvement in response generation. This demonstrates that the bi-direction attention mechanism leverages richer coherence over the generated response. In addition, the relative position representation models the positional information of word clusters, which improves the BLEU score nearly 1.3% in response generation. The increment of BLEU score between response groups divided according to the slot sequence length. For example, ''1->2'' denotes the increment between two response groups which the length of slot sequence is 2 and 1.
(3) Particularly, MS-CBD performers better when generated response contains more slots. As shown in Figure 3, the increment of BLEU score between two response groups of MS-CBD increases when containing more slots, while other baselines decline between the groups containing 4 and 3 slots. The result demonstrates MS-CBD is more adaptive to other baselines in multi-slot constrained generation.

D. END-TO-END CONVERSATION RESULTS
Except for the policy and generation results, we also conduct a end-to-end conversation experiment. We let three human evaluator, who serve as the customer service staff, to chat with our proposed system. Each human evaluator complement 30 conversation using Human Rating. Each conversation has a corresponding request table in advance. The end-to-end conversation result is shown in Table 3. Compared to other conversation approaches, our proposed model outperforms TABLE 3. End-to-end conversation results on the MultiWOZ testset by human rating. We evalute the result by the Acc metric. We also report the average number of turns per dialogue ending with Human Rating 1. in delivering user request precisely with end-to-end dialogue manner. Our method has nearly 4.2% improvement compared to MD-DAS, which encodes the conversation policies implicitly. We also find that our proposed model achieves the task fulfillment within fewer turns, which demonstrates the efficiency of our model under multi-slot generation.

E. VISUALIZATION OF MS-CBD
To better validate MS-CBD is able to associate related words with keyword and select informative clusters, we visualize the attention score between generated tokens and word clusters using Vig [34] in Figure 5. Figure 5 illustrates the word cluster and bi-direction thinking mechanism. Every dotted box is the cluster associated by the slot in pink box. Blue line denotes attention score. In example (a), we obverse that the attention score between 'leaving' and 'after' is evidently higher than others, suggesting that the bi-directional decoder has learned which word clusters are more informative. The association ability simulated by word cluster also enhances the generation performance. For example, the slot 'moderate' of second cluster in example (c) associates the word 'price', which has a strong impact in the next step generation. The visualization show the effectiveness of the MS-CBD.

V. RELATED WORK A. AI PERSONAL ASSISTANT MODELING
Research on AI personal assistants has been studied for decades from both academia and industry [1]- [3]. Most existing AI personal assistants perform their tasks by acting as service terminals [9]. For example, [6] proposes a assistant agent to manage users' calendars, while Cmradar Azaria [8] proposes a programmable personal assistants for detecting new emails, playing music, etc. Matsuyama et al. [9] demonstrates a socially-aware assistant to analysis user's behaviours. Although these service terminals assistance achieve great success in the last decades [5], building AI personal assistants as a requester roll, however, still needs more attention to be explored. In this paper, we focus on building personal assistants on behalf of user, where they chat with people according to pre-defined user request tables.

B. TABLE-TO-TEXT
The insight behind performing Table-to-Dialogue approach to build personal assistant is partly inspired by the success of text generation from a factual table [10], [14], [15], [32]. Similar to Table-to-text generation task [24], Table-to-Dialogue also requires model to understand the context of given table.
In the context of dialogue, Budzianowski et al. [14] proposes a text generation task, which requires to generate response from conversation and oracle belief-state to fulfill certain tasks as a service roll. Different from the previous studies, to server as a requester role in our task, the expected response generation should deliver specific user request precisely, raising new challenge on dialogue generation.

C. TEXT GENERATION
Another research area related to our work is the text generation [16]- [18], [35]. Most works are focus on the generation without keywords constraint [16], [23], [35]. When it comes to multi-keyword constrained generation, it brings a new challenge for generation. Uchimoto et al. [17] first describes a system for generating text from multi-keyword by generation rules. Chen et al. [19] proposed a hierarchical neural network to generate response from the implicit semantically condition. To the best of our knowledge, MS-CBD first performs multi-keyword generation using keyword association and bi-directional context in a human manner.

VI. CONCLUSION
In this paper we try to build AI personal assistants for a different yet important dialogue scenario, where they chat with people to fulfill specific tasks on behalf of their human users. We present Table-to-Dialogue to address the challenges when building a requester roll AI personal assistant. In Table-to-Dialogue, explicit conversation policy are learned to decide the information slots in the next round of the dialogue based on the conversation context and the user request table. We also integrate a novel MS-CBD model into the policy network, to generate machine response according to the multiple slot values and their intermediate representations from the policy decoder. MS-CBD leverages the bi-directional context of the response when generating it to enhance the semantic coherence. The experiments shows that our approach significantly outperform the state-of-the-art conversation approaches.