Adaptive Multi-Domain Dialogue State Tracking on Spoken Conversations

The main objective of the task-oriented dialogue system is to identify the intent and needs of human dialogue. Many existing studies are conducted under the setting of written dialogue, but there always exists a difficulty in coping with real-world spoken dialogues. To this end, DSTC10 challenge organizers propose the task of building robust dialogue state tracking (DST) models on spoken dialogues. With the powerful existing DST model (i.e., MinTL), this article suggests integral components for building a dialogue state tracker; 1) Data augmentation effectively enhances the capability of the model to catch the entities that exist in the evaluation dataset. 2) Levenshtein post-processing aims to prevent the distortion in model prediction caused by automatic speech recognition errors. To validate the effectiveness of our methods, we evaluate our model on DSTC10 datasets and conduct qualitative analysis by ablating each component of the model. Experimental results show that our model significantly outperforms baselines in all evaluation metrics and took 3rd place in the challenge.

Task-oriented Dialogue Modeling on Spoken Conversations" track.The objective of the track is to benchmark the robustness of the conversational models while filling the gaps between written and spoken conversations.Task 1 in this track mainly focuses on identifying the state of the given multi-domain dialogues.
The main difficulty of this challenge lies in the fact that the training corpus is not given.Since the validation set is the result of spoken conversation, most of the dialogue state tracking (DST) datasets available are mainly the written conversation corpus [1], [2], [3], [4].Also, the entities in the training set and those from the evaluation set have significant differences.Under this situation, we set our goal as building a robust and generative dialogue system of open-vocabulary approach to comfortably manage unseen values along with the ASR errors.Moreover, we decide to adopt and implement additional modules to overcome the problem of generating inconsistent values.
To address these issues, this article proposes integral components to build applicable models to deal with the above realworld errors in spoken conversations.To show the effectiveness of our proposed components, we adopt the existing DST model MinTL [5], which is an effective transfer learning framework while showing comparable performance with generative pretrained language models.First, we introduce a highly effective data augmentation strategy to reduce data discrepancy between written and spoken conversations.Since the training dataset is not provided in the challenge, we augment the existing DST dataset (e.g., MultiWOZ 2.1 [4]) by replacing several names and types of the entities from the given dataset with those of the evaluation dataset.After model training, we then additionally process the predicted value to have a suitable and consistent dialogue state by exploiting Levenshtein post-processing.Lastly, we aggregate the predictions from the differently initialized models by selecting the most predicted value for each slot type, and it is taken as a final prediction.Experimental results show that our model outperforms baselines by about 30% in joint goal accuracy and took 3rd place in the challenge.

A. Open vocabulary-based DST
Open vocabulary-based DST is one of the main approaches to traditional dialogues state tracking (DST).Unlike the predefined ontology-based approach, open vocabulary-based methods [6], [7], [8] generate slot value at each turn with a generative model such as RNN, LSTM, and GRU [9], [10], [11].With the advent of pre-trained language models and their remarkable performance [12], [13], [14], recent studies utilize pre-trained models on DST as well [5], [15].By exploiting pre-trained language models, the dialogue system does not suffer from taskspecific design and extensive human annotations.Moreover, the models get benefits from the pre-trained weights and achieve decent performance with a small fraction of the training data.

B. Handling automatic speech recognition
Dialog state tracking models that receive the output of the automatic speech recognition module inevitably face up to ASR errors.Previous studies of handling such errors can be divided into two approaches.One is to consider these errors within the models directly.The studies of [6], [16] explicitly utilize ASR n-best lists as the additional features with extra encoders to find the correct belief state of the dialogue.Researchers of [17], [18] added the layer of correcting ASR errors in training along with the original NLU tasks.Also, one models ASR sequence as graphs and exploits confusion networks with a neural dialogue state tracker [19].The other is augmenting data simply to the training data for the robust DST model.Also, the study of [20] leverages an ASR error simulator to inject noise into the errorfree text data and subsequently train the dialog models with the augmented data.[21] propose a reinforcement learning (RL) based framework for data augmentation that can generate highquality data to improve the dialog state tracker.Since we aim to build a robust model without latency in the dialog state tracker, we exploit data augmentation to the given training data directly.

III. TASK DESCRIPTION
Multi-domain dialogue state tracking is one of the tasks in DSTC10 track 2; Knowledge-grounded Task-oriented Dialogue Modeling on Spoken Conversations.The main objective of this track is to benchmark the robustness of the conversational models against the gaps between written and spoken conversations [22].The dataset is transcribed from the human-to-human dialogues about touristic information for San Francisco.Since it is constructed from the transcription, the data suffer from ASR errors.The task is evaluated with joint goal accuracy, and the training set is not limited to any of the datasets.

A. Problem Formulation
where U and R is user utterance and system response, respectively, we define dialogue state at each turn B = {B 1 , B 2 , . .., B T }.The slot value for each domain-slot pair is denoted as B t (d i , s j ) = v, where d is a domain, s is a slot type, and v is a corresponding slot value.The dialogue state tracking model aims to predict dialogue state at each turn B t for all domain slot pairs, given previous dialogue state B t−1 and dialogue context where w is a window size.

B. MinTL
MinTL [5] leverages generative pre-trained language models for multi-domain dialogue state tracking.The main idea of MinTL is to generate dialogue states that need to be changed (Levenshtein Belief Spans) at each turn.Specifically, we concatenate the previous dialogue states B t−1 and dialogue context C t to build source input.In the case of target sequence S t , they consist of newly updated slots and each slot is updated based on one of the following conditions.
r Insertion: Each update slot is formed as s j ⊕ B t (d i , s j ) for each domain, and the special token for domain [d i ] is added to the beginning of the very first slot.MinTL model is fine-tuned from generative pre-trained language models, such as T5 [13] and BART [14].The model is trained by minimizing negative log-likelihood of S t given B t−1 and C t , which is denoted as, Especially, BART is significantly effective when fine-tuned on text generation since it performs well on comprehension tasks also.To obtain the aforementioned advantages, we employ a pre-trained generative language model and the design of [5] in this study.

C. Data Augmentation
To build the dataset that covers the entities in the dataset, we augment MultiWOZ dataset by replacing entities of certain slot types in the dialogue.The examples of data augmentation are described in Table I.In order to reduce the distributional bias of entities, we refer to the MultiWOZ labels based on domain and slot types.In other words, we substitute original entities from the MultiWOZ dataset with the corresponding entities from the training set according to the domain and slot types.We randomly choose the entity from the candidates that have the same target domain and slot type to replace.We conduct entity substitution only on the certain slot types as described in Table I.Along with the entity substitution, only 30 % of the values of slot type day are replaced with today and tomorrow since they are limited to the range of Monday, Tuesday, Wednesday, Thursday, Friday, Saturday and Sunday in MultiWOZ dataset.

D. Levenshtein Post-Processing
We introduce a method of revising the predicted values depending on the given dataset and database called Lev processing.After obtaining the predicted value from the previous step, we conduct a replacement process utilizing the Levenshtein distance [23].We first measure the Levenshtein distance between the predicted value and all the values from the corresponding domain and slot types from the database.After choosing the value that has the lowest score, we also apply the word error rate to find the exact matching values to the ground truth.In detail, when the distance between the predicted value and values from the database is longer than the threshold T, we regard this case as a failure.In our qualitative analysis, we conclude that these cases Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE I EXAMPLES OF AUGMENTED DATASET FOR EACH DOMAIN IN THE DSTC10 DATASET
are driven by the spoken conversation which contains erroneous texts.To mitigate this issue, we utilize the word error rate.We extracted the words from the groups of n-gram words which are gathered from the previous dialogue history and computed the word error rate score with the predicted word.Afterward, we choose the lowest value word to measure the Levenshtein distance.The lowest value word is chosen as the final answer and this process made our model more robust to the spoken conversation text regardless of the types of domains.
We also suggest consistent Levenshtein post-processing called Lev c that matches the slot values according to database values of name.Once the predicted value of the name slot is obtained, we additionally find the value of area slots and type slots according to the name slots.As we assume that the name slot possesses the centralized information of the dialogue, we substitute the previous value with the corresponding value from the database from the matched slot.For example, when the predicted name of the slot for the hotel is "fairmont san francisco", the model predicts the value of the area slot as "embarcadero" even though the ground truth value "nob hill" is indicated in the database already.Therefore, we switch the predicted values of other slot types with the database value according to the predicted name.By exploiting Lev c , the consistency of the dialogue state increases empirically.

E. Ensemble
To boost performance, we aggregate the slot value prediction results from several randomly initialized models.All slot values are post-processed (either using Lev or Lev c ) first, and then we select the most predicted value for each slot type as a final prediction.When more than half of the models generate none value (empty slot), none value is taken as a final prediction.On the other hand, if the majority of the models generate non-empty values even if the values are slightly different from each other, we choose the most predicted value among the non-empty values.

A. Experimental Setup 1) Dataset:
To train our model, we adopt a clean version of MultiWOZ [4], [24], which is a commonly used benchmark dataset for multi-domain dialogue state tracking task.This dataset is constructed on the basis of written conversations in 7 domains (e.g., restaurant, hotel, police, and taxi).During the training phase, all training, validation, and test sets are used to train the model.For the evaluation, we use DSTC10 1 dataset provided by the challenge organizers.Unlike MultiWOZ, it is annotated from the spoken conversations and covers only 3 domains (i.e., restaurant, hotel, and attraction).To reduce domain discrepancy, we only use dialogue states belonging to these three domains, and those of the remaining domains are not considered during model training.Corpus statistics for both MultiWOZ and DSTC10 datasets are described in Table II.
2) Evaluation metrics: We evaluate our methods using several evaluation metrics. 1) Joint goal accuracy is commonly used as the main metric in dialogue state tracking, and it is used to rank the participants in the challenge.It gets 1 if all predicted slot-value pairs are exactly the same as the ground truth and 0 otherwise at the turn level.2) Slot accuracy is used to check whether each slot is correctly predicted.3) Precision, Recall, and F1 score are used for both value and none prediction.

B. Implementation Details
We implemented our model using PyTorch [26] library.We employed BART-base [14] as a pre-trained backbone model as it showed better results than T5 [13] in our experiments.The batch size is set to 32, and the window size for the previous dialogue context is set to 3. For the data augmentation, we replaced slot values with new entities for every epoch so that the model 1 [Online].Available: https://github.com/alexa/alexa-with-dstc10-track2datasetAuthorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE III QUANTITATIVE RESULTS ON THE DSTC10 VALIDATION SET
can learn the diverse entities that exist in the database of the evaluation set.The model is trained using Adam optimizer [27] with the initial learning rate of 2e-5.During training, the ground truth of the current turn is used as the previous state in the next turn.On the other hand, in evaluation, the prediction result of the current turn is used as the previous state in the next turn so that the result of the front turn is propagated to that of the following turn.For the Levenshtein post-processing, we used fuzzywuzzy 2 library to calculate edit distances between generated slot value and database candidates.

C. Results
Table III reports quantitative results on DSTC10 validation set.For the baseline comparison, we compare our models with the TripPy [25] model, which is the official baseline in the challenge.Also, we report experimental results of the vanilla MinTL [5] to show how effective our proposed method is.We perform ablation analyses based on the MinTL model to explore how each method affects performance improvement.A brief explanation of our proposed methods are as follows.
r MinTL + DA aims to reduce data discrepancy between training and evaluation datasets.Slot values related to the information of each item in the database, such as name, area, and type, are replaced with values existing in the evaluation database.
r MinTL + DA + Lev finds pre-defined slot values from the database for the slot type in which the value is generated by using Levenshtein distance.
r MinTL + DA + Lev c finds an item in the database through Levenshtein distance for name only.In order to ensure that all generated slot values can be consistent for the item, we fill the value of each slot type with the item's pre-defined value.
r MinTL + DA + Lev † c is an ensemble model predictions from five MinTL + DA + Lev c models.Compared to the other single models, data augmentation and post-processing based on Levenshtein distance significantly improve the joint goal accuracy.Specifically, data augmentation achieves significant improvements in joint goal accuracy from 0.85 to 11.54 and in slot accuracy from 75.04 to 85.37, compared to the MinTL model.Also, Levenshtein post-processing (Lev results in an additional performance improvement of more than 10% in joint goal accuracy.We also report an ensemble of 15 In addition, when comparing the model with Lev and Lev c , we especially observe that Lev c significantly increases joint goal accuracy and slightly decreases performance in some metrics related to each slot.Even if the model predicts almost all of the slot values, the joint goal accuracy gets 0 if even one slot is predicted incorrectly.Since the dialogue state of each domain contains values of consistent items, Lev c consistently replaces them with the searched item information found based on the name.One fatal limitation of this method is that if the searched item is not the correct answer, all replaced values can be wrong so that the performance of individual slots is degraded (slot accuracy: -0.7%, value prediction F1: -1.92%).
Table IV lists official results for entry submissions by participants.We only report the teams that achieved above 10% in the joint goal accuracy metric.Each team submitted up to 5 prediction results and the result with the highest joint goal accuracy was used for final ranking.We submitted the predictions of MinTL + DA + Lev c (ensemble) and took 3rd place in the challenge.Even though the winning team achieves remarkable results, it is notable that we obtain significant performance improvement compared to the baseline without using any other ASR corpora and without additional fine-tuning on the DSTC10 validation set.

D. Qualitative Analysis
Table V shows qualitative results on DSTC10 validation set.For the first example, the value of the name slot is predicted as secon one when we train MinTL using MultiWOZ dataset.Also, the value of the area slot is also predicted as south since the values are limited to the north, south, east, and west in MultiWOZ.Afterward, the model predicts the name correctly and the area value is converted to san francisco which is the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE V QUALITATIVE RESULTS ON THE DSTC10 VALIDATION SET
value from DSTC10 database on area slots even if it is a wrong answer.When we exploit Lev c , it is shown that the value gains more consistency than that of models utilizing Lev.Because the restaurant um ma son is in the area of outer richmond selling korean food, conversion of the values has positive impacts on the results.Since there is an ASR error about the food type in the conversation (i.e., currying), the model inevitably predicts the food type based on the dialogue.Even Lev method brings the most similar value from the database, it is difficult to predict the correct value as the raw value is completely unrelated to the ground truth (i.e., korean).Thus, Lev c is highly effective in that it brings values for the consistent item.Similar consequences can be seen in the second example, especially in the type slot.By augmenting training dialogues, MinTL + DA predicts the area as embarcadero correctly, not stating centre.Moreover, Lev c also aids in having non-irregular values by switching bike rental to public market.
Jungwoo Lim received the B.S. degree in library and information science from Sungkyunkwan University, Seoul, South Korea.She is currently working toward the Ph.D. degree with Natural Language Processing and Artificial Intelligence Lab.Her research interests include dialogue systems, question and answering, and relation extraction.She was also a Reviewer for ACL and EMNLP.
Taesun Whang received the M.S. degree in computer science and engineering from Korea University, Seoul, South Korea.He is currently working toward the Ph.D. degree with Natural Language Processing and Artificial Intelligence Lab.His research interests include natural language processing, machine learning, and artificial intelligence.He was also a Reviewer for ACL, NAACL, and EMNLP.Dongyub Lee received the M.S. degree in computer science and engineering from Korea University, Seoul, South Korea.He is currently working toward the Ph.D. degree with Natural Language Processing and Artificial Intelligence Lab.His research interests include natural language processing, multi-modal retrieval, and XAI.He was also a Reviewer for ACL, NAACL, and EMNLP.
Heuiseok Lim received the B.S., M.S., and Ph.D. degrees in computer science and engineering from Korea University, Seoul, South Korea, in 1992Korea, in , 1994Korea, in , and 1997, respectively.He is currently a Professor with the Department of Computer Science and Engineering, Korea University.His research interests include natural language processing, machine learning, and artificial intelligence.
Online].Available: https://github.com/seatgeek/fuzzywuzzyTABLEIV OFFICIAL RESULTS FOR TEST SUBMISSIONS BY DSTC10 PARTICIPANTS models (MinTL + DA + Lev c ) which are trained with different random initial seeds, and it achieves the highest performance in all evaluation metrics.