Overview of the Tenth Dialog System Technology Challenge: DSTC10

This article introduces the Tenth Dialog System Technology Challenge (DSTC-10). This edition of the DSTC focuses on applying end-to-end dialog technologies for five distinct tasks in dialog systems, namely 1. Incorporation of Meme images into open domain dialogs, 2. Knowledge-grounded Task-oriented Dialogue Modeling on Spoken Conversations, 3. Situated Interactive Multimodal dialogs, 4. Reasoning for Audio Visual Scene-Aware Dialog, and 5. Automatic Evaluation and Moderation of Open-domainDialogue Systems. This article describes the task definition, provided datasets, baselines, and evaluation setup for each track. We also summarize the results of the submitted systems to highlight the general trends of the state-of-the-art technologies for the tasks.


I. INTRODUCTION
T HE Dialog System Technology Challenge (DSTC) is one of the leading series of research competitions in the space of dialog systems.Since the inception in 2013, DSTC has been accelerating the development of dialog technologies by bringing the leading researchers and engineers together to solve important problems in dialog systems.The challenge has been evolving every year to cater to the demand and the interest of the dialog community to foster the development of technology.
The first version of the challenge (initially called Dialog State Tracking Challenge) [1] used human-to-bot dialogs in the bus timetable domain.Dialog State Tracking Challenges 2 [2] and 3 [3] used restaurant reservation applications, which introduced more complicated and dynamic dialog states.Dialog State Tracking Challenge 4 [4] and Dialog State Tracking Challenge 5 [5] moved to tracking human-to-human dialogs in mono and cross-language settings.From the sixth challenge [6], the DSTC rebranded itself as "Dialog System Technology Challenge" and organized multiple tracks in parallel to address a wider variety of dialog-related problems.The tracks in DSTC-6 were focused on end-to-end conversation modeling and dialog breakdown detection.DSTC-7 [7] focused on developing end-to-end dialog technologies for noetic response selection [8], grounded response generation [9], and audio visual scene aware dialog [10].Then, in DSTC-8 [11], the focus was on a diverse set of four tracks that included multidomain task completion, predicting responses, audio-visual scene-aware dialog, and schema-guided dialog state tracking.More recently, DSTC-9 [12] focused on unstructured knowledge access in dialogue systems, multidomain task-oriented dialogs, dialog evaluation, and situated multi-modal dialog modeling.
For the tenth edition, we received track proposals from leading research organizations and top universities.The proposals went through a formal peer review process focusing on each task's potential for (a) impact on the community, (b) novelty of the task, (c) feasibility of the proposal, and (d) potential participants.Participants in previous DSTC editions were also asked to provide their feedback on the presented track proposals through a survey, and responses were also considered in the evaluation.Finally, we ended up with five main tracks, including two newly introduced tasks and three follow-up and extensive tasks from previous challenges.
Track 1, MOD: Internet Meme Incorporated Open-Domain Dialog, aims to incorporate contextualized internet memes into multi-turn open dialogues.Track 2, Knowledge-grounded Taskoriented Dialogue Modeling on Spoken Conversations, focuses on benchmarking the robustness of the conversational models against the gaps between written and spoken conversations, where it extends the last-year challenge about unstructured knowledge access in task-oriented dialogues.Track 3 of this year, SIMMC 2.0: Situated Interactive Multimodal Conversational AI, is a continuation of last year, aimed at laying the foundations for the real-world assistant agents that can handle multi-modal inputs and perform multi-modal actions.Track 4, Reasoning for Audio Visual Scene-Aware Dialog, aims to promote the combination of conversation systems and multimodal reasoning algorithms into a single framework, where the system needs to learn to produce the answers without the captions of videos.Finally, Automatic Evaluation and Moderation of Open-domain Dialogue Systems (track 5) mainly focuses on developing effective automatic evaluation metrics that perform robustly across a range of dialogue evaluation tasks.The following sections describe the details of each track.

A. Track Overview
Internet memes have become one of the most important approaches for expression and emotions in social media and messaging communication [13], [14], [15].Meme, which is a type of content that features a visual format of images, GIF, or short videos, can inject humor into conversations and create an emotional context [16].Compared to emojis which is limited in variety and size, memes are more expressive and engaging.Although there is an increasing interest for chatbots that can converse using multiple modalities with humans [17], [18], incorporating contextualized internet memes into multi-turn open dialogues under different situations is still under explored.This challenge aims to deal with a new task -Meme incorporated Open Dialogue (MOD), where models are required to generate a vivid response in text-only, meme-only, or mixed information, provided with a multimodal dialogue context.There are three main tasks as introduced in [19]: text response modeling, meme  I.The data and baseline system are publicly available. 1

B. Task and Data
1) Meme Incorporated Open-Domain Dialogue Task: Participants are expected to build multi-modal dialogue systems based on the MOD dataset.Provided with the dialogue history consisting of utterances filled with Internet memes, the dialogue system aims to build an interesting response in the form of textonly, meme-only, or a mixed category of both.We further split the current scope of MOD into the following three tasks as shown in Table I: (1) Text Response Modeling: given the multimodal history context, the task aims to generate a coherent and natural text response.(2) Meme Retrieval: given a multimodal historical context and a generated text response, the goal here is to select a suitable meme as feedback.(3) Meme Emotion Classification: given the multimodal history, the goal is to predict the emotion type when responding with an internet meme.
2) Data Collection: a) Step 1: Pre-processing: For Internet meme sets, the meme candidates are firstly collected from the Internet and then chosen carefully by annotators to maintain good quality.In addition, if textual information appears in the selected Internet meme content, we will also annotate it manually.To avoid the model only utilizing the textual information and ignoring visual features, we control the proportion of memes without appeared texts in the final set to 40%.Meanwhile, to avoid multiple appropriate memes being selected under one dialogue condition, we filter out the memes with highly similar or duplicate semantic content.Finally, we obtain a total of 307 Internet memes for the subsequent data annotating process.To facilitate the arrangement and annotating process, the Internet meme set is further split into four groups: atmosphere adjustment, basic expression, basic emotion, and common semantics, respectively.

b) Step 2: Internet meme incorporated response construction:
The annotators, who are well-educated and familiar with dialogue research, are tasked to take two operations using the prepared Internet meme candidates: use one most suitable Internet meme to replace part of the text conversation or insert an Internet meme into the utterance to enhance the emotion of the current dialogues.In particular, we also ask annotators to label the emotional states when utilizing the current Internet memes.The annotators are specially instructed based on the following criteria: (i) behave naturally, and the meme usage is in line with real daily chats, (ii) the number of different Internet memes in the dataset is kept balanced to avoid meaningless gatherings and biased data.c) Step 3: Quality control: Before formal annotation, annotators are asked to annotate training samples until their results pass our examination.During the annotation, to eliminate the subjective inconsistency and make the annotation reliable, several specialized workers consistently monitor the collected dialogue data and perform a periodic quality check on samples.After the checking, we sample 10% data and manually check the samples ourselves.
d) Dataset Statistics: The total detailed statistics of the MOD dataset are summarized in Table II.MOD dataset has an average of 13.93 turns, and each turn contains 11.6 tokens.The text is tokenized by Chinese BERT tokenizer [20] and the vocabulary size is 13,086.We also plot the usage frequency of Internet memes and corresponding emotion in Figs. 1 and 2, respectively.Although the dialogue system is evaluated under MOD, participants can leverage any public datasets and pretrained models to build models.In the evaluation phase, we

C. Evaluation Criteria
Each participating team submitted up to five system outputs each of which contains the results for all three tasks on the two unlabeled test sets.We first evaluated each submission using the automatic task-specific objective metrics as show in Table III by comparing to the ground-truth labels and responses.Considering the limitation of text response evaluation metrics, we selected the top-3 finalists based on the metric score to be manually evaluated for task #1, following the four aspects as: r Correctness: whether there are grammatical errors in the machine generated text response.
r Relevance: whether the generated text response related to the historical content of the conversation.
r Fluency: whether the generated response is natural and smooth, in line with persons' conversation habits.
r Informativeness: whether the generated text response con- tains sufficient information.General replies are considered to be missing valid information.Besides, we also required annotators to give an overall score based on the above four aspects.All of the scores are ranged from 1 to 5 with integers.The annotated data is randomly chosen from the submitted entries of each team, 2000 history-answer pairs for easy and hard test, respectively.

D. Results and Analysis
Generally, we received 22 entries in total submitted from 5 participating teams, setting a new state-of-the-art in all three subtasks.To preserve anonymity, the teams were identified by  numbers from 1 to 5, while our baseline [19] was listed as team 0. Table IV presents the evaluation results of the best entry from each team in the automatic metrics for different tasks.
a) For Task #1: Text Response Modeling: Table V presents the human evaluation results of the task #1 participating teams.We can find that team 1 wins the text response modeling task in easy test set while team 3 achieves the first in the hard test set.Team 1 focuses more on correctness and relevance while Team 3 gains the highest scores in fluency and informativeness.The big gap in automatic evaluation and relatively small gap in human evaluation between teams show that the automatic metrics are not reliable for the open-domain dialogue.
b) For Task #2: Meme Retrieval: Team 3 achieves over 90% Recall 10 @5 in the easy test set, and also the highest scores in the hard test set.They treat the meme retrieval task as a matching problem and employ the cross-encoder architecture for relevance estimation using negative sampling.The big gap in performance between easy and hard tests also reveals that the generalization ability is limited for meme retrieval.
c) For Task #3: Meme Emotion Classification: Team 3 achieves the highest score of 89.5% in the easy test and 49.9% in the hard test.In particular, they devise an auxiliary method called Emotion-Enhanced Masked LM to improve the ability of meme emotion recognition.Meantime, Team 2 integrated historical memes and constructed a good-quality candidate set to reduce the difficulty of model learning and advance multimodal content understanding.There is also a big gap between easy and hard tests.

E. Conclusion
In this section, we describe the task definition, provided datasets, and evaluation set-up for DSTC10-MOD tracks.The top systems are all built with transformer-based end-to-end learning and follow the pre-training and fine-tuning paradigm.The incorporation of extra data for contrastive learning can effectively improve the robustness and generalization of the model.Well-designed self-supervised tasks can boost the multi-modal information fusion and understanding of the system.Although there is a lot of advancement compared with the baseline, we believe that the MOD task is worth further exploring and can benefit the modeling of multi-modal open-domain dialogue intelligence in the future, especially in how to exploit the visual features of memes better.

A. Track Overview
Recently, more public data sets and benchmarks have become available for dialogue research on task-oriented conversations in various domains [21], [22], [23], [24].However, most data sets include only written conversations collected by crowdsourcing via web interfaces, which differ from spoken conversations for the following reasons.First, there are differences between the style of spoken and written conversations, even for the same context, intention, and semantics.Second, spoken conversations tend to have extra noise from grammatical errors, disfluencies or barge-ins, which are rarely encountered when processing written text.Finally, speech recognition output is not perfect and contains errors, which brings in additional challenges for developing spoken dialogue systems in practice.

TABLE VI SUMMARY OF TRACK 2 TASKS
research communities have rarely addressed these issues on more contextual dialogue tasks including dialogue state tracking, dialogue policy learning, or end-to-end dialogue response generation, which are as important as the single-turn understanding tasks in fully working dialogue systems.This is mainly due to the lack of rich, annotated spoken data for such multi-turn dialogue tasks.
To benchmark the robustness of conversational models on spoken conversations, this challenge track introduces a new data set with spoken task-oriented dialogues for two subtasks: 1) multi-domain dialogue state tracking [23] and 2) knowledgegrounded dialogue modeling [34], as summarized in Table VI.Our new data includes the ASR output instead of manual transcripts for the user turns, which aims to evaluate how robust each model is against ASR errors.The remainder of this section presents the data details and reports the evaluation results of the submitted entries from the challenge track participants.

B. Data
To study speech-based task-oriented dialogue modeling, we collected spoken human-human dialogues about touristic information for San Francisco.Each session was collected by pairing two participants: one as a user and the other as an agent.We provided a set of specific goals to the user-side participant before each session.The agent-side participant had access to the domain database including both structured information and unstructured text snippets.We recorded 890 sessions, which are around 45 hours in total, and manually transcribed all the utterances.Table VII shows the statistics of DSTC10 data.For each of the user turns we provide the ASR output instead of manual transcripts.Our ASR model is based on the wav2vec 2.0 model [35] that was pre-trained on 960 hours of Librispeech [36] and then fine-tuned with 10% of our validation data.This model

C. Evaluation Criteria
Each participating team submitted up to five system outputs for either or both tasks.For task 1, we performed only automatic evaluations by comparing the submitted DST predictions with the ground-truth labels.We calculated the joint goal accuracy (JGA) as the main evaluation metric as well as the slot-level scores listed in Table VI.
For task 2, we use the same evaluation criteria and metrics as in the DSTC9 Track 1 [37].First, for each submission we calculated the task-specific objective metrics (Table VI) by comparing to the ground-truth labels and responses.Then, we aggregated a set of multiple scores across different tasks and metrics into a single overall score computed by the mean reciprocal rank.Based on the overall objective score, we selected the finalists to be manually evaluated by two crowd-sourcing tasks: r Appropriateness: This task asks crowd workers to score how well a system output is naturally connected to a given conversation on a scale of 1-5.
r Accuracy: This task asks crowd workers to score the ac- curacy of a system output based on the provided reference knowledge on a scale of 1-5.Finally, we used the average of the Appropriateness and Accuracy scores to determine the official ranking of the submissions to task 2.

D. Results
We received a total of 99 submissions, including 40 entries from 11 teams for task 1 and 59 entries from 16 teams for task 2. Six of the teams participated in both tasks.To preserve anonymity, the teams were identified by A01 -A11 for task 1 and B01 -B16 for task 2.
1) Task 1 Results: Table VIII shows the task 1 evaluation results of the best entries from each team selected based on JGA.We differentiated between the single-model and ensemble-based entries and categorized the core methods into value classification, span extraction, value generation, or hybrid approaches combining more than one of them.A key observation is that the generative models outperformed the other classification or extraction-based methods, consistent with findings on written conversations.We suppose this demonstrates the benefit of the generation-based DST in terms of its robustness against unseen values, different styles, as well as noisy transcriptions in our test data.On the other hand, most span extraction models failed to predict accurate dialogue states, because many of the extracted spans from spoken dialogue contexts with lexical variations Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.and ASR errors are not correct dialogue state values.Another finding from the highly-ranked teams is that they commonly made huge efforts in data augmentation to account for the difference between the training and test data sets.Especially, Team A11 achieved the best performance by trying various data augmentation methods including value substitutions, synthetic data generation and speech/ASR simulation.In addition, the model ensemble also helped to boost the performance from the single-model results.We observed the performance gains by the model ensemble from all three teams: A11, A10, and A09 who submitted the entries in both settings.In particular, the ensemble-based entry from the winning team A11 was significantly better than their single model and also all the entries from other teams.
2) Task 2 Results: Table IX shows the objective evaluation results of the best entry from each team selected based on the overall score.Most entries show the improved performance from the DSTC9 [37] and Knover [39] baseline models in all three tasks.Team B10 achieved significantly better knowledge selection results than all the other teams, which may be attributed to the huge amount of augmented data they generated as well as the enhanced negative sampling methods.For response generation, the top 8 teams achieved at least two to three times higher scores than the baselines in the key automatic generation metrics.This is mainly because of their efforts on style transfer from written to spoken languages in response generation.For example, Team B08 introduced a noisy channel model to guide the generated responses towards more spoken styles and it helped to get the best scores in all the automated generation metrics compared to the reference responses from spoken human-human conversations.
We selected 8 finalists to be manually evaluated, corresponding to the best entry from each of the top 8 teams in the overall objective score.Table X shows the official ranking of the finalists based on the human evaluation results.Team B10 won the task 2 with the highest scores for both Accuracy and Appropriateness.A notable observation is that Team B10 was just in the middle rank in the automatic NLG metrics, due to the lack of style transfer mechanisms in their systems.Nonetheless, their system responses were more preferred by the crowd-workers in the human evaluation compared to the other entries even with much higher objective scores.
Consistently with our DSTC9 track results [37], the best team on the knowledge selection task again ended up with the final winner after the human evaluation.Most participating teams took the pipelined system architecture as the baselines, including three models for detection, selection, and generation, each of which was fine-tuned from the large-scale pre-trained language models.On the other hand, three of the top-4 teams introduced a separate entity tracking component for knowledge selection to narrow down the search space before the document ranking.In addition, all the top-4 teams for task 2 utilized the augmented data to train their models.Finally, model ensembles further improved performance.

E. Conclusion
We presented the official evaluation results of our DSTC10 track on the Knowledge-grounded Task-oriented Dialogue Modeling on Spoken Conversations.This challenge track addressed the multi-domain dialogue state tracking and the knowledge-grounded conversational modeling tasks on spoken task-oriented conversations.We released the validation and test Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
data sets including 890 dialogues collected from spoken humanhuman conversations.A total of 21 teams participated with an overall number of 99 entries submitted.From the evaluation results, we learned the following two key factors to achieve high performance in both tasks: data augmentation for better generalization to unseen data and ensemble of different model outputs.

A. Track Overview
The SIMMC challenge aims to lay the foundations for the real-world assistant agents that can handle multimodal inputs, and perform multimodal actions.We thus focus on task-oriented dialogs that encompass a situated multimodal user context in the form of a co-observed image or virtual reality (VR) environment.The context is dynamically updated on each turn based on the user input and the assistant action.Moon et al. [40] (SIMMC 1.0) and Kottur et al. [41] (SIMMC 2.0) provide more details on the datasets and the models we provide.

B. Data
SIMMC 2.0 dataest contains about 11 k human-to-human dialogs (totaling about 117k utterances).We chose shopping experiences-specifically furniture and fashion-as the domain for the SIMMC datasets because of the dynamic environment created by these domains, where rich multimodal interactions happen around visually grounded items.
SIMMC offers many key advantages over previous multimodal dialog datasets: 1) SIMMC assumes a co-observed multimodal context between a user and an assistant and records the ground-truth item reference.SIMMC tasks emphasize semantic processing of the input modalities, while work in this area has traditionally focused heavily on raw image processing.2) SIMMC emphasizes semantic processing.The proposed SIMMC annotation schema allows for a more systematic and structural approach for visual grounding of conversations, which is essential for solving challenging problems in real-world scenarios.3) SIMMC 2.0 provides photo-realistic scenes that change over time (via viewpoint updates), moving away from the sanitized contexts present in many multimodal datasets.

C. Evaluation Criteria
We present four subtasks primarily aimed at replicating human-assistant actions in order to enable rich and interactive shopping scenarios.
1) Subtask 1: Multimodal Disambiguation: identifying whether a given user turn contains ambiguity in referencing to objects in the scene.As defined in [41], given the dialog history and the current user utterance, multimodal disambiguation requires the agent to predict a binary label conditioned on the multimodal context, to indicate the presence of a referential ambiguity in the user utterance.We use accuracy to measure and compare model performances for this task.
2) Subtask 2. Multimodal Coreference Resolution:: requires the dialog system to resolve referential mentions in user utterances to their canonical object IDs as defined for each scene.These mentions can be resolved through (1) the dialog context (e.g., A: 'This shirt comes in XL and is $29.' → U: 'Please add it to cart.', or (2) the multimodal context (e.g., U: 'How much is that red shirt?'), or (3) both (e.g., U: 'How much is the one next to the one you mentioned?').The main evaluation metric includes F1, precision and recall performance.
3) Subtask 3. Dialog State Tracking (DST):: aims to systematically track the dialog acts and the associated slot pairs across multiple turns, as represented in the flexible ontology developed to represent the SIMMC multimodal context.We use the intent and slot prediction metrics (F1), inline with prior work in DST.
4) Subtask 4. Response Prediction:: examines the relevance of the assistant response in the current turn.We evaluate in two ways; (a) as a conditional language modeling problem, where the closeness between the generated and ground-truth response is measured through using BLEU-4 score, and, (b) as a retrieval problem, where we measure the model performance when retrieving ground-truth responses from a pool of 100 candidates (randomly chosen and unique to each turn).

D. Results
The challenge saw a total of 16 model entries from 10 teams across the world, setting a new state-of-the-art in all four subtasks (Table XI).
For each subtask, we listed metrics in a priority order and the entry with the most favorable performance on the highest priority metric was considered to be a candidate winner.The winner of the multimodal disambiguation subtask (subtask 1) was the BART+ResNet model from Team 6.This model was the winner for the MM-DST subtask (subtask 3) as well.The winner of the multimodal coreference resolution task (subtask 2) and the response retrieval task (subtask 4a) was a BARTbased multimodal model from Team 4. The joint winners of the response generation (4b) were Team 5 and 10.

A. Track Overview
Recent artificial intelligence (AI) research activities have accelerated the development of technologies required for advanced human-like capabilities in machines, such as robots.For instance, current computer vision technologies can accurately perceive visual scenes, and spoken dialog systems can transcribe speech and understand speakers' intention.However, one important piece of technology is missing: natural and contextaware human-machine interaction, where machines understand their surrounding scene from the human perspective, and they are able to share their understanding with humans using natural language.To invent machines that can communicate with humans about objects and events in surrounding scenes, the project to work on Audio-visual Scene-aware Dialog (AVSD) was kicked-off [6], [11], [43], [44].An automated system that can converse with humans on video scenes via natural dialogs is a challenging research problem.The goal of AVSD in DSTC is to have question-answering based conversations on videos from daily life.To this end, the AVSD challenge task was designed based on the popular Charades dataset [45], with the goals: (1) generate answers to questions about objects and events in the video clips and (2) hold a meaningful dialog with humans about objects and events using conversational frameworks.To promote further advancements into real-world applications of the AVSD setup, a third challenge was proposed in DSTC10, progressively improving the challenge from the previous videobased scene-aware dialog tracks.The new task is to generate sentences for a system response to a query that occurs during a dialog about a video using reasoning features without using the human-created video description.Participants used the video, audio, and dialog text data to train end-to-end models without the manual descriptions.This challenge used the AVSD datasets that were collected and used in the previous challenges.The additional datasets for temporal reasoning for QA datasets were collected and used in DSTC10.

B. Audio-Visual Scene-Aware Dialog Data Set
The AVSD in DSTC10, the same AVSD data collected by [43] have been used.Table XII shows the size of the data used for DSTC10.For DSTC10, additional data for temporal reasoning were collected, in which humans watched the videos and read the dialogues, then identified segments of the video containing evidence to support each given answer.Fig. 3 shows the annotation tool for reasoning.With this tool, humans identified temporal segments based on visual evidence and/or audio evidence and  filled in the appropriate fields with begin and end timestamps to provide temporal reasoning.

C. Baseline Model
A baseline system has been built for the DSTC10 AVSD track, which utilizes an AV-transformer architecture [46].The system employs a transformer-based encoder-decoder, including Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE XIII SUBMITTED SYSTEMS TO THE DSTC10-AVSD TRACK
a bimodal attention [47], [48] that lets it learn interdependencies between audio and visual features.
The audio-visual encoder extracts VGGish [49] and I3D [50] features from the audio and video tracks, respectively, and encodes these using self-attention, bimodal attention, and feedforward layers.The decoder receives the encoder outputs and the dialog history until the current question, and starts generating the answer sentence.At each iteration step, it receives the preceding word sequence and predicts the next word by applying M decoder blocks and a prediction network.The self-attention layer converts the word vectors to high-level representations considering temporal dependency.The bimodal source attention layers update the word representations based on the relevance to the encoded multi-modal representations.A feed-forward layer is then applied to the outputs of the bimodal attention layers.Finally, a linear transform and softmax operation are applied to the output of the M -th decoder block to obtain the probability distribution of the next word.

D. Temporal Reasoning
Temporal reasoning is the task of finding evidence supporting the generated answers, where the evidence corresponds to human-annotated time regions of the video that have been identified as supporting each ground-truth answer.Human annotators were allowed to choose multiple time regions for each question-answer pair, but most of the reasons consist of a single region.

E. Submitted Systems and Evaluation
The AVSD Task received 12 system submissions from 5 teams.This section summarizes the techniques used in the submitted systems to the AVSD challenge, including the baseline system.Table XIII lists the baseline and submitted systems with brief specifications including the encoder-decoder model type, multimodal fusion type, audio-visual video features used, and additional techniques or data sets.
In this challenge, the quality of a system's automatically generated sentences is evaluated using objective measures to determine the level of similarity between the system-generated responses and ground-truth responses provided by humans.For this purpose, we needed to collect more human-generated responses to each test question (the original dialog, of course, contains only a single human response to each question).To collect more possible human answers in response to the test question for each test video, we asked 5 humans to watch the video, read a dialogue (up to the test question) about the video between a questioner and an answerer, and then provide an answer in response to the test question.
To evaluate the systems, we compared them with 6 groundtruth human answers, which consisted of the one original answer and these 5 newly collected answers.We used the MSCOCO evaluation tool for objective evaluation of system outputs.The supported metrics include metrics based on word overlap, such as BLEU, METEOR, ROUGE_L, and CIDEr.In addition, we collected human ratings for each system response using a 5-point Likert scale, in which humans rated system responses given a dialog context.We asked the human raters to consider correctness of the answers as well as naturalness, informativeness, and appropriateness of the response according to the given context.The reasoning performance was measured by Intersection over Union (IoU), which indicates the ratio of overlap between the predicted and ground-truth time regions (higher is better).IoU-1 is obtained as an average IoU computed between each ground truth and the predicted region that gives the highest IoU to the ground truth.IoU-2 is computed by frame-level matching among all predicted and ground-truth regions for each answer.
Table XIV reports the numerical results of all qualifying submitted systems (entries) from all teams.The subjective human ratings described above are given in the rightmost column of the table, and the others are the objective scores that were computed using word-overlap metrics (Bleu, METEOR, ROUGE_L, and CIDEr) and reasoning metrics (IoU-1 and IoU-2).Fig. 4 plots the human ratings for each system in several ways.In all three figures, the systems are shown in the same order on the x-axis.
We tested our baseline model in two settings: giving matched and shuffled videos.As indicated in Table XV, we can see a certain degradation in the scores of the two systems, but the performance gaps are relatively small.Thus, the result suggests that text information is dominant in the AVSD task, and at the same time, the expressive power of the baseline video features,  i.e., I3D and Vggish, is insufficient.In other words, developing more advanced video features was one of the important issues in this challenge; better video features are important.In the DSTC10-AVSD challenge, some teams applied more advanced video features and reported substantial improvement.For example, Team 4 introduced TimeSformer to extract video features.

F. Conclusion
The third AVSD challenge promoted further advancements for real-world applications, where 1) the human-created description is unavailable at inference time, and 2) systems must demonstrate temporal reasoning by finding evidence from the video to support each answer.The submitted systems provided highquality answers and reasoning even without human-generated descriptions at inference time.The DSTC10 winning system achieved 90.2% of the human performance based on human ratings.The result is considerable, but the gap with human performance is actually larger than the DSTC8 result (98.4%).This shows that continued research is still needed to achieve human performance.The data setup, baseline system, and evaluation tools are released, which facilitate continuous improvement by the community after the DSTC10.

A. Track Overview
Our track consists of two tasks: (1) Automatic Open-domain Dialog Evaluation.(2) Safe Chatbots Development.The goal of the first task is for participants to design robust automatic dialogue evaluation metrics that correlate well with human judgements across multiple dialogue domains as well as across different dialogue evaluation dimensions, such as naturalness, appropriateness, etc.The goal of the second task is for the participants to build generative models that first detect a toxic user's comment, and then generate appropriate and polite responses that keep the dialogue fluid and nontoxic.

B. Data 1) Task 1 -Automatic Dialogue Evaluation:
As evaluation benchmark we released 14 publicly available datasets for the participants to tune their proposed metrics during the development phase.During the final evaluation phase, we collected Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
five hidden test evaluation datasets for assessing participants' submissions.The datasets and final leaderboards are publicly available on the ChatEval platform. 2ach turn-level dataset is a collection of context-response pairs.The context refers to a list of consecutive utterances that are extracted from a human-human conversation.The response is produced by a dialog model conditioned on the context.Each context-response pair was assessed by several human judges along different evaluation criteria.The 19 evaluation datasets cover a large number of distinct dialogue domains, such as daily chitchat [56], knowledge exchange [57], and persona-based conversations [58], and a large number of different evaluation criteria, such as naturalness, interestingness, response appropriateness, etc.More details can be found in [59] 2) Task 2 -Safe Chatbots Development: Several datasets were preprocessed and formatted from their original sources as part of the Chat/Dialogue Modeling and Evaluation task (CHANEL) held during the 2020 Seventh Frederick Jelinek Memorial Summer Workshop. 3All selected datasets are organized into turn of pairs (prompt-answer) and processed using Microsoft Azure Cognitive Services to automatically detect toxic turns.Then, we selected those pairs where the prompt was detected as toxic but the answer was not.To reduce false positives in the prompts or false negatives in the answers, we filtered the Azure results by passing all detected turns through a dictionary consisting of the 320 most common swear words in English.Concretely, the datasets we used include: (1) MovieDic [60] (2) Cornell Movie Dataset [61] (3) ChatCorpus [60] 4 (4) DSTC8-Reddit [62] 5 Refer to [59] for more details of the four datasets, which are annonymized.Besides the toxicity detection process, we extract additional features: humour scores and detected emotion.Humour scores are extracted using Colbert pretrained model [63].Emotion detection were trained by four different datasets [56], [64], [65], [66], distinguishing up to 7 different emotions: happiness, sadness, fear, angry, surprise, disgust, and neutral [67].To further assess task difficulty, we manually annotated a subset of the test data.In total, 1290 prompt-answer pairs were annotated by 7 annotators from three different geographical zones (3 in the USA, 3 in Europe, and 1 in Asia).An annotation guideline, with no examples, was prepared to avoid biasing responses.Refer to [59] for the annotation guideline details.

C. Baselines
For each task, we provide a baseline system.For "Automatic Dialogue Evaluation" task, We adopt the deep AM-FM framework [68], an ensemble metric, as the baseline for the automatic dialogue evaluation task. 6We modify the framework to a reference-free version whereby for AM, we compute the cosine similarity between the sentence-level embedding of the response and that of the last sentence in the corresponding dialogue context.For FM, we use the formulation of the contextresponse coherence metric in HolisticEval [69].
For the "Safe Chatbots Development" task, participants are provided with a baseline system based on DialoGPT: a GPT-2 model pretrained on 147 M multi-turn dialogues from Reddit threads [70] and finetuned on our provided training data. 7

D. Evaluation Criteria
In task 1, we adopted Spearman correlation to assess the participants' submissions.We rank the submissions only based on their performance on the five test evaluation datasets.We compute the Spearman correlation between the submitted metric scores and the corresponding mean human annotation scores per evaluation dimension for each evaluation dataset.In task 2, we conduct both automatic and human evaluation.For automatic evaluation, we adopt four different objective metrics: a) BLEU [71], b) ROUGE-L [72], c) BERTScore [73], and d) BLEURT [74].For human evaluation, we perform a pairwise ranking of the system-generated responses given a toxic prompt.A subset of 160 toxic prompts are randomly selected from the golden test set for pairwise analysis.

1) Task 1 -Automatic Dialogue Evaluation:
In task 1, we received 21 and 35 submissions from nine different teams for development and testing, respectively.Table XVI presents the main correlation results of each team on the five test datasets.For each row in the table, we show the Spearman rank correlation w.r.t. each team's best submission.Each entry in row 6 is computed by averaging the 11 dimension-wise correlation scores over all five test datasets.Each dimension-wise correlation score is computed between the metric scores assigned to all data instances within a test dataset and the corresponding human annotated scores along one evaluation criterium of that particular dataset.
Remarkably, Team 1, 5, and 8 all rely on ensembling multiple sub-metrics for evaluation.The weights of combining different sub-metrics are dynamically learnt from the data.This finding is inline with the observation made in Yeh et al. [75], which highlights the advantage of combining multiple sub-metrics.
2) Task 2 -Safe Chatbots Development: Unfortunately, there was no submission for this task.Hence, we decided to test the performance of three existing state-of-the-art chatbots on our annotated golden test set (described in Section VI-B2).The three chatbots include: a) the pretrained baseline released to the participants (a finetuned version of DialogGPT [70]).b) BlenderBot Vs 2.0 (including its safety layer) [76], [77], and c) GPT-3 [78] (the DaVinci version). 8able XVII shows the automatic evaluation results for each chatbot.The results for the word-overlap metrics (BLEU and ROUGE) are very low due to the high differences in the system generated responses and the corresponding human references.On the other hand, semantic metrics (i.e., BERTScore and BLEURT) show marginal differences between chatbots, with BlenderBot Vs 2.0 performing slightly better.
In Table XVIII and Fig. 5 shows performance of chatbots and humans.

F. Conclusion & Future Work
We conclude the track with several important points that can benefit future development of automatic dialogue evaluation metrics and safe dialogue systems.(a) in task 1, we notice that all the teams' performance on the development data is much better than that on the hidden test data (around 31.4% in average) except the baseline, which performs better on the test data (around 34.5% better) probably due to the usage of a more simple mechanism for combining different evaluation dimensions and some topic overlap between the training data and test sets (e.g., topical and persona datasets).Hence, future work should explore models with better generalization to out-ofdistribution evaluation (i.e., robustness).This research direction towards robust and generalizable metrics is also highlighted in Mehri et al. [79].(b) we standardize a large number of dialogue evaluation datasets and release a ready-to-use and high-quality benchmark to meta-evaluate different capabilities of automatic dialogue evaluation metrics, such as domain generalization, multi-dimensionality, and robustness.The benchmark serves to help dialogue researchers and practitioners holistically assess their newly-proposed automatic dialogue evaluation metrics.(c) The second task is just scratching the surface on how to deal with toxic users.As there is currently not enough resources on this topic, we provide the data and baseline systems that can help advance the development of safe chatbots.(d) Future work may focus on more advanced techniques in detecting different types of toxicity and how to address them.In addition, efforts to avoid the use of toxic words is just a first step in reducing toxicity.There are other complex toxic scenarios to address.

Fig. 2 .
Fig. 2. Histogram of top-10 annotated emotions when memes are used in Track 1. Positive emotions (pink) occur significantly more often than negative emotions (blue).

Fig. performance between
Fig. performance between the different chatbots and human answers on the annotated test set.
VII. CONCLUSION This article summarizes five tracks in the tenth dialog system technology challenge (DSTC10).MOD: Internet Meme Incorporated Open-domain Dialog incorporates interbet memes into open-domain dialogues.Knowledge-grounded Task-oriented Dialogue Modeling on Spoken Conversations focuses on robustness in spoken conversations.The Situated Interactive Multi-Modal Conversational AI track focuses on real-world assistant agents that can handle multi-modal inputs and perform multi-modal actions.Reasoning for Audio Visual Scene-Aware Dialog promotes a multimodal reasoning task in conversational scenarios.Finally, Automatic Evaluation and Moderation of Open-domain Dialogue Systems target the proposal of new metrics, self-supervised methods, and non-toxic generation of responses for open-domain dialog systems.All datasets and resources introduced for each track are kept publicly available even after the challenge period to support future dialog system research.

TABLE I SUMMARY
OF DSTC10 TRACK 1 TASKS retrieval, and meme emotion classification, as listed in Table

TABLE II STATISTICS
OF THE TRACK 1 DATA SETS Fig. 1.Internet meme frequency in the Track 1 dataset.The meme usage balances without significant bias.Meme ids greater than 274 only occur in hard test set.

TABLE VII STATISTICS
OF THE TRACK 2 DATA SETS achieved a WER of 26.25% at 1-best and 24.31% oracle WER at 10-best hypotheses on the user utterances on our test set.

TABLE XI SUMMARY
OF THE RESULTS ON TEST-STD SPLIT

TABLE XII AVSD
DATASET FOR DSTC10

TABLE XIV DSTC10
-AVSD EVALUATION RESULTS WITH WORD-OVERLAP-BASED OBJECTIVE MEASURES BASED ON 6 REFERENCES, A SUBJECTIVE MEASURE BASED ON 5-LEVEL RATINGS BY HUMANS (HR), AND REASONING PERFORMANCE BASED ON INTERSECTION-OVER-UNION (IOU) Fig. 4. Statistics of human rating scores

TABLE XV COMPARISON
OF ANSWER QUALITIES WHILE THE VIDEOS ARE SHUFFLED OR NOT

TABLE XVI MEAN
SPEARMAN CORRELATIONS (%) FOR THE BASELINE, WITH EACH TEAM'S BEST SUBMISSION ON THE 5 TEST DATASETS TABLE XVII OBJECTIVE METRICS FOR TESTED CHATBOTS IN SUBTASK 2 XVIII HUMAN PERFORMANCE FOR THE SUBTASK 2 TEST SET.PERCENTAGES USE TOTAL ANNOTATED ITEMS FOR EACH CHATBOT