ToD4IR: A Humanised Task-Oriented Dialogue System for Industrial Robots

Despite the fact that task-oriented conversation systems have received much attention from the dialogue research community, only a handful of them have been studied in a real-world manufacturing context using industrial robots. One stumbling block is the lack of a domain-specific discourse corpus for training these systems. Another difficulty is that earlier attempts to integrate natural language interfaces (such as chatbots) into the industrial sector have primarily focused on task completion rates. When designing a dialogue system for social robots, the user experience is prioritized above industrial robots. To overcome these challenges, we provide the Industrial Robots Domain Wizard-of-Oz dataset (IRWoZ), a fully-labeled discourse dataset covering four robotics domains. It delivers simulated discussions between shop floor workers and industrial robots, with over 401 dialogues, to promote language-assisted Human-Robot Interaction (HRI) in industrial settings. Small talk concepts and human-to-human conversation strategies are provided to support human-like answer generation and give a more natural and adaptable dialogue environment to increase user experience and engagement. Finally, we propose and evaluate an end-to-end Task-oriented Dialogue for Industrial Robots (ToD4IR) using two types of pre-trained backbone models: GPT-2 and GPT-Neo, on the IRWoZ dataset. We performed a series of trials to validate ToD4IR’s performance in a real manufacturing context. Our experiments demonstrate that ToD4IR outperforms three downstream task-oriented dialogue tasks, i.e., dialogue state tracking, dialogue act generation, and response generation, on the IRWoZ dataset. Our source code of ToD4IR and the IRWoZ dataset is accessible at https://github.com/lcroy/ToD4IR for reproducible research.

Furthermore, some research works have been done on designing and developing such ToD for the manufacturing domain, especially in human-robot interaction (HRI) for industrial robots [10], [11]. While several studies have been presented to ground natural language commands for industrial robot manipulation [12], [13], the majority do not need dialogue datasets to train a neural network since they employ the keywords-matching approach, or their dialogue datasets are not publicly available. Additionally, while adopting language-enabled ToD to improve HRI for industrial jobs is a novel concept, there is currently no pre-built dialogue corpus for training such ToD.
The contributions of our work can be summarized as follows: • As a new benchmark, we provide the Industrial Robots Wizard-of-Oz dataset (IRWoZ), a scalable, innovative manufacturing domain-centered ToD dataset that may be used to create ToD in HRI for industrial robots.
IRWoZ is open to the public, facilitating successful collaboration between academics and industrial partners.
• We introduce small talk principles and human-to-human conversation strategies to assist the human-like response generation build the IRWoZ. To the best of our knowledge, it is the first effort to produce a human-like response in HRI for industrial robots.
• We offer ToD4IR, a conversational AI, fine-tuned on our IRWoZ corpus utilizing two types of State-of-The-Art (SoTA) pre-trained backbone models (GPT-2 and GPT-Neo).
• We demonstrate the robustness of our approach in a real-world factory setup. The remainder of this work is structured as follows: Section II summarizes the related work. Section III then outlines how we created the IRWoZ dataset, the ToD4IR system architecture, and essential components. Section IV summarizes the findings of the evaluation and discussion. The paper is concluded in Section V.

II. RELATED WORK A. TASK-ORIENTED DIALOGUE SYSTEMS
In general, ToD aims to accomplish specific tasks through dialogue with the user, such as booking a taxi or ordering food. Such systems are typically classified into two categories: pipeline and end-to-end [14]. Compared to a pipeline manner, partial or complete end-to-end dialogue systems have received more attention in recent years [15]. Each component is trained end-to-end, reducing sub-model errors' propagation and accumulation. Furthermore, passing user feedback to a model is challenging because each module is interdependent in a pipeline pattern [16]. The evaluation of such end-to-end ToD systems mainly focuses on dialogue state tracking and response generation. Lei et al. presented a sequence-to-sequence method incorporating dialogue state tracking and response generation [2]. Li et al. proposed an end-to-end neural dialogue system for achieving targets and reaching higher accuracy [17]. Perez et al. an end-to-end memory network, a memory-enhanced neural network architecture, for dialog state tracking [18]. The proposed Alternating Roles Dialog Model (ARDM) uses the large pre-trained language model. It does not require belief states or dialog acts from human annotations [19]. Chen et al. introduced a graph attention network to extract information from utterances and graphs and leveraged a recurrent graph attention network to control state updating [20]. Wu et al. proposed a pre-trained ToD-BERT that models dialogue behavior during pre-training and outperforms downstream tasks, e.g., response selection [21]. Unlike TOD-BERT, Minimalist Transfer Learning (MinTL) used a copy mechanism to inject the previous dialogue states into the new one and improve the end-to-end response generation [22]. In SimpleTOD, Hosseini-Asi et al. took the whole ToD as a single sequence prediction problem, leveraged transfer learning from a pre-trained language model based on open-domain, and improved the performance for dialogue model [9]. Soloist used task-grounded pre-training to learn tasks while enjoying low annotation cost for the training dataset and reached a higher task success rate [23]. Human feedback is considered during the training stage in the end-to-end model [24] to improve the system performance. Bhuwan et al. presented an end-to-end differentiable KB-Infobot, which improved the robustness of the system and flexibility of question types [25].
The most related works to ours are [9], [23], which leverage the pre-trained language model to build end-to-end ToD systems. However, those works do not distinguish the dialogue belief state but treat them as a whole entity. Our work divides the belief state into four groups: database-related, task-related, required, and optional, based on the source and importance, respectively. It helps to generate system actions and responses with higher efficiency and accuracy.

B. CHIT-CHAT DIALOGUE SYSTEMS
In comparison to task-oriented dialogue systems, chit-chat systems place a higher premium on small talk, sociability, engagement, and humanness.
He et al. investigated the effects of various pre-trained finetune schemes. They found that some significant language generation methods can be forgotten due to data separation. They proposed a mix-review method based on the data mixture idea and effectively alleviated language skill forgotten problems [26]. Daniel et al. employed a sequenceto-sequence model, Meena, which includes two 260 million parameters. Compared to other chatbots, Meena is an open-domain model based on a multi-turn transformer architecture; it can give unique and more reasonable responses. The paper also presented a new Sensibleness and Specificity Average (SSA) index for Manual evaluation [27]. Sun et al. integrated Chit-Chat to enhance task-oriented dialogue and achieve the goal of making virtual assistant conversations more engaging and interactive [28]. Roller and his research team think good conversation requires varied skills, including engagement, knowledge, empathy, and personality [4].
Zhang et al. developed a DIALOGPT model, trained on a massive real-world Reddit dataset [29]. Moon et al. proposed the model with light social greetings annotations for a few chit-chat dataset [30]. Another work suggests randomly sampling utterances from a chit-chat corpus to improve the out-of-domain recovery performance [31]. XiaoIce's design considers intelligence and emotion together, which advances communication, engagement, and social belonging [32]. Shu proposed a Meta-Learning framework (DAML) based on random source domains with disparate label sets and achieves high performance on an unknown target domain [33]. Asma proposed a self-play metric where the dialog system talks to itself. They showed that this metric was similar to the human-rated quality for a dialog model and better than other automated metrics [34]. Hannah et al. proposed a new model based on empathetic dialogue generation and created a novel dataset of 25k emotional dialogue [5]. Mazare et al. attempted to focus on personal facts to make their chit-chat dialogue model more engagements [35]. Akasaki and Kaji annotate user utterances with chat/non-chat binary labels [3].
While the aforementioned chit-chat dialogue systems are primarily concerned with the open domain or function more like personal assistants, borrowing such features (e.g., small talk, engagement) that enable users to speak naturally in order to complete tasks more efficiently is also critical for developing robust and humanized task-oriented dialogue systems.

C. DIALOGUE DATASETS
A dialogue dataset is critical for building data-driven-based conversational AI. Such datasets may cover many different domain categories. Daily dialogue is a human-written dataset based on communication intention and emotional information and achieves good performance in multi-turn dialogues [36]. Cornell Movie-Dialogs corpus is generated in a social context from movies. This corpus contains a large metadata-rich collection of fictional conversations extracted from a raw movie, including 304,713 utterances in total [37]. Sun et al. bridge the two baselines, and compare ACCENTOR-SGD and ACCENTOR-MultiWOZ with original SGD [20] and MultiWOZ [38], [39], [40] datasets [28]. The ConvAI2 is based on large pre-trained transformers [41]. BlendedSkillTalk is a conversation dataset of about 7k entries explicitly designed to exhibit multiple conversation modes, focusing on personality, empathy, and knowledge [42]. Bookscorpora includes some fixtures and the integration of proxy usage, which downloads books from the original smashwords dataset where all books are written in English with at least 20k words. Wolf et al. used the BooksCorpus dataset to get the best perplexities and F1 scores [43]. The Ubuntu Dialogue Corpus contains over seven million utterances and 100 million words for multi-turn dialogues. It is based on neural language models with large amounts of unlabeled data [44].
While new datasets have been developed constantly, none of them are specifically targeted toward industrial robots. In our study, we create a novel dialogue corpus focusing on four common topics for HRI in industrial robotics. Additionally, we seek to increase shop floor workers' engagement and user experience by offering a natural, humanized conversation environment. That is, to enable ToD4IR to communicate in natural human language. To do this, ToD4IR must be trained on a corpus of natural human-to-human conversations to understand how to generate a human-like response. While the SoTA work [8] incorporates chit-chat to augment its ToD and utilizes pre-trained generative models to produce freeform chit-chat data, it requires a hybrid classifier to filter candidate datasets and crowd workers to annotate data. Nevertheless, our dataset is enriched by including small talk and human-to-human conversation strategies during the dataset collection process. No extra classification or interaction by human workers is required to validate the candidate response, as we collect the discussion corpus directly using ''Wizardof-OZ,'' a human-to-human technique.

III. METHOD
The approach for producing the IRWoZ corpus, which mixes small talk and human-to-human conversation strategies with task-oriented dialogue and is used in our pre-training, is described in this section. The proposed ToD4IR's overall system architecture, as well as the backbone models trained on a dialogue history level, are then presented.

A. DATASET CREATION
Among our contributions is the development of an industrial robot-oriented dialogue corpus, IRWoZ, a fully-labeled collection of human-to-human conversations spanning over four domains (assembly and relocation tasks of industrial manipulator, delivery and positioning tasks of mobile industrial robots). It seeks to create simulated dialogues between shop floor workers and industrial robots in order to facilitate language-assisted HRI in an industrial setting, with a total of 401 dialogues.
This section describes the four-step process by which the IRWoZ dataset was created. Firstly, we investigated the most typical scenarios for HRI in industrial robots to identify areas ideal for incorporating language-enabled ToD. Secondly, we leverage the WoZ method for collecting domain-specific conversation datasets, which receive less attention and are not readily available online or on the market. Thirdly, we examined work-related small talk principles to guide the humanized response generation to boost the user experience. Finally, we explored human-to-human conversation strategies to provide a more meaningful task-oriented response and maintain a high task completion rate.

1) SCENARIOS FOR INDUSTRIAL ROBOTS
In general, industrial robots are employed in the following scenarios [45]: • Material handling, including material transfer, which requires the robot to move materials or work parts from one location to another, and machine loading and VOLUME 10, 2022 unloading, which utilizes a robot to load and unload parts at a production machine.
• Processing operations where a robot manipulates a tool to carry out a process on the work part, e.g., assembly, inspection.
• Service operations where a robot fulfills several operational services, e.g., repairing the manufacturing equipment, cleaning up the waste, and scrap after manufacturing tasks. In comparison to machine loading and unloading, ToD is more involved in the material transfer scenario, e.g., internal transportation, in which the user requests a delivery task by informing a mobile robot (e.g., Mobile Industrial Robot (MIR) 1 ) the destination of the transportation, recipient, and optional information of parcels (e.g., size, color). Fig. 1 (b) depicts an example dialogue of such scenario. Human collaborates with an industrial manipulator for assembly tasks (e.g., Universal robots, 2 Franka Emika 3 ), on the other hand, is also common in a production environment where ToD fits well. For instance, a shop floor worker may request a Franka Emika robot to assemble a product via dialogue in which the ToD system gathers the information of the product type and amount and controls the robot to accomplish the assembly operation. An example dialogue for such assembly scenario is illustrated in Fig. 1 (a). However, service operations, such as cleaning and repairing, are too task-specific to be generalized for ToD.

2) DATA COLLECTION -A WOZ WAY
In general, most of the ToD dataset is built based on existing dialogue systems [38]. To our knowledge, the ToD corpus for HRI in industrial robots is unavailable. Inspired by the Wizard-of-Oz framework (WOZ) [46], MultiWOZ datasets [38], [39], [40] and its succeeded validation in [15], [47], we build a dialogue corpus via human-to-human data collection method. To obtain a more reliable and diverse dialogue corpus, participants invited must have background and skills related to Robotics, Automation, or Computer Science, with an emphasis on expertise in using industrial robots. Furthermore, factory workers and engineers are also included, given that their regular tasks involve operating industrial robots. As a result, an 18-member team is created, consisting of three factory engineers, six Ph.D. students, two master's students, and seven professors. All students have a background in Robotics/Automation and have experience in programming and manipulating industrial robots. Among seven professors, four of them work on industrial robots, one focuses on automation, and the other two have a background in computer science.
As seen in Fig. 2, the simulation environment is Aalborg University's Learning Factory [48]. The robot utilized in the simulation is one of our autonomous industrial mobile manipulators, Little Helper (LH), currently in its eighth generation [49]. LH combines a MiR 200 (on the bottom) and a Franka Emika Panda collaborative Robot (on the top), as seen in Fig. 2.

a: DIALOGUE TASK
An ontology of a dialogue task is formed by a domain, dialogue act type, and dialogue slots. Table 1 shows the defined ontology of the IRWoZ dataset.
As mentioned earlier, four domains are identified from the two scenarios (seen Appendix C).
• Delivery. A transportation task, where a mobile industrial robot delivers a package.
• Position. A positioning task, where a mobile industrial robot edits its position/location (e.g., add a new position) on a 2D shop floor map.
• Assembly. A product assembly task, where an industrial manipulator assembles a requested product.
• Relocation. A relocation task, where an industrial manipulator assists with the relocation of an object, e.g., grasping, moving. There are three types of dialogue acts: database-related, taskrelated, and general greeting. The database search conditions are set based on extracted slots if a task requires database querying. For instance, a general delivery task requires a worker to specify the destination. The ToD system should be able to verify whether or not a particular destination (i.e., location) exists in the system database. There are two types of database act: required (DB_request_req) and optional (DB_request_opt) (e.g, name of the recipient for a delivery task) acts. The task-related dialogue act denotes additional information that a worker needs to offer to complete a task, e.g., quantity of an assembly task. It also includes required (T_inform_req) and optional(T_inform_opt) slots. The verification results of the required or optional dialogue acts are specified in the search results(i.e., search_results). Additionally, general greeting acts, like ''thank you'', ''goodbye'', are also provided.
Slots are required core information (e.g., product, area) extracted from a worker's utterance to accomplish a  manufacturing task. There are a total of 19 slots identified throughout the above four domains. Domains may share the slots; for example, slot color can be used to specify the color of a product for an assembly task or the color of a package used in a delivery task.

b: ROLE OF PARTICIPANTS
Human-to-human method is leveraged to collect dialogue corpus regarding the above four domains. Each participant is introduced to the LH robot and instructed on how to operate LH to perform tasks. To collect data, one participant assumes the role of an industrial robot (i.e., the wizard), while the other one acts as a shop floor worker.
The shop floor worker is asked to randomly choose a task and initiate a dialogue. By following the same approach in [38], sampled task specification (e.g., Fig. 3) crossing four domain are distributed to shop floor worker. The wizard needs to respond to the shop floor worker according to the required task. The maximum conversational depth and the number of  [50] and three conversation strategies [51] for generating humanised response with examples. continuous dialogue turns for a task are not limited to creating a natural dialogue environment [51]. Since individuals structure their utterances differently even when assigned the same task, the shop floor worker is encouraged to compose the sentence uniquely.
The wizard is given access to the back-end database and a list of the robot-controlling APIs. Once the shop floor worker initiates a task-related conversation, the wizard responds appropriately. If the task requires database verification, the wizard needs to provide ground truth results based on the database. They then extract task-related slots from the shop floor worker 's utterance. To construct a humanized response for the ToD, the wizards are urged to order their utterances using their own words. If preferred, wizards are also supplied with work-related small chat principles and task-related human-to-human conversation strategies.

3) SMALL TALK RESPONSE GENERATION: ARE PRINCIPLES
Unlike [28], we do not employ pre-trained models (e.g., GPT-2 [52], BlenderBot [6]) to generate candidate chit-chat responses, annotate and filter them manually. Rather than that, we collect individual human responses to the IRWoZ dataset during the conversations. The observation of human collaboration demonstrates that most tasks-related conversations may also include small talk, which can be incorporated into the ToD process to build rapport, establish trust, and increase user engagement. [53], [54], [55]. Inspired by [50], Anchor, Reveal and Encourage (ARE) principles are introduced to assist the small talk response generation. Table 2 shows the descriptions of ARE and its associated examples. Anchor facilitates mutual understanding and builds a friendly dialogue that can lead to task development. For instance, a shop floor operator may ask the mobile robot to deliver a package to the warehouse where the main storage room is located. The ToD may respond by combining task-related and small talk responses (the italic part) by saying:'' Sure, I will do that. I know that place, it is a quite busy area.'' Establishing a trustworthy dialogue is likewise reliant on trust. Individuals trust those who share more personal information with them (i.e., Revealing). Automated assembly tasks may require the operator to instruct the robot to assemble an unfamiliar product. Therefore, instead of replying: ''Sorry, I cannot.'', ToD may say: ''Sorry, I cannot. it is quite new to me. I have not learned how to do that yet.''. Another way to enhance user engagement is to encourage them to speak and involve them in conversation. The ToD may give a response: ''Hey, how are you? Do you have a nice day?'' when the operator greets the robot: ''Hey robot, good morning!''.

4) TASK RELATED RESPONSE ENHANCEMENT: CONVERSATION STRATEGY
While the use of small conversation helps to increase user engagement, the primary objective of the developed ToD is to assist manufacturing tasks. Three human-to-human conversation strategies [51], end topics with a suggestion, elicit more information and clarifying, are introduced in order to improve task-completion rate while maintaining a more humanised and task-related response. Table 2  The ToD is expected not to give a ''yes'' or ''no'' answer for a requested task but to the mining operator's intention to the greatest extent and provide alternative solutions to assist in completing the manufacturing task. For example, for a transportation task, the operator may accidentally give a location that is not registered in the system. The ToD may use end topics with a suggestions strategy and propose a Position task to the operator to mark the location first and say: ''well, I don't know the place. Can you register it in the system first?''. ToD should also be able to ask the operator to elicit more information if the instruction is not explicit. For example, the operator may forget to mention the required task-related information for an assembly task and ask: ''Hey robot, can you assemble ten pieces of this?'' The ToD may say: ''Can you describe a bit more of the thing you want to assemble?'' Clarification and confirmation are essential for task execution. Operators may alter tasks during the conversation, e.g., aborting a task-switching a task. For example, if the operator decides to abort a task during its execution, ToD needs to understand if that is what the operator wants or just a misunderstanding. Therefore, ToD should be able to use the Clarifying strategy to ask: ''Sorry, do you mean you want to abort the task?''.

5) DATA STRUCTURE
To maintain high scalability, IRWoZ uses a data structure similar to that of the most popular MultiWOZ 2.0 [38]. Each dialogue is divided into two sections: the domain and the turn. The turn section includes multiple dialogue turns, including user and system utterances, belief states, and system actions. The dialogue is saved in JSON format. The left side of Fig. 4 illustrates an example of IRWoZ data. To extend the IRWoZ, the user needs to follow the current data structure and add new JSON elements which includes the desired domain, dialogue turns, database-related slots, task-related slots, and search results.

B. SYSTEM ARCHITECTURE
The proposed ToD4IR comprises three services: cognitive speech, human-robot dialogue, and robot control. Fig. 5 illustrates a high-level system architecture of ToD4IR.

1) COGNITIVE SPEECH SERVICE
The ToD4IR is designed in such a way to accept human speech as input and provide near-human voice as the output to create a natural and flexible communication environment.
Generally, most popular conversational AIs support a trigger word, e.g., Ok Google, Alexa, for activating the voice service. The service will terminate if no active human voice is detected within a certain period. Though most smart speakers adopt this interaction strategy, such interaction is not as natural as human-to-human communication. The experience of real-world human-to-human interaction demonstrates that there are no universal standards for constructing dialogues. The work-related dialogue among participants might be continuous or discrete, with a particular greeting at the beginning of the conversation. Furthermore, observation from our previous work [10] also shows that inaccurate human intent prediction is another concern, as a portion of a human's speech may fall between two continuous voice detection periods. Therefore, the voice interface of ToD4IR is designed to keep listening to the human voice in the background rather than manually invoking the service using trigger words. Thus, ToD4IR can provide a near-human ability to participate in a conversation at any moment without requiring the user to employ trigger words, as long as the task-related user utterance is identified.
Two cognitive speech services of Microsoft, 4 speech-totext (STT) and text-to-speech(TTS), are leveraged to convert streamed human voice to the transcript and generate the near-human voice as a response, respectively.

2) HUMAN-ROBOT DIALOGUE SERVICE
As the core service of ToD4IR, the human-robot dialogue service (HRDS) is composed of five components: encoding dialogue context, decoding belief state, computing system action, encoding dialogue history, and decoding system response. SQLite database, 5 which stores real-time production data, is leveraged to assist grounding response generation.

3) TASKS OF THE HRDS
There are four tasks, data preprocessing, belief state prediction, system actions generation, and system response generation, running on the HRDS.

a: TASK 0: IRWOZ DATASET PREPROCESSING
We define the dialogue context as follows: where U i represents the user's utterance at turn i, and Sys i is defined as where ⊕ denotes text concatenation, Tres_i and Sres_i represent task-related and small talk-related responses, respectively. As aforementioned, the wizard may either use a hybrid response (i.e., only Tres_i, self-organized natural response which might mix small talk response with the task response) or concatenates Tres_i and Sres_i (guided by ARE principles and human-to-human conversation strategies). Dialogue belief B defines the dialogue states: where b i denotes dialogue state at turn i, and b i defined as where DB_req i and DB_opt i indicate the required and optional belief states for database retrieval at turn i, T _req i and T _opt i for the required and optional task related belief states. Both DB_opt i and T _opt i can be NULL if operator does not provide them. System actions, Sys_act, define verified results from the operator's utterance: where DB_SR i and T _SR i denote the generated system actions at turn i based on database search results and 4 https://azure.microsoft.com/en-us/services/cognitive-services/ 5 https://www.sqlite.org/index.html state tracking results, respectively. Additionally, the system response, Res, is defined as: where the response, Res i , at turn i will be the part of the dialogue context, Sys i+1 , at turn i + 1.
We define a dialogue, D, of HRDS as: Therefore, the training dataset, as raw IRWoZ (see the original format on the left side of Fig. 4), needs to be reconstructed and annotated by following the above dialogue structure before moving to HRDS. Table 3 defines the special tokens used to identify components of the raw text. One of the tasks of data preprocessing is to extract and annotate the raw text from IRWoZ and generate the expected dialogue dataset (see the right side of Fig. 4).

b: TASK 1: PREDICT BELIEF STATE
ToD4IR follows a similar strategy, such as [9], [23], which leverages the autoregressive model (see the III-B4) to generate the belief states, task-related response, and small talk-related response stepwise. The joint probability, p(x), over the given sequence of the text of a dialogue can be factorized as: Dialogue belief, b i , at turn i is predicted based on the dialogue context, {U 1 , Sys 1 , . . . , U i−1 , Sys i−1 , U i }, from turn 1 to turn i (only user utterance at turn i). The predicted belief states that b i , is a sequence of text which is formed as follows: where T represents the predicted domain, b_s and b_v stand for the belief slot and the slot value, respectively, and m is the total number of the predicted belief slots. b_s should belong to the set of predefined slots, slt T , of the desired task T , An overview of the ToD4IR model architecture including three tasks, predict belief state, generate system actions and generate system response.
where n is the total number of the task slots S and slt i is a slot related to the database or the task. The training goal of belief prediction, L(B T i ), is defined as: o j=1 logpθ(b_s j |(b_s 1 , . . . , b_s j−1 ), C T i )), (12) where o is the total number of slots of a belief state sequence, and θ is the learning neural network parameters. If the slt j is an optional slot (i.e., DB_opt or T _opt), it will not be predicted unless it is detected from operator's utterance.

c: TASK 2: GENERATE SYSTEM ACTIONS
System actions, Sys_act, include database-related actions and task-related actions. If database querying is required based on the predicted belief states, the corresponding database-related slots of belief states are extracted as query parameters. The extracted task-related slots and the database querying results are mapped to system actions (i.e., T _act and DB_act) and are defined as follows: where null means the operator does not provide the slot while it can be detected from the operator's utterance (if it is a taskrelated slot) or matched with database query results (if it is a database-related slot). Naturally, undetected means that the slot remains undetected.

d: TASK 3: GENERATE SYSTEM RESPONSE
The delexicalized task related system response (e.g., text between '|boTres|' and '|eoTres|' of the right side of Fig. 4), and small talk response (e.g., text between '|boSres|' and '|eoSres|' of the right side of Fig. 4). The goal of training, L(Tres T i ) and L(Sres T i ), are as follows: (15) where N k and N j represent the length of the sequence of task-related and small talk-related responses, respectively. L(Sres T i ) is not calculated if operator mix the small talk response with task-related response (see Fig. 5, hybrid response).

4) BACKBONE MODEL
ToD4IR, like previous state-of-the-art (SoTA) works (e.g., SOLOIST [23], SimpleToD [9] and MinTL [22], is implemented using two auto-regressive pre-trained language models, GPT-2 and GPT-Neo [56]. Those models were trained on massive volumes of open Web material and learned how to complete a sentence in a given context. Such models are largely used in downstream NLP tasks (e.g., machine translation, answering questions, text generation in our case), where they are fed a small task-specific dataset for fine-tuning the desired final model. We briefly introduce GPT-2 and GPT-Neo, which are used in this paper.

b: GPT-NEO
Since the latest pre-trained language model, GPT-3 (with 175B parameters), has not yet been open-sourced, EleutherAI's GPT-Neo is leveraged as our second backbone model. GPT-Neo is an open-source transformer model which replicates the GPT-3 architecture. It is trained on the Pile [58] dataset, which has an 825GB English text corpus. Evaluation results of linguistic reasoning and physical and scientific reasoning show the GPT-Neo (with 2.7B parameters) resembles GPT-3 in performance. 8

5) ROBOT CONTROL SERVICE
Based on our previous work [10], ToD4IR is designed to be agnostic of robot hardware. The robot control service is 6 https://openai.com/blog/better-language-models/ 7 https://huggingface.co/gpt2 8 https://github.com/EleutherAI/gpt-neo implemented as a pair of components: a robot service management (RSM) and a robot service execution (RSE). The output of the HRDS (i.e., task-related information and commands) will be sent to RSE to ground language instructions into robot actions, which invokes robot control functionalities (e.g., package delivery) via communication protocols (e.g., TCP/IP). RSM periodically synchronizes, once per week by default, the local robot services and skills with the ToD4IR server to update the robot services/skills registered on the local client-side and the robot controlling scripts.

IV. EXPERIMENTS AND DISCUSSIONS
This section evaluates the proposed approach to address two research questions: Q1: Given that ToD4IR is a conversational AI for industrial robots, how does ToD4IR perform tasks within the four defined domains? Q2: Is the ToD4IR able to augment user experience through embedded small talk?

A. EXPERIMENTAL SETUP
We trained and evaluated our ToD4IR on Aalborg University Cloud, where each server is configured with Intel(R) Xeon(R) Gold 5118 (12 cores), 256GB of memory, and Nvidia V100 with 32GB VRAM. The ToD4IR is implemented based on HuggingFace Transformers 4.7.0, 9 Microsoft DeepSpeed 0.4.0 10 and Torch 1.7. 11

1) PREPARATION OF DATASETS
While the WoZ method enables a more flexible and straightforward generation of low-noise dialogue corpora, the time-consuming data collection and annotation processes need high accuracy. The primary difficulty lies in checking whether the appropriate slots are given, validating whether the slots match database search results, and manually annotating slots. Fig. 4 shows an example of raw IRWoZ dialogue and the respective annotations.
To address this issue, a Flask-based 12 IRWoZ web application is constructed. The application features two interfaces, one for the user and one for the Wizard. The user submits their utterance using the online form, and the Wizard automatically detects and checks the extracted dialogue acts (e.g., delivery, position) via the web form. The Wizard is given two distinct input text fields to deliver task-related and small-talk-related responses. When the user confirms the chat is complete, the web application automatically annotates and saves the conversations. We provide an example of the dialogue collection and annotation processes using the web application developed in the appendix B.
Eighteen individuals have been asked to participate in the gathering procedure for the IRWoZ dataset. The dataset contains 158/42/101/100 dialogues corpus span over assembly/delivery/position/relocation (see Fig. 7), with each dialogue having at least two turns and augmenting over 88.7% of system responses with small talk. Table 4 summarizes the essential components of the annotated dialogue with examples. To evaluate the proposed ToD4IR on our IRWoZ dataset, we divide it into 60%, 20%, and 20% for training, validation, and testing, respectively.

2) AUTOMATIC EVALUATION
To respond to the first research question, we train our ToD4IR based on GPT architecture with five pre-trained language models: gpt2, gpt2-large, gpt2-xl, GPT-Neo (1.3B), and GPT-Neo(2.7B). The ToD4IR follows the end-to-end dialogue pattern in which the evaluation mainly involves two aspects, dialogue state tracking, and system actions and response generation [38]. 12 https://flask.palletsprojects.com/en/2.0.x/ In order to evaluate the performance of ToD4IR, we used three metrics.
• Joint goal accuracy. The output of the dialogue state tracker is compared to the ground truth label at the end of each discourse. The proportion of dialogue turns in which the value of each slot is correctly predicted is known as the joint goal accuracy.
• Slot accuracy. It compares each (domain, slot, value) triplet with the corresponding ground-truth label. Compared with the joint goal accuracy, its evaluation granularity is more refined.
• Bilingual evaluation understudy (BLEU) [59]. It is mainly used for measuring the fluency of the generated text. Among the above three evaluation metrics, joint goal accuracy and slot accuracy are commonly used in dialogue state tracking tasks, and BLEU is leveraged for response generation.

a: DIALOGUE STATE TRACKING
This task aims to assess ToD4IR's ability to predict dialogue state in the given dialogue context, which includes domain, slot, and value. Table 5 compares the ToD4IR's joint objective and slot accuracy to that of various backbone models on the IRWoZ dataset. ToD4IR-gpt2-xl outperforms other models at 0.964 of slot accuracy, whereas ToD4IR-gpt2-neo(2.7B) outperforms at 0.866 of joint goal accuracy.

b: SYSTEM ACTIONS AND RESPONSE GENERATION
In this task, ToD4IR should generate system actions and responses given the ground truth dialogue states and database search results. Due to the WoZ data collection method, the ground truth of database search results is manually incorporated into the system actions of the IRWoZ dataset.  As a result, ToD4IR eliminates the requirement for database searches during the training and validation processes. Table 6 compares the ToD4IR's BLEU 1-4 scores on the IRWoZ dataset's five models. At 0.6013, 0.5349, 0.5032, and 0.4763, ToD4IR-gpt2-large outperforms other models.

3) HUMAN EVALUATIONS
Human assessments provide a complete picture of responsegenerating performance, particularly in the generation of human-like response [60]. We leverage the human evaluation questions from [8], which cover engaging, interesting, human-like, and knowledgeable, for response generation assessments.
As mentioned, 20% dialogue corpus is selected as the test data set. We feed those data samples to the five models, ToD4IR-gpt2, ToD4IR-gpt2-large, ToD4IR-gpt2xl, ToD4IR-gpt-neo (1.3B) and ToD4IR-gpt-neo(2.7B) to obtain the responses. Task-related responses and small talk responses are highlighted in each of the dialogues.
To verify the validity of the assessment findings, domain specialists within robotics, computer science, linguistics, and humanities and various end-users such as shop floor workers and lab engineers were invited to serve as evaluators. The evaluators are asked to rate their answers on a predetermined scale. 13,14 Each number on the scale denotes a different quality level, ranging from Not at all to Absolutely. For instance, the inquiry How much would you prefer to talk to the ToD4IR? is used to determine whether the user is engaged and willing to speak with the ToD4IR.
In some circumstances, particularly when evaluating the engaging and knowing competence, the dialogue context is essential to evaluate the generated system response. However, the reference context is not required to evaluate fluency (without grammatical errors), interest, and human likeness. We give context for each generated response in our evaluation (Appendix C presents six examples and the web interface for online evaluation).
We showcase each model's individual score, in Table 7, and summarize how engaging, interesting, human-like, and knowledgeable the dialogue was. Table 8 reports the overall score of ToD4IR.

1) DATASET
Although the IRWoZ is being offered as the first industrial-oriented conversation dataset for human-robot 13 https://shorturl.at/cmES1 14 We obtained the ethics approval from Aalborg University regarding the online questionnaire.  interaction in the manufacturing sector, it is a small-scale dialogue corpus with limited domain coverage. Additionally, an imbalanced data distribution (i.e., delivery occupies only 11% of the conversation corpus) affects ToD4IR's performance on the delivery job.

2) NOISE-LABELS
In comparison to other datasets, such as MultiWoZ 2.0, IRWoZ contains far fewer mis-annotations, which adds to the high accuracy of dialogue state tracking (see table 5). The first reason is that the IRWoZ is compiled by 18 individuals, including students, academics, and shop floor workers with backgrounds ranging from robotics to computer science. As a result, the corpora of dialogue collected are significantly more pure and professorial. Second, a web application (source code can be found on our Github) is created to support the data collection. The user may directly engage with the application with the four domain tasks. The application's back end automatically annotated and verified each dialogue, including annotations and database search results. Additionally, human verification is performed as the last step, ensuring that the collected dialogue corpora contain fewer mis-annotations than other datasets.

3) DIALOGUE RESPONSE
One of the objectives of our method is to train the ToD4IR to generate more natural and human-like responses. As a result, natural human responses are expected during the dialogue simulation. However, examining the collected data, we realize that when users communicate with the system, they use a less complicated version of their language. For instance, a simple response of ''Ok.'' is typically used when they affirm a system-generated inquiry, whereas ''Yes, I believe so.'' or ''yes, you are correct.'' is frequently used when they speak with a human. The participants confirm this trend during the collection of data. The other observation is that ten out of the eighteen persons prefer to structure their small talk responses instead of directly using the ARE principles.

4) PRE-TRAINED MODELS
The ToD4IR is mainly driven by gpt2-large and gptneo(2.7B) models with pre-trained weights. Each model is  trained for 20 epochs. However, the largest model, ToD4IRgpt-neo(2.7B), only surpasses ToD4IR-gpt2-large by a factor of 0.01 on joint goal accuracy while around 0.05 less on BLEU 1-4. Additionally, we compare the ToD4IR against three other gpt type models. The overall evaluation results indicate that shallow neural networks outperform deep VOLUME 10, 2022 neural networks when ToD4IR is trained on a small-scale dataset.

V. CONCLUSION
In this study, we present ToD4IR, a humanized taskoriented dialogue system aimed at industrial robots. The first industrial-oriented dialogue corpus, IRWoZ, is constructed with 401 dialogues spanning four industrial tasks: delivery, assembly, position, and relocation. To aid ToD4IR in generating natural and humanized responses, we use small talk principles, known as ARE, along with human-to-human conversation strategies. Driven by a task-based neural network GPT, ToD4IR can predict the dialogue state and generate system actions using real-time database search results.
Additionally, it can generate work completion and user experience enhancement responses that include task-related (based on user goals) and small talk-related responses. Experiments demonstrate that ToD4IR achieves high accuracy in the dialogue state tracking and fluency in the generated response.
We hope that the proposed IRWoZ will inspire the dialog research community and industrial partners to continue investigating language-assisted human-robot interaction in manufacturing, contributing dialog corpora for new industrial domains, and fine-tune pre-trained pre-trained ToD4IR for new industrial tasks.
We intend to conduct extensive user research in the future to gather input on the naturalness and coherence of ToD4IRgenerated dialogue and responses when COVID-19 constraints are removed.

APPENDIX A IDENTIFIED FOUR DOMAINS FOR ToD4IR
Fig . 8 shows the identified four domains, position, delivery, assembly, and relocation, and the corresponding scenario examples. The HRI dialogue corpus is collected based on the tasks from the four domains. Fig. 9 displays the designed web interface for collecting dialogue corpora. One operator uses User Mode to start the dialogue based on the task specification. The other operator takes on the role of Wizard and responds to inquiries in System mode.

APPENDIX C GENERATED EXAMPLES
A. GENERATED EXAMPLES BY DIFFERENT MODELS Table 9 shows six examples of generated responses by ToD4IR based on five models. The response is generated based on the given context of the test dataset. Dialogue state and system actions are not shown in this table.