Robots and Wizards: An Investigation Into Natural Human–Robot Interaction

The goal of the study was to research different communication modalities needed for intuitive Human-Robot Interaction. This study utilizes a Wizard of Oz prototyping method to enable a restriction-free, intuitive interaction with an industrial robot. The data from 36 test subjects suggests a high preference for speech input, automatic path planning and pointing gestures. The catalogue developed during this experiment contains intrinsic gestures suggesting that the two most popular gestures per action can be sufficient to cover the majority of users. The system scored an average of 74% in different user interface experience questionnaires, while containing forced flaws. These findings allow a future development of an intuitive Human-Robot interaction system with high user acceptance.


I. INTRODUCTION
The factory of the future, the so-called smart factory, relies on intelligent, independently operating and globally networked systems [1]. As part of the fourth industrial revolution (Industry 4.0), a new possibility for production was introduced: Human-Robot Interaction (HRI). Robots are not operated behind a protective barrier as usual, but work in the same space as humans. Three levels of cooperation between humans and robots have been established: coexistence, cooperation and collaboration [2].
To enable collaboration and integrate humans into these smart surroundings, intelligent systems and sensors are required, since humans lack a direct digital interface. Human communication is complex. Only a part of a spoken message depend on the actual words, while the vocal information (tone and other sounds) and nonverbal information (body stance, mimic, gestures), depending on the context, sometimes play the major role, as observed by Albert Mehrabian in his studies [3]. That is why the quality of interaction is crucial and a lot of testing is needed. What is the use of a great expensive system that is not safe or not applicable? The effort The associate editor coordinating the review of this manuscript and approving it for publication was Maurizio Tucci.
of programming as many scenarios as possible would not be economical. One way to overcome this issue is to use the well-established method of Wizard of Oz (WoZ), to evaluate and adapt existing, unfinished systems.

A. WIZARD OF OZ
The WoZ methodology allows the use of technology as a rapid prototype that would be too costly or time consuming to implement otherwise or is generally not available yet [4], [5]. The subjects are tricked into thinking that they interact with an autonomous machine when in reality they are interacting with a human operator instead. For the duration of the experiment, the operator secretly controls the systems actions, unknown to the test subjects. The operator is concealed from the participants by remotely working from another room, or hidden in plain sight, e.g. as an experiment observer, pretending to be taking notes or as a fake participant.
This method has been proven useful throughout many different Human-Machine Interaction (HMI) studies [6]- [10]. It can also be used repetitively to further improve the design of the system [11]. WoZ gives us the opportunity to take a more general approach that facilitates a natural, intuitive HMI. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

B. RELATED WORK
Many ideas have been developed to further improve the human interaction with industrial robots [12]- [16]. Among others, they propose augmented reality (AR) concepts to give visual feedback [12], [16] or use external tools as interface [12], [14]. For example, Zaeh & Vogl [12] successfully presented an interactive laser-projection system for programming industrial robots. And although noise in industrial environments is a challenge for automatic speech recognition systems, Pires [13] has shown that it is possible to overcome with the use of headsets. Serrano and Nigay [10] propose a component-based WoZ-approach for prototyping of multimodal user interfaces. Another approach has been undertaken by Speicher and Nebeling [17] with ''GestureWiz'', a tool that uses the WoZ-technique to quickly model gesture interfaces. Hoffman [18] describes another framework, ''OpenWoZ'', which is specifically made for HRI studies.
We implemented our own WoZ-framework in order to allow a restriction-free, multimodal HRI, since none of the mentioned techniques provided all of the necessary features (speech, gesture, head pose, gaze, posture, touch).

II. STUDY DETAILS
Experiments were conducted by an experimenter who provided guidance to the subjects for the duration of the experiment and a second experimenter who operated the robot, from now on referenced as ''wizard''. The wizard sat in another, well isolated chamber, invisible to the subjects, equipped with two surveillance monitors and headphones, along with a control unit. This was necessary to preserve the illusion of an independently, intelligently acting robot. To improve uniformity and standardization, the same people and roles were used for all experiments. The control unit consisted of a handheld controller (Dual Shock 4), a robot control-panel, a keyboard for shortcuts and a second keyboard for direct speech output. Shortcuts were speech feedback, projector control or pre-saved robot-coordinates for fast and precise interaction. The wizard's chamber is shown in fig. 1.
The experimental setup is based on an UR5e industrial robot with a RG6 gripper. The robot was developed by Universal Robots with the explicit goal of enabling safe HRI. It was placed on a table in front of a TV-screen, displaying the depth view from a time-of-flight camera together with a gesture recognition which were -apart from the intention to strengthen the illusion -not used.
A projector facing down from the ceiling was used to highlight blocks or positions. Five cameras recorded the experiment and streamed the view in real time to the wizard. The instructions were presented on a separate monitor. The entire system was introduced to the subject as ''RoSA'' (Robot System Assistant).
12 cubes with letters from A-Z and numbers from 0-9 were placed in front of the robot. Half of the cubes were black with white letters and the other half white with black letters. The setup as described is shown in fig. 2.

A. CAPABILITIES AND RESTRICTIONS OF RoSA
The skills and restrictions described below were used to create a consistent set of rules for the wizard when acting as RoSA. The skills include high-level speech recognition, covering generic terms: direction (left, right, front, etc.), position (home, ''here''), distances (cm, mm), angles (degrees) and adapting to custom user vocabulary after several repetitions (''cube'' can be referred to as ''block'' or ''brick''). Furthermore RoSA is able to interpret any shown gesture at human level, filtering obvious non-HRI gestures (e.g. scratch head, check wristwatch). When speech and gestures are used simultaneously, the gestures are interpreted secondarily as an amplifier for the spoken words and their intention, as understood by the wizard. RoSA is not able to recognise the letters or numbers on the given cubes, but is able to recognise and identify the object, knowing what colors and shapes are. RoSA is able to route to specific coordinates, hand-over, pick or place objects, collision-free.
In addition to the recognition capabilities, the system is able to learn from the subject. The learning process can 207636 VOLUME 8, 2020 be separated into passive teaching by repetition and active teaching by direct commands.
If a task requires more than one step, these could be sequenced and the program could be reused later. This is referred to as advanced programming.
Since RoSA is controlled by an operator using different presets and a custom controller, the system is not able to move in exact numerical steps when prompted (''move left 5 cm''). In such cases the speech input is still accepted, but the resulting distance differs due to manual control.
RoSA does not know the position between two blocks and is not able to route to this position without prior teaching. Slight ''system imperfections'' were applied in order to force diversity in the use of interaction modalities.

B. EXPERIMENT
The study was held at the Otto-von-Guericke-University Magdeburg, Germany. The experiments were conducted in German language. The subjects were asked to fill in their sociodemographic information, as well as their experience with artificial intelligence (AI), industrial robots and programming skills. In the following the experimenter provided material with the instructions. These vaguely described the capabilities of the system (''RoSA is capable of: speech-and gesture-recognition''). The subjects were then given three tasks: Spell a specific word with alternating color of blocks.  After fulfilling all tasks the subjects were asked to complete several questionnaires to evaluate their user experience (see chapter IV-B) [19]- [22].
Later the subjects were told to show a set of gestures that would further improve the systems capabilities. The set contained gestures to sign in, start, stop, move XYZ and open/close/rotate gripper. During the procedure there were two enforced system failures at certain events. The first one occurred during the second task: when asked to put a block on a projected field the system would purposely stack the cube onto another one until the subject actively intervenes. The second failure occurred during the third task: right before finishing the task the system would crash the last cube into the pyramid. Doing so would set back the progress, so steps would have to be redone. These deliberate errors were introduced in order to trigger reactions to errors or unpredictable behaviour. At the end of the experiment the wizard was revealed to the subjects.

III. ANALYSIS AND DATA SEGMENTATION
After the experiment the video data was reviewed and the different modalities of interaction summed up. The possible interactions were categorized into the following groups: speech direction [  A Pointing gesture to show a direction is a different intention as the same gesture to show a location. It was sufficient for the subject to use a modality once for it to be registered. Furthermore the use of: teaching features, advanced programming, active gaze, touching and leading the robot, as well as time needed for a task, robot problems and number of unclear instructions were noted. The term ''instructions unclear'' was a response from RoSA, when knowledge restrictions were surpassed.
To ensure inter-rater reliability, the same footage was examined by the raters with a resulting 93% of agreement leading to a Cohen's kappa coefficient κ of 0.86 which corresponds to ''almost perfect'', according to Landis and Koch [23].
The recorded data was then expanded with the data from the questionnaires and the full data set examined with the help of descriptive statistics and Pearson product-moment correlation coefficients [24]. This method was chosen because Pearson correlation coefficient estimated for two binary variables computationally will return the phi-coefficient, used for describing binary data [25].
Due to problems with memory cards and the unexpected failure of cameras or gimbals, we suffered data loss. Fortunately these gaps could partly be covered by other camera angles.

IV. RESULTS
We have a sample size of 36 subjects and assume the data to approximately follow a normal distribution. There were no outliers. The demographic questionnaire reveals 14 females, 22 males and 0 diverse subjects with age ranges from [20]- [24]

A. HUMAN-ROBOT INTERACTIONS
Some biases (''this is just like Alexa/Google/Jarvis'') could be observed, when subjects were completing the demographic questionnaire regarding their experience with artificial intelligence systems. This resulted in 97.2% of the subjects using any form of speech for interaction at least once. The subjects used different interaction modalities to complete the tasks. Most of the modalities were used independently, while some depended on auxiliary input to function efficiently. Speech description, for example, was used exclusively in combination with pointing gestures, whereas both of the modalities alone, while not being as effective, would have been sufficient. Most of the subjects used a single working strategy and maintained it throughout the experiment, until encountering restrictions. Only a few subjects experimented and explored the possibilities.
For instance: If the subjects (S) that primarily used descriptive speech for HRI faced the restriction of RoSA (R) not being able to read the letters on the blocks. The subjects either retried the same input or changed strategy. This could be switching to numeric instructions: Other users who ran into the same problem tried adding new interaction modalities. These ''hybrids'' (72.2 % of the subjects) used pointing gestures in combination with speech to fulfill the given tasks: S: ''Take this *points* and put it here.'' *points* In rare cases the subjects switched modalities to gestures only, using either pointing gestures or directional gestures to operate the system. An in-build system flaw of not understanding the instructions ''put on two cubes'' led to another obstacle that would come up when trying to stack the pyramid. Before that, all instructions, which were compliant with the rule-set, given to RoSA by the user, were executed flawlessly. When asked to stack on two cubes, RoSA would choose one of the cubes and put the third one directly on top.
S: ''Put the cube in the gripper on the cube A and the cube B. On both of the cubes.'' The instruction was accepted if the letters A and B were already taught to RoSA. Alternatively ''this'' accompanied with a pointing gesture instead of letters would also have been accepted. Still, instead of placing the cube at the desired position (on top of space between the cubes) RoSA would execute the command literally: placing the third cube on the first and then quickly picking it up and placing it on the second.
At this point most subjects switched to a more manual approach using only directional voice commands (91.7%) or directional gestures (36.1%). Subjects not relying on automatic path planning, i.e the pre-programmed positions known to RoSA, but using directional commands from the beginning, did not notice the position deviation when stacking, since they were in direct control of the robot path.
The popularity for each interaction modality can be seen in the table 2. It was possible to actively teach RoSA new commands by explaining the action beforehand or by summarising the actions already taken: S: "RoSA, take the cube and place it here." *points* S: "This cube is next to the last one." R: "I understand, I learned: next to the last one." S: "RoSA, take another black cube and place it next to the last one." RoSA places the cube as expected. R: "Is this correct ?" S: "Yes, RoSA, very good." The passive teaching took place when RoSA was given additional information not necessarily required for the execution of the command: S: "RoSA, give me the third cube, the one with the letter A." RoSA moves to the cube. R: "Processing data. I learned: letter A. Is this correct ?" S: "Yes, give me the cube [...]." RoSA hands-over the correct cube. Although 72.2% of the subjects used teaching in a passive or active manner (58.3% / 27.8%), only 55.6% of the subjects actually (re-)used the commands learned by RoSA. RoSA hands-over a black cube. All the users admitted to not noticing that RoSA was operated manually when greeted by the wizard after the study and thereby confirming the WoZ setup.

B. USER EXPERIENCE
To evaluate the user experience, subjects were asked to complete the four questionnaires, SUS, PSSUQ, UMUX and ASQ after the third task, still thinking that RoSA is an AI system. To plot and compare the PSSUQ and ASQ Scores range of [1 to 7] on the same graph as SUS and UMUX range of [0 to 100], a conversion of Score [0:100] = (Score [1:7] − 1) · 100 6 was used. The results can be seen below in table 3.
when compared to SUS. The moderate correlation of ASQ can be explained by the low count of questions this questionnaire uses, which leads to higher quantisation of possible score values. A total system score can be calculated by averaging the four questionnaire scores, resulting in a total system score of 74.02 out of 100, which will be further referred to as the score.

C. DATA CORRELATION
The correlation coefficients in table 4 and 5 describe the statistical probability of two or more modalities being used together. A strong positive correlation can be seen between pointing gesture and descriptive voice commands. There is also a strong correlation between macro and continuous gestures. The pervasive use of descriptive voice commands in combination with iterative voice commands, but not with numerical voice commands is worth noting.
A subject using the speech domain is more likely to stay in it and less likely to use directional gestures. This is especially seen in the overall negative correlation between speech instruction and gesture direction.
Time was not a significant factor in the experiment which is supported by the evaluation (table 5) since there is no correlation with the system score. The biggest time saver, suggested by the data, was using automatic path-planning. On the contrary, using voice commands or two handed gestures had a strong correlation with time. Subjects that mainly used instructional numeric voice commands or instructional VOLUME 8, 2020  pointing gestures were usually the fastest to finish all three tasks. Subjects who were keen to experiment with different modalities needed more time to complete the assignments.
System failures had no contribution to the score, in contrast to ''instruction unclear'' feedback which had a negative correlation of 0.4.
The factors that impacted the score negatively the most are: age, experience with AI, passive teaching, number of unclear instructions and modalities speech description and gesture pointing. The factors that had positive influence on the score are: Robot ''Freedrive-mode'', when the robot arm can be manually moved by the subject, and the use of modalities Gesture Micro and Macro. There is a weak positive correlation (0.22) between experience with AI and active teaching, as well as a moderate correlation (0.35) between experience with industrial robots and not using the knowledge taught to RoSA. Passive learning shows a negative correlation with the score.

D. GESTURE CATALOGUE
The gestures given by the subjects to ''improve the system'' were relatively consistent for the actions: stop, move and operate the gripper; while the actions for start and sign in were more individual. The summary of the two most popular gestures per action can be seen in the table 6.

V. CONCLUSION
This study provides deep insight into user experience and interaction with industrial robotic assistants. In giving the subjects the illusion of a restriction-less system, we had the possibility to observe truly intuitive HRI.
As described in chapter II-A, RoSA, although being operated by a wizard, had several knowledge-based restrictions in order to comply to the experiment setup.
The feedback ''instruction unclear'' could have led to a high degree of frustration. This especially applies to subjects with more experience in AI, robotics and programming, leading to lower scores. Subjects from this group are more likely to use such a system on a daily basis and are naturally more critical.
The negative correlation between passive teaching and score indicates that RoSA actively saying: ''Processing data. I learned: [. . . ]'', or teaching basics (''This is letter A.''), could have been perceived as annoying due to disturbance in the workflow. Some subjects seemed to expect the system to be already programmed to complete the given assignments and only wanted to give orders and not program sub-tasks that would seem basic.
The average score over all questionnaires being 74%, shows that the average user was pleased with the system, unaware that some of the errors were enforced on purpose and the system was remotely operated.
Since the industrial environment of robots tends to be noisy, we did not expect the high impact of speech. As the experiment shows, under laboratory conditions, this seems to be an intuitive way to communicate with such a system.
We assume the high use of dialogue-like speech to be an effect of the subjects being biased by already existing AI systems (Alexa, Google Assistant, Jarvis, etc.) and expecting RoSA to behave in a similar manner (''OK, RoSA.'' vs ''OK, Google'').
In many cases the system was expected to be able to calculate forward kinematics and path-planning, fewer subjects operated the robot with directional speech/gestures. The high correlation between macro and continuous gestures might be due to the fact that they are very similar in execution and can be used in an auxiliary manner.
We learned that the questionnaires were likely inadequate for the given scenario as the scores for several questions suggest. For example one of the questions being: ''The system helped to quickly fulfill the given task'' -considering the task being ''stack a pyramid'', which could have been done faster by the subject -received a low score. If looking at the task being ''[program this robot to] stack a pyramid'' the score could have been different. Alternatively a modification of an existing questionnaire or a set of new questions, especially concerning HRI, could lead to a better and more detailed user evaluation.
With the data from the ''system improvement'' catalogue we present a set of gestures that can be applied to many contact-less user interfaces. A system using the first two most popular choices would allow for an intuitive way of interaction for the actions Move XYZ, Open/Close/Rotate Gripper and stop. The multitude of sign-in and start gestures suggest a need for a customizable interface in order to fulfill user expectations.
The interactive laser-projection system for programming industrial robots by Zaeh & Vogl could be used to provide the user with feedback to improve the precision of pointing gestures [12]. A low-cost passive stylus ''the DodecaPen'' which allows 6 degrees of freedom input using a single camera as proposed by Wu et al. could work as an alternative to pointing gestures when higher precision is required [26].
For the next steps, we suggest development and testing of a system based on pointing gestures complemented by object orientated speech and micro/macro gestures for fine-tuning as these modalities show high potential and time-efficiency, as these modalities presented as intuitive in our experiments.
It is worth noting that the use of multi modal (speech, gestures, gaze, body pose) interaction for contact-less communication adds an additional layer of redundancy and thus safety, if the different modes are processed as logical conjunctions. Although our system was ''safe'' as the robot operated in a collaborative mode (with permanently reduced speed, force, and power) a combination of the interaction modalities could allow a safe integration of contact-less communication with industrial robots. Our future work will focus on applying the findings to state of the art technologies.