Preference in Voice Commands and Gesture Controls With Hands-Free Augmented Reality With Novel Users

Hands-free augmented reality (HAR) has the potential of one day becoming our gateway to the Metaverse. However, the devices are still out of reach for most consumers mainly due to their price. The issues with novel interaction techniques can also present a challenge when interacting with holographic content in 3D space. In this article, we describe a mixed-method user study (N=34, inexperienced HAR users) with an augmented reality (AR) map application and HoloLens 2, where the target is to compare the usability, user experience (UX), and preferences with two of the most common HAR interaction techniques: gestures and voice commands. Our findings show that there may not be a significant difference in the overall UX between these controls when using an AR map. However, there are individual variability and differences in how novel users prefer to interact with holographic content. Gesture controls were preferred for their playfulness and voice commands for their efficiency. This suggests that developers should focus on different interaction techniques for different contexts from gaming to serious collaborative work.


H
ands-free augmented reality (HAR) remains somewhat rare due to the high price and availability of smart glasses and headmounted displays (HMDs) with the feature.Most people have no prior experience with HAR; however, as these devices become more lightweight and affordable, they have the potential to even replace our mobile phones and become the default technology for accessing the Metaverse.We are now at the stage of technology adoption, where problems influencing usability and user acceptance should be researched and solutions found. 1 Mobile augmented reality (MAR) is often used in applications such as chat filters or barcode readers.Many people are already unknowingly using augmented reality (AR).Some AR applications like Pok emon Go have been successful, but these examples are only a few.A major contributor to the success of the aforementioned applications is possibly the platform: the mobile phone.For MAR applications, the potential customer base is the entire world.While these MAR applications have their place, the bigger potential of AR lies in HAR.The current roadblock for HAR is the cost of devices such as Meta Quest Pro and HoloLens 2, making them unappealing and unattainable to the broader audience.In the foreseeable future, roadblocks may no longer exist and HAR devices can achieve similar mobility and ubiquity as mobile phones.It can also be expected that AR functionality will complement immersive VR allowing even more scoping access to digital realms such as the Metaverse. 1,2,3,4n this article, the Augmented Reality Tactical Sandbox (ARTS), a map application developed for HoloLens 2 HMD, is used as a probe in a mixedmethod user study (N ¼ 34).The goal of this work is to scope the current and future issues influencing user acceptance of HAR devices.We approach this issue by comparing the two most common interaction techniques used in HAR: voice commands and gestures to evaluate user preferences.

CURRENT WAYS TO ACCESS AR
There are many technologies used to access AR content: mobile-based, projection-based, see-throughbased, or even PC-based.All of these technologies provide different multimodal ways to interact with AR content. 5AR can also be divided into other subcategories according to how it is accessed.The most common type of AR is MAR, where a hand-held mobile phone is used and interfaced with a 2D touch-screen.A less common type of AR is HAR, where the device is worn on the user's head like glasses.This leaves hands free for interacting with the holographic content.The ability to interact with digital items in 3D space is such a groundbreaking addition to AR that, to highlight this difference in this article, we have chosen to call these technologies hands-free AR, HAR.The interaction techniques in HAR vary between devices, but the main techniques are gestures, voice commands, or handheld controllers.Using gestures does not require the user to have anything in their hands, so they are, in essence, hands-free, and the interactables are situated around the user in a three-dimensional (3D) space.In contrast, with MAR, they appear to be in 3D space, but interactions are mediated by a 2D surface. 6,7The used gestures mimic familiar hand gestures such as pinching, pointing, and grabbing.Immersion and usability between MAR and HAR can differ substantially. 2 There are HAR devices that use see-through displays to visualize AR content, while the other technique is to use video pass-through to display the AR content on a display close to the user's eyes.Figure 1 illustrates the difference between MAR and HAR.Handheld controllers can be used for HAR, however, for more pervasive daily use voice commands, gestures, 9 or more ubiquitous controllers such as smart rings show more promise.UX, usability, and user preferences with different interaction techniques are widely researched even with AR, but the main focus of the current research is on interaction mediated by 2D surfaces spanning from the mobile phone display to AR projected on car windshields. 10,11,12teractions in HAR There are three common techniques for interacting in HAR, and these techniques differ in usability and performance.Research has shown that the combination of both voice commands and gestures is the fastest, with only a slight difference to using voice commands only, while gestures are found to be the slowest. 13owever, many nuances and details are still inconclusive on what is the most efficient and intuitive control technique for HAR.Some users still prefer to use gestures instead of voice in certain tasks, even when the performance is worse. 14Some state voice-only or voice combined with gestures feels more natural than only gestures. 9,13In some situations, if hands are already in use, voice commands can enhance the operation speed and efficiency. 14To summarize, in comparison to hand-held controllers, 15 gestures can have better performance in some tasks and they can be easier to learn, but on the other hand, using them can be physically tiring.

HoloLens 2 HAR Device
The very first HMD developed by Sutherland in 1968 was an AR HMD, however, it took decades for this technology to become miniaturized into marketable products. 2In 2016, Microsoft released its first generation HAR device: HoloLens, that used optical seethrough (Figure 1) to display the virtual content.Three years later, in 2019, the HoloLens 2 was released.It pass-through (Meta Quest Pro) HAR, and MAR.The two leftmost are mainly used for hands-free while MAR requires the use of hands for holding the device and is therefore intrinsically different. 8sed the same kind of display technology as its predecessor and featured improved design and more powerful hardware.
HoloLens 2 uses the Articulated HAnd Tracking (AHAT) depth camera in two different modes.For capturing hand movements to obtain hand-tracking data, the high-frequency (45fps) near-depth sensing mode is used.The low-frequency (low-throw) (1-5fps) fardepth sensing mode is used for spatial mapping.The obtained mapping data can then be used to place holographic content into surrounding space.Visible Light Environment Tracking cameras (grey-scale) are used for head-tracking and map building while an infrared (IR) reflectivity stream is used to compute depth.The device also includes an inertial measurement unit for detecting rotations and orientation. 16he precision of the hand tracking of the device is not known, though aspects influencing the tracking precision have been researched. 6,7,17

Maps in AR Versus Traditional Monitor
A study 18 compared a monitor map application, a Microsoft HoloLens version, and the Augmented Reality Sandtable (ARES), which is a mix of a real-world sand table terrain and a down-projected 2D map.In a situational judgment task, where the participants needed to rank different route alternatives, the AR application allowed the users to finish the task significantly quicker than with the regular 2D application.The participants were also significantly more engaged while using the AR version.We continue where this study finished and aim to investigate which AR interaction technique users prefer in AR.

AUGMENTED REALITY TACTICAL SANDBOX
ARTS has been developed as a part of a bigger system for border surveillance.The requirements were defined with a user-centered design approach.The purpose of this application is to provide a 3D AR map for better situational awareness in the command and control center and in the field.The application displays a 3D map on the HoloLens 2 HMD (Figure 1).The application was developed on top of Microsoft's Bing Maps, a commercial product with high quality requirements.The basic map functionalities therefore inherit many best practices for designing digital maps.Bing Maps also allowed the addition of custom content and analytics for research purposes.
When ARTS starts up, the holographic map (Figure 3) is placed roughly 1.5 m below the user's gaze point, placing it below where the user is looking rather than the point where their eyes are.The user can always resituate the map.The map can also be resized or rotated.Map placement can be done via voice commands or gesture controls while resizing and rotating must be done with gestures.During the user test, the users accessed these functions to better align the map with a physical table used for reference.
The users can also move freely into a different area on the map using voice commands or gestures.Using the voice command "pan" followed by "north/south/ west/east," the users can move the map slightly to the desired direction.With the gesture "air pinch," the users can "grab" the map and then drag it in any direction.
ARTS contains two "quick travel" buttons and four voice commands that allow the user to change the position of the map to preset coordinates.In the experiment, the users used this function with voice commands and gestures.The gesture control is simply to press a button for the function.Voice commands are either "Return (to) Home/Greece" or "Go to New York."Zoom functions are also available with buttons or voice commands.The zoom function gives the user the ability to view any location at any height.ARTS also allows the users to hide assets (boats, drones, vehicles) from the map.This can also be done with either gestures or voice commands.
For gestures, the application has a hovering menu bar [see Figure 3(b)/(e)].In addition, users can "air pinch" or "physically" pinch the virtual object for direct interactions.The menu bar can be hidden or shown with a voice command and pinned down anywhere.By default, the menu will follow the user as they move.

METHOD
A mixed-method task-based user study with a qualitative priori was conducted between November 2022 and March 2023 in a controlled lab environment.The interaction technique was the independent variable in this study.The goal was to compare the usability, UX, and preferences in voice command and gesture controls of participants (N ¼ 34) who had no prior experience with AR smart glasses such as HoloLens 2.

Study Procedure
The study procedure (see Figure 2) began with the participant reading and signing consent forms.As most people in the general public have no experience with HAR, 1 we made this a prerequisite in our study.We only had to exclude one participant due to their prior HAR experience.Eleven out of 34 users had prior experience with MAR, but this was expected.
Next, the participants were introduced to the Holo-Lens 2 HMD followed by an introduction to basic AR concepts, the testing procedure, and ARTS.The participants were divided into two groups: one group would use voice commands first and the other would use gestures first (Figure 2).
For the group using voice commands first (G1-A in Figure 2), the first task (Task 1: Warm up in Figure 2) would consist of running the eye calibration for the HoloLens 2 and a quick introduction to the basic user interface (UI) controls.For the group using gestures first (G2-G in Figure 2), this warm-up task was the same, but with a tutorial for the "air pinch" gesture.This tutorial was done using an interactable floating 3D football [see Figure 3(a)].When they felt confident about using the "air pinch" gesture, the warm-up task would end.This was done to exclude the influence 19 of difficulties in "air pinch" for those who take longer to learn the gesture.The G1-A group would perform the "air pinch" tutorial at the beginning of task 3 (Task 3: Gesture in Figure 2).
The tasks for both groups were exactly the same but done using a different interaction technique; voice commands or gestures.For the group using voice commands, there was a quick voice check at the beginning.The participant disabled and enabled the gesture menu with voice commands, and if everything worked they proceeded with the task.In the gesture group, participants observed how the menu followed their head movements and learned how to disable its movement so it would not follow them.When the turns changed to use a different interaction technique, the same procedure for checks was followed.The participants were shown a list of tasks in view to do with ARTS during the experiments.Each participant was encouraged to do their tasks independently and only ask the researchers if there was a clear problem.The tasks (see Figure 3) were identical for both groups, the only difference was the interaction technique.The G1-A group used voice commands first (Task 2: Audio, Figure 2.), then gestures second (Gesture controls, Figure 2), whereas the sequence was opposite for the G2-G group.After answering the postquestionnaire, the participants went through the same tasks but with the interaction technique they had not yet experienced.After completing the main tasks there was a bonus task, where the participants had the option, though not explicitly suggested, to use both interaction techniques at the same time.They were also asked to try out the hidden coffee cup voice command.The bonus task was a way to reward the users, but also allowed us to observe what technique they chose to use.The task sequence was as follows: Tasks.
1) Use the zoom-in/out functions.
3) Move the map to "home." 4) Use the show/hide functions for different assets.5) Move the map to "Greece." Additional assignment, which allowed but did not force mixed interaction techniques, at the end of tasks.
1) Move the map to New York.
2) Find the Statue of Liberty.
3) Make the coffee cup appear and interact with it.4) Use the map any way you want (optional).

Material
The collected material consisted of informed consent, a prequestionnaire for demographics, a postquestionnaire, and a brief transcribed interview.The postquestionnaire included the system usability scale 20 (SUS) questionnaire and the user experience questionnaire (UEQ). 21It was issued after the users only had experience with one interaction technique.At the very end of the experiment, following the participant's experience with both interaction techniques, we conducted a semistructured interview.The interview targeted their favorite interaction technique, preference of use, and general feedback.The average length for an interview was 3 min 57 s.

Participants
We recruited 34 participants (M ¼ 23, F ¼ 11) via university mailing lists and an online recruitment system that prescreened for participants with no prior experience using HAR devices.Most participants were students with a major in computer science and engineering or some other field of engineering, however, there were also archaeology and logopedics majors.The age range of the participants was 21-48 (M ¼ 27.2, SD ¼ 5.2).The majority (20) of the participants were of Finnish nationality.

RESULTS
Based on the SUS and UEQ, issued after the users had a chance to try either the voice command or the gesture interaction technique, there were no significant differences in basic usability and UX between these techniques.However, in the following section, we summarize the analysis methods and the small gender-related outlier observations.The questionnaire results are between-subjects as they were collected before the users had a chance to try both modes of interaction.

System Usability Scale
The SUS questionnaire results were analyzed using the independent-samples T test.There were no significant differences in the overall SUS score, but one significant difference was found between the groups in one of the questions.The difference was in question 6 (N ¼ 32, mean G1-A ¼ 1.733, mean G2-G ¼ 2.588, p ¼ 0.014) on the inconsistency in the system.The group that used gestures (G2-G) evaluated the system as being (significantly) more inconsistent than the G1-A group, which used voice commands.The overall SUS score showed no differences apart from one question, which showed a small gender difference between men and women in G1-A in question 7 (N ¼ 15, M ¼ 11, F ¼ 4, mean rank M ¼ 6.36, mean rank F ¼ 12.50, p ¼ 0.012).Here, the women in the group evaluated the system as being significantly quicker to learn than men.

User Experience Questionnaire
The UEQ (scale from (À3) to 3) was analyzed using both one-way ANOVA and the Kruskal-Wallis H test.We found no significant differences in UX between the groups for the six main categories [p ¼ (0.212, 0.295, 0.356, 0.442, 0.937, 0.940)].However, between genders in the G1-A group, there was a significant difference in attractiveness (N ¼ 16, mean M ¼ 1.79, mean F ¼ 1.18, p ¼ 0.009) of the application, with men rating it higher than women.In the G2-G group, the women rated efficiency higher than men (N ¼ 13, mean M ¼ 0.67, mean F ¼ 2.25, p < 0.000).Differences between genders were analyzed using one-way ANOVA.

IEEE Pervasive Computing
January-March 2024

THE PERVASIVE MULTIVERSE
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Qualitative Findings
The qualitative findings are based on the semistructured interviews conducted after the users had a chance to try both voice command and gesture interactions.Therefore, while these observations are within-subjects, half of the users tried voice commands first and half tried gestures first.

User Preferences
The participants had initial preferences for either voice commands (17), gestures (12), or mixed ( 5) controls as their first choice when asked about what interaction technique they found most comfortable.When enquired about what interaction technique they would use in practice, the preferences shifted in favor of mixed-use ( 14) followed by gestures (11) and voice commands (9).Interestingly, none of those who initially preferred gestures chose voice commands afterward, while three users shifted from voice commands to favoring gestures in practice.Agreement testing was conducted between two observers with a sample of 100 text excerpts.For recognizing whether the user was talking about the experience in general or a specific interaction, there was a very high agreement of 96% with Kappa= 0.937.In recognizing categories with the used interaction technique the agreement was 87% with Kappa= 0.839.There was slight disagreement in the interpretation of intuitiveness under the Experience category.Also, observers disagreed on four occasions about learning requirements apposed to having to remember voice commands or UI layout due to their similarity.These were also in separate categories.The thematic analysis 22 of the further interviews revealed potential reasons for the preferences, which can be divided into the following three main categories:

Reliability
The users found voice commands to be faster, more reliable, and more efficient.To quote a user: "...I had choose separate things for showing and hiding the objects.I had to press the button and select the object, but with voice command it was just simple as saying the one voice command." The issues users had with controls varied.Neither method was problem free.With gesture controls, problems arose from the technology.The most common issues were unreactivity (i.e., inertness) and overreactivity.Some users struggled especially with the pinch gesture, while some users had unintended interactivity with the content, such as the floating menu or the map sticking to their hand.To quote one user on gestures: "If I knew what type of interaction would result in what type of end-results the use would be faster"..."Many times these balls and maps kind-off stuck to your finger.You stop the gesture, but the target follows you."With voice commands, the users' main concern was if they would remember the commands.However, both issues were seen by the users as something they would be able to overcome with practice.The pinch gesture was difficult while pointing and grabbing worked for most.Both voice command and gesture controls evoked comments about a high learning curve.For voice commands, the users felt they needed to learn the utterances, and with gestures, finding the correct hand positions was noted to require practice as well.

THERE WAS SLIGHT DISAGREEMENT IN THE INTERPRETATION OF INTUITIVENESS UNDER THE EXPERIENCE CATEGORY.
Experience One identified category was experience, specifically, how natural and fun the users found the interactions with gestures as opposed to voice commands.The gesture controls were considered more fun, natural, intuitive, and, as an interaction method, more closely resembling how one interacts with objects in physical reality.The users were in general more impressed with gesture controls.There were also a few users whose initial preference was voice commands but still stated that they would use gestures in practice.To quote one user on the gesture-related UX: "It felt, like, more controlled.I like to play games where you can move the map view.This was kind of the same and the setup was more natural... gesture controls for instance in games, it creates this bodily... like for instance, I like playing with Wii console."

Comparison With Current Technologies
The users frequently compared the AR experience to familiar technology.This is known to occur with other novel technologies as well and might influence user acceptance in serious contexts like work.SUS scores and UEQ scales also show that no big usability or UX issues were influencing the findings.Some users compared the device to existing use cases or technology such as their PCs or smartphones, to quote a female participant from G1-A: "You cannot yet watch Netflix with it, since the colors are not like on a laptop screen..." Participants were also enquired if they found the technology mature enough for daily use and if they would use the device themselves.18 people answered yes, 15 said no, and one replied maybe.

Limitations
This study has a small sample size.The differences between gesture and voice controls were not statistically significant.However, the qualitative analysis gives us more insight into how the users experienced the controls and the use of HAR.For qualitative analysis, the sample size is adequate, 23 with the only issue being the brevity of the interviews due to the need to keep the study's time limit under an hour per user.We mainly recruited participants from the university campus, and many of them reported an interest in the technology.Due to this potential sampling bias, one must be careful when generalizing the overall UX or SUS scores.Since none of the participants had any prior experience with HAR, we can assume that the results from the qualitative analysis of the user preferences have high generalizability.Although, there is high interpretability due to the qualitative method.Agreement testing, though, suggests that the interpretability of the results is small.There is a chance that the perceived efficiency of voice command interaction is due to technology limitations and lack of experience.However, we aimed to mitigate this issue by having calibration, UI introduction, and a warm-up added to the study procedure.

DISCUSSION
We do not know yet if being able to use AR hands-free is the needed "killer feature" for the technology.The HAR devices can display information to the users as they walk the streets, but the possibility to interact with the environment is not yet there for a full Metaverse experience.In such a scenario, for instance, one could buy something simply by interacting with an advertisement they see, and then have Amazon deliver the product to their home a few days later.
Considering current smart glasses such as Holo-Lens 2, as interfaces to the Metaverse, the threshold question is: how naturally and intuitively can people, especially novel users, interact with the holographic content?Early smart glasses, like Google Glass, suffered from social rejection and privacy issues. 24Even when the technology becomes more miniaturized and affordable, the public might feel reluctant to adopt HAR due to the high learning curve of available controls and reliance on phones and tablets. 25In this study, the users compared the HoloLens 2 to the technologies they are more familiar with and they currently use to interact with digital maps.
For high user acceptance of new technology such as HAR, more than just feasibility tests are needed.An important goal, when unlocking the potential of this technology, is to understand the users' needs and preferences beyond usability and basic UX.Our findings show that there are individual differences in the user preference for interaction techniques despite there not being statistically significant differences in usability or UX of the two most common interaction techniques.We noticed through the interviews and observations that individual preferences can vary and are based on how reliable the users find a specific technique.Some users complained about accidental interactions with gestures, and some had a high learning curve.While voice commands were found to be efficient, it was not perceived as exciting.However, voice commands provided a reliable option for those who had a frustrating initial experience with gestures.Still, gestures were found to be more intuitive, as many users wondered if they would be able to recollect all needed voice commands.We expected there to be a stronger novelty effect from the gestures considering that this interaction technique is more novel and voice commands have been around for decades on many consumer devices.However, the users found the gesture controls more natural and compared using them to how one interacts with objects in the physical world.
Voice commands were the initial preference for many of our test users.However, those who had no issues with gestures preferred them.Hand-tracking still seems to be an issue with gestures, but this is due to limitations in the technology and might be mitigated in the near future.Both interaction techniques had a learning curve, but once the hand-tracking issues are ironed out, voice commands will still retain the problem of users worrying about memorizing the phrases.

Design Recommendations
What HAR-type smart glasses provide is a direct way of interacting with holographic objects as one would in the physical world.Our findings show that the users consider this type of interaction exciting and intuitive.Therefore, we suggest that especially when designing for games and the context of leisure, gesture interactions should be prioritized due to the "excitement" effect and making the player feel more involved.The popularity of exergames, fitness games, and motion controllers is a good indicator that people like moving around and using their hands while playing games.These recommendations are not exhaustive, and of course, there may be application-specific considerations that defy the context.
For serious contexts, such as work, where efficiency is the main factor, voice commands should be favored.However, one must be careful not to rely solely on voice commands since issues will arise when working in loud environments.It is recommended to have gestures as a backup.For situations that require, for example, object manipulation or high precision, gesture interaction should be the main technique because voice commands can become inefficient and impractical if the required action is complex.
When using separate panels for buttons or other information, it is recommended that the user has the option to disable these panels from their view, or at least that they are movable and can be locked in place.We observed instances where our test users would have unintentional interactions with the menu buttons because the panel hovered too close to their hands.Adding many information panels can also easily clutter the AR space.

CONCLUSION
In this study, we explore the viability of two types of controls for HAR: voice commands and gestures, with the notion that in the near future, these may be the most viable interaction techniques for the persistent use of smart glasses when accessing the Metaverse.
Our findings show that gesture controls are found to be more natural and exciting, while voice commands are faster and more efficient.For this reason, we recommend that application designers and developers consider gesture interactions, especially for leisure and voice command interactions when efficiency and reliance are required.Both interaction techniques evoked some level of worry in users on how they would learn or remember how to use them but for different reasons.With gestures, these issues will be mitigated once the technology advances and the tracking techniques improve, however, with voice commands the users might still, in the future, require some type of memory assistance or tutorial implemented.

FIGURE 1 .
FIGURE 1. Examples of currently available AR technologies.Left to right: optical see-through (Microsoft HoloLens 2) HAR, video

FIGURE 3 .
FIGURE 3. (a) Virtual football used for gesture training.(b) Floating user menu showing all buttons.(c) Final phase of testing at New York + coffee cup interaction (d) Moving the map with the air pinch gesture (e) Menu button press interaction.(f) Study participant using the map during the experiment.