Text Input for Non-Stationary XR Workspaces: Investigating Tap and Word-Gesture Keyboards in Virtual and Augmented Reality

This article compares two state-of-the-art text input techniques between non-stationary virtual reality (VR) and video see-through augmented reality (VST AR) use-cases as XR display condition. The developed contact-based mid-air virtual tap and word-gesture (swipe) keyboard provide established support functions for text correction, word suggestions, capitalization, and punctuation. A user evaluation with 64 participants revealed that XR displays and input techniques strongly affect text entry performance, while subjective measures are only influenced by the input techniques. We found significantly higher usability and user experience ratings for tap keyboards compared to swipe keyboards in both VR and VST AR. Task load was also lower for tap keyboards. In terms of performance, both input techniques were significantly faster in VR than in VST AR. Further, the tap keyboard was significantly faster than the swipe keyboard in VR. Participants showed a significant learning effect with only ten sentences typed per condition. Our results are consistent with previous work in VR and optical see-through (OST) AR, but additionally provide novel insights into usability and performance of the selected text input techniques for VST AR. The significant differences in subjective and objective measures emphasize the importance of specific evaluations for each possible combination of input techniques and XR displays to provide reusable, reliable, and high-quality text input solutions. With our work, we form a foundation for future research and XR workspaces. Our reference implementation is publicly available to encourage replicability and reuse in future XR workspaces.


INTRODUCTION
Virtual, augmented and mixed reality (VR, AR, MR), for short extended reality (XR), offer a great opportunity to expand the way we work. Even though XR workspaces are often seen as a substitute for our current workspace, it is not clear yet how future workspaces mixing physical and virtual tools, and places might look like, but it is safe to assume that we will see a mix where XR tools extend or enhance the physical workspace. While traditional office work has mostly taken place stationary at a desk using physical keyboards for text input, XR workspaces can free us from a physical location where the working environment accompanies us virtually [25,27,42]. Text input is also • Florian Kern  an essential feature of XR workspaces: It enables us to take notes, edit documents, communicate in chat programs, comment on social media platforms, browse the Internet, or authenticate with a username and password. Therefore, working in VR and AR enables high flexibility in workspace design, but also places special demands on text input technique [6,24,77]. Known from today's real office environments, the physical keyboard is a promising candidate for stationary XR workspaces with a desk [27,42,50,57]. For non-stationary scenarios like presented in [25,48,62], a physical surface may not be available, or the physical keyboard is technically not supported due to the diversity of XR device features. Therefore, future XR workspaces should also support input techniques that can be used across different physical locations and interiors, XR devices, and input modalities. Virtual keyboards are a promising solution and have been extensively evaluated in either VR or optical see-through (OST) AR. However, our literature review revealed a lack of research on virtual keyboards in video seethrough (VST) AR. To address this gap, we present a novel study to investigate virtual keyboards in VR and VST AR using a single XR device. We identified virtual tap and word-gesture (swipe) keyboards as state-of-the-art text input techniques and evaluated both in VR and VST AR, with contact-based controller interactions.
We formulate the following research questions: First, how do different XR displays (VR, VST AR) affect usability, user experience, task load, well-being, and text entry performance? Second, how do different input techniques (Tap, Swipe) affect usability, user experience, task load, well-being, and text entry performance? Third, when transferring input techniques from VR to VST AR, what are the challenges, and how can we improve their use?
We have implemented input techniques, including capitalization, punctuation, and text revision for real-world scenarios. We used a single XR device for VR and VST AR to reduce a major confounding variable in text input studies. This also improved the comparability of our results across input techniques and XR displays. For VST AR, we found a simple but effective solution to reduce the visual conflict between the real controller/hand seen in the video stream and the virtual controller model by adding an opaque surface behind the keyboard.
Our user evaluation with 64 participants revealed that XR displays and input techniques strongly affect text entry performance, while subjective measures are only influenced by input techniques. We found significantly higher usability and user experience ratings for tap keyboards compared to swipe keyboards in both VR and VST AR. Task load was also lower for tap keyboards. In terms of performance, both input techniques were significantly faster in VR than in VST AR. Further, the tap keyboard was significantly faster than the swipe keyboard in VR. We compared our results with previous work in VR and OST AR and provide novel insights into usability and performance of contact-based tap and swipe keyboards in VST AR. In this study, the question of which XR display enabled higher performance with contact-based midair virtual input techniques can be clearly answered with VR. However, the subjective results showed no difference between VR and VST AR. Therefore, we strongly recommend investigating text input techniques in both VR and VST AR to provide reusable, reliable, and high-quality text input solutions. We hope that our work forms a foundation for future text input research in VR and VST AR, and fosters the development of XR workspaces. Our reference implementation is publicly available to encourage replicability and reuse in future XR workspaces.

RELATED WORK 2.1 XR Displays
We use the term extended reality (XR) to summarize virtual reality (VR), augmented reality (AR), and mixed reality (MR) and define MR according to the modified reality-virtuality-continuum (mRVC) [91] as MR must include a merging of real and virtual qualities.
VR devices display purely virtual worlds. AR devices, on the other hand, enhance the real world with additional virtual content. Milgram and Colquhoun [69] classified AR devices based on their display technology into optical see-through (OST) AR and video see-through (VST) AR. With OST AR, the user sees the real world directly through the display medium, and virtual content is then optically superimposed. In contrast, VST AR captures the real world via cameras on the front of the device and displays video data to the user using a video screen. This allows an opaque overlay of virtual content compared to OST AR. While OST AR currently always binds the user to the real world, VST AR displays enable a much more extended design space along the modified reality-virtuality-continuum. Therefore, XR workspaces based on VST AR appear promising, and current VST AR devices are already paving the way to enable the supported design space. Currently available consumer devices support basic VST AR (e.g., Meta Quest 2 or HTC Vive Focus 3), but limited to a low resolution (grainy, distorted) and/or monochrome and/or monoscopic video stream, that can compromise the overall performance and experience. Non-consumer devices for the professional market (e.g., Varjo XR-3), present the real world captured by two cameras in high resolution, color, stereo, and with low latency, making these devices appear more suitable for user studies that require high-quality VST AR. Announced XR devices (e.g., Meta Quest Pro, Pico 4, or Lenovo VRX) also show promising color VST AR, but not available at the time of this study and have yet to be evaluated as an alternative to professional equipment.
A variety of XR devices can be used to evaluate text input techniques for future XR workspaces. However, due to differences in tracking accuracy and functionality, a prior evaluation of the selected XR device is essential to avoid potential confounding variables in user studies.

XR Controllers and Hand Tracking
XR devices are usually equipped with physical controllers, but increasingly support hand tracking as an additional, or sometimes exclusive, input method. Both controllers and hand tracking enable direct interactions with virtual objects. For consumer XR devices, the controller enables more accurate motion detection than integrated hand tracking [2,12,39,40,84]. In addition, the controllers' internal sensors enable higher tolerance to occlusions and faster response to motion. Professional systems such as OptiTrack Motion Capture provide better accuracy and faster motion detection for hand tracking [20,30,81], but they are mostly not available in mobile XR workspaces with consumer XR hardware. Kern et al. [40] showed that most XR controllers can also be used as XR styluses. While the typical controller grip (power grip) requires wrist and arm movements, holding the controller upside down in a pen-like posture (precision grip) enables fine-grained movements that are essential for accurate drawing, handwriting, and direct interactions with small keys on a virtual keyboard, resulting in fewer errors and higher performance [5,73]. Therefore, controllers are a reasonable solution to interact with virtual content using arbitrary XR devices.

Physical Keyboards
In today's real-world office environments, we work at a desk with the most common input technique, the physical keyboard [17,27]. The uniform keyboard layout allows us to reuse existing knowledge [77] and work at high input speeds [4,23,43,64]. Due to their widespread use, familiarity, and efficiency, physical keyboards have also been successfully integrated into VR and VST AR. The most common approaches are based on the alignment of a virtual representative with the physical keyboard [9,28,29,34,36,43,59,68,71,72,79,89], the segmentation of the physical keyboard and/or the user's hands [36,59,68,72], or a video camera stream of the real world [28,59,72]. However, for non-stationary XR workspaces like presented in [25,48,50,62], the physical table may not be available and alternatives are required [10]. HawKEY [72] eliminates the need for a physical desk with a self-built construction. The lightweight tray is carried by the user and allows access to a physical keyboard. This might work in XR workspaces [27,42,57,72] that expect self-built constructs to be available and technically supported. Nevertheless, there are various scenarios where a physical table or general technical support cannot be expected due to different environments or capabilities of the XR devices. For example, Lages and Bowman [48] developed a non-stationary OST AR workspace that follows the user while walking and approaching walls. Here, the use of a physical keyboard can be complicated or even impossible due to a missing physical surface. Other solutions are also not supported.
Physical keyboards are a promising approach for stationary scenarios with a desk and technical support. However, alternatives are essential to address the diversity of XR devices with different features and interaction approaches (e.g., controller or hand tracking). Therefore, future XR workspaces should also implement input techniques with minimal requirements that can be used across different physical locations and interiors, XR devices, and input modalities.

Virtual Keyboards
As a promising alternative to physical keyboards, virtual keyboards can be easily integrated into any XR workspace and controlled with the user's hands, employing either controllers or hand tracking. They are also location-independent and supported by arbitrary XR devices. For VR workspaces, virtual keyboards have been extensively evaluated. Distance-based approaches include head-pointing [61,84,93,97], controller-pointing [8,13,84], and hand/finger pointing [20,22]. For a more natural interaction, researchers investigated contact-based approaches with controllers [84,94] or hands/fingers [18,21,30,84]. For OST AR, virtual keyboards are also a common input technique [15,21,53,54,63,76,92,96], but far less explored than in VR.
With virtual keyboards in VST/OST AR, perceptual issues can arise (e.g., resolution, registration errors, occlusion, and plane interference) that, along with insufficient tracking quality of controllers and hands, may considerably affect the user's performance, usability, and user experience [20,46].
Virtual keyboards as text input technique have been well-studied in VR and OST AR. However, there is a research gap on virtual keyboards in VST AR. Given the increasing popularity of VST AR, it is crucial to investigate virtual keyboards in VST AR, especially considering possible visual constraints.

Text Input by Tap-Typing
Probably the best known implementation of virtual keyboards are tap keyboards. They enable the user to enter words by pressing individual keys. A variety of interaction metaphors for tap keyboards have been evaluated, including pointing towards keys using ray casting, or touching virtual keys with hands/fingers or controllers.
Speicher et al. [84] explored distance-and contact-based VR text entry on an upper-case virtual keyboard to compare six different input techniques. After a warm-up phase of five minutes, five sentences were written. Users employing controller-based pointing achieved the highest input speed of 15.44 words per minute (WPM) and 0.97% error rate (ER), followed by controller tapping (12.69 WPM, 1.95% ER). The authors explained the relatively low results for index finger typing (9.77 WPM, 7.57% ER) with a strong impact of inaccuracies in the utilized system's hand tracking. In addition, the results revealed that user experience and workload are important factors for text entry performance. Adhikary and Vertanen [2] investigated contact-based VR text input with one or two index fingers on traditional and split keyboard designs. Participants typed four sentences for practice and 12 during the evaluation. The authors found no significant difference between one-handed (16.1 WPM, 2.9% ER) and two-handed (16.4 WPM, 2.3% ER) text entry, but for the split design using both hands (14.7 WPM, 2.4% ER). In this study, users had problems precisely targeting the keys or typing in mid-air. The authors hypothesized that users focused on visually guiding their fingers or successfully triggering a tap, which limits their ability to plan the next character sequence. However, by using an autocorrection algorithm, the faulty input sequence (9% ER) could be improved significantly to a 1% error rate.
For OST AR, virtual keyboards were also explored in the past. Rickel et al. [76] studied one and two-handed contact-based text input in OST AR using the Microsoft Hololens 2 with the system's default keyboard. Participants entered 20 sentences, including five for practice. Typing with two hands (13.91 WPM) was significantly faster and preferred by participants compared to the one-handed condition (12.07 WPM). The two-handed method showed significantly higher error rates and participants reported that they typed faster, but also more inaccurately because they were confused about which hand to use for each key. This is also supported by a significantly increased mental demand for users in the two-handed condition. Using fingers other than the index finger led to unintentional text input. In this study, capitalization, punctuation, and error correction were not required. This probably also affected the word error rate for one-handed (5% WER) and two-handed (8% WER) input. Xu et al. [92] explored four pointing methods for distance-based text entry in OST AR with the Meta 2. Participants practiced with two sentences and typed eight phrases for evaluation. The controller-based approach (14.6 WPM) significantly outperformed other techniques, i.e., head-based pointing, hand gesture, and head+hand, in terms of input speed, task load, and user experience.
In previous work, researchers have evaluated tap-typing in VR and OST AR using distance-based and contact-based approaches. Controller-based text input achieved better results than hand tracking, however, controller tracking is currently also much more accurate than hand tracking. Overall, distance-based and contact-based approaches achieved similar performance in text entry, although the authors reasonably preferred certain approaches. In addition, distancebased approaches can reduce physical demand, while contact-based approaches allow for more natural interaction with both controllers or hands/fingers. Therefore, the choice of interaction metaphor depends on the use case and the XR displays. The use of tap keyboards in VST AR has not been explored yet.

Text Input by Swipe-Typing
As an alternative to tap-typing, word-gesture keyboards have shown promising results in terms of both input speed and error rate. Gesturetyping was first introduced in 2003 with SHARK [99]. Since then, many approaches like SHARK2 [45] and Vulture [67] were published. Text input by gesture-typing is also known as word-gesturing or swiping, and word-gesture keyboards or swipe keyboards. For simplicity and familiarity, we refer to word-gesture keyboards as swipe keyboards. A gesture is generated by sliding over the letter keys of a word in the desired order. In distance-based approaches, the gesture is recorded during or between key presses, contact-based techniques require the user to touch and lift the hand/finger or controller.
Zhai et al. [98] defined three key components of swipe keyboards: The visual guidance which supports the user to trace word paths, the language model to express the user's language regularities and constraints, and a recognition system to classify the gesture into a ranked list of possible words. Gesture recognizer should be highly error-tolerant, as users may use memory recall of gestures instead of visual guidance with more practice [99]. In contrast, participants in a study by Markussen et al. [67] did not seem to remember gestures, but continued to rely on visual tracing. Moreover, they highly relied on visual feedback even after extensive training and repeating the same words. Gesture-typing can also benefit from the user's prior knowledge if traditional keyboard layouts are used.
Markussen et al. [67] evaluated gesture-typing in mid-air and by touching a vertical 2D surface in physical space. After 54 phrases, participants were significantly faster with touch (47.5 WPM) compared to mid-air (28.1 WPM). Between the first and last block, the input speed has improved by 26% using touch and 25% in mid-air. Yu et al. [97] proposed to use head movements for distance-based gesture-typing in VR. Participants had to press a button to start and stop the gesture. The algorithm was implemented according to [45,67]. They informed the participants that error correction is supported (but not mandatory). After eight sessions of ten sentences each (60 minutes of training), participants achieved 24.73 WPM on average. The authors observed that participants developed strategies to more accurately input words whose gesture patterns were similar to others (e.g., shopping and shipping). Chen et al. [13] also investigated swipe keyboards in VR with a distance-based approach using controllers and touching on a touchscreen. They also implemented a swipe keyboard based on [99]. Text corrections were not required in this study. Participants achieved 16.4 WPM with controllers and 9.6 WPM with the touchscreen. Users preferred gesture-typing in terms of usability and task load.
Gesture-typing in OST AR showed similar results compared to VR. Xu et al. [92] also explored gesture-typing using the Meta 2 AR device. They used a swipe keyboard based on the Damerau-Levenshtein distance [14,55]. Gesture-typing with controllers (13.68 WPM) outperformed other techniques using pointing by head or hands. Swiping produced significantly higher temporal demand and frustration, but a similar mental workload than tap-typing.
Swipe keyboards have a high potential for text input in VR and OST AR. Contact-based swiping achieved better results than distancebased approaches. In this context, error-tolerant gesture recognition is essential, as users do not necessarily follow the intended path. Compared to tap-typing, swiping can increase workload and frustration, but may also achieve higher input speeds after a longer training period. The use of swipe keyboards in VST AR remains unclear.

Keyboard Layouts
Virtual keyboards often adopt the physical keyboard layouts [17] to take advantage of prior knowledge and minimize the effort of learning. The most prominent keyboard layouts used in evaluations of virtual keyboards are QWERTY [1,2,8,76] and QWERTZ [74,84]. While most researchers predefined the target layout, others let the participant decide [84]. Alternative keyboard layouts evaluated are ATOMIK and OPTI II [75,93]. Virtual keyboards include at least letters from A-Z. Capitalization and punctuation are often not considered [1,21,76,84], because researchers compared pure text input, rather than focusing on real-world usability. Dube and Arif [18] investigated different key dimensions and shapes on a virtual keyboard. The results showed a significantly lower error rate with 3D keys than with 2D keys. In addition, participants typed faster with 3D keys. The key shape (round, square, and hexagonal) had no effect on the error rate and text input speed. However, participants preferred the square 3D keys because they are most similar to the physical keyboard's key design. Contact-based approaches also benefit from visually pressable buttons.
An established keyboard layout and the design of individual keys in a rectangular 3D key shape with visual feedback seem preferable. The keyboard layout also influences the gesture path on the swipe keyboard, and users may lose the advantage of prior knowledge. Therefore, it is advisable to choose a familiar keyboard layout.

Text Revision
Bowman et al. [10] summarized text revision as editing subtasks including insertion, deletion, and replacements. The possibility of deleting the last entered characters or words allows all editing subtasks in principle. However, it may not be suitable if a complete sentence or paragraph has already been entered. Previous work has shown that text revision techniques can significantly improve text entry and error rates [1]. Likewise, research also showed that users spend a significant amount of time revising text [16]. Therefore, text correction should be as fast and precise as possible [27].
On traditional physical keyboards, text revision is available via arrow keys and the backspace key. This technique is well-known and can be easily implemented for virtual keyboards [58]. However, previous work often limits the correction to the end of the input stream without the ability to control the cursor [1,8,13,21,67,76,84,97]. While this may increase internal validity, it does not reflect a real-world scenario. In their design space, Li et al. [58] define backspace granularity as the degree to which the backspace removes user input. Backspace on the character-level deletes a single character, and on the word-level deletes multiple characters. Discrete caret control is often implemented by left-and right arrows and moves the cursor by a single character. Continuous caret control moves the cursor along multiple characters by a single smooth interaction (e.g., a touchpad). For larger text distances (e.g., multi-line text sections), a cursor positioning inside the text field could be beneficial.
A more advanced approach is word suggestions. Here, users choose from a list of alternative words and replace the current input sequence. Researchers presented up to 5 word suggestions in their keyboard implementations [1,21,35,60,67,74,93,94]. A user evaluation by Markussen et al. [67] showed that word suggestions were used more frequently than corrections using backspace. Since waiting for system feedback can increase frustration [17,77], suggestions should be continuously refined and visible while typing. Faulty autocorrection of spelling errors can frustrate users and create a sense of failure [3,90]. The negative impact on mental and physical effort [3] may also inevitably affect usability and user experience [41].
Text revision can significantly improve input speed and allow the user to enter text without errors. Therefore, input techniques in realworld scenarios should provide correction capabilities and should work reliably to avoid frustration. Previous works therefore recommend arrow keys for discrete and continuous cursor movements, backspace, and word suggestions.

Phrase Sets
When assessing text input techniques, participants commonly enter predefined text phrases, while performance data is collected [65]. Compared to free text entry, predefined phrases have the advantage that there is a sentence to be compared, and each person enters approximately the same amount of words. The most used phrase sets for text input evaluations are the MacKenzie [65] corpus and the EnronMobile [87] corpus. In addition, researchers have used text sources such as newspapers [21,70], movie subtitles [21,89], specific solutions for target groups [37,86], or self-created text phrases. The MacKenzie phrase set contains 500 English text phrases written in mostly lowercase, without numbers and limited punctuation. EnronMobile is based on the Enron corpus, which contains 1.1 million sentences created by employees with their BlackBerry smartphones. For the EnronMobile corpus, the authors filtered the original corpus and selected 2200 sentences, including lowercase, numbers, and punctuation. When non-native speakers write English sentences, it can lead to longer memorization time and also higher transcription errors [26,44]. Therefore, researchers have translated the text phrases into the participants' native language. For multilingual text evaluations, it is important that the phrase set achieves similar performance [26]. Important measures are words per phrase and letters per word. Simple sentences include only alphabetic lowercase letters and require one keystroke. Complex sentences are more demanding and comprise modifier keys, special characters or numbers [72]. Entering sentences also becomes more difficult with text revision using arrow keys, backspace, and word suggestions.
Many studies focused on pure word-based text entry and omit capitalization and punctuation [5,58,84]. Ignoring capitalization and punctuation can increase the comparability of different experiments, but it also limits their validity in real-world scenarios [65]. Publications should clearly define their target validity and possible sources of interference from selected text phrases. If the evaluated input techniques are to be applicable in the real world, complex sentences seem to be preferable to simple ones. To achieve the best possible result, the text phrases should be displayed in the native language of the subjects. In today's world, however, multilingual input is essential. As always, it depends on the particular use case.

CONCEPT
We believe that contact-based approaches are the most promising solution for future XR workspaces. Therefore, we decided to evaluate contact-based mid-air virtual tap and swipe keyboards in VR and VST AR (see Fig. 1). We have arranged our keys according to the local keyboard layout standard (QWERTZ). For quick punctuation, we have positioned the most important special characters (.,?!-) next to the letter keys. Upper and lower case can be selected with the shift keys. Spaces are inserted with the space bar. Additional keys for entering numbers or special characters may be on a separate page. The shift keys and the space bar are hidden on swipe keyboards because they are not required for text entry. As recommended by Li et al. [56], we integrate a backspace key to correct spelling errors and arrow keys to position the cursor discretely or continuously. The cursor position can also be adjusted directly in the input field. We also offer three word suggestions that allow a quick replacement of misspelled words.
We designed our keyboards to be compatible with arbitrary XR controllers and hand tracking. Since inaccurate hand tracking can lead to unintended key interaction, we follow the approach of previous work [5,40,73] and use the controllers upside down in a pen-like posture to allow precise interaction with the virtual keyboard keys. We decided on a virtual keyboard size of 51 cm x 13 cm (Opaque Surface: 53 cm x 19 cm) that neither requires excessive arm movement nor is too small for the size of the bottom tip of XR controllers. The dimensions of 3D letter keys are 2.25 cm x 2.25 cm x 0.05 cm with a spacing of 1.25 cm between the keys. For active feedback during text input, the keys are pressed visually and trigger a vibration in the controller. In addition, for the swipe keyboard, a visual tail of the controller movement is displayed during gesture input. Although we have focused on nonstationary XR workspaces, our keyboard is also suitable for alignment with physical surfaces using alignment techniques as proposed by previous work [40,88,102]. We use Unity 2020.3 LTS and the XR plugin framework for our reference implementation with the Varjo XR-3 and two HTC Vive Pro controllers.

Phrase Set
In our study, subjects transcribe sentences from a phrase set. Therefore, we chose the MacKenzie corpus because it is most frequently used and showed convincing text entry and error rates [44]. We translated the phrases into German. We have corrected spelling errors and added punctuation to make our results more applicable to real-world scenarios [65]. We recognized several phrases as inappropriate because they offend the participant personally (e.g., "you are a capitalist pig" or "if you were not so stupid"), touch sensitive topics (e.g., "a psychiatrist will help you", "do you get nervous when you speak", or "a tumor is OK provided it is benign"). Some also address political or moral domains (e.g., "the accident scene is a shrine for fans", "coalition governments never work", or "everybody loses in custody battles"). In addition, it is unclear how these sentences affect user experience ratings. We progressively filtered inappropriate sentences and selected ten phrases with a length of 30 ± 10 characters, including punctuation marks (!?.,) and German umlauts (ä, ö, ü). As the sentences' length can influence reading time and error rate [87], the words per sentence and the letters per word should be as similar as possible, especially when using different languages [26]. Therefore, we put a special focus on comparable sentence lengths to the original phrase set and within the chosen sentences. The original English phrases from the MacKenzie corpus [65] and our translated selection with corrected spelling and punctuation are shown in Tab. 1. We also provide a public German translation of the entire MacKenzie phrase set of 500 sentences.

Dictionary
While individual keys are pressed for text input with tap keyboards, the algorithms used for our swipe keyboard generate gestures from words and compare them to the user's gesture path for recognition. As a source for words, we used the DeReKo 1 German reference corpus with 80,000 words, including word frequencies. We optimized the dictionary by removing phrases with numbers or abbreviations. Further, we reduced the dictionary to 35,000 words based on word frequencies.
We have adopted the approach of [67], and added missing words of the translated MacKenzie phrase set to the dictionary. We used the Google Ngram viewer 2 to obtain word frequencies. While this improves the recognition result, it may also affect the comparability with other work that considered out-of-vocabulary handling. For our swipe keyboard, we followed previous work [92,97] and first implemented a character-based approach using Damerau-Levenshtein distance algorithm [14,55]. However, this did not provide satisfactory results for our large 35,000 word dictionary. Therefore, we based our swipe keyboard on the proposed algorithm of Kristensson et al. [45] (SHARK 2 ) and followed the implementation of Markussen et al. [67] (Vulture). Our swipe keyboard recognizes words from the vocabulary, but not out-of-vocabulary words or passwords. For the former, an extended algorithm might dynamically add unknown words entered via a tap keyboard to the dictionary for future recognition.

Text Revision Techniques
We have adopted the recommendations of Li et al. [58] and provide text revision techniques considering ease of use, less physical and mental load, and robustness of handling various revision conditions. We implemented three text revision techniques optimized for tap and swipe keyboards based on the smallest adaptable unit (characters for tap keyboards and words for swipe keyboards). For both techniques, three word suggestions are displayed in a horizontal bar, determined by the Levenshtein distance algorithm using the filtered dictionary.
Our system continuously calculates word suggestions during user input to avoid frustration caused by waiting for the system response [17,77]. As a caret control mechanism, we provide left and right arrow keys, as well as a backspace key for quick character-based (Tap keyboard) or word-based (Swipe keyboard) revisions. We also reduce the number of arrow keystrokes required to move the cursor by allowing the user to place the cursor directly in the input field to the character or word.

METHODOLOGY
The experiment followed a 2 × 2 between-within subjects study design with XR display (Levels: VR, VST AR) as between group factor and input technique (Levels: Tap, Swipe) as within group factor. We followed the current COVID-19 regulations and assume that these did not influence the results of our study. Subjects were only required to wear a mask when answering the questionnaires. The experimenter always wore a mask. The study was conducted in a 20-square-meter room to ensure enough space between the experimenter and the subjects. After each experimental session, the room was thoroughly ventilated, and the surfaces were disinfected (e.g., tables and electrical devices). We recorded a detailed log file of each session that includes keyboard interactions and movements of the head, eyes, and left and right-hand controllers. A detailed ethics proposal was submitted to the ethics committee of the Institute for Human-Computer Media (MCM) of the University of Wuerzburg and found to be ethically unobjectionable. We chose the text presentation style Transcribe [44] and constantly displayed the phrase to be entered. Further, we decided for the error correction conditions Recommended from Arif and Stuerzlinger [4].

Participants
We recruited 64 participants (45 female, 19 male) through the university's participant management system. They received either 10 euro (n = 37) or student credit points (n = 27) equal to the amount of time spent on the experiment. The age of the participants ranged from 19 to 48 years (M = 24.8, SD = 5.6). The majority of participants were students (n = 55). 59 subjects were native speakers and 5 had knowledge between 3 and 10 years of the local language. We asked participants about their prior experience with VR, AR, and text input. In the real world, all participants have typed on a physical keyboard for more than 20 hours. All but one participant worked with the local keyboard layout for more than 20 hours. While tap keyboards on touchscreen displays (e.g., smartphones or tablets) were used comparably to physical keyboards, participants had less than one hour of experience with swipe keyboards. On average, participants had one to three hours of VR experience and less than one hour or no experience in AR. Some participants were also familiar with the controller used in this study. The majority of participants had no to very little experience with text entry in VR and AR. Subjects had no impairment in reading, writing, or speaking. We also asked for color blindness and visual and hearing impairments. Finally, we checked for the ability to move the wrist. We did not have to exclude any subjects based on the previous criteria or technical problems.

Apparatus
We chose the Varjo XR-3 as a recent up-to-date solution for high-quality VST AR that supports controllers and hand/finger tracking as well as eye tracking. We used two HTC Vive Pro controllers. During VR and VST AR conditions, participants stood in the center of the room. They wore the XR device and held the controllers in their hands. Motion tracking was done via Varjo Base 3.5.1 and SteamVR. We used four HTC Vive Base Stations 2.0 mounted in the corners of the room at the height of 3 meters. We used a Microsoft Windows 10 based computer system with an Intel i9-11900 CPU, 64 GB DDR4-RAM, 1TB Samsung 970 EVO NVMe M.2 SSD, and an NVIDIA GeForce RTX 3080 GPU.

Digital Twin
We accurately measured the physical space and used a 3D modeling tool (Blender) to hand-create a digital twin of the physical room (see Fig. 2). We took photos of real surfaces to texture virtual representatives. In VST AR, subjects saw the real physical space via the video see-through cameras of the XR device. In VR, subjects were shown the digital twin.

Procedure
Our study followed the procedure shown in Fig. 3 and took about 60 minutes. Subjects participated in the study in either the VR or VST AR condition (Between factor). The experimenter counterbalanced the XR display condition. In the beginning, subjects were asked to read and agree to the study information. Subsequently, subjects answered demographic and prior knowledge questions, and the SSQ. Participants then were instructed on how to use the XR device. The correct positioning of the XR device and calibration to the participant's eyes is essential for high-quality VR and VST AR. The first input technique (Tap or Swipe) was also counterbalanced by the experimenter (Within factor). The subjects now entered the ten selected sentences in a random order (see Tab. 1). They used one or two controllers for text input. Participants were advised to enter text phrases as error-free and fast as possible. After text entry, participants put down the XR device and were asked to complete questionnaires. We followed the same procedure for the second input technique (Tap or Swipe).
To increase the internal validity of our user evaluation (e.g., order and sequence effects), we systematically counterbalanced the assignment of subjects to experimental conditions (type of quasi-randomization). For this, we defined the specified biological sex as a blocking variable and alternated the between factor (VR or VST AR) and the within factor (Tap or Swipe) in a structured and predefined manner.

EVALUATION METRICS
We based our study design on the evaluation suite of Kern et al. [40] and extended it with performance measures for text input techniques. We evaluated the users' preferences with questionnaires for well-being (SSQ, [38]), task load (RTLX, [31,32]), user experience (UEQ, [52]), and usability (SUS, [11]). According to Arif and Stuerzlinger [4], we calculated subjects' average text entry rate using words-per-minute (WPM) and errors in the transcribed text using error rate (ER).
The assumption of Homogeneity of variance-covariance matrices (Box's M test) were met by all variables except mental demand (RTLX) and words-per-minute (WPM). The analysis of the Shapiro-Wilk test showed that residuals of several variables are not approximately normally distributed. However, previous work considered the ANOVA robust to violations of the normal distribution [49,78] when group sizes are approximately equal and n > 10 [7,33]. Nevertheless, we validated our results with a cross-check using non-parametric tests (Paired: Wilcoxon Signed-Rank Test, unpaired: Mann-Whitney U Test), which showed no differences in the significance of our results. Since our experiment also has a group size per condition of n = 32, we consider the requirement of normal distribution to be satisfied. Therefore, for clarity and comparability, we report the results of the parametric tests for all variables. In this study, no interaction effects were found for subjective and objective measures. Therefore, we first compute post hoc tests for our main factors. If the main factors are significant, we calculate respective post hoc tests for simple main effects. All tests were performed against an α of .05. In the following, we performed two-tailed paired-sample post hoc t-tests with Bonferroni-Holm correction when the respective main effect revealed significant differences. The descriptive results are listed in Tab. 2 and for the SSQ in Tab. 3.

System Usability
We found no interaction effects and no main effects for XR displays on the SUS scale.

User Experience
For all subscales of the UEQ, we found no significant interaction effects and no main effects for XR displays. Dependability ratings revealed a significant main effect for input techniques, F(1, 62) = 28.574, p < .001, η 2 p = 0.315. Participants had a significantly higher   The chart shows the User Experience Questionnaire (UEQ) scores for each subscale and condition. The scales range from -3 (Bad) to 3 (Excellent). * * p < 0.01, * * * p < 0.001, * * * * p < 0.0001.

Task Load
We found no interaction effects and no main effects for XR displays on all subscales of the RTLX. With regard to the subscale performance, we found a significant main effect for input techniques,  * p < 0.05, * * p < 0.01, * * * p < 0.001.

Task Performance
Based on the learning effect and in line with Xu et al. [92], we found that participants required two sentences to become familiar with the keyboards. Therefore, we define the first two sentences as training and the remaining eight sentences as the validation. The results for WPM, ER, and the respective learning effect are shown in Fig. 7. For words-per-minute (WPM) in the training phase, there  (27.14) was no significant interaction effect, but significant main effects for

DISCUSSION
The goal of our work was to investigate how different XR displays (VR, VST AR) and input techniques (Tap, Swipe) affect performance, usability, user experience, task load, and well-being. In addition, we explored the challenges of text input in VST AR and proposed ways to improve it. Our study presents a novel evaluation that directly compares contact-based mid-air virtual tap and swipe keyboards in VR and VST AR using a single XR device (Varjo XR3). We compared our results with previous work in VR and OST AR and form a foundation for future VST AR text input evaluations.
To answer our research questions, we first summarize our main findings and explain them in detail in the following sections: XR displays (VR, VST AR) and input techniques (Tap, Swipe) strongly affected text input speed, while the error rate remains similarly low. In our study, participants achieved significantly higher input speeds in VR than in VST AR, and in VR with tap keyboards compared to swipe keyboards. XR displays did not affect usability, user experience, task load, or well-being. We found no interaction effects between XR displays and input techniques for all subscales of the questionnaires and performance measures. Our results revealed a significant learning effect between the first two sentences and the following eight sentences for both input techniques and XR displays. The usability and user experience results showed that participants preferred tap keyboards over swipe keyboards. However, there was no preference in terms of the XR display for VR or VST AR.

Task Performance
For text entry speed, we found significant differences for input techniques and also for XR displays, while the error rate remained similarly low. The low error rate across all conditions suggests that subjects focused on typing without errors rather than achieving the highest input speeds. Since in reality, people usually aim for error-free text input, this corresponds to real-world behavior. Subjects performed text entry significantly faster in VR than in VST AR with both input techniques. Further, tap keyboards outperformed swipe keyboards in VR, but not in VST AR. Our participants achieved excellent results with only ten sentences entered per keyboard, while they also paid attention to capitalization, punctuation, and error correction.
We explain the significantly lower input speed of both input techniques in VST AR compared to VR by the visual conflict between the real controller/hand shown by the video stream and the virtual controller model used for keyboard interaction. The visual mismatch can be resolved by segmenting the real controller/hands in the video stream, as it has been done in previous work for physical keyboards [59,68,72]. However, this means that the contact point between the segmented controller/hand and the virtual keyboard must be reliably detected. The detection is not required for physical keyboards because physical keys are pressed. We found another simple but effective solution to reduce the visual conflict by adding an opaque surface behind the virtual keyboard to hide the real controller/hand. In this study, we chose the second approach and used a predefined point on the virtual controller model for keyboard interaction.
We follow recent theories which define coherence [82,83] or more specific congruence [51] as critical preconditions for a plausible and compelling XR exposure to further explain the overall lower performance of text input in VST AR compared to VR. Our results indicate that the inherent visual mismatches lead to incongruent visual depth cues in VST AR during text entry, which ultimately may have contributed to a significant impact on performance. Notably, these specific low-level incongruencies did not affect perceived usability, user experience, task load, and well-being in our study, indicating either a lower relevance for the latter factors and/or a potential difference in the respective relative weighting of these cues for these factors [51].
The significantly lower input speed for swipe keyboards in VR can be attributed as follows: Participants had less prior experience with swipe keyboards, compared to tap keyboards. This may increase the time to plan and execute gestures [66,99], while with tap keyboards, it is only one keystroke. Further, inaccurate gestures can lead to recognition errors of the system [99] and thus influence the input speed and error rate. Since we used a dictionary with 35,000 entries, words may have very similar gestures, and the probability of incorrectly recognized words increases. One solution to this would be to preselect words based on gesture length or sentence semantics. We assume that typing speed increases significantly with experience, as shown in a previous study with more than 50 sentences typed on a swipe keyboard [67,97]. Our assumptions are supported by the significant learning effect revealed, as well as by the fact that the swipe keyboard was only significantly slower in VR, but not in VST AR. This is also reflected by the higher mental demand and frustration ratings (RTLX), and lower dependability and perspicuity ratings (UEQ) for swipe keyboards than for tap keyboards.
Compared to previous work in VR, subjects in our study achieved similar input speeds and error rates for tap keyboards [8,84] and swipe keyboards [13,97]. For VST AR, it is difficult to compare our results with previous work, as they exclusively used OST AR and hand tracking. Nevertheless, it can be concluded that our tap and swipe keyboard in VST AR with controllers achieved higher or at least comparable results to Tap [76,92] and Swipe [92] in OST AR.
Since our contact-based input techniques enable higher or at least similar text input performance after only a few typed sentences, compared to distance-based approaches [8,84], the question arises of whether distance-based techniques have reached an upper limit, while contact-based approaches can surpass it. One can imagine how fast text input might be with accurate 10-finger tracking [20,30,81].

User Preference
In terms of usability, we found significant differences for input techniques, but not for XR displays. For user experience and task load, some subscales showed significant differences, while others did not. Neither XR displays nor input techniques showed a significant impact on the subjects' well-being.
Tap keyboards achieved very good to excellent usability, swipe keyboards only showed average results. The user experience ratings indicated that it was significantly easier for subjects to become familiar with and learn how to use tap keyboards. The results also suggest that tap keyboards provided a significantly greater sense of control. Participants had little to no prior experience with text entry in VR and VST AR, and prior knowledge of touchscreen-based text entry with tapping was significantly higher than with swiping. Therefore, we attribute the higher novelty ratings for swiping to the general novelty effect of immersive technologies and prior knowledge of input techniques.
For well-being, we observed slightly higher values for disorientation, but these are not significant and still very low considering the scale. We assume that simulator sickness has no negative influence on our experiment. Our results for text input in VR are in line with previous work, which also showed similar usability and user experience for tap [8,84] and swipe keyboards [13]. Again, for text input in VST AR, it is difficult to compare with previous work, as publications used OST AR devices with hand tracking. However, compared to OST AR, we found similar results to at least some extent for tap keyboards [76,92] and swipe keyboards [76,92].
The task load analysis revealed that subjects experienced similar physical demand, temporal demand, and effort for both input techniques and XR displays. The relatively low physical demand, which did not differ significantly between XR displays and input techniques, suggests that head or wrist fatigue was not present. However, this may change if the text input time is increased [5,84]. For tap keyboards, participants perceived significantly higher performance in VR and significantly lower mental demand in VR and VST AR. Combined with significantly higher frustration levels for swipe keyboards, this suggests that participants felt less successful with swiping than tapping. This assumption is supported by the dependability score of the UEQ and the objectively measured text input speed. The higher mental demand is another indicator of the greater effort required to plan and execute the swipe gesture than pressing individual keys on a tap keyboard, even though this is not reflected in the effort subscale of the UEQ. Compared to previous work, our tap and swipe keyboard achieved lower or at least similar task load in VR [13,84] and OST AR [76,92].
In addition, we compared the most similar input technique controller tapping (CT) of Speicher et al. [84] with our tap keyboard in VR in more detail. The comparison showed that our tap keyboard achieved better results for all objective and subjective measures, which are part of their decision support tool. Our solution is even comparable to controller pointing (CP). For example, we consider compatibility with controllers and hands in XR workspaces to be an essential prerequisite for virtual keyboards. Therefore, we have focused on contact-based approaches. Nevertheless, it can also be argued that minimizing task load is the key requirement. Combined with the previously raised question of upper limits for input speed, this paves the way for exciting new research.

CONTRIBUTION AND CONCLUSION
Virtual keyboards are a promising solution for non-stationary XR workspaces where the physical keyboard is not available or technically supported. While virtual keyboards have been extensively evaluated in either VR or OST AR, there was a lack of research in VST AR. Our work fills this gap and provides novel insights into usability and performance of contact-based mid-air virtual tap and word-gesture (swipe) keyboards in VR and VST AR. We used a single XR device to reduce a key confounding variable in studies of text entry.
From our findings, we derive three conclusions: 1) XR displays (VR, VST AR) and input techniques (Tap, Swipe) have a strong impact on text input speed, while the error rate remains comparatively low, 2) as expected and consistent with previous work, input techniques highly influence usability, user experience, and task load, and 3) the question of which XR display enabled higher performance with contact-based virtual text input techniques can be clearly answered with VR. However, the subjective results showed no difference between VR and VST AR. The significant differences in subjective and objective measurements emphasize the importance of specific evaluations for input techniques and XR displays to provide reusable, reliable, and high-quality text input solutions. We have created a publicly available reference implementation that can be used in future studies and developments (https://go.uniwue.de/tap-and-word-gesture-keyboards). We hope this work forms a foundation for future research and fosters the development of XR workspaces.

LIMITATIONS AND FUTURE WORK
The results of our study showed that contact-based virtual tap keyboards performed better than swipe keyboards in VR. The question arises whether the differences are explained by the subjects' lack of prior experience or by a technical limitation of our swipe keyboard. However, a preliminary test showed that our implementation also recognized arbitrary words without errors. Future work should compare the influence of prior experience and different swipe implementations (e.g., gesture recognition algorithms or better dictionaries) on text entry performance. Also, a higher number of sentences written by the participants should be considered. Although participants wrote text phrases with capitalization and punctuation, a keyboard for the XR workspace should support the full range of symbolic input. With our text-based layout description, our keyboards can also support numbers, special characters, or other keyboard layouts. We only investigated virtual tap and swipe keyboards with contact-based controller interactions. In addition, hand tracking becomes increasingly robust and hence popular. To verify the generalizability, reusability, and performance, alternative XR controller devices as well as hand tracking solutions should also be investigated as potential input sources. The integration of controllers and hand tracking is straightforward using the Unity Engine and the XR Plugin Management. The question of whether our study results can be generalized to distance-based approaches so far remains unanswered.

ACKNOWLEDGMENTS
We thank Lukas Schreiner and Jonathan Tschanter for their support in developing the reference implementation and Lukas Schreiner for conducting the user study. This work was supported by the German Federal Ministry of Education and Research in the project ViLeArn More (Reference: 16DHB2214) and by the Bavarian State Ministry For Digital Affairs in the project XR Hub (Reference: A5-3822-2-16).