Assessing Depth Perception in VR and Video See-Through AR: A Comparison on Distance Judgment, Performance, and Preference

Spatial User Interfaces along the Reality-Virtuality continuum heavily depend on accurate depth perception. However, current display technologies still exhibit shortcomings in the simulation of accurate depth cues, and these shortcomings also vary between Virtual or Augmented Reality (VR, AR: eXtended Reality (XR) for short). This article compares depth perception between VR and Video See-Through (VST) AR. We developed a digital twin of an existing office room where users had top erform five depth-dependent tasks in VR and VST AR. Thirty-two participants took part in a user study using a 1 × 4 within-subjects design. Our results reveal higher misjudgment rates in VST AR due to conflicting depth cues between virtual and physical content. Increased head movements observed in participants were interpreted as a compensatory response to these conflicting cues. Furthermore, a longer task completion time in the VST AR condition indicates a lower task performance in VST AR. Interestingly, while participants rated the VR condition as easier and contrary to the increased misjudgments and lower performance with the VST AR display, a majority still expressed a preference for the VST AR experience. We discuss and explain these findings with the high visual dominance and referential power of the physical content in the VST AR condition, leading to a higher spatial presence and plausibility.


INTRODUCTION
In recent years, Spatial User Interfaces (SUIs) such as Virtual Reality (VR) and Augmented Reality (AR) have captivated many areas.These platforms promise to redefine our interaction with digital content, incorporating seamless integration into our physical world (AR) or immersing us entirely in artificial environments (VR).While the potential applications of VR and AR span diverse fields, such as entertainment, health, education, or maintenance, a fundamental understanding of how we perceive and interact within these environments remains a topic of ongoing research.In computer-generated environments, a major challenge is to place the virtual content in the right depth and to provide a coherent set of depth cues to enable users to perceive this depth correctly and make sense of it.In AR, an additional challenge lies in combining depth cues from the virtual and the real world to result in the perception of a congruent scenario [28,44].In VR, users perceive one congruent scenario in which depth cues affect all virtual content similarly.In AR, contradicting depth cues between virtual and physical content can lead to misinterpretation of spatial information, potentially affecting user performance, safety, and overall immersion.
The quality of blending virtual and physical content is confined by a constellation of factors, including (1) hardware constraints such as latency, optical distortions, and tracking inaccuracies that can result in the misplacement of virtual objects within the real environment, and (2) disparities in the appearance of virtual and physical components, encompassing differences in color and illumination.There is a great body of knowledge addressing these perceptual incongruencies and how different technologies cope with these [2,4,9,10,27].Different AR display technologies inhere different incongruencies.In optical seethrough (OST) AR displays, users can directly view the environment, and virtual content is added as an overlay by a virtual combiner.Since the view of the environment stays undistorted, the depth perception in OST AR displays shows more accurate results compared to VR [19,36].In contrast, the video see-through (VST) AR display uses a real-time video stream with virtual content added on top of this video stream.Compared to OST AR displays, depth estimations in VST AR displays are less accurate [1,3].
While a lot of research has been conducted to examine depth perception in OST AR displays, VST AR displays remain underexplored as empirical studies are rare.Since consumer Head-Mounted Displays (HMDs) increasingly offer VST functionality besides classic VR, new use cases arise that make use of both (VST AR and VR) and enable transitions along the Reality-Virtuality (RV) continuum between reality and virtuality without the necessity of switching the HMD.Hence, it is particularly important to determine whether (depth) perceptions at different points on the RV continuum [32] are comparable, facilitating the application of VR research findings to various other forms.To our knowledge, a direct comparison between VR and VST AR concerning depth perception has not been conducted so far.Thus, in this paper, we attempt to answer the following research question: "Is there a difference concerning depth perception between VR and VST AR?" Display-mediated visual perception is potentially influenced by a variety of display characteristics, also including the ergonomics (e.g., wear comfort and weight) of HMDs.However, many comparative studies use different HMDs for the assessment, inhering different influences of HMD characteristics in their results on depth perception.
In our work, we present a comparison of depth perception between VR and VST AR with the Meta Quest Pro.Using one HMD for both conditions minimizes the possible side effects of hardware characteristics.We introduce a set of tasks to assess egocentric depth perception in VR and VST AR and collect empirical data on depth perception, task performance, and preference in a user study.Our findings enhance the understanding of depth perception within SUIs and highlight the challenges present in this domain.

RELATED WORK
VR and AR can be located on Milgram's RV continuum [32] and represent different forms of Mixed Reality (MR) [40].VR applications are situated near the right endpoint Virtuality.AR applications can be located between Virtuality and Reality since AR technology augments the physical environment with a virtual overlay, accounting for different real and virtual proportions.Building upon this continuum, AR displays have evolved in various forms.Handheld and projector-based devices are complemented by AR HMDs [27,33], further divided into VST and OST AR displays.The VST AR displays use external cameras to capture the environment and stream the image directly into the HMD.Virtual content is added as an overlay.The main advantage of this setup is the high control over the environment.The scene is discretized and situated in the same pixel rasterization as the virtual content enabling the adaption of visual coherence.Disadvantages of VST AR displays are the reduced resolution due to image compression, lens distortions, and time lags, which could ultimately contribute to a wrong or distorted depth perception.OST AR displays use optical combiners to project content on collimating lenses.This technology has the main advantage of a direct view of the environment, as there is no image compression.
Most AR-related research has been conducted with OST AR HMDs.However, in recent years, VST AR HMDs (such as the Varjo XR-3, the Meta Quest Pro, and the Meta Quest 3) reached an acceptable display quality to become a valuable technology to apply in different areas, such as education, training, health, manufacturing, or entertainment.Future release announcements appear even more promising, strengthening the desideratum to focus more research on this display technology.
With regard to the precise interactions that are required in ARsupported operations (such as surgery or maintenance), perceiving depth correctly is essential.Depth perception is an important part of our visual sense-making [17], and depth cues support the perception of space, such as stereo-vision, motion parallaxes, object occlusion, or perspective vision.
For VR displays, advances in computer graphics and rendering are already well-established to provide depth information for the user [11,18].In VST AR, virtual content can be rendered accordingly.However, the challenge exists to match the depth cues rendered for the virtual objects with those from the physical environment.Not only the visualization of objects but also a wrong registration in the environment, lens distortion, latency, etc., can lead to conflicting visual-visual stimuli that users easily detect and might be disturbed by [2].
Various tasks have been developed to assess depth perception.These include verbal reports of distances towards virtual objects, bisection tasks (where participants mark the half distance to virtual objects), or blind actions [1,16].In blind actions, participants see virtual objects for a while.They then have to reach or walk blindly to the position where they estimate an object.Because the requirements of these tasks are very diverse (e.g., some require motor skills, while others only include perceptual/cognitive processing, some are continuous, and others are static), it is hard to establish a common ground controlling for task-specific confounds.
In VR, a systematic underestimation of distances was observed [16,19,20,37].Kelly [20] conducted a literature review and summarized an average underestimation ratio of 73.48 % (where 100% would be an accurate estimation and value > 100 % would be an overestimation) of the actual distance.He concluded that a wider field of view, less weight, and a higher pixel density of the HMD lead to a more accurate depth estimation.Because HMDs have improved on these aspects in recent years and will further improve, depth judgments among newer HMDs are expected to become more accurate.Willemsen et al. [47] also examined the problem of distance underestimation and mechanical aspects that might influence depth estimation.To some extent, these mechanical attributes of the HMD (weight, moments of inertia) account for the distance compression.However, the authors assumed there must be other perceptual aspects of why users underestimate distances in virtual environments.In another work, Kelly et al. [21] measured depth estimations in the Meta Quest and Meta Quest 2. The results from a verbal report revealed an underestimation of 82% (Meta Quest) and 75% (Meta Quest 2) compared to a real-world estimation of 94% of the actual distance.
Jones et al. [19] examined depth judgments in VR and OST AR displays with a blind walking task.While they did not detect a distance underestimation in the OST AR display, they found an underestimation effect for the VR condition.They added an additional factor of motion parallax since they expected that higher motion while viewing the object would contribute to a better estimation.In the control condition, participants were asked to remain in one position.Against their expectation, the authors could not find an effect of motion parallax on the depth judgment between VR and OST AR.Ping et al. [36] implemented a task to move a virtual bar back and forth to match the distance of a ball on a shuffleboard.In VR, the ball and the shuffleboard were virtual.In the OST AR view, a real shuffleboard and ball were provided, even though the ball was also displayed virtually.They measured a higher accuracy in the OST AR condition.Ping et al. [36] and Jones et al. [19] both measured a higher error the farther the target objects were away.Cidota et al. [8] enhanced VR and OST AR with visual effects, i.e., blur and fade effects, to investigate if these effects alter the perception and performance in their system.Participants performed grasping and sorting tasks into boxes at different depths.They did not measure a difference between VR and OST AR in their control condition.Their results further showed that the induced visual effects disturb the performance of the OST AR condition, while they contributed to better results in VR.
While these studies examine OST AR displays, only a few studies exist on depth perception in VST AR displays.Messing and Durgin [31] compared distance perception of a real-time monocular video stream in a VR HMD (Virtual Research Systems V8 HMD) and direct viewing with monocular goggles and a cardboard tube to simulate a restricted field of view.In a blind walking task, participants were asked to walk distances to targets between 2 and 7m.While the monocular goggles almost reached 100 % accuracy, there was an underestimation of 77 % in the HMD.Similarly, Pfeil et al. [35] examined distance perception with a blind throwing task between a stereoscopic VST view (HTC Vive equipped with a ZED Mini pass-through camera), an unrestricted real-world view and a restricted real-world view realized through a plastic casing from a stripped-down HMD.They found a higher underestimation in the VST view (93%) compared to the other conditions.Even though Messing and Durgin [31] and Pfeil et al. [35] examine the VST view, they do not integrate virtual objects in their applications, which does not conform with the definition of AR.Therefore, incongruencies by visual and spatial mismatch are not addressed.Vaziri et al. [42] assessed the depth perception of a virtual object in three different VST AR conditions.While in one condition, full visual detail of the environment was provided, the other conditions showed a sketch-like environment and no environment at all, respectively.Measured in a blind walking task, they discovered that the depiction of the environment has no influence on depth perception.Ballestin et al. [3] compared VST AR to the OST AR of the Meta 2 headset by MetaVision.The VST AR view was rendered on a smartphone mounted in front of the eyes.In a reaching task, participants significantly underestimated the distance to virtual objects in VST AR compared to the OST AR condition.The monocular nature of the camera image that represented the VST view might have contributed to this outcome since it omits stereoscopic depth information.Adams et al. [1] investigated OST and VST AR in combination with shadow cues and different heights of objects in space.They used the Microsoft Hololens 2 to represent OST AR and a Varjo XR-3 for VST AR.They found out that the application of shadow cues has only a little effect on depth judgment.When virtual objects were floating in space, they were judged as farther away.Overall, the authors could replicate Ballestin et al.'s results of more underestimation in VST AR than in OST AR.
Differences in depth judgments between VR and VST AR remain unclear, as we found no studies that directly compare these two SUIs in the same setting.However, VST AR and VR seem to incorporate a higher underestimation than OST AR [1,3,19,36].The reason for this might be the distortion of the display [16].To achieve a high field of view in the VST display, camera lenses are distorted to capture more content, i.e., straight lines appear curved [27].Other influencing factors are the field of view, the weight, and the resolution [20,47].We conclude that the hardware specifications of HMDs seem to have a certain impact on the measured distance underestimation in HMDs.Most comparative studies on depth perception use different HMDs incorporating different hardware specifications and, thus, different influences on depth estimations.Therefore, we propose investigating aspects independently of HMD-specific characteristics, to better understand perceptual aspects and evaluate more fine-grained influences among different SUIs.
While depth judgments can be measured more or less directly with the tasks previously described, an indirect measurement is the task performance, which results from the quality of the depiction of depth cues and the correct depth perception.These performance measures (such as task completion time) also show the extent to which perception affects action.Only a few studies exist that examine task performance in VST AR displays.Krichenbauer et al. [26] examined a simple selection and placement task in nine degrees of freedom (position, rotation, scale) in VR and VST AR with a 3D input device and measured a higher completion time in the VR condition.Furthermore, they measured more head movement in the AR condition, which could be an indicator of absent or conflicting depth cues that participants then counteracted with motion parallaxes.Kern et al. [23] investigated different keyboard input modalities in VR and VST AR.Contrary to Krichenbauer et al.'s findings [26], Kern et al. found a significantly higher completion time in VST AR than in VR, i.e., participants typed faster in VR.Discrepancies in the results might arise from the nature of the tasks that participants had to perform.Text input might require more cognitive resources.Kern et al. [23] explain their results with the Congruence and Plausibility (CaP) model by Latoschik and Wienrich [28] which defines a manipulation space with three layers: the sensation, the perception, and the cognition layer (i.e., bottom-up to top-down).On each layer, (in)congruence can be manipulated, resulting in a condition of plausibility.In the VST AR condition, incongruencies lead to a visual mismatch that participants actively need to counteract on a cognitive level, resulting in a lower performance.
Westermeier et al. [44] manipulated the cognitive congruence of a scenario in VR and VST AR.They implemented two different effects of a power outage: the cognitive congruent power outage affected the whole scenario, while the incongruent power outage only affected virtual interaction objects.In VST AR, they manipulated the physical environment with smart lights that were triggered simultaneously with the participants' actions.They found effects on the perceived scenario plausibility and spatial presence, i.e., the feeling of "being there" [29].In VR, they measured that the cognitive congruent power outage triggered higher plausibility and spatial presence ratings.This effect was inverted in AR.The congruent power outage (which triggered the lighting in the physical environment) performed worse than the incongruent power outage regarding the plausibility and spatial presence ratings.The authors assumed that due to the visual mismatches that VST AR contains, participants could not combine physical and virtual content into one congruent scenario.Following the CaP model's assumptions and previous findings, we thus predict that the depth perception may be violated more in VST AR than in VR due to the contradicting a priori cues causing a visual mismatch.
There is little research on both task performance and the perception in VST AR, i.e., how it may affect other evaluations of the experience, such as plausibility or the sense of presence [44] Here, we see another research gap as these ratings are important for good XR experiences.

SUMMARY AND PRESENT STUDY
Derived from the existing literature on depth perception, there is a lack of direct comparisons between VR and VST AR.However, previous studies [1,3,19,36] revealed distance underestimations in both display technologies.Building on the literature [28,44], we anticipate that the higher amount of incongruencies stemming from the combination of virtual and physical content might lead to visual mismatches, which further distort the depth judgment.As it is described by Westermeier et al. [44], AR inheres "slight reconstruction errors caused, for example, by inaccuracies or imprecisions of object tracking or unknown parameters of the current real-world light transport, given the used AR device, rendering engine, and sensory equipment" [44, p.2682].Previous work by Azuma [2] and Kruijff et al. [27] discussed the perceptual issues of AR displays and the accompanying conflicting cues from real and virtual entities.Due to this, we believe it is harder to set and estimate a virtual object in the physical environment than it is in the virtual environment, which motivates our first hypothesis: • H1: Distance judgments in VST AR are less accurate than those in VR.
Furthermore, we hypothesize a difference in task performance (i.e., time and error rate) between the two display technologies [23,26]: • H2: The task performance is lower in VST AR than in VR.
Considering the visual incongruencies inherent to VST AR, we predict potential implications on the user's perceived spatial presence and scenario plausibility [44].As such, we propose: • H3: Users report a higher spatial presence in VR than in VST AR.
• H4: Users report a higher perceived plausibility of the scenario in VR than in VST AR.
As we hypothesize superior outcomes of VR over VST AR concerning depth perception and task performance, we expect that participants will prefer VR over VST AR: • H5: Users will prefer VR over VST AR.
We implemented a 1×4 within-subjects design, utilizing a counterbalanced randomized order structured as a 4×4 Latin square.Our study comprises four conditions: a pure VR condition, a VST AR condition, and two additional exploratory conditions simulating AR in VR.For this simulation, we used the implementation presented in Westermeier et al.'s work [45].We induced noise and a low resolution to the simulated VST video stream (condition VAR, see Fig. 2a).We further increased the lens distortion (barrel distortion) in the simulated VST video stream (condition VAR+, see Fig. 2b).While this manipulation affected the environment, the interaction objects remained untouched by the noise and lens distortion.We intended to cause a visual mismatch and, thus, to validate the AR simulation [45].For the purposes of this work, we are focusing exclusively on the VR and VST AR conditions, as the other conditions are out of the scope.We decided on the Meta Quest Pro as HMD, taking advantage of its pass-through functionality to ensure consistent pixel rasterization for both VR and VST AR conditions.From the literature, we know that hardware characteristics influence depth perception.By the consistent use of only one HMD, we can control all the possible hardware-specific effects and keep inherent display characteristics consistent, focusing on differences in visual perception only.
In the VR scenario, we replicated the real office room into a virtual version (see Fig. 1a).For tasks involving participant interaction, virtual objects were employed.However, in the VST AR condition, while the interaction was still with virtual objects, reference objects (relevant but non-manipulable for the tasks) were physical (see Fig. 4).

Participants
Thirty-five participants took part in the experiment.Due to technical issues, three participants were excluded, resulting in 32 remaining participants for data analyses.The participants' demographic distribution and XR experience can be seen in Tab. 1.The study was approved by the institution's ethics committee.

Apparatus
Our experiment was conducted on a high-performance computer equipped with an Intel i9-11900K CPU, an NVIDIA GeForce RTX 3080 GPU, and 64 GB of RAM.We used the Meta Quest Pro HMD.
This device offers pass-through functionality and, thus, consistent visual parameters for both VR and AR modalities.An advantage of the Meta Quest Pro, compared to other VST AR devices, is its relatively small distortion in the camera pass-through.This pass-through is realized by two gray-scale cameras (enabling stereo vision) and an RGB camera, which overlays color onto the gray-scale images.For interaction, we utilize the Meta Quest Touch Pro controllers.
Our application was implemented in Unity (v2021.3.27f1) using the Universal Render Pipeline (v12.1.12)for rendering.To link the Meta Quest Pro with our computer setup, we connected the HMD to the computer via the Oculus Link cable and utilized the Reality Stack I/O framework developed by Kern and Latoschik [22] for HMD support.

Procedure
The study procedure can be seen in Fig. 3.It took about 1.5 hours in total.Participants began by completing the consent forms.Subsequently, they filled out pre-questionnaires covering demographics, media usage, prior experiences with VR and AR, current VR sickness status, and their aptitude in visual imagery.
Participants then put on the HMD.In the beginning, participants adjusted the interpupillary distance of the HMD lenses until they saw a clear and unimpaired image.They were placed in a black environment and saw a white cube to refer to when adjusting.
Participants completed a tutorial phase to familiarize themselves with the system.Here, participants engaged with primitive objects, following auditory instructions.A consistent black background was maintained to eliminate potential distractions or confounding factors from the physical or virtual environment.
The main experience was divided into four blocks, each representing a within-condition.Each block commenced with calibrating the HMD and controllers to the virtual space.Participants executed a series of five tasks guided by auditory cues.Upon task completion, they responded to a set of questionnaires.This structure was repeated for all four within-subjects conditions.
In the end, participants reported their VR sickness status.Additionally, they provided insights through a few retrospective questions, concluding the experimental procedure.

Tasks
Participants were required to complete five distinct tasks (see Fig. 4).These tasks were designed both to reflect established methods from prior studies (tasks 1 and 2), to explore the alignment of virtual and physical objects (tasks 3 and 4), and to get insights into task performance (task 5).One requirement for the selection of tasks was that participants did not engage too much with their own body to avoid confounds between VST AR (the own body is visible) and VR (the own body is not visible).With this selection of tasks, we wanted to cover different aspects: they vary in depth range, motor activity, and difficulty.All tasks were situated in the same room; participants would focus on the task at hand.Participants primarily used the controller thumb stick for interactions for tasks requiring object movement or rotation.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.The different tasks in VR and VST AR.Task 1 showed virtual cubes in the same manner as it is depicted for task 2 (but without the red marker).In tasks 3, 4, and 5, the reference objects' positions were fixed, while the objects were randomly assigned to these fixed positions in the VR condition.In VST AR, the assignment was always fixed.
Task 1: Verbal report Inspired by existing literature [1,11], this task required participants to verbally report depth estimations for five virtual cubes.The cubes in this task were virtual, while the depicted environment was virtual (VR) or the video stream of the real environment (VST AR).Distances to the cube positions ranged from 85 cm to 315 cm.Estimations were logged by the experimenter.
Task 2: Bisection task We refer to existing literature and conducted a bisection task [6].Participants placed a virtual red marker midway between themselves and five sequentially appearing virtual cubes.This task followed a similar structure to task 1, with the same distance range and no reference objects (see Fig. 4a and Fig. 4e).A sphere spawned in front of the participants.With the controller thumbstick, it could be moved towards the participant's body (the endpoint was defined as the position of the HMD, but at half the height) and the virtual cube's position.
Task 3: Vertical alignment We adopt Adams et al.'s [1] idea of examining the depth perception of floating objects.In addition, we intended to offer tasks with different levels of difficulty and believe the vertical offset made the depth judgment harder.Participants were presented with five virtual objects to map over reference markers with a vertical offset.In addition to accurate positioning, the virtual objects had to be rotated to align with the marker orientation (see Fig. 4b).The VST AR condition featured reference markers placed on the physical room's floor (see Fig. 4f).
Task 4: Depth alignment Similarly to Ping et al.'s [36] approach, we include a depth alignment task in our set of tasks, where participants need to move four virtual objects back and forth to align with reference objects (see Fig. 4c).Compared to task 3, this task was simpler and also positioned nearer to the participants.However, some difficulty was induced by omitting occlusion handling in this task.Styrofoam primitive forms served as the physical reference objects in the VST AR condition (see Fig. 4g).
Task 5: Hotwire game We adopt the hotwire game initially proposed by Lugrin et al. [30] as a benchmark for tracking the quality of 3D interaction.Participants steered a sphere (attached to their controller) through holes in aligned panels.These holes varied in size and position (see Fig. 4d).In the VST AR version, a 3D-printed foot and cardboard cutouts replicated the virtual setup (see Fig. 4g).The panels had dimensions of 30x30 cm, and the hole sizes ranged from 4.5 cm to 7.5 cm.The panels were lined up with distances of 10 and 11 cm.Participants conducted three iterations from left to right.
Randomization Various randomizations were employed to ensure that participants would not get accustomed over the course of four conditions.In tasks 1 and 2, seven predefined positions existed, from which five were chosen randomly in each task iteration.For tasks 3 -5, positions were fixed.However, the reference objects associated with each position were randomized, with the exception of the VST AR condition, where randomization was skipped for simplicity.

Implementation and Mitigation of Confounds
Some differences between VR and AR persist, which brings along an unknown amount/intensity of confounds.Thus, we took some countermeasures to minimize possible confounding effects.

Digital twin and visual consistency
We used a virtual replica of the real physical room in the exact dimensions, including the same composition of furniture [34].We developed task-specific models for our virtual scene and also created corresponding 3D-printed versions for our physical scene.Hence, we could ensure uniformity between VR and VST AR.A static lighting model illuminated the virtual scene to minimize computing time.While interactable objects had dynamic real-time lighting, their shadows were omitted due to potential inconsistencies between VR and VST AR and negligible impact on depth perception as indicated by Adams et al. [1].In task 4, we manipulate the occlusion of the reference objects in both the VR and the VST AR condition to always render the interaction object in front.We create an occlusion model for the remaining tasks and reference objects in VST AR, i.e., a replica of the virtual room with a specific material.This material is rendered at a late step in the render pipeline and replaces all the pixels with a higher depth value.Hence, the virtual object could be occluded by physical content.

Calibration and spatial consistency
We introduced a room calibration procedure to align physical objects with their virtual replica.The experimenter conducted this room calibration at the beginning of each condition.We defined two virtual reference points in the virtual scene.One controller was used to define the position offset to the first reference point, which was added to the XR rig.We calculated the rotational offset angle between the second controller and the other reference point and rotated the XR rig around this angle.We maximized the distance between the two controllers to minimize a potential rotational error.We ensured the correct and consistent controller position by installing 3D-printed mounts [24] tailored for the Meta Quest Touch Pro controllers in our physical room.
Body perception Previous work has proven that a higher embodiment improves the accuracy of depth judgments [38].To align body perception between VST AR and VR conditions as closely as possible, participants' real bodies were covered with a hairdressing cape.A virtual replica of the covered body was created, which moved in synchrony with the head movement.Movement-centric tasks were minimized, so only task 5 required active 3D controller movement in space.
Size references Covering their own body also had the effect that participants could not set the size of their body parts in relation to the distances they had to judge.In addition, we removed unused physical furniture and objects from the room to prevent biases from familiar objects and their sizes.We kept the virtual and physical environments as similar as possible to reach an acceptable similarity of level of detail.
Experimenter presence The experimenter needed to stay in the same room to observe the experiment.To avoid a confounding copresence effect in AR, the experimenter was strategically positioned behind a poster wall, ensuring non-visibility to the participant.

Objective Measures
We use the term "judgment" for the assessment of distances.In task 1, the judgment includes an estimation of distance.In all other tasks, the judgment additionally includes the positioning and steering of virtual objects.Thus, task 1 includes perceptive and cognitive resources.All other tasks include motor activity as well.Here, participants make continuous readjustments while placing/steering the virtual objects.Hence, we define judgment as the result of both the estimation and (if applicable) the subsequent motor activity.We utilized the concept of a distance ratio across multiple tasks.The ratio is defined as: In task 1, participants provide an estimation of the egocentric distance towards objects.In task 2, the distance to the set marker position was used as the judged distance.In contrast, the distance to the actual calculated midpoint between the object and participant is the actual distance.In tasks 3 and 4, we proceeded similarly by using the reference objects' positions for the calculation of the actual distance, while the objects placed by the participants provided the location for the calculation of the judged distance.
To account for possible offsets in the judgments (e.g., on average, an overestimation could cancel out an underestimation), we also compute the absolute misjudgment: In task 5, we calculate an error rate by counting the frames in which a collision between the sphere and panels is detected and dividing it by the total amount of frames needed for one iteration.
As performance metrics, we measure the needed time for each task as well as motion data (i.e., participants' head and controller position and rotation).

Subjective Measures
Besides objective measures, we use questionnaires to assess the participants' perception.We ask for the Spatial Situation Model (SSM), designed by Vorderer et al. [43].It contains questions concerning the construction of a spatial model of the viewed scene, such as "Even now, I still have a concrete mental image of the spatial environment."[43].The SSM is defined to be a precondition for spatial presence.The SSM contains eight items that are answered on a five-point Likert Scale from "I do not agree at all"(1) to "I fully agree" (5).
In the beginning, we ask participants for their Visual Spatial Imagery (VSI), which gives insights into the participants' individual preconditions concerning the recreation of spatial information [43] (e.g., "When someone describes a space to me, it's usually very easy for me to imagine it clearly.").Vorderer et al. describe in their work that it can influence the SSM.The VSI is answered on a five-point Likert Scale ranging from "I do not agree at all"(1) to "I fully agree" (5).
We measure spatial presence by using the Spatial Presence Experience Scale (SPES) with the two subscales Possible Actions and Self Location [14].It includes eight items (four per subscale) with the endpoints "I do not agree at all"(1) and "I fully agree" (5).The SSM, the VSI, and the SPES were all developed in the broader frame of the MEC-SPQ [43].
Similarly to Westermeier et al. [44], we assess the perceived plausibility by using their proposed questions (inspired by Brübach et al. [7]) in an adapted form (e.g., "This experience was unusual for me" or "I could not anticipate what would happen next in the scenario").The questions are answered on a seven-point Likert Scale from "I do not agree at all"(1) to "I fully agree" (7).
To assess the subjective task load, we ask the NASA TLX questions on mental and physical load [13] using a slider ranging from 0 to 20.
At the end of the experiment, participants were asked to decide which condition they preferred, which condition they perceived as most complex, and which condition they perceived as easiest.
As a control measure for VR sickness, we asked participants to fill in the Virtual Reality Sickness Questionnaire (VRSQ) [25] before the first exposure started and after the last exposure ended.The VRSQ is answered on a four-point Likert Scale ranging from "None"(0) to "Severe"(3).

Hypothesis Testing and Task-Based Analysis
For H1, we involve results from tasks 1 and 2 (the distance ratio and absolute misjudgment).Tasks 3 and 4 also provide information on the accuracy.However, they include reference objects from the physical environment and, thus, depend on the correct calibration of the room.Tasks 3 and 4 are more implicit and active and provide more relevance for real-world scenarios.H2 is measured by the time needed to fulfill the tasks.Task 5 furthermore provides an error rate, which can be used as a measure of task performance.However, again, the results of this measure highly depend on the calibration quality.We answer H3 and H4 with results from the SPES and the perceived scenario plausibility questionnaire.H5 is answered by the preference rating participants provide at the end of the experiment.

Objective Measures
Objective results were obtained from the tasks to determine the depth judgments and task performance in VR and VST AR.A comprehensive overview of these results is provided in Tab. 2.
By visual inspection, we noticed some outliers that we trace back to issues with the controller interaction (e.g., we observed that participants accidentally confirmed their choice instead of placing the object at the right location because they confused controller buttons).To mitigate these outliers, we conducted a winsorizing over the data of both conditions with 0.05 as the lower and 0.95 as the upper limit.We conducted the winsorizing only for the judgments of tasks 2, 3, 4, and 5, as there was no controller action required in task 1.Some of our results did not meet the assumptions of normality.However, ANOVAs have been found to be resilient to such deviations [5,15,39].Therefore, repeated measures ANOVAs were used to compare VR and VST AR, with a significance threshold set at p < .05.T-tests were conducted to compare distance ratios against the expected value of 1.

Comparison of Depth Judgments between VR and VST AR
For task 1, we measured a significant difference in distance ratios between VR and VST AR (see Fig. 5a).According to the p-values, we did not find significant differences between VR and VST AR concerning absolute misjudgments.In general, the depth was underestimated by around 10 %.The absolute misjudgments show that participants guessed the distance wrong by more than 20 % on average.In addition, the standard deviation values are very high, indicating a high variance in misjudgments.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.In task 2, we found a significant effect concerning the distance ratio and the absolute misjudgment between VR and VST AR.The distance ratio is significantly higher in VST AR than in VR.In VR, the participants positioned the marker wrong by 12.9 % on average.In VST AR, the misplacement reaches 16.5 %.Participants generally placed the markers farther back than the actual center between the participant and the object (see Fig. 5b).
Task 3 revealed significant differences in distance ratios (see Fig. 5c) and absolute misjudgments, indicating higher distance judgments in VST AR.However, the absolute misjudgments in VR and VST AR remain relatively small by 1.2 % (VR) and 1.7 % (VST AR).
Similarly, a significantly higher distance ratio (see Fig. 5d) and absolute misjudgment could be detected in the VST AR condition in task 4. Again, the absolute misjudgments were minimal, with 0.1 % (VR) and 1.0% (VST AR).
In task 5, a significant effect on the error rate was measured (see Fig. 5e).While in VR participants collided the sphere with obstacles by 7.2% of the whole task, they collided by 23% in VST AR.
Analyses of task completion times revealed a noticeable trend of longer durations for the VST AR condition in specific tasks and over the duration of all tasks.Regarding movement, participants in the VST AR condition consistently showed greater head movement across most tasks compared to those in VR.

Questionnaires
While the mean values were high in the SSM, the SPES, and the plausibility questionnaires, no significant differences were found between VR and VST AR:

Participant Preferences and Perceived Complexity
After concluding the tasks, participants were asked to share their preferences and perceptions regarding the simplicity and complexity of the conditions.Fifteen participants decided on VST AR, followed by ten who liked VR the most.The other conditions received fewer votes, with four and one vote, respectively.Two participants did not decide on one condition.When we asked participants about the simplicity of the conditions, 18 participants chose the VR condition, and six participants chose the VST AR condition, followed by four and two votes for the omitted conditions.Two participants did not decide on a condition.Seven participants stated that the VST AR condition was the most complex, and one participant voted for the VR condition.Twelve and nine participants rated the other conditions as the most complex.Three participants did not decide on one condition.

Depth Judgments
We can accept H1 "Distance judgments in VST AR are less accurate than those in VR."In task 1, the deviation of underestimation was of less intensity in VR.In tasks 2 -4, elements were placed farther away in VST AR than in VR, and the absolute misjudgment indicates a higher variance in misjudgments in VST AR compared to VR.Although results from tasks 3 and 4 can be caused by wrong depth perception, we need to interpret the results with caution: In contrast to tasks 1 and 2, where the interaction only included virtual objects, referential objects were provided in tasks 3 and 4 that had to be matched.Thus, the correct placement of virtual interaction objects in the physical environment also depended on the calibration quality.Although we provided fixed positions for the physical controller mounts and virtual reference points (see Sec. 4.5), the Meta Quest Touch Pro controller inhere possible tracking inaccuracies caused by insufficient tracking camera information or problems when consolidating tracking information from the controllers and the HMD.Thus, we cannot rely on tasks 3 -5 to make assumptions about the accuracy.However, tasks 1 and 2 alone prove a higher misjudgment in VST AR compared to VR.
In task 1, we detected an underestimation for both conditions.This is in line with previous literature [1,3,19,36].If we compare the results from task 1 to Kelly et al.'s work [21], which examined the depth judgment in the Meta Quest and Meta Quest 2 by verbal reporting, the Meta Quest Pro performs more accurately than its precedents.We measured a distance ratio of 94 % in our VR condition compared to 82% (Meta Quest) and 75% (Meta Quest 2) that were measured by Kelly et al. [21].Compared to their real-world condition, which resulted in a 94 % underestimation, our VR results keep up with this measure.
In contrast to this underestimation in task 1, virtual objects in tasks 2 -4 were placed farther away from the participant's perspective.A possible reason for this deviation might be the nature of tasks.Task 1 was static and required no motor interaction.Hence, it only required perceptive and cognitive resources.In contrast, the other tasks were more active, enabling the user to perform motor actions and, thus, to continuously readjust and reevaluate the object placement.We assume that participants underestimated the distances and then overcompensated by moving the virtual objects further back.Additionally, in task 2, participants might have calculated the position from the edge of their body and not the center of their body, even though they were instructed otherwise, leading to the placement farther back.
The results are especially interesting with regard to the identified causes of distance compression in HMDs from the literature.As Willemsen et al. [47] hypothesized, there are (additional) perceptual factors influencing depth perception apart from hardware characteristics.In our results, these perceptual factors appear in different manifestations in VR and VST AR, causing a significant difference between VR and VST AR.

Task Performance
Hypothesis H2 "The task performance is lower in VST AR than in VR." can be accepted.We identified a higher completion time over all tasks in VST AR.Participants also moved their heads significantly more in VST AR than in VR.This finding is in line with Krichenbauer et al.'s work [26].The authors measured more head motion in VST AR.We assume participants perceived conflicting depth cues caused by visual mismatches in the VST AR condition.As a consequence, they moved their head more to exploit motion parallaxes.Hence, they could compensate for these conflicts and enhance their depth perception.
We did not detect significant effects in the mental or physical demand of the NASA TLX.We imagine that VR/VST AR differences were too subtle to be consciously noticed by participants.
In task 5, a significant effect revealed a higher error rate in VST AR.While these findings support our hypothesis, we again need to view them with caution as the calibration quality defined the error rate to a certain extent.In task 5, this also affected the visualization of occlusion as the occlusion model matched the virtual room model.Additionally, a mismatch in lens distortions between the VST view and the virtual overlay might have caused misplacements of the occlusion model on the actual physical model (see Fig. 4h).Thus, participants might have perceived visual mismatches between the occlusion of the sphere and the actual holes in the panels.

Subjective Measures
Hypotheses H3 "Users report a higher spatial presence in VR than VST AR." and H4 "Users report a higher perceived plausibility of the scenario in VR than in VST AR." cannot be accepted for our sample.Following the CaP model [28], we would have expected that the AR-inherent a priori incongruencies would negatively affect the spatial presence and plausibility.Even though we did not find a statistically significant p-value < .05,we cannot say there was no difference between VR and VST AR as we measured small effect sizes.
Surprisingly, we have to reject H5 "Users will prefer VR over VST AR."Even though the VR condition was the easiest to conduct, most participants chose VST AR as the condition they liked the most.
Derived from the VR and AR experience (see Tab. 1), participants had less experience in AR than in VR.This could have led to a novelty effect, causing the participants to feel a higher sensation and, thus, give higher ratings for the VST AR condition.
Another reason why participants liked the VST AR condition most and also had no deduction in the questionnaires on spatial presence and perceived plausibility is the relatively small proportion of virtual content.Most of the display was covered with an undistorted view of the environment captured by a camera stream that matched the real environment perfectly.Thus, participants might have mainly focused on that.Furthermore, participants might not have taken a closer look at the composition of the environment.In an observation task, outcomes might have been more distinctive for the spatial presence and plausibility ratings.
Wienrich et al. [46] proposed a reference frame as a frame that defines the primary reference (reality/virtuality) that the experience is judged upon.It is constructed by the proportions of virtual and physical content and further weighting.Given the small proportion of virtual content, participants possibly had a reference frame of reality.Hence, everything seemed plausible and spatially intact to them, as they may have neglected the virtual content completely.We expect that a higher proportion of virtual content causes more potential for incongruencies and, thus, more deviations in spatial presence, plausibility, and personal preference.

Limitations and Future Work
Study design Our study design included a 1×4 within-subjects design.Thus, participants might have performed better and faster in the later conditions when they had more practice.Additionally, when rating one condition, participants take previous conditions as a relative anchor and adjust their new ratings accordingly.We decided to omit reporting and discussing two conditions in this paper.However, the results showed that in the VAR+ condition, tasks 1 and 2 incorporated similar results to the VR and VST AR conditions, while tasks 3 -5 had worse results for the VAR+ condition.The VAR condition showed similar results to the VR condition.From these results, we can conclude that a mismatching lens distortion negatively impacts depth perception and that spatial coherence (disturbed by lens distortion in the VAR+ Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.condition) is more relevant for the right depth estimation than visual coherence (disturbed by film grain in the VAR condition).Even though these results are insightful, we suggest that more refinement is needed to validate this AR simulation fully [45].Thus, we will use these insights for future studies as a starting point to present a more concluding view on this simulation in the future.
Causal ambiguities Our results cannot pinpoint the underlying cause or distribution of causes of the discrepancy between VR and VST AR.This would require further investigations.In the future, it would be fruitful to apply a direct comparison of VR, the VST view as it was used by Messing and Durgin [31] and Pfeil et al. [35], and VST AR (including virtual content) as we used it in our study.Hence, we could further eliminate parameters not responsible for the misjudgment and determine if the effects result solely from the VST view or if the mismatch of virtual and physical content plays a significant role.To quantify the effects of single incongruencies leading to that mismatch, we consider an evaluation by simulation VST AR in VR [44].
Hardware and use case specificity Our findings are based on the use of the Meta Quest Pro.Since every HMD has specific technical realizations (in terms of lens distortion, resolution, etc.), we cannot generalize our findings for other VST AR HMDs.For example, other HMDs often do not offer stereoscopy for the VST content.Thus, results might differ drastically.
In addition, we acknowledge that there are distinct use cases better suited for VR and others for VST AR.Thus, our results shall not answer if an application should be implemented in VR or VST AR but rather inform about possible effects that can appear when designing SUIs along the RV continuum.Furthermore, performance, as we measured it (completion time and error rate), is not the decisive quality criterion for some use cases.Thus, other measurements, such as the usability of a system, shall also be considered in the future.Depth range We looked into the absolute values of the distances in cm in task 1 to see if the estimation error increases when distances are bigger.Figure 6 shows a wider spread of estimations the higher the actual distance, while the overall trend shows a consistent incline.We measured depth judgments only in the near action space up to 3.5m.It would be interesting to get further insights in other spaces > 3.5m, which would be relevant for contexts such as driving simulations or other navigation tasks.For example, previous work of Gagnon et al. [12] found overestimations and underestimations at specific depth ranges up to 500m.Examining the space < 0.5m would also be interesting since this is the action space for precision tasks.
Interaction For these smaller, precise tasks requiring accurate finger interaction in the near-eye field, hand-tracking interaction could be applied.This would also enable more intuitive interactions.For now, we rely on controller interaction because tracking is more accurate than the recognition of finger gestures.We additionally wanted to prevent participants from deploying their bodies too much since it would have triggered confounding effects in VST AR in comparison to VR [38].
Participant posture Contrary to previous studies [6] on bisection tasks, we found the positioning of the marker farther away in task 2. These results might have been confounded by the body offset when participants were in a seated position and saw their knees.This makes it harder for participants to judge distances from the center of their bodies.In addition, participants have different lengths of thighs, leading to interpersonal differences in the forward offset, especially in the VST AR condition, where they see their own thighs.In VR, we approximated the participants' virtual bodies with a 3D model of the cape.To avoid confounds, we suggest applying a standing position in future experiments.
Task relevance In the future, tasks shall be tailored to relevant use cases.We plan to include more cognitive tasks to learn about the interplay between incongruencies and task load on different levels [28], i.e., perception and cognition.For now, we only concentrated on the perception part, but it would be interesting to find out how incongruencies cognitively restrict users.
Calibration accuracy For the tasks including reference objects (i.e., tasks 3 -5), the VST AR condition indicated a higher variance in distance ratio and error rate, respectively.We attribute some of the variance to an incorrect depth judgment and some to inaccuracies in calibration.However, we cannot clearly separate these.In the future, one option would be to calculate a relative model.For example, if all objects were judged incorrectly by a consistent offset in one direction, we could assume this is the amount of calibration inaccuracy.Our tasks 3 and 4 were placed on opposite sides of the room.A calibration inaccuracy would imply that if there is a farther judgment in task 3, there has to be a shorter judgment in task 4 or vice versa.However, we measured a farther judgment in both tasks, leading to the assumption that objects were positioned farther back not due to calibration inaccuracies.
Outlier We mitigated outliers in tasks 2 -5 by winsorizing our dataset to cancel out possible mistakes participants made when interacting with the controllers.In comparison, the winsorized data showed significance for the same measurements as the original data and an additional significance for the distance ratio of task 2. In most cases, the winsorizing led to higher effects (with exceptions for the distance ratio of task 4 and the error rate of task 5, which still remained significant).

CONCLUSION
This work provided a comprehensive examination of depth judgments in VST AR to VR.Our findings indicate higher misjudgment in VST AR compared to VR.We further identified a lower task performance in the VST AR condition measured by needed time and head movement.Surprisingly, we measured no deductions in subjective ratings.The VST AR condition was preferred over the VR condition.We assume that the high proportion of real content in the VST AR condition caused the participants to neglect the visual mismatch between real and virtual content.We outlined certain challenges of VST AR, including possible confounds and hardware limitations (viewer's body perception, lens distortion, and potential inaccuracies in calibration).Although our findings indicate comparatively worse ratings in VST AR, we assess the level of inaccuracies to be within a reasonable range, particularly given the emerging nature and rapid advances of VST AR displays.This suggests a promising scope for refinement and advancement in future applications.Overall, our study offers insights into depth perception within both VR and VST AR environments.It lays a groundwork for further exploration, emphasizing the evolving capabilities and potential of VST AR despite the initial challenges observed.

Fig. 2 :
Fig.2: The omitted conditions.We lowered the resolution of the VST view and added noise.In (a), the occlusion is aligned between the sphere and the simulated VST view.Due to a discrepancy of lens distortion in (b), the occlusion is not coherent between the sphere and the simulated VST view.
Fig.4: The different tasks in VR and VST AR.Task 1 showed virtual cubes in the same manner as it is depicted for task 2 (but without the red marker).In tasks 3, 4, and 5, the reference objects' positions were fixed, while the objects were randomly assigned to these fixed positions in the VR condition.In VST AR, the assignment was always fixed.

5 Fig. 5 :
Fig.5: Plots showing the distance ratios from tasks 1 -4 and the error rate of task 5.All boxplots show a significant effect.

Fig. 6 :
Fig. 6: Scatter plots of the absolute values in cm of the actual and estimated distances in task 1.
• Franziska Westermeier and Larissa Brübach are with the Human-Computer Interaction (HCI) Group and the Psychology of Intelligent Interactive Systems (PIIS) Group from the University of Würzburg.• Carolin Wienrich is with the PIIS Group from the University of Würzburg.• Marc Erich Latoschik is with the HCI Group from the University of Würzburg.

Table 1 :
Demographic data and XR experience of participants.

Table 2 :
Results from calculating repeated measures ANOVAs of the objective measures between VR and VST AR.We report the mean (M), the standard deviation (SD) as well as the test statistic F, the p-value and the partial eta squared (η 2 p ). Significant p-values and the mean values of the respective condition with higher accuracy, less movement, and less time are marked in bold.Distance ratios and absolute misjudgments for tasks 2 -5 were winsorized to mitigate outliers.