Analysis of Gaze, Head Orientation, and Joint Attention in Autism With Triadic VR Interviews

Effective use of gaze and head orientation can strengthen the sense of inclusion in multi-party interactions, including job interviews. Not making significant eye contact with the interlocutors, or not turning towards them, may be interpreted as disinterest, which could worsen job interview outcomes. This study aims to support the situational solo practice of gaze behavior and head orientation using a triadic (three-way) virtual reality (VR) job interview simulation. The system lets users encounter common interview questions and see how they share attention among the interviewers based on their conversational role (speaking or listening). Given the yaw and position readings of the VR headset, we use a machine learning-based approach to analyze head orientations relative to the interviewers in the virtual environment, and achieve low angular error in a low complexity way. We examine the degree to which interviewer backchannels trigger attention shifts or behavioral mirroring and investigate the social modulation of gaze and head orientation for autistic and non-autistic individuals. In both speaking and listening roles, the autistic participants gazed at, and oriented towards the two virtual interviewers less often, and they displayed less behavioral mirroring (mirroring the head turn of one avatar towards another) compared to the non-autistic participants.


Analysis of Gaze, Head Orientation, and Joint Attention in Autism With Triadic VR Interviews
Saygin Artiran , Poorva S. Bedmutha , and Pamela Cosman , Fellow, IEEE Abstract-Effective use of gaze and head orientation can strengthen the sense of inclusion in multi-party interactions, including job interviews.Not making significant eye contact with the interlocutors, or not turning towards them, may be interpreted as disinterest, which could worsen job interview outcomes.This study aims to support the situational solo practice of gaze behavior and head orientation using a triadic (three-way) virtual reality (VR) job interview simulation.The system lets users encounter common interview questions and see how they share attention among the interviewers based on their conversational role (speaking or listening).Given the yaw and position readings of the VR headset, we use a machine learning-based approach to analyze head orientations relative to the interviewers in the virtual environment, and achieve low angular error in a low complexity way.We examine the degree to which interviewer backchannels trigger attention shifts or behavioral mirroring and investigate the social modulation of gaze and head orientation for autistic and non-autistic individuals.In both speaking and listening roles, the autistic participants gazed at, and oriented towards the two virtual interviewers less often, and they displayed less behavioral mirroring (mirroring the head turn of one avatar towards another) compared to the non-autistic participants.
Index Terms-Autism, job interview practice, machine learning, social modulation of gaze and head orientation, virtual reality.

I. INTRODUCTION
V IRTUAL reality (VR) has rapidly gained traction in recent years.It is employed in numerous fields, including games, interior design, and healthcare.The immersive and interactive nature of VR lets users activate their senses to blend with the environment.The sensation of becoming physically present in a non-physical world may offer a distinct opportunity for effective experiential learning.For example, users can practice social communicative skills in a supervised setting without fear of real-world repercussions.
Gaze can be used to perceive information from others or to signal a variety of meanings, e.g., wishing to communicate [1].
Conversational roles may alter gaze characteristics, which is called the social modulation of gaze; when listening, individuals make more eye contact than when speaking [2], [3], [4].Head rotation and gaze direction tend to align about 70% of the time [5].Head orientation alone can also allude to the visual center of attention in a conversation [6].Like gaze and head rotation, backchannels can also be used to show interest in the conversation.Conversational backchannels can be verbal, e.g., "uh-huh", "hmm", "yes", "wow", or nonverbal such as head nods, shakes, and tilts.
Observations of individuals' gaze patterns and head orientations, along with their responses to backchannels can yield valuable insights into their propensity for engagement and joint attention.Joint attention refers to the deliberate alignment of an individual's focus of attention with that of another person, resulting in both parties looking at the same subject matter.This capacity holds special significance during early developmental stages as it facilitates children in acquiring object names and object usage guidelines, contributing to later developmental milestones [7].An individual's focus of attention can be inferred from gaze patterns when their face is visible; head or body orientation can also provide valuable clues [8].Both gaze and head orientation serve as critical mechanisms for either enacting or evaluating joint attention.In situations involving multiple listeners, speakers can prevent excluding participants by briefly directing their gaze or orienting their bodies towards them [9].This includes acknowledging listener input.Such actions are pivotal for ensuring that other parties feel acknowledged and integrated in the conversation.
Effective use of gaze and head orientation can boost an individual's chances of securing and maintaining employment.During job interviews, applicants who maintained direct eye contact and kept their head up were generally perceived as more confident and reliable, increasing their likelihood of being offered the job [10].Other early research supported these findings, revealing a strong connection between higher levels of perceived self-confidence and competence in interviewees who effectively used nonverbal cues such as eye contact and head orientation [11], [12], [13].In a more recent study, interviewees who maintained optimal eye contact with the interviewer, without excessive gaze wandering or staring, received higher interview scores [14].Candidates who demonstrated effective head orientation towards conversational partners were selected more frequently for employment opportunities.
Autism is a multifaceted developmental condition that influences an individual's social interactions and information processing [15].One out of every 44 children is autistic [16], and autism is linked to a high unemployment rate, as 69% of individuals with autism express a desire to work, yet only 20% of this demographic finds gainful employment [17], [18].Researchers have delved into contrasting patterns of gaze behavior and head movements in nonautistic (NA) and autistic individuals.Variations in head movements were identified as potential early indicators of autism [19], [20].Autistic individuals exhibited a reduced tendency to initiate and sustain eye contact, with less focus on facial features [21], [22], [23], [24], [25], [26].A study suggested that autistic individuals might fixate on a single person while neglecting others [27], or may face challenges when attempting joint attention [28].Differences in social communication, coupled with conventional workplace communication norms and hiring procedures, could be hindering the employment prospects of autistic individuals [29], [30], [31], [32].
In this study, our contributions are: (1) A novel triadic job interview system in VR that enables fully automatic analysis of joint attention and of social modulation of gaze and head orientation, (2) A multilayer perceptron (MLP) regressor to adjust head rotation values based on head movements to accurately detect engagement with virtual targets, (3) A novel study of joint attention and social modulation of head orientation for both autistic and NA participants using immersive VR.
The rest of this paper is organized as follows.Section II summarizes related work and Section III describes the triadic VR mock job interview application.In Section IV, we explain the virtual interviewers' backchannel behaviors during the mock job interviews, while Section V presents the system pipeline and data processing procedures.Section VI presents comparative social behavior analyses of our participants, and we conclude with Section VII.

II. RELATED WORK
In this section, we review studies where VR was employed to analyze social behaviors such as gaze, head movements, and mirroring in a variety of environments, including triadic (three-person) conversations.We also report on previous work where the technology was used as a tool to enable the practice of joint attention and social skills in job interviews.

A. Gaze Behavior Analysis
Precise surveillance of eye movements can allow VR systems to understand user gaze patterns, identify objects that draw attention, and investigate the impact of stimuli on gaze behavior.VR eye tracking has found applications across a range of fields, such as gaming [33] and psychology [34].Fathy et al. [35] built an ML architecture to forecast visual focus in VR, showing the promise of using large-scale gaze data to analyze gaze patterns in VR.Wang et al. [36] investigated visual attention in VR driving simulators.
A few studies investigated gaze behavior in the context of job interviews.The authors of [37] developed a tool to practice maintaining eye contact.During the interviews, the virtual character's level of interest was influenced by the user's gaze direction.If the user gazed towards the virtual character, it appeared interested, otherwise, it behaved as though it were not paying attention.In [38], participants acted as interviewers listening to job applicants in both computer-mediated and faceto-face mock interviews; eye tracking was used to examine how scar-like facial features influenced gaze patterns.
Contemporary eye trackers, particularly those in headmounted displays (HMDs), encounter difficulties with calibration and data accuracy [39], [40].Glasses, contact lenses, mascara, and physiological aspects such as eye color can impede gaze tracking.This may necessitate repeated calibration and could reduce measurement trustworthiness.Researchers have devised algorithms to mitigate issues due to calibration drift, headset movement, and blinking [4], [41], [42].

B. Head Orientation Analysis
Head movements can communicate information on emotions, intentions, and conversational engagement.Researchers have used VR headsets to understand the role of head movements in engaging with virtual environments [43], [44].Xiao et al. [45] aimed to discern the head movements executed by participants in dyadic (two-person) interactions to acknowledge or criticize the other party.In [46], an ML model gauged conversational engagement levels based on head gestures detected using readings from an augmented reality headset.
VR technology for analyzing head orientation has shown utility across diverse fields such as spatial cognition and training simulations [47], [48].Researchers have studied how people organically move their heads during social engagements within immersive virtual settings [49].These studies revealed that effective head rotations can enhance the sense of inclusion among interlocutors.Beyond gaze, comprehending the patterns of head rotation in social contexts can improve the realism of simulated conversational systems [50].

C. Social Behavior Analysis in Triadic Conversations
Although they can inform about some social communicative behaviors, dyadic interactions have limitations in capturing other behaviors like social exclusion.There are some studies of triadic interactions in real-world scenarios.In [51], unequal distribution of gaze by a salesperson in triadic sales encounters was perceived as favoritism, leading to reduced trust from customers.Zima et al. [52] explored how speakers use gaze to assert or relinquish their conversational turn; gaze aversion from co-starting speakers was an effective strategy in securing speaking opportunities.In [53], head and eye movements of listeners were found to accurately indicate speaker location, regardless of background noise level, although head movements alone slightly undershot the speaker's position.
Unlike the real-world studies, VR use has been limited in studying triadic interactions.Hladek and Seeber [54] expanded upon [53], observing a similar undershooting behavior in a VR-based triadic conversation scenario.Using immersive VR, and head and hand tracking, Miller et al. [55] explored synchrony within triads.Synchrony, which represents the natural time-dependence of behaviors in human interactions, was influenced by the virtual environment, the dynamics of turn-taking within the triad, and gaze.Tarr et al. [56] designed an experiment in which participants, represented by virtual agents, engaged in a collaborative movement activity with two other participants.Participants in the synchrony condition reported significantly higher social closeness to their virtual co-participants compared to those in the non-synchrony condition.

D. Behavioral Mirroring and Joint Attention
Unintentional behavioral mirroring describes the phenomenon of passively and involuntarily mimicking the postures, expressions, and mannerisms of one's counterparts in social settings.In an earlier non-VR study, Chartrand and Bargh [57] examined how this unconscious mimicry can enhance the fluidity of interactions and foster a greater sense of liking between individuals.Their findings pointed at a direct correlation between the degree of behavioral mirroring and participants' self-perceived rapport.Novotny et al. [58] investigated mirroring in interviews.Following the initial interviews, participants who were paired with interviewers who engaged in mirroring showed a greater willingness to share additional information when compared to a control group in which interviewers intentionally refrained from mirroring.
In [59], researchers examined how head gestures exhibited by virtual interviewers shaped trust and liking towards them.Head gestures that appeared to follow naturalistic patterns and that were realistically mimicked led to enhanced synchronization and a stronger perception of mutual understanding.Hence, virtual interviewers capable of providing lifelike conversational backchannels might enhance the immersiveness and authenticity of VR experiences.
VR technologies have provided opportunities for groups with social communication differences to practice joint attention.In [60], a VR-based joint attention practice module helped autistic school-aged children improve their joint attention abilities, leading to more normative eye contact patterns and increased initiation of interactions.Mei et al. [61] reported that customizable virtual humans can remind participants to focus on task-relevant areas within a VR game, improving their task performance.

E. Job Interview Practice
Various tools have been developed to aid individuals in practicing skills for job interviews.Strickland et al. [62] designed a job interview practice suite that comprised multiple approaches, including VR.Other studies have demonstrated the effectiveness of VR-based job interview practice, with participants gaining familiarity with common interview questions, performing better after the practice, and reporting reduced anxiety and increased self-confidence [63], [64], [65].
In summary, previous research has demonstrated the potential of VR for studying human behavior and social interactions.VR simulations of job interviews can let individuals tailor their social communication skills to increase their employment chances while reducing the potential costs of having that same practice delivered by a human coach [66].

A. Main Design
In our simulation, the job interview was for a video game company as video gaming is a common interest of young adults, and because neurodivergent individuals are strongly represented in the video game industry [67].Users signed a consent form before the simulations.They wore an HTC Vive Pro Eye VR headset to visualize the virtual office space.It contained two virtual interviewers along with common office objects (e.g., desk, notebooks, pens, plant, closet; Fig. 1a).The interview consists of 43 questions, such as "What are some skills you would like to gain while working with us?" and "Have you previously worked with any game engines?".To signal that they fully answered a question, the users use the controller's trigger button.If needed, a question could be repeated by pressing the trackpad.This study was approved by the UC San Diego Institutional Review Board under IRB Protocol 210775 (Date of approval 7/1/2021).
The virtual interviewers ask questions one by one, not referring back to previous questions.The users respond as they wish and can pause to think.Each interviewer asks roughly half of the questions, with assignments fixed across users.For increased realism, an interviewer could ask consecutive questions or remain silent for longer durations.
We used the Live Link Face app to record the facial animations and voices of two individuals while they read 43 predefined job interview lines to a camera.The recorded data was processed in Blender and added on head models as key points.Facial textures were created using photos of the individuals.The interviewers looked engaged by means of head nodding, shaking, and tilting animations, performed when the users talk.In addition, the interviewers provide verbal backchannels such as "uh-huh" and "hmm."An interviewer turns their head towards the other interviewer when the other asks a question.They can also turn to the other interviewer when he gives a backchannel during user response.Other than backchannels and head turns, the interviewers do not engage in nonverbal communication such as arm movements.The interviewers' faces were divided into forehead, eyes, and mouth (Fig. 1b).
3D coordinates and angular velocities (pitch, yaw, roll) of the headset are tracked using Unity.Eye tracking uses the built-in tracker of the headset, accessible through Unity using the Vive SRanipal SDK.At each time step, the system collects gaze origin and direction and finds the virtual object that collides with the related gaze ray.The system records the object label, intersection location, and time spent in that time step which is not fixed across time steps.Detected speech levels are also recorded at each time step, which is also accessible through Unity.During each interview session, in addition to all of the previously mentioned data, information regarding whether the user blinked at a time step, question IDs, and the total time and percentage spent gazing at each face region are recorded and collected in a CSV file.These data are not publicly accessible due to our IRB protocol.

B. Participatory Design
Autism researchers have recently started favoring participatory design (PD) techniques; autistic individuals coordinating with researchers can co-develop practical technologies that reflect the insights of end users [68], [69].To improve our application, we conducted a PD session with two autistic adults (one college educated, one not) to discuss the acceptability, ethics, and design of the VR job interview simulation.Our initial design was the office environment we introduced in [70].Although our design partners were not distracted by their surroundings and thought the space was realistic, they suggested populating the desk with more objects which would make it look more cluttered and natural.Both design partners thought the interviewer speech was realistic and immersive, and reported that audio was in sync with mouth movements.They suggested adding subtitles as an option to better accommodate the needs of individuals with impaired hearing.They were generally pleased with the execution and timing of the head turns but reported that although not common, some of them were slightly too fast.One design partner felt interrupted by some of the verbal backchannels as they were sometimes performed while the participant was talking.Per our design partners' suggestions, we put more objects on the virtual desk and modified our design to have the interviewers only perform physical backchannels (e.g., head nods) if the user is speaking, and both backchannel types, otherwise.

IV. VOICE ACTIVITY AND BACKCHANNEL TIMING
Our system tracks user speech to control interviewer backchannel behavior.The starting point of a user's response is important as we do not want an interviewer to give a backchannel before the user starts speaking.Likewise, voice activity detection (VAD) is important to prevent the interviewers from interrupting the users by giving out of place verbal backchannels as pointed out during the PD session.
The system's audio sampling rate is 48kHz.Our application runs at around 45 time points per second, and the root mean squared (RMS) value of the audio samples over 1 second are computed at each time point.A Gaussian weighted window of RMS values centered at a time point is thresholded to decide whether there is voice activity.This approach is defined by the threshold (t R M S ), window size (w), and standard deviation of the Gaussian distribution (σ ).To validate and tune our VAD algorithm, we had 12 individuals wear the headset and read aloud 3 scripts displayed in VR with and without additional background noise (office sounds on YouTube).Each recording took about 3 minutes; we annotated the time intervals where the reader was silent for longer than 0.5s.In total 1,582 silence segments were marked (average length = 1.3s).The optimal parameter set ( tRM S , ŵ, σ ) = (0.07, 29, 21) maximized the sum of true positive rate, true negative rate, and intersection over union value (92.8%, 89%, and 89.8%).
For our application, we had to determine when the virtual interviewers ought to provide a backchannel.We based this decision on timings from 6 in-person 10-minute professional conversations involving an interviewee and two individuals who acted as interviewers.The interviewer wait times for verbal backchannels (VB) and physical backchannels (PB) were similar.From here on, we call the interviewer who asked the most recent question I nt Q, and the other interviewer I nt O. On average, I nt Q gives the first backchannel at t+8.5s (s.d.=2.8s), and I nt O at t+9.5s (s.d.=3.5s)where t marks the start of the interviewee's response.I nt Q and I nt O spend about 9s (s.d.=2.4s) and 13s (s.d.=5.5s) between consecutive backchannels.The shortest interviewee response that received a backchannel was 2.5s.
The virtual interviewers randomly perform backchannels based on these numbers from real conversations, while never giving a VB in the presence of user speech.If 2.5s of voice activity is detected starting at time t, I nt Q gives an initial backchannel at a time drawn uniformly randomly from the interval t + 8.5±2.8s,while this interval is t + 9.5±3.5sfor I nt O, as long as the user does not push the trigger button to indicate the end of their answer prior to the backchannel occurrence.For I nt Q, the duration between consecutive backchannels is drawn from 9±2.4s, and from 13±5.5s for I nt O (Table I).
Among the possibilities of interviewer backchannels and head turns towards each other, we consider four cases: QbOl: I nt Q may give backchannels, I nt O merely listens, Ql Ob: I nt Q merely listens, I nt O may give backchannels, QbOb: both interviewers may give backchannels, and Qt Ob: I nt Q turns his head towards I nt O when I nt O gives a backchannel, but does not give backchannels to the user.For those 5 questions that were known from [4] to consistently have user responses shorter than the average initial backchannel times in Table I, the virtual interviewers do not give backchannels.The remaining 38 questions were assigned with 10 each to the first two cases, and 9 each to the other two, where questions with answers of different lengths were split evenly among the cases.These assignments were fixed across participants.

V. SYSTEM PIPELINE
The system pipeline is visualized in Fig. 2. The process begins with a VR session in which the user's voice activity is tracked in real-time to determine the timing of conversational backchannels by the avatars, and all gaze data and head motion data is recorded.After each mock job interview session, the gaze data and head motion data from the headset are processed separately.The middle portion of the diagram (non real-time) shows the data processing, which consists of gaze filtering (top branch, described in Section V-A) and head motion processing (bottom branch, described in Section V-B).Finally, the analysis portion on the right uses the sequence of gaze object labels and head orientation angles to tabulate the user's engagement under different conditions (results shown in Section VI).

A. Gaze Processing
The gaze data is processed using the algorithm from [4] which reduces tracking inaccuracies due to factors like blinking, headset slippage, and abrupt head movements.As shown on the upper branch of Fig. 2, the gaze filtering algorithm begins with Kalman smoothing to remove blinks and jitter, then clusters the gaze locations in time and space using ST-DBSCAN [71].The algorithm has a final step in which points that were initially labeled as noise (no cluster assignment) by ST-DBSCAN get relabeled based on their spatial-temporal distance from the clusters.
The gaze filtering algorithm is defined by four parameters: maximum spatial (ϵ1) and temporal (ϵ2) distance to form/share a cluster, minimum number of gaze points to form a cluster (min Pts), and the weight assigned to the temporal distance (w t ) in the spatial-temporal distance metric used to relabel initial noise gaze points.Detailed parameter explanations can be found in [4].
We tuned this algorithm for this new triadic conversation version of our simulation.Following the protocol from [4], an experimenter instructed 10 participants to look at an object (e.g., forehead, eyes, mouth, plant, notebook, mug; Fig. 1a); subjects promptly shifted their gaze and fixated on it.Each participant completed 5 sessions (average 185.6s) which adds up to 2.6 hours of annotated gaze data for tuning.Subject-wise leave-one-out cross validation (CV) was used, so 45 recordings from 9 individuals determined the optimal hyperparameter set which was tested on the remaining 5 recordings.The optimal hyperparameter set in each fold was the one that minimized the sum of forehead, eye, and mouth region gaze percentage errors [4] averaged over all 45 recordings.

B. Head Motion Processing and Machine Learning-Based Yaw Adjustment
Shown on the lower branch of Fig. 2, we track head movements by tracking the headset object's position and orientation.Our goal is to explore how users engage with the virtual interviewers whose locations are known with respect to the center C of the virtual chair (Fig. 3).Irrespective of headset location, a yaw angle of 0 • corresponds to facing forward.However, a target placed 30 • left of C will not be 30 • left of the headset if the headset moves.Hence, an algorithm is needed to connect headset readings and known target locations.
The distance from C to the interviewers is 1.5m.The one on the right of Fig. 3 is centered at 22.5 • (spanning 22.5 • ±9 • ), and the other spans −22.5 • ±10 • with respect to the z (forward) axis of the virtual chair.Since the interviewers appear as targets in the horizontal direction for a user in the virtual chair, we are interested in yaw measurements.In [70], we developed a geometric approach to compute the yaw β around C that corresponds to facing the same virtual target the user is facing from a different position (H x , H y , H z ) and rotation angle α (Fig. 4a).
To validate, we instructed users to face spheres placed 1.5m from C at angles between −90 • and 90 • , in increments of 15 • (Fig. 4b).An experimenter told 12 subjects to orient towards specific spheres in a random order.This protocol was repeated for 9 different positions: left, center, right, back-left, back, back-right, front-left, front, and front-right.The center position corresponds to sitting at C. The other positions correspond to being seated half a meter away from C in the given direction.All of the spheres were targeted from all positions.
Assuming minimal head position shifts, the center position does not require yaw adjustment.We examine the mean absolute difference (MAD) between the measured yaw angle (α)  and the horizontal angle with respect to C at which the designated sphere was located (β GT ) as a baseline.For the center position, MAD = ( S i=1 |β GT i − α i |)/S, where S is the number of spheres that were faced from a position.S > 13 because all 13 spheres were oriented towards and because of repeated instructions.The center MAD averaged over all subjects was 3 • .This small error is due to the fact that subjects cannot orient exactly towards the spheres.For positions other than center, MAD = ( S i=1 |β GT i − β i |)/S.The average MAD for unprocessed yaw measurements over all positions was 15.1 • , whereas the average MAD achieved by our previously developed geometric approach was 3 • , equal to the baseline.
Ray casting is commonly used for accurate 3D target acquisition, especially when there is a clear path between the user and the target.Precise object selection in virtual environments is possible through head tracking-based ray casting [72], [73].Evaluating the head tracking-based ray casting using the spheres yielded an average MAD of 3 • , equal to that achieved by the lower complexity geometric approach.
Although the simple three degrees-of-freedom geometric approach works well, it cannot account for some behaviors.For some positions, the participants tended to tilt their heads or rotate their shoulders, not just turn their heads, to accurately orient towards the target sphere.The average absolute pitch angle was 4 • , 4 • , and 4.8 • for the positions in the back, middle, and front rows, and the maximum pitch was 8 • (for the sphere at 45 • from the front-right position).This means participants tilted their head upwards when facing nearby spheres.The average roll angles were 2  The subjects were instructed to face a sphere for 4.4s on average.For training and validation of the MLP model, N random data points were sampled from each instruction.Our MLP model is defined by 6 potential inputs from the VR headset: yaw, pitch, roll, H x , H y , H z .We used Random Forest Recursive Feature Elimination (RF-RFE) [74] in a subjectwise leave-one-out CV manner, using the data from 11 subjects to train the random forest (including different configurations) and testing it on the outstanding data.On average, the recorded yaw value (α) was the most important feature with 92%, followed by H x (4%), and H z (3%).As pitch, roll, and H y had average importance values less than 1%, we elected to use (α, H x , H z ), the same set used in the geometric approach.
Next, using the data from 12 subjects, we ran subject-wise leave-one-out CV to train and validate the MLP, while also optimizing N .We used adaptive learning rate.The optimal model had 3 hidden layers with 32 nodes in each layer.Best results were achieved for N =45 which corresponds to randomly sampling 1s of data from each turn towards a sphere.For this model, the average MAD over all positions was 2.6 • , compared to 3 • for the geometric approach (see Table II).By exploiting the strong correlations between the pitch/roll rotations and the recorded headset positions (Pearson correlation |r |=0.52, p=0.045, on average), the machine learning method is able to more accurately reflect whether the subject is orienting towards one or another interviewer.

VI. RESULTS
In this section, we compare participants' gaze and head orientation tendencies in our triadic VR job interview simulation as a function of conversational role and neurodivergence status, which has not been previously attempted in the context of VR.We make use of two non-parametric significance tests.The Mann-Whitney U test assumes that two groups are sampled from the same distribution (e.g., normal, right-skewed) and is valid for both normally and non-normally distributed data [75].The two-sample Kolmogorov-Smirnov (K-S) test evaluates the cumulative distributions of two data sets with no assumptions on the distributions; statistic D represents the maximum Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.distance between the cumulative distributions of two groups.To decide which test to use, we employ the Shapiro-Wilk test of normality, accompanied by skewness tests.
Fifteen autistic (9 male, 4 female, 2 non-binary) and 15 NA individuals (11 male, 4 female) took part in the simulations.Participant ages ranged from 18 to 28, except for one, who was 43.Individuals were assigned to the former group if they had received a community autism diagnosis in the past.The NA participants were drawn from the university graduate student population with the criteria that they did not identify as autistic and had not participated previously in this VR mock job interview simulation.The autistic participants, also mostly university students, were recruited through a neurodiversity-focused technical summer internship program.We contacted all interns from two summers for participation, and accepted all those who chose to participate; the number of NA participants was selected to match the number of autistic participants.Before each session, the eye-tracker was calibrated following the headset's default calibration procedure.The average session length was 22.3 minutes (s.d.=8.3 minutes).
After a session, participants were asked to complete the Autism-Spectrum Quotient (AQ) survey, a quantitative selfevaluation of autistic traits in an individual [76].Fig. 5 shows the reported AQ scores (one autistic participant did not disclose their score).The average AQ score for the autistic participants was 29.2 (s.d.=6.6), whereas it was 16.9 (s.d.=6.7) for the NA group.The NA participants with AQ scores higher than 25, which is the lower bound indicating autism according to [76], were kept in the NA group as they had not received a community autism diagnosis, nor self-identified as autistic.Likewise, autistic individuals with scores lower than 25 remained in the autism group.Potential reasons for such scores include familiarity with the test and its normative answers, or a decrease in autistic traits over time after an initial autism diagnosis [77].

A. Gaze and Head Orientation Analysis
Fig. 6 displays the effect of conversational role on gaze behavior.On average, both neurotypes gazed at the interviewers' faces more when they were listening to a question, compared to when they were speaking.Higher overall gaze percentages were recorded for the NA participants.The NA participants looked more at interviewers' eyes than did autistic participants (63.9% (s.d.=18.7%)versus 40.8% (s.d.=31.3%) while listening, and 54.2% (s.d.=26.1%)versus 23.4% (s.d.=19.9%) while speaking).For the listener role, the eye contact percentage distributions for both groups were normal.A Mann-Whitney U test deemed the difference between these distributions significant, U =65, p=0.05.For the speaker role, the eye contact percentage distributions for the autistic and NA participants were left-skewed and normal.A K-S test revealed the difference between these percentages to also be significant, D=0.53, p=0.01.These results comport with the findings of [4].
We also investigated the social modulation of head orientation behavior (Fig. 7).We defined 3 yaw angle regions: Both neurotypes mostly faced the interviewers regardless of conversational role.However, as speakers, participants turned more to the region between the virtual interviewers.The NA participants oriented towards the interviewers more than the autistic group did (86.5% versus 70.6% for listening, which was marginally insignificant, and 76.2% (s.d.=19.8%,rightskewed) versus 58.8% (s.d.=28.6%,left-skewed) for speaking, which was significant, D=0.53, p=0.03), whereas the autistic participants faced the I nterior region more (29.1% versus 15.1% for listening, which was also marginally insignificant, and 41% (s.d.=28.7%,right-skewed) versus 23.7% (s.d.=19.7%,left-skewed) for speaking, which turned out to be significant with D=0.53, p=0.03).The participants did not turn to the E xterior region, so we exclude it hereafter.
Our design allows for a more granular analysis based on the identity of I nt Q, the interviewer who asked the most recent question.We split the face regions in Fig. 6 into smaller ones, producing regions F Q, F O, E Q, E O, M Q, and M O, where F, E, M, Q, O denote forehead, eyes, mouth, I nt Q, and I nt O. Fig. 8 shows that the NA participants mainly gazed at I nt Q's eyes regardless of their conversational role.They made eye contact with I nt O more in the speaker role compared to the listener role (12.4% (s.d.=10.3%, normal) versus 3.7% (s.d.=1.5%, normal) which was significant, U =56, p=0.02).The autistic participants looked at I nt Q's mouth the most as listeners.Both parties gazed at the interviewers' faces less when speaking (85.3% (s.d.=8.4%, normal) versus 45.2% (s.d.=25.6%,normal) for the autistic participants which was found to be significant, U =202, p<0.001; 93.2% (s.d.=6.3%, right-skewed) versus 66.2% (s.d.=20.5%,normal) for the NA participants which was significant, D=0.8, p<0.001).Overall, the autistic participants tended to avert their eyes more from the interviewers.
Similarly, the spatial regions in Fig. 7 can be divided to show whether they relate to I nt Q or I nt O.The I nter viewer s region is divided into I nt Q and I nt O, while the I nterior region is divided into I nterior -Q and I nterior -O representing the yaw angles from 0 • to either −12.5 • or 13.5 • based on which interviewer asked the most recent question.Fig. 9 shows that when listening to a question,   the participants mostly dwelt around I nt Q.As listeners, both groups mainly faced I nt Q, and they faced the region between him and the 0 • line more than I nt O.As speakers, both groups oriented towards I nt O more than they did when they were listening; however, this increase was statistically insignificant for the autism group (5.2% (s.d.=5.6%) versus 3.4% (s.d.=3.3%)) compared to the NA group's significant attention shift (12.2% (s.d.=11.3%, left-skewed) versus 3.7% (s.d.=1.8%, normal), D=0.6, p=0.009).
We also measured individuals' levels of social exclusion while responding to a question.Since it is natural to look at a person (I nt Q) when they ask one a question, here we examine exclusion based on whether I nt O is also attended to during the response.We looked at the longest duration without interacting (gaze or head turn) with I nt O while answering a question, averaged over all questions and over all participants.For the autistic participants, this number was 17.4s (s.d.=7.5s), and for the NA participants, it was 13.2s (s.d.=5.2s), and the distributions are in Fig. 10.There is substantial overlap between the two groups.It is apparent that a few participants, both autistic and NA, tend to exclude the interviewer who did not ask the question for 25-30 seconds at a time, while primarily attending to the interviewer that asked the question.This suggests that this VR tool might be useful for solo situational practice by both autistic and NA individuals, who could practice answering questions while also bestowing attention on both interviewers.

B. Effects of Interviewer Backchannels and Head Turns
As introduced in Section IV, this VR simulation setup can also be used to explore the influence that backchannel cues and head turns performed by the virtual interviewers have on participant behavior.We investigated 4 cases: QbOl, Ql Ob, QbOb, and Qt Ob, where Q and O denote I nt Q and I nt O, and b, l, t indicate whether an interviewer can give backchannels, listens only and gives no backchannels, or turns to the other interviewer when the other gives a backchannel.We measured the maximum unsigned yaw angle changes that emerged as a reaction to interviewer backchannels or head turns, averaged over the total number of backchannels or head turns for each case.In total, for QbOl, Ql Ob, and QbOb, 305, 313, and 184 backchannels were given, and for Qt Ob, 242 head turns were performed.These backchannels and head turns (hereafter, interviewer cue) that the interviewers perform are animations with known durations.Including an extra 1 second to give users time to react, the average lengths of the animations for VBs, PBs, and head turns during the mock interviews were 3s, 6s, and 5s, respectively.
The NA participants reacted to Qt Ob the most; when I nt Q turns towards I nt O following a backchannel by I nt O, the NA participants tend to mirror I nt Q's behavior.This joint attention pattern was less prominent for the autistic participants which is consistent with a previous study involving in-person conversations [28].For the NA participants, the maximum unsigned yaw shift values were larger in the presence of interviewer backchannels compared to the short answer cases which had no backchannels, which could be due to the interviewer backchannels, whereas autistic individuals turned their heads slightly less on average for the 4 backchannel cases, compared to the short answer cases without backchannels.Across the 4 cases, autistic individuals behaved similarly regardless of interviewer cues.Mann-Whitney U tests found that the average maximum yaw shifts were significantly different for the two participant groups for Ql Ob and Qt Ob, with U =55, p=0.01, and U =53, p=0.008, respectively.Therefore, the two neurotypes tended to differ the most when I nt Q did not give backchannels but rather stayed idle or acknowledged I nt O's backchannels.

C. User Experience
Upon completing the VR simulation, participants were asked about the repeatability, complexity, user-friendliness, and realism of the application.Table III presents the results.They were asked to respond with their level of agreement on a 5-point Likert scale from 1 = "Not realistic at all" to 5 = "Acceptably realistic" for the last 3 items in Table III, and from 1 = "Strongly Disagree" to 5 = "Strongly Agree" for all the other items.High repeatability, realism, and userfriendliness, and low complexity scores verify the suitability of our triadic self-deliverable VR mock job interview simulation which can let users practice for job interviews.Open-ended questions in the survey asked about what the participants liked the most, and what design components could be improved.The participants enjoyed being able to practice for a job interview in a low stakes environment and with human-like interviewers.They found some interviewer head turns a bit fast and abrupt, and suggested having more background noise.Some wished the headset were more comfortable, and suggested the app should facilitate taking a break.Overall, 25 participants out

VII. CONCLUSION
In professional settings such as job interviews, applicants who use gaze and head orientation effectively can improve their likelihood of employment [14].In this study, we presented a triadic VR mock job interview that lets users familiarize themselves with popular interview questions while getting informed about their attention distribution and social exclusion tendencies.Our participants favored the repeatability, user-friendliness, and realism of our design.
Using a machine learning architecture to accurately detect head turns towards the virtual interviewers, and a signal processing-based algorithm that can mitigate the common problems with eye tracking in VR [4], we showed that regardless of their conversational role or their neurotype, our participants primarily interacted with the interviewer who posed the most recent question.As listeners, the autistic participants gazed at the interviewers' mouths more than at their eyes, and more than the NA individuals gazed at the mouths.This was also reported in [4] on dyadic VR conversations, which shows that the autistic participants followed similar gaze trends in triadic cases.
We have findings that are novel in the realm of immersive VR; NA participants engaged with the interviewer who did not pose the most recent question significantly more when speaking compared to listening, whereas the autistic participants did not.We also discovered differences in joint attention tendencies; NA participants tend to mirror an interviewer's behavior in turning to the other interviewer, whereas autistic participants do this significantly less.These behaviors have not been previously addressed using VR.Also, a few participants of both neurotypes tended to exclude the interviewer who did not ask the most recent question, suggesting that this system might be useful in the general population to practice interview and engagement skills.
Although our system can provide a useful solo practice opportunity for job interviews, it has some limitations, including the relatively small number of participants, and that the AQ scores of the two participant groups show significant overlap, likely because the groups were based only on community diagnosis and self identification as autistic or not.We also have some limitations in the design of the VR application, in particular that some avatar head turns were too fast, and users were not able to go back to a previous question (useful if a user accidentally skips a question).
In future work, we intend to expand our participant set.We will design multiple question sets so that the users can encounter a more diverse set of questions, and can use the application more than once.We plan to address the feedback we received.We will add captions to make our design more inclusive for people with hearing disabilities, and will redesign some head turn animations to make them more natural.We also aim to create an interview practice tool that offers automated feedback, such as alerting users in case of social exclusion.

Fig. 2 .
Fig.2.System pipeline, showing the VR session on the left, the data processing in the middle section, and the behavioral analysis on the right.In the figure, α is the yaw angle measured by the headset, and H x and H z are x and z-positions of the headset.Note that the analysis portion also takes as input (not shown) from the VR application the identity of IntQ/IntO and whether IntQ turned his head towards IntO at a time point or not.

Fig. 3 .
Fig. 3. Position and orientation of the interviewers with respect to the center of the virtual chair.

Fig. 4 .
Fig. 4. (a) C, H, and T represent the center of the virtual chair, headset, and target location that the user is facing.D is the projection of T on the forward z axis of the headset.x-axis is the horizontal motion axis in Unity.α is the yaw angle measured by the headset, with respect to the forward z axis of the headset object.β is the output yaw angle of the algorithm, (b) Head rotation data collection setup.Nine chair positions were used during data collection, over the area spanned by the red arrows.
for the positions in the back, middle, and front row.The largest roll value was recorded when facing the sphere at −90 • from the front position (5.3 • ).As an ML model can learn that although head positions might change due to head tilts, they still correspond to facing the same target, we developed an MLP regressor-based yaw adjustment module.An MLP is a fully-connected feed-forward artificial neural network with at least three layers (at least one hidden layer) with a nonlinear activation function.

Fig. 5 .
Fig.5.Autism-Spectrum Quotient (AQ) scores reported by participants.The broken black line marks the threshold introduced in[76]; scores higher than this point to an increased chance of autism based on this self-report questionnaire.

Fig. 11 .
Fig. 11.Maximum unsigned yaw shifts for each interviewer cue type, averaged across all participants.

TABLE I TIME
TO FIRST BACKCHANNEL, AND BACKCHANNEL SPACING

TABLE II MAD
VALUES AFTER PROCESSING THE YAW VALUES WITH GEOMETRIC OR MLP-BASED APPROACH • , 2 • , and 2.2 •

TABLE III SUMMARY
OF REPEATABILITY, COMPLEXITY, USER-FRIENDLINESS, AND REALISM SCORES of 30 reported that they use the application for solo job interview practice.