Stress Assessment for Augmented Reality Applications Based on Head Movement Features

Augmented reality is one of the enabling technologies of the upcoming future. Its usage in working and learning scenarios may lead to a better quality of work and training by helping the operators during the most crucial stages of processes. Therefore, the automatic detection of stress during augmented reality experiences can be a valuable support to prevent consequences on people's health and foster the spreading of this technology. In this work, we present the design of a non-invasive stress assessment approach. The proposed system is based on the analysis of the head movements of people wearing a Head Mounted Display while performing stress-inducing tasks. First, we designed a subjective experiment consisting of two stress-related tests for data acquisition. Then, a statistical analysis of head movements has been performed to determine which features are representative of the presence of stress. Finally, a stress classifier based on a combination of Support Vector Machines has been designed and trained. The proposed approach achieved promising performances thus paving the way for further studies in this research direction.


Stress Assessment for Augmented Reality Applications Based on Head Movement Features
Anna Ferrarotti , Student Member, IEEE, Sara Baldoni , Member, IEEE, Marco Carli , Senior Member, IEEE, and Federica Battisti , Senior Member, IEEE Abstract-Augmented reality is one of the enabling technologies of the upcoming future.Its usage in working and learning scenarios may lead to a better quality of work and training by helping the operators during the most crucial stages of processes.Therefore, the automatic detection of stress during augmented reality experiences can be a valuable support to prevent consequences on people's health and foster the spreading of this technology.In this work, we present the design of a non-invasive stress assessment approach.The proposed system is based on the analysis of the head movements of people wearing a Head Mounted Display while performing stress-inducing tasks.First, we designed a subjective experiment consisting of two stress-related tests for data acquisition.Then, a statistical analysis of head movements has been performed to determine which features are representative of the presence of stress.Finally, a stress classifier based on a combination of Support Vector Machines has been designed and trained.The proposed approach achieved promising performances thus paving the way for further studies in this research direction.
Index Terms-Augmented reality, stress detection, machine learning classifier.

I. INTRODUCTION
I N RECENT years, the success of immersive technologies has steadily increased [1] and nowadays Augmented Reality (AR) and Virtual Reality (VR) are used in a wide range of applications.While VR is often employed for entertainment [2] (e.g., gaming or streaming applications), AR systems are expected to become a key element in future work and learning scenarios (e.g., professional training, medical studies) [3], [4].Under these circumstances, the onset of stress while using an AR application may impair the completion of the required tasks.Therefore, the detection of stress conditions while using Head Mounted Displays (HMDs) is a relevant aspect to be investigated to fully exploit the potentiality of these new technologies.In fact, physiological and psychological consequences of stress can interact with each other causing dangerous changes in the components of the physiological systems of the human body and, at the same time, causing changes in a person's behavior, thus leading to an unhealthy status [5].In addition, being a technology still under development, the HMD itself can contribute to the occurrence of a stress condition (e.g., heaviness, blurred vision, uncomfortable fit), thus precluding the adoption of HMDs.
The onset of a stress condition can be identified and measured as the human body reacts to stress with a defense mechanism, represented by physiological variations [6].These reactions cause changes in different biosignals, such as the Electroencephalogram (EEG) or the Electrocardiogram (ECG) signals [7].Although in the literature several studies have been proposed for the automatic detection of stress or workload levels [8], the design of non-invasive systems is still an open problem.Therefore, in this work, we present a stress assessment system in an AR scenario, focusing on an AR static task (e.g., reading), which exploits the information on head movement data recorded by the HMD.In this way, the sensors embedded into the HMD can be directly employed for stress analysis thus offering considerable benefits with respect to physiological measurements.Differently from the approaches based on physiological parameters, the proposed solution allows to employ the same device used during working/learning activities for detecting the onset of stress without the need for other hardware than the AR HMD.
The main contributions of this research are the following: r The design and implementation of a subjective experiment involving two stress-inducing tests in an AR scenario.The acquired data are available to the research community and can be found at https://muse.uniroma3.it/headdataset.
r The demonstration of the relation between head move- ments and the presence of stress through the recording and analysis of head movements acquired during the subjective tests.
r The design and test of a stress classifier based on the ac- quired data.The usage of two stress-inducing tests allowed to verify the generalization capabilities of the proposed classifier.The paper is structured as follows.In Section II, an overview of the related works is presented.In Section III, the proposed method for stress assessment is described.More specifically, the stress-inducing tests, the experimental protocol, and the head movement features extracted through the HMD are presented.
Moreover, the architecture of the proposed stress classifier is detailed.In Section IV, the results obtained from the statistical analysis and the classification performances are presented.Finally, in Section V the limitations of the current study and the future research directions are sketched, and in Section VI the conclusions are drawn.

II. RELATED WORKS
In this section, a brief summary of the most widely adopted stress-inducing tasks is provided and the stress detection approaches presented in the literature are described.

A. Stress Inducing Tasks
Due to the complex mechanisms behind the onset of stress, it is necessary to evaluate the body response to a stressor in controlled environments before testing the designed systems in real-life scenarios.In the literature several methods have been proposed to induce stress in a laboratory environment.Among them, one of the most used is the Stroop Color Word Test (SCWT) [9] since its capability to induce stress is well-established [7], [10].In [11] heart rate, frequency of skin conductance responses, and selfreported anxiety have been recorded while performing a SCWT for demonstrating its correlation with stress.The performed analysis proved that the SCWT may act as an efficient laboratory stressor.Similarly, in [12], the stress induced by the SCWT was measured using electrocardiographic and heart rate variability signals.The results of this study highlight the presence of a significant variation in these signals between the stressed and the normal conditions.The SCWT presents to the user different color names in two conditions.In the first condition, i.e., the congruous condition, the words are all written in black and the user is asked to read the words aloud.In the second condition, i.e., the incongruous condition, the color of the words and their meaning are not coherent (e.g., the word yellow may be written in red).In this case, the user is asked to pronounce the color in which the word is represented and not the word's meaning.This creates a conflict between the automatic reading process and the task of naming the color in which the word is written [13].Variations of the SCWT have also been proposed, either combining the visual stimulus with an auditive stimulus [14] or introducing time limitations [11].In addition, some authors have added a second congruous condition in which the word colors match their meaning, thus realizing an intermediate step between the original congruous and incongruous phases [10], [15].
Another class of stress-inducing tests, which also requires a high level of cognitive engagement from the users, is represented by Mental Arithmetic (MA) tasks.The task of mental arithmetic involves performing calculations and solving mathematical problems only through mental processes, without the use of external aids or tools.It requires the ability to manipulate numbers, apply mathematical operations, and accurately derive results in one's mind.Different implementations of the MA test have been proposed in the literature [16], [17], [18].The Pased Auditory Serial Addition Test (PASAT) presents to the user a series of single-digit numbers.Each time a new number is presented, the user is asked to sum it to the immediately preceding one [19].This test is also used in the Mannheim Multi Component Stress Test (MMST), which incorporates five distinct stressors presented in combination to the users.The first stressor is represented by the PASAT, during which 44 pictures depicting negative feelings are presented in the background, serving as the second stressor.Moreover, acoustic (i.e., white and random explosion noise) and motivational stressors are included.Finally, considering the category of stress-inducing tasks related to MA, the Montereal Imaging Stress Task (MIST) requires the user to solve mathematical problems in a limited amount of time, which varies depending on the cognitive capability of the user himself [20].
Exploiting sensory cues to elicit stress has been proposed in other stress-inducing tests.The unpleasant pictures in the International Affective Picture System (IAPS) database can be used to induce a stress response [21].In [22], [23], the onset of stress is caused by a horn sound played randomly during the test.In the Cold Pressor Test [24], participants are asked to immerse a hand or foot in cold water.
Another category of stress-inducing tests exploits social pressure.In the Trier Social Stress Test (TSST) [25], participants are asked to prepare in a limited amount of time a presentation to be performed in front of an audience that does not give any positive feedback.In addition, participants are asked to perform a MA test without previous notice.Moreover, the Maastricht Acute Stress Test (MAST) combines the TSST, the cold pressor, and the MA tests [26].
Other stress-inducing tests are based on a gamification approach [27], [28].As an example, during the Wisconsin Card Sorting Test [29], users are asked to match the cards of a deck without knowing the sorting rule, which must be discovered during the test itself.Lastly, other stress-inducing tasks are related to specific target scenarios, e.g., stress induced during driving tasks [30], [31], [32] or during university examinations [33].
In this work, the SCWT and the MA have been chosen as stress-inducing tasks.In fact, these tests are widely accepted as highly reliable for the onset of stress in participants as highlighted in [7], [34].They are also more similar to the target scenario of this study, i.e., a system for detecting stress in workers/students performing static activities.

B. Stress Detection Techniques
In the literature several authors have analyzed the effects of stress on people, focusing on automatic stress detection [35], [36].One of the most accurate methods exploits the processing of biological signals, whose variation can be related to the onset of stress [7].
Among the biosignals, EEG is widely used for stress assessment [37], [38].Although it can be analyzed in both time and frequency, different levels of cognitive engagement are often accurately differentiated through the analysis of the EEG power spectral density [39].The ECG, which represents the heart electrical activity during contraction and relaxation, is also commonly used [40], [41].Although different features of this signal can be exploited in order to detect the presence of stress, Heart Rate (HR), Heart Rate Variability (HRV), and blood pressure, are the most widely adopted [42].A possible subsitute of HRV is the Pulse Rate Variability (PRV), that has been employed for stress detection in [43].In addition, the Galvanic Skin Response (GSR), which measures human skin conductivity, can be employed for stress detection [44].For instance, in [45], a multimodal system that exploits both the ECG and the GSR for stress assessment is presented.Another multimodal approach involving ECG, GSR and an accelerometer has been presented in [46].Other biosignals that can be exploited to detect the presence of stress are strictly related to the muscular activity of the human body.In particular, eye gaze, eye blink, and pupil dilation are often tracked or measured for stress assessment applications.In [47], these features have been measured through a specific device for eye activity tracking.Moreover, electrodermal activity has been analyzed for comparison.The results highlighted that the variation of the pupil diameter is more discriminative than electrodermal activity features.
Depending on the target application, some biosignals may be more suitable than others.The key factors for performing this choice are the type of sensor adopted for signal acquisition and the requirements in terms of detection accuracy.It is useful to notice that most of the described signals require devices that are typically not available to the consumer market, such as the electroencephalograph and the devices to measure the GSR.In addition, these devices can be difficult to operate for non-professional users.The EEG signal, for example, requires a complex setup for its acquisition.Moreover, although some consumer-grade devices have been proposed [48], they require the user to wear additional equipment for signal acquisition purposes, thus representing a burden in the performance of specific tasks.
The analysis of signals like voice, whose acquisition requires only a microphone, partially addresses this problem.In [49], for instance, a VR application aimed at detecting the presence of stress from the user's voice has been designed.Voice recording was implemented directly in the VR application.However, the voice might not be actively used during certain working and learning activities, in particular those heavily reliant on written communication or where reading is the predominant task to perform.Also, the use of voice in very noisy environments can lead to unreliable results.Therefore, other cues must be exploited to gauge the emotional state of an employee/student.
In the past years, several works have demonstrated that posture and head movements are related to the presence of stress in non-AR scenarios.In [50], the time variation of head motion, in the presence and absence of stress, has been studied.In addition, the influence of speech on head motion and its variation due to the presence of stress have been considered.Head motion has been analyzed through video recordings of different user activities.From the videos, the head position has been estimated and analyzed through its (x, y, z) components as well as the rotation angles around the three axes.In addition to these features, head velocity has been evaluated.The obtained results highlighted an overall head mobility increase, which is not related to the act of speaking.Thus, although language cognitive processes are strongly linked to head movements, the mobility increase can be attributed to a stress factor.Other studies have shown that body posture and postural sway vary with the presence of stress.In [10] and [15], these variations have been analyzed by placing pressure sensors on a commercial office chair so that a person could not sense their presence while sitting.The proposed experimental setup considered a standard 2D monitor, on which a SCWT was presented to induce stress in the participants.In [10], the authors showed that when a higher cognitive engagement level was required, participants tended to change their posture, moving closer or further away from the screen.In [15], the posture variation has been further analyzed by considering the variation of postural sway.One of the most interesting results concerns the speed of movement, which decreased when the cognitive engagement was larger, due to an increased difficulty in maintaining the balance.
Finally, in [51], stress assessment for AR or VR applications has been performed.The study was based on the analysis of the ballistic forces generated by the human heart (also referred to as ballistocardiography).More specifically, since the heart activity causes involuntary movements, these can be analyzed in order to extract heart rate information, which in turn leads to stress assessment.
In this work we investigate the applicability of head movement-based stress detection to the AR scenario.More specifically, the users' head movement is recorded to detect the presence of stress during a static task such as reading, using the data extracted from the HMD.The proposed method has the following advantages with respect to the approaches based on biosignals: r it is easy to use -as no specific setup is required; r it is non-invasive -as the AR HMD is employed for monitoring the user's behavior while the user is performing a task, thus avoiding skin and/or bulky sensors; r it does not require additional equipment -the HMD is employed by the user for other AR applications and stress detection is performed without adding burdens.To the best of our knowledge, this work represents the first attempt to directly link head movements and stress in AR applications.In particular, we analyze whether the relation between head movements and stress, which other works already investigated in different scenarios, still holds when using an AR application.Moreover, we designed a classifier to detect stress automatically from the analysis of the head movements, thus realizing a non-invasive and easy-to-use stress assessment approach.

III. PROPOSED METHOD
This work focuses on analyzing the relationship between stress and head movements while using an AR application.Due to the foreseen spreading of AR solutions for working [52] and learning activities [53], we considered a static task (e.g., reading, informative training, tutorial watching, data analysis [54]) and analyzed several head movement features to determine whether they show a significant difference during a stress situation.The most significant features have been used to define a Machine Learning (ML) classifier able to detect the presence of stress.To this end, as will be detailed in Section III-A, two stress-inducing For each test, a Microsoft Hololens 2 application consisting in a virtual screen displaying the test content has been designed.An example is shown in Fig. 1.The dimensions of the virtual screen were chosen by taking into account the size of 40 inches commercial monitors and its distance was set according to the ITU-R BT.500-14 recommendation for HD 16:9 resolutions [55].During the tests, the position of the AR HMD was recorded in order to analyze the users' head movements.

A. Design of the Stress-Inducing Tests 1) Stroop Color Word Test:
The first test we considered is the SCWT [9].For its design we selected the SCWT version used in [10], [15].The test consists of presenting a set of slides containing different color names to the user and is composed of three phases, with an increasing level of difficulty.During the first phase, the words are all colored in black, while in the second phase, each word color matches its meaning (congruous conditions).In these two phases, the user is asked to read the words as quickly as possible.During the third phase, the color of the words and their meaning are not coherent (incongruous condition) and the user must pronounce the color in which the word is written.The third phase induces the stress condition.During the test, the number of words presented in each slide and the duration of each slide varied.These two parameters have been set in accordance with [10], [15], to allow a fair comparison of the achieved results.In the third phase, the difficulty of the task is increased both through the mismatch between the word color and its meaning and by reducing the amount of time provided to complete the task.The overall test has a duration of approximately 2 minutes.The details of the test are represented in Table I, which shows the number of words per slide and the amount of time scheduled to complete the task.In Fig. 2, an example of a slide is provided.In Fig. 2(a)-(c), a sample slide for each phase is presented, while Fig. 2(d) shows the maximum number of words presented to the users during the third phase.
During the test, only head movements have been recorded, disregarding the correctness of the answers provided by the users.In fact, this work aims to analyze the relation between the presence of stress and head movement features, thus making the correctness of the answers not relevant to our purposes.
2) Mental Arithmetic Test: The second stress-inducing test is based on MA.The designed test comprises two phases.During the first phase, which is 1 minute long, the participants are asked to close their eyes and relax.During the second one, which lasts for approximately 3 minutes, the participants are asked to solve 19 mathematical operations.The complexity of each calculation is mixed so that a person would not be discouraged by the fact that he or she could not solve one of the previous problems [56].Each operation is presented to the participant for a different time interval depending on the difficulty of the operation: 5 s for easy tasks (e.g., compute 5 × 2), 10 s for medium tasks (e.g., compute 15 + 28), and 15 s for complex tasks (e.g., compute 32 × 17).To increase the level of stress perceived by the user, the time left for solving each operation was displayed on the screen through a filling bar and audio feedback was provided to inform the participant whether the given answer was correct or not.

B. Experimental Protocol
The designed experimental protocol consists of three sessions involving three independent groups of participants as detailed in Section IV-A: r Session #1: participants performed the SCWT; the acquired data were used for analyzing the head movements and selecting the features.The selected features have been used as the training set for the proposed stress classifier.
r Session #2: participants performed the SCWT; the acquired data have been used as validation set for the proposed stress classifier.
r Session #3: the participants performed both the SCWT and the MA test; the acquired data have been used as the test set for the proposed stress classifier and to verify its generalization capabilities.The stress-inducing tests were performed in a controlled environment.The virtual screen was superimposed on a white background in a quiet room so that external stimuli would not interfere with the execution of the different tasks.
For Session #1 and Session #2 the experimental protocol consisted of three steps: i) participant screening procedure, ii) AR test, and iii) final questionnaire.The screening comprises  a Snellen Test [57] and a test for color blindness [58].The participants were asked to wear any vision-correcting devices (glasses or contacts) that they normally wear.Before the test, the eye gaze calibration procedure for the headset was performed.During Sessions #1 and #2, after performing the SCWT, the participants were asked to answer the Simulator Sickness Questionnaire (SSQ) [59] to evaluate the presence of cybersickness due to the use of the HMD.The SSQ allows the participant to rate 16 cybersickness-related symptoms on a four-level scale: none, slight, moderate, and severe.Session #3 had a different structure.More specifically it was organized as follows: r participant screening procedure; r SCWT; r NASA-TLX questionnaire; r 5 minutes break; r MA test; r NASA-TLX questionnaire; r SSQ.
The screening procedure and the SCWT were performed in the same way as during the previous sessions.The NASA-TLX (National Aeronautics and Space Administration Task Load Index) questionnaire [60], was introduced in the 80's to measure the perceived workload and mental effort.It consists of a series of questions that aim to capture six dimensions of task load (i.e., mental demand, physical demand, temporal demand, performance, effort, and frustration).Since the SCWT is a well-known and standardized stress induction procedure, while the MA tests have not reached the same level of maturity, the NASA-TLX questionnaire has been employed to verify the similarity of stress levels induced by the two tests.

C. Feature Extraction and Statistical Analysis
The HMD used for the experiments, Microsoft Hololens 2, allows to track the head movements and to extract the variation in time of the x, y, z coordinates, according to the coordinate space presented in Fig. 3.
From the acquired data, different head movement features were computed and used as observations for the statistical analysis.First of all, the movements along the three axes were extracted to analyze the users' movement trends.Then, the absolute value of the difference between adjacent temporal samples was evaluated.This feature, referred to as entity of displacement, allowed us to verify if there was a privileged axis for the head movement.In addition, the sign of the difference between adjacent temporal samples, referred to as sign of displacement, was computed.This feature has been considered to verify if there was, along a specific axis, a privileged direction of motion.Moreover, the total displacement has been evaluated to analyze the overall head movement.This feature has been computed as the norm of the vector with components x, y, and z.As further analysis, the head velocity along the three axes and the total velocity, evaluated as the norm of the velocity vector, were computed.
Since further insights could be gained by combining the information coming from different axes, we included the time variation of the angles between two axes ((x, y), (x, z), (y, z)) and their respective angular velocities.Also in this case, the entity and sign of the angular displacement were evaluated.
Finally, since an oscillatory movement trend was noticed, we performed a Short-Time Fourier Transform (STFT) analysis to Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
investigate it.Both the module and phase of the STFT coefficients were analyzed in order to evaluate their frequency content.To deal with the inherent discontinuity of phase coefficients, a phase unwrapping operation was performed.Moreover, in order to obtain a frequency representation of the signals independent from the length of the time sequence, a cumulative sum of the STFT coefficients belonging to the same frequency bin was performed.
In order to determine whether the described features were significantly different among the three SCWT phases, two statistical tests were considered.An Analysis of Variance (ANOVA) test [61] was performed using the three phases of the test as levels of the independent variable.Before performing the ANOVA test, the normality of the population of interest and the homoscedasticity between the samples related to different levels of the independent variable have been verified.Normality was checked for each feature through a Kolmogorov-Smirnov test [62].Homoscedasticity was assumed to be respected if the highest variance was not larger than twice the smallest one.The null hypothesis was defined as the absence of a significant change in the analyzed feature in the three phases.If the results of the ANOVA test on a feature allowed to reject the null hypothesis, a Tukey Honestly Significant Difference (HSD) [61] test was also performed, to evaluate between which phases such significant difference held.
When the assumptions for the ANOVA were not respected by the feature under analysis, a Welch's ANOVA test [63] was performed.This test is the corresponding non-parametric test of the ANOVA for the within-groups design.Similarly to the previous case, if the results of Welch's ANOVA test on a feature allowed to reject the null hypothesis, a Games-Howell [64] posthoc test was performed.

D. Proposed Stress Classifier
The features that showed a significant difference between the third phase and at least one of the other two phases have been selected as possible input features for the proposed stress classifier.In fact, these features were assumed to be more effective in discriminating between the presence and absence of stress.Since all time variations along the three axes showed a significant difference among the three phases, as it will be shown in Section IV-C, the total displacement was used as a comprehensive feature.The selected features are total displacement, total speed, and the three angular displacements.All these features represent a combination of the information collected from different axes.
For all test sessions, the input data were divided into two classes: presence and absence of stress.The labeling of the input data was carried out considering the different phases of the stress-inducing tests.In particular, for the SCWT, the first two phases were labeled as non-stress, while the third one was labeled as stress.For the MA, the first phase was labeled as non-stress, while the second one was labeled as stress.In fact, the third phase of the SCWT and the second phase of the MA test require a higher level of cognitive engagement and are designed to increase the perceived level of stress.
The proposed architecture for the stress classifier is presented in Fig. 4. The computed features undergo the STFT analysis described in the previous section.This choice is motivated by a variation of the frequency content of the signal during the three phases, as will be detailed in Section IV-C.More specifically, the unwrapped phase of the STFT coefficients is chosen as the input feature for the stress classifier.In fact, the STFT phase carries more information than its module and can be more representative of the head movement signal [65].The cumulative sum of the phase coefficients allows a representation which is independent from the length of the analyzed sequence.This is fundamental since ML algorithms, such as Support Vector Machines (SVMs), do not take into account the temporal variations of the input data but rely on the variations of a limited number of predictors [66].
To account for the different characteristics of the selected features, a SVM classifier is trained for each of them.In order to obtain a more robust design, we propose to combine the outputs of the classifiers at two different stages.First, the classification decisions of the three angular displacement classifiers are combined through a majority voter.Then, the decisions concerning the total displacement, d 1 , the total speed, d 2 , and the angular displacement, d 3 , are combined through a weighted sum.This approach allows on one hand to fit each classifier to the peculiarities of each feature, and on the other to overcome the flaws of the single SVM by relying on multiple features.
The outputs of the individual classifier, d i (i = 1, 2, 3), are set to −1 if the input is classified as a non-stress observation, and to 1 otherwise.The weights, w 1 , w 2 , and w 3 , have been defined based on the accuracy of the different classifiers on the validation set, composed of the data acquired during Session #2.Indicating the accuracy values as a 1 , a 2 , a 3 , and setting the weights are computed as follows: The weighted sum is obtained as Since the sum of the weights is equal to 1, the value of the weighted sum can vary in the interval The final decision is taken by comparing the weighted sum with a threshold.In Section IV, this threshold is set to 0, to evaluate the accuracy of the proposed classifier.

A. Test Sample
During the three acquisition sessions, a total of 100 users participated in the AR tests.All participants were Italian speakers, and the words presented during the SCWT were in Italian.Before performing the tests, all participants were asked to sign a privacy agreement for data collection.

TABLE II AVERAGE NUMBER OF SAMPLES PER USER
During Session #1, the SCWT was performed by a group of 60 subjects, 32 men and 28 women, whose ages varied between 19 and 47 years (25.6 ± 4.5).Session #2 comprised a group of 20 participants, 10 men and 10 women, whose age varied between 22 and 47 years (28.8 ± 6.2).Also in this case the SCWT has been submitted to the participants.Finally, during Session #3, the SCWT and MA test were performed by a group of 20 subjects, 12 men and 8 women, whose age was in the range between 21 and 31 years (24.4 ± 3.3).

B. Pre-Processing
The collected data were first divided based on the duration of each phase of the stress-inducing tests as reported in Section III-A.In order to filter out possible noise, we considered a windowed mean of the samples.To this aim, we employed a window with a fixed length instead of setting the window equal to a predefined time interval.In fact, since the sampling interval of Microsoft Hololens 2 is not constant, the same time interval may contain a slightly variable number of samples.A fixed window length allows to filter noise while preserving the original signal variance.This aspect is important for the performed statistical analysis, since a variation in the signal variance may prevent from meeting the assumptions required from the ANOVA test.The window length has been set to 5 samples based on the average number of samples included in a temporal window of 0.1 seconds.In fact, considering the limits for head speed during reading tasks [67], the head displacement during this time interval can be deemed not significant.In Table II, the average number of samples per user along with the corresponding standard deviation is reported for both SCWT and MA, before and after pre-processing.

C. Statistical Analysis
In Table III, the ANOVA and HSD tests results for the different head movement features in the time domain are presented, while in Table IV the Welch's ANOVA and Games-Howell results for the frequency-domain representation of the head movement features are illustrated.The absence of changes in the features in Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.presence and absence of stress was used as the null hypothesis.In particular, three outcomes are possible: r the null hypothesis is not rejected (-); r the null hypothesis is rejected and the analyzed feature increases during the stress-related phase (↑); r the null hypothesis is rejected and the analyzed feature decreases during the stress-related phase (↓).It is useful to highlight that in the SCWT Phase 1 and Phase 2 represent the congruous non-stress conditions, while Phase 3 represents the incongrous and stress-related condition.For this reason, we expect the stress-discriminant features to be significantly different when comparing Phases 1 and 3 and Phases 2 and 3.

TABLE IV RESULTS OF THE WELCH'S ANOVA TESTS FOR THE FREQUENCY-DOMAIN REPRESENTATION OF THE HEAD MOVEMENT FEATURES
In Fig. 5, an example of the variation of the head movement of one participant along the three axes is presented.The temporal variation of the displacement along the axes and the total displacement have shown a significant difference among the three phases, demonstrating that the head movements vary when the complexity of the required task changes.In particular, except for the z-coordinate, the head movement decreased during the third phase.On the other hand, the total displacement has shown an increase during the third phase.Since the total displacement allows to aggregate the information provided by the movements along the single axes, it has been selected as input for the classifiers.The results of the Tukey HSD test are represented in Fig. 6(a).
Concerning the entity of the displacement, although a significant difference between the three phases has been detected, the magnitude of the feature during the different phases is comparable, thus indicating that there is not a privileged axis of movement.A similar consideration can be made for the sign of the head movement.In this case, an oscillatory movement was observed, rather than a movement in a specific direction.As for the statistical analysis performed on the head speed, the speed along a specific axis was not significantly different during the three phases.On the other hand, it is interesting to note that the total speed, which represents a combination of the other three speed features, is significantly higher during the third phase (Fig. 6(b)).
Regarding the angular displacement, this feature is significantly different during the third phase.The only exception is represented by the (x, y) angle which has not shown a significant difference between the first and third phases.However, the decrease of the angular displacement between the second and third phases indicates a variation of this feature with the task complexity.The results are represented in Fig. 7.The outcomes for the entity of the angular displacement, and for the sign of the angular displacement have confirmed that the head movement was not characterized by a privileged direction and it was mainly oscillatory.The angular speed was not significantly different during the three phases.
A second set of statistical tests has been performed for the STFT coefficients of the selected features: total displacement, total speed, and the three angular displacements.These coefficients were evaluated considering a flat top window of 16 samples and 50% overlap.This operation resulted in 16 × 33, 16 × 56, and 16 × 73 complex coefficients for the first, second, and third phases of the SCWT, where the first number indicates the frequency bins whereas the second represents the temporal bins.After extracting the phase and module of the coefficients, and performing the phase unwrapping procedure, a cumulative sum along the temporal dimension has been computed thus obtaining 16 coefficients for each user and phase of the test.While the other features respected the assumptions for the ANOVA test, the module and phase of the STFT coefficients were not homoscedastic.Therefore, the Welch's ANOVA test was performed.The corresponding results are provided in Table IV.For both module and phase, a significant difference among the three phases was detected, thus demonstrating the oscillatory nature of the head movement.These results show that the frequency content of the oscillatory movement changes during the different test phases, thus motivating the choice of performing a STFT analysis of the signal before providing it as input to the classifier.
To summarize, the following insights can be gained from the performed analysis: r the results indicate that the head motion is affected by stress; r the results concerning the entity and sign of both the singular coordinates and the angular displacements show that there is neither a preferred axis nor a preferred direction of motion; r the features that aggregate information concerning differ- ent axes (i.e., total displacement, total speed and angular displacement) are representative of the presence of stress; r the results concerning the analysis in the frequency domain suggest that oscillatory head movements are triggered during a stressful event.

D. Classification Results
The performances of the proposed classifier for validation and test sets are presented in Table V.More specifically, the results are reported in terms of classification accuracy Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE V CLASSIFICATION ACCURACIES
where T P (True Positives) indicates number of times a stressrelated sample is correctly classified, F P (False Positives) represents the number of misclassified stress observations, T N (True Negatives) is number of times a non-stress sample is correctly classified, and F N (False Negatives) corresponds to the number of non-stress observations classified as stress-related.
Since the MA test has a different time length compared to the SCWT, the acquired time series have been segmented to obtain sub-sequences whose length is comparable to that of the SCWT.More specifically, the data obtained during the first phase of the test were partitioned into two sub-sequences with an average sample count of 449.2 ± 1.3 that is comparable to the second phase of the SCWT.Similarly, the data collected in the second phase were segmented into three sub-sequences, with an average sample count of 645.8 ± 6.7 that is comparable to the samples in the third phase of the SCWT.For each participant, in order to obtain a single observation for the two classes, the mean of the features evaluated for each sub-sequence was performed.The obtained sequences underwent the same STFT analysis performed for the SCWT.
Table V reports the results obtained using the analyzed features independently and the performance achieved through their combination, as represented in Fig. 4. The combination weights have been computed as described in Section III-D based on the classification performance on the data recorded during the validation session (Session #2).As can be noticed in the first column of Table V, the performances of the three classifiers are comparable, thus resulting in similar weights for the three classifiers.More specifically, the weights obtained for the SVM of the total displacement, total speed and angular displacement are w 1 = 0.34, w 2 = 0.33, and w 3 = 0.33, respectively.
Concerning the test session (Session #3), Table V shows that the Combined Classifier has not the highest accuracy for the SCWT.In fact, based on the computed weights, a wrong classification occurs when two out of three classifiers make a wrong decision on the input observation.Despite this, we believe it represents a more robust solution due to the aggregation of the information provided by different features.In fact, the lower accuracy for the SCWT is due to a mismatch in the performance of the different SVMs for the validation and the test sessions.This phenomenon could be overcome through a more extensive validation procedure.

E. SSQ Results and NASA-TLX
In Table VI, the SSQ answers obtained during all the acquisition sessions are summarized.In most cases, the participants  did not present any discomfort related to the use of the HMD.
In fact, for all the symptoms, most users reported no perception.The slightly more frequent symptoms were fatigue, eye strain, and difficulty in concentrating.This may be due to the cognitive engagement required to perform the different tasks.In general, it can be safely stated that the HMD and the stress-inducing tests did not cause in most cases unpleasant feelings to the users.In order to determine whether there was a significant difference between the NASA-TLX scores for the SCWT and MA test, both a two-tailed and a one-tailed paired samples t-tests [61] were performed.In the first case, the null hypothesis assumed the means of the scores obtained for the SCWT and the MA test to be equal.The test rejected the null hypothesis at a 5% significance level, with a p-value of 0.017.The second test allowed to reject the null hypothesis, according to which the mean of the SCWT scores is larger than the mean of the MA scores, at a 5% significance level.These results show that the MA test was perceived as more stressful than the SCWT, thus achieving the desired goal.In Fig. 8, the boxplots of the scores of the two tests are represented.It is possible to notice that the mean of the MA scores is slightly larger than the SCWT one.Although there is not a standardized thresholding system for the NASA-TLX score and the score interpretation is task-dependent [68], it is possible to gain interesting insights from Fig. 8.In [68] the authors analyzed the use of NASA-TLX in the literature showing the variation of the scores for different tasks.Among these, the ones which more resemble the scenario considered in this paper are: cognitive tasks (scores between 13.8 and 64.90 with a median of 46) and computer activities (scores between 7.46 and 78 with a median of 54).Moreover, they reported that the smallest and highest scores are 6.21 and 88.5, respectively, and that daily activities range between 7.20 and 37.50 with a median of 18.30.Considering daily activities as baseline, Fig. 8 shows that both SCWT and the MA test achieve scores that are remarkably higher.Moreover, for both tasks the median and maximum values are higher with respect to cognitive and computer activities.Therefore, we can conclude that both tasks were demanding in terms of workload and mental effort.

F. Comparison With State-of-the-Art Approaches
The state-of-the-art review presented in Section II highlighted a wide variety of methodologies and types of test for stress detection.For this reason it is not possible to perform a direct comparison with our approach.However, we compare the statistical analysis results with [10] and [15], where the same experimental protocol for the SCWT has been adopted, and the performance of the proposed stress classifier with [49] and [51], where immersive technologies are employed.To further investigate the achievements of the proposed approach, we compare our results with the ones of the methods based on physiological data that selected the SCWT and/or MA test as stress inductors [37], [40], [41], [43], [46], [47].
In [10], [15], a variation in users' movements due to the increasing level of cognitive engagement has emerged.More specifically, an increase in the overall distance covered between the first and third phases was detected.These results are consistent with what has been discussed in Section IV-C, where an increase in the total displacement and in the entity of displacement along all three axes has been identified.In [15], a significant decrease in the mean speed was also highlighted.In this study, an opposite behavior was observed, since the statistical analysis highlighted an increase of the total speed.This difference can be attributed to the chosen experimental protocols.In [15], the participants performed the test sitting on an office chair, while in our case they were in a standing position.Moreover, the difference test modality (i.e., PC and AR screen) could have influenced the participants' behavior.
Moreover, we provide in Table VII the comparison with state-of-the-art approaches in terms of classification accuracy.We reported the best results presented in [49], [51].As can be noticed, the proposed approach definitely outperforms previous research concerning non-invasive stress assessment in AR/VR.
As for the approaches using physiological measures, in [47] a stress detection method exploiting eye tracking data and electrodermal activity has been presented.The authors performed a binary classification between a "stress" and a "relaxation" phase using the SCWT and a three-level stress classification for the MA test.They achieved an accuracy of 88.43% for the former scenario, and an accuracy of 91.10% for the latter.Another multimodal stress detection system involving ECG, GSR and an accelerometer has been proposed in [46].They employed both the SCWT and the MA test in a single experiment.They trained a binary classifier achieving a classification accuracy of 92.4%.The ECG in combination with the SCWT has been employed also in [40].Stress detection has been modeled as a binary classification problem achieving an accuracy of 96.41%.In [41] ECG has been used to detect stress based on a MA task to differentiate between "stress" and "rest" phases achieving an accuracy of 82.7%.Also in [43] a MA task and the SCWT have been employed to realize a stress classifier using photoplethysmogram signals.The MA test was designed to include up to five different levels of stress, while the SCWT has been employed following a three-level classification approach.The authors achieved 86% accuracy for the SCWT, and designed different detection models for the MA task corresponding to a number of classes from two to five.The presented method achieved an accuracy of 97.77% for two classes, 94.11% for three classes, 94.4% for four classes, and 94.33% for five classes.Finally, in [37] EEG has been employed to detect stress using the SCWT and MA test.The authors realized a binary classifier reaching an accuracy of 88% and 96% for SCWT and MA, respectively.Moreover, they presented a three-level classification approach considering three separate classes for rest, SCWT, and MA, under the assumption that SCWT induces milder stress than MA.In this case, they achieved an accuracy of 75%.
Based on these results, the proposed head motion-based method allows to achieve better or comparable performance with respect to the physiology-based counterparts.

V. LIMITATIONS AND FUTURE DIRECTIONS
In this section we report the limitations of the proposed approach which open new interesting research paths for the design of non-invasive AR stress detection systems.
One limitation of this study lies in the procedure adopted for calculating the weights assigned to the outputs of the individual classifiers.According to the performed analysis, the choice of weights based on the validation set did not allow the generalization to new test samples for the SCWT.The main mitigation for this problem is to increase the number of subjects in the validation phase so that the weights of the different classifiers can be Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.more representative of their capabilities.Another aspect which could be improved is related to the usage of a hard threshold to differentiate between the stress and the non-stress conditions.Future developments could include the design of a more flexible system with a soft thresholding based on three-decision regions.In this case, two thresholds can be defined, one for negative and one for positive values of the weighted sum.In this scenario, the external negative range represents the non-stress region, while the external positive range represents the stress region.The central region may represent a transition phase, which could be used to alert the user of an upcoming stress condition.The two threshold values should be based on the specific application scenario.In addition, the current system allows to perform only post-processing stress detection, while a relevant improvement could be the realization of a real-time stress detection system.
Finally, in this work, we have considered a static task (e.g., reading, informative training, tutorial watching).Future developments may include the extension of the proposed approach to dynamic and interactive activities.In this direction, improved classification results could be obtained by including additional information such as eye-tracking data.The usage of a multimodal system could be beneficial for addressing data acquisition artifacts, and for isolating stress-related and task-related movements.

VI. CONCLUSION
The analysis of the effects of stress on the human body is a widely investigated research topic.The solutions proposed in the literature mainly rely on invasive systems and infrequently focus on AR/VR applications.To fill this gap, this work investigates whether the head movements of a user wearing an AR HMD vary due to the presence of a stress factor while performing static tasks.In order to induce stress, the SCWT has been used.From the statistical analysis, it has emerged that several head movement features vary when a stressful situation is presented to the user.More specifically, head displacement, head velocity, and angular displacement have shown to vary during all the different phases of the stress-inducing test.Therefore, they have been selected as stress-discriminating features.The phase of the STFT coefficients of these features has been used to define a stress classifier, based on a combination of SVMs optimized for each feature.In order to test the generalization capabilities of the classifier, both the SCWT and a MA test have been employed.Therefore, the main contributions of this work concern the thorough description of how head movements vary due to a stress state in an AR scenario and the definition and realization of a stress classifier, which has shown excellent performances on both tests.Moreover, the proposed system is completely non-invasive and easy to use.

Fig. 1 .
Fig. 1.Virtual screen for providing the stress-inducing test to the participants.

Fig. 2 .
Fig. 2. Example of words presented to the users during the SCWT: (a) first phase; (b) second phase; (c) third phase; and (d) third phase, with the maximum number of displayed words.

Fig. 5 .
Fig. 5. Example of head movement along the three axes of the (a) first, (b) second, and (c) third phases of the SCWT for one participant.

Fig. 6 .
Fig. 6. Results of the HSD test obtained for (a) total displacement and (b) total speed.The filled dots represent the mean value for each phase, while the line and the associated shaded area represent the comparison interval corresponding to the 0.05 significance level.Different colors are used to indicate the phases which are significantly different.

Fig. 7 .
Fig. 7. Results of the HSD test concerning the angular displacement obtained for the (a) (x, y), (b) (x, z), and (c) (y, z) angles.The filled dots represent the mean value for each phase, while the line and the associated shaded area represent the comparison interval corresponding to the 0.05 significance level.Different colors are used to indicate the phases which are significantly different.

Fig. 8 .
Fig. 8. Boxplot of the NASA-TLX scores obtained during the test acquisition session.The red line in the boxes indicates the median value.

TABLE III RESULTS
OF THE ANOVA TESTS FOR THE TIME-DOMAIN REPRESENTATION OF THE HEAD MOVEMENT FEATURES

TABLE VI SSQ
ANSWERS REPORTING THE % OF PARTICIPANTS SHOWING EACH SYMPTOM

TABLE VII COMPARISON
WITH STATE-OF-THE-ART APPROACHES FOR NON-INVASIVE STRESS DETECTION