Elicitation of Anxiety Without Time Pressure and Its Detection Using Physiological Signals and Artificial Intelligence: A Proof of Concept

Stress can be defined as a state of anxiety (or mental tension) caused by a particular situation. Everybody experiences stress to some level, but how we respond to stress significantly affects our well-being. Various events generate anxiety that leads to stress. For example, not having enough time to complete a task or being late are situations where anxiety (and stress) depends on a temporal factor: the scarcity of time. But people also slide into anxiety as they live in a condition that causes them to be tense, independently of time. The studies eliciting anxiety in laboratory settings have less widely considered this variant. This paper presents a proof of concept (PoC) that investigated the possibility of stimulating anxiety without time pressure through a purposely edited horror movie trailer, giving new insights into the emotional experiences evoked by controlled audiovisual stimuli. The PoC comprised an AI-based classifier to detect a person’s emotion among anxiety, relaxation, and none of the two based on the galvanic skin response (GSR), photoplethysmogram (PPG), and heart rate (HR), achieving an accuracy higher than 95%. Key application areas include media and marketing, and psychology. Media producers could improve their content to capture the audience better; psychologists could create tailored exposure experiences to promote gradual desensitization to stress triggers.


I. INTRODUCTION
Modern stress never stops and can impact well-being and health.People are continuously bombarded by news updates and notifications that can induce a feeling of uncertainty and apprehension.Also, modern work environments are rapidly changing, leading to uncertainty and a need for quick decisions.This creates a ''perfect storm'' that decreases mental well-being and spoils the quality of life.According to an article cited on the American Institute of Stress, 83% The associate editor coordinating the review of this manuscript and approving it for publication was Byung-Gyu Kim.
of people felt stressed at work in 2022 [1].The American Psychological Association defines stress as the physiological or psychological response to internal or external stressful events (stressors).Stress can be acute or chronic based on the duration of the stressor [2].It causes changes involving nearly all the body systems, impacting how people feel and behave.Acute stress arises on a single occasion, typically within a short timeframe, whereas chronic stress develops when stressors are recurrent or prolonged.Chronic stress can manifest either episodically or continuously.Episodic stress arises at regular intervals and is of limited duration; continuous stress occurs over an extended period.The number of stressors is a key predictor for an individual's health and psychological well-being [3].Continuous exposure to stressors leads to adverse health problems, including cardiovascular diseases and cognitive decline [4], along with physiological changes, such as increased heart rate, blood pressure, and cortisol levels [5].
Various studies have used physiological changes to detect stress, such as the galvanic skin response (GSR), heart rate (HR), heart rate variability (HRV), electroencephalography (EEG), photoplethysmography (PPG), and respiration rate (RESP) [6], [7], [8].The EEG and GSR signals led to levels of accuracy of 60% and 92.9% using artificial neural networks [9] and support vector machines (SVMs), respectively [10].Multiple signals were also combined to increase precision using SVMs and k-nearest neighbor (k-NN) classifiers.For example, simultaneously using the ECG, GSR, RESP, blood pressure, and blood oximeter, and the combination of EEG, GSR, and PPG led to levels of accuracy of 95.8% [11] and 96.25% [12], respectively.Another study involved construction workers, detecting stress with an 80.32% precision via Gaussian SVMs based on EEG and salivary cortisol [13].High levels of accuracy (up to 94.78%) were also reached using inertial sensors to predict the stress levels of drivers based on steering wheel movements [14].Stress detection was also carried out in Industry 4.0 environments using HR, GSR, and RGB videos via supervised and unsupervised techniques, whose respective precisions were 94.9% and 77.4% [15].Recent studies have also started focusing on detecting and potentially tracking a person's mood in daily life.For example, detecting happiness from 14-channel EEGs via deep learning [16].
These studies highlighted that physiological signals help design accurate machine-learning algorithms for stress detection.Accurate stress detection can enhance mental health monitoring and help people improve the way they deal with stress.However, most studies in the literature focused on specific contexts, thus limiting their generalization capabilities.Moreover, most approaches mainly elicited stress, forcing people to deal with tight time constraints, for example, asking them to carry out tasks under time pressure [17].But being under time pressure is not the only cause of stress.Stress also arises when exposing people to stressors that do not necessarily put a person under time pressure.This variant of stress (common in daily life) needs to be better explored in the literature.A deeper understanding of how people respond to different types of stress could be key in various application fields.For example, it could assist in designing multimedia content that evokes stress at specific moments to capture the viewers' attention or help people cope with stress, thereby enhancing psychological well-being.
This paper describes a proof of concept (PoC) that presents a way of eliciting anxiety without putting people under time pressure using a purposely edited horror movie trailer.The PoC also shows that this type of anxiety-which is similar to long-lasting tension-can be detected only using the GSR and PPG signals.A total of 34 participants took part in an experiment where they were shown the edited horror movie trailer, preceded and followed by a relaxing video clip, while their GSR and PPG signals were being recorded.Participants' data were used to design and develop an intelligent system made up of two 2-class k-NN classifiers distinguishing between two pairs of classes: Anxiety and Another Emotion, and Relaxation and Another Emotion.The system output is Anxiety, Relaxation, or Another Emotion, i.e., a set of negative affective states, such as anger and sadness.The system achieved a mean accuracy of 95.22%.
A system implemented based on our PoC could work alongside the existing methods to enrich the types of anxiety elicited in lab settings in years to come.This could help stressed people suffering from various forms of anxiety to improve their quality of life via personalized interventions tailored to specific forms of anxiety.The paper is structured as follows: Section II gives a background on the physiological signals used by the PoC; Section III describes the data collection and editing procedure; Section IV describes the anxiety detection method; Section V presents the results; Section VI discusses the results and presents some application areas that could benefit from implementing the PoC; Section VII draws the conclusions.

II. PHYSIOLOGICAL SIGNALS
This section gives a background of the physiological signals used in this study.

A. GALVANIC SKIN RESPONSE (GSR)
GSR measures the skin electrical conductance in micro-Siemens (µS), which increases in proportion to the micro-variations in sweat levels generated by the autonomic nervous system as a reaction to external stimuli.Sweat is saline, and sodium chloride increases skin conductance: a micro-increase (-decrease) in sweat generates an increase (decrease) in electrical conductivity.
The GSR consists of two main components: phasic and tonic.The phasic component is associated with quick variations in the skin conductance in response to emotional stimuli; the tonic component is the ''baseline'', i.e., the skin conductance resting level.GSR is measured in body areas with micro sweat levels controlled by the autonomic nervous system, e.g., palms of the hands [18].GSR levels highly correlate with anxiety.In particular, when experiencing stress, tension, or anxiety, there is an increase in micro sweat secretion that increases skin conductance.Conversely, in moments of relaxation, sweat secretion decreases, thereby diminishing skin conductance [19].

B. PHOTOPLETHYSMOGRAM (PPG)
PPG detects the blood oxygenation level (SpO 2 %).It involves shining a light onto the skin and measuring the percentage of light absorbed/reflected by the underlying blood vessels.PPG is commonly measured in millivolt (mV) [20] and is widely used in wearables to estimate a person's heart rate.
The analysis of PPG waveforms has proved a valuable approach to evaluating stress because of their significant changes during mental stress and relaxation [21].Also, a person's heart rate increases during physical activities or because of discomfort and a feeling of danger: it is an innate reaction that prepares them to flee [22].

III. DATA COLLECTION
This section describes how data were collected.The University of Pisa Bioethical Committee approved the protocol (authorization no.22/2022, protocol no.77857/2022).

A. PARTICIPANTS
A total of 34 participants (14 women and 20 men) aged between 19 and 28 took part in the experiment.Their details are in Table 1.Before collecting data, participants read and signed an informed consent explaining the experiment's aim and procedure and that data would be anonymized.After the experiment, participants were asked if they had previously watched Andy Muschietti's IT: Chapter One (2017) (or its trailer) to consider potential bias from prior exposure.Only two participants confirmed having seen that content before, leading to the exclusion of their data.

B. VIDEO CLIP
The first step to selecting the video clip was analyzing various candidate film genres.We first eliminated those less likely to generate continuous and profound anxiety stimuli, such as comedy, drama, science fiction, romance, western, adventure, biographical, etc. Thriller and horror genres resulted as potential candidates.Then, by watching various trailers of both genres, we realized that thriller scenes could elicit anxiety but often relied on diverse emotional arcs as they contained action, mystery, or surprise elements, leading to frequent peaks in emotional responses.Thriller trailers generally needed to be edited with frequent cuts to remove the scenes causing peaks of emotion, thereby generating continuous anxiety.Although we carefully performed the editing, trying to make the content as smooth as possible, these cuts were too frequent and challenging to make unnoticeable.The viewer frequently became aware that the video had been edited, thus damaging the attempt to generate continuous anxiety.
Contrarily, we found that some horror trailers could easily be edited with fewer cuts that were unnoticeable to the viewer.Watching the edited horror trailer resulted in a uniform and protracted state of anxiety.This characteristic made the horror movie trailer particularly suitable for our research objectives.Other film genres or scenes may be appropriate after applying our editing procedure to remove the variability of emotions characterizing those scenes.In this paper, we wanted to show that horror movie trailers are one of the possible genres that can generate long-lasting anxiety after being edited with the proposed technique.And that anxiety can be elicited using alternatives to time pressure based on edited movie trailers.
Table 2 summarizes the considered horror movie trailers.Those characterized by excessive violence or paranormal themes were discarded because their scenes could not be edited only to elicit anxiety.The trailer selected for the experiment was Andy Muschietti's IT: Chapter One (2017), which featured calm and anxiety-inducing scenes.To create a long-lasting scene where anxiety is prevalent in the viewer, each scary scene was removed from the trailer and the parts preceding and following that scene were joined through a fade transition.Also, the editing procedure removed the sudden loud sounds accompanying each scary scene, and smoothed the soundtrack volume to make the changes imperceptible to the ear.Fig. 2 shows the editing procedure.In particular, the top of the figure shows a part of the original video clip with a scary scene.Under the video clip, the figure shows its soundtrack and a plot highlighting that a sudden scary scene causes a drop in anticipation level that leads to decreased anxiety.Instead, the bottom of Fig. 2 shows no chance to reduce anxiety (anticipation level) when watching the edited video clip.Viewers constantly expect something scary but do not know it will never occur: this results in a movie trailer only characterized by suspense and anticipation.
Before and after the edited trailer, participants watched a Meditation Relax Music's relaxing video clip characterized by natural content and peaceful soundtracks.The relaxing video clip and trailer were adapted to be played using a mobile Virtual Reality (VR) headset.Fig. 3 shows a participant during the experiment while watching the edited trailer (emotional clip).

C. PHYSIOLOGICAL SIGNALS RECORDED
Two physiological signals were collected: GSR and PPG.Signals were measured using a Shimmer3 GSR+ worn as shown in Fig. 1, with a sampling rate of 128 Hz.The HR was derived from the PPG using the internal PPG to HR algorithm of the Shimmer3 GSR+.These signals were chosen as they are reliable for detecting stress [18], [23].Also, they can be collected using affordable and non-invasive sensors embedded in most current smartwatches and smart bands.

D. PROTOCOL
The experiment consisted of three phases (blue rectangles in Fig. 4): a preparing phase, an emotional phase, and a labeling phase, whose details are as follows: 1) Preparing phase: the participant was asked to comfortably sit in a chair while positioning the Shimmer3 GSR+ electrodes on the index and middle fingers (GSR) and ring finger (PPG).The participant then put on the mobile VR headset (with a smartphone inside as a display) and headphones (see Table 3 for specs).Then, the participant was induced into a resting state using dim lights and relaxing background music.The participant's physiological signals were observed in real-time on a PC monitor using Consensys software.
When the HR and GSR values were low and stable [24], [25], a participant was considered relaxed, and their signal values were considered as a baseline; 2) Emotional phase: this phase comprised a relaxing clip, an emotional clip, and another relaxing clip.Both relaxing clips lasted 4 minutes to rest the participant with stress relief music and video [26], whereas the emotional clip (i.e., the edited version of Andy Muschietti's IT: Chapter One) lasted 2 minutes; 3) Labeling phase: the participant filled out a questionnaire asking for the emotion felt during the emotional clip and whether they felt relaxed after watching the second relaxing clip.

IV. METHOD
This section describes how the datasets were generated to train a classifier to distinguish anxiety, relaxation, or another emotion, based on physiological signals recorded while watching the video clips.The procedure is illustrated in Fig. 5.The section then presents the system architecture and how the classifiers were trained and tested.

A. SIGNAL LABELING
Participants reported they experienced a single predominant emotion during the emotional clip.Specifically, 20 participants reported feeling anxiety, 8 participants reported feeling sadness, and 4 participants reported feeling anger.All the participants reported feeling relaxed when approaching the end of the post-stimulus relaxing clip.Labeling was performed as follows: • the signals of those who reported feeling anxiety while watching the emotional clip were labeled as Anxiety; • the signals of those who reported feeling sadness or anger while watching the emotional clip were labeled as Another emotion; • the signals collected while watching the relaxing clip by those who said they felt relaxed were labeled as Relaxation.

B. SIGNAL PRE-PROCESSING
First, the raw GSR, PPG, and HR signals were preprocessed by filling in missing values through linear interpolation.The raw GSR signal underwent a noise-canceling procedure based on a first-order low-pass Butterworth filter with a cut-off frequency of 5 Hz [27].Then, it was decomposed into the phasic and tonic components [28].This led to three GSR signals per participant: preprocessed, phasic, and tonic.
The PPG and HR signals were smoothed using a thirdorder Savitzky-Golay filter with a span of 11 [29], [30].
Each participant's emotional and post-stimulus signals were divided into 10-second non-overlapping time windows.Shorter windows resulted in higher noise and lower classification accuracy, whereas longer windows failed to capture the rapid changes in the signal.The emotional and post-stimulus signals generated 12 and 24 consecutive time windows, respectively.Then: 1) each time window coming from the emotional clip was associated with the emotion the participant declared after watching that clip; 2) each time window coming from the relaxing clip of those who declared relaxation after watching that clip was associated with relaxation.The first time window of each clip (emotional or poststimulus) was discarded as potentially influenced by the participant's emotion during the previous clip [31].Also, the first eight and last eight post-stimulus time windows were neglected to reduce the dataset imbalance, as the relaxing and emotional clips had different lengths.

C. FEATURE EXTRACTION
Three feature sets were extracted from each window in the time, frequency, and deep domains.Their features were as follows: 1) Time domain: the features extracted from the GSR, PPG, and HR signals were the maximum, minimum, the ratio between maximum and minimum, root mean square, mode, crest factor, mean, median,  median absolute deviation, mean absolute deviation, geometric mean, 10% trimmed mean, 2nd-, 3rd-, and 4th-order moments, 33rd-percentile, 1st-, and 3rdquantile, difference between maximum and minimum value (range), harmonic mean, skewness, kurtosis,  interquartile, mean of absolute values, mean square, standard deviation, and variance [32].Additional GSR features [33] were the number of peaks, rising time, mean height of the peaks, standard deviation of the heights of the peaks, mean of the time intervals between each pair of consecutive peaks, the standard deviation of the time intervals between each pair of successive peaks, the difference between last and first value, the normalized difference between the maximum and minimum values, and the arc tangent of normalized difference between the maximum and minimum values (angle).PPG features [34], [35], [36] were the mean pulse rate, the standard deviation of pulse rate, the mean root square of pulse rate, mean RR intervals, the standard deviation of RR intervals, the standard deviation of successive RR interval differences, mean root square of successive differences, Poincaré standard deviation perpendicular to the line of identity (SD1), Poincaré standard deviation along the line of identity (SD2), the ratio of SD1 to SD2, number of adjacent NN intervals that differ from each other by more than 50 ms, percentage of the number of adjacent NN intervals, and Baevsky's stress index; 2) Frequency domain: the discrete Fourier transform (DFT) of each GSR, HR, and PPG window highlighted that the spectral power distribution concentrated at frequencies lower than 8 Hz (see Fig. 6).The frequency resolution of the signals was 0.1 Hz, which was sufficient to detect physiological changes associated with emotions [37], [38].Regarding the GSR, the highest power density was in the band ranging from 0 to 0.3 Hz (associated with sympathetic tone variations [39]).Additional features were extracted from the band [0, 5] Hz.Regarding HR, features were extracted from frequencies between 0 and 0.2 Hz, widely used to assess heart rate variations related to respiration [40].
Features were also extracted from 0.5 to 2 Hz because the heartbeat range is typically between 30 bpm (i.e., 0.5 Hz) and 120 bpm (i.e., 2 Hz) [41].Finally, as highlighted in [42] and [43], the frequency band that showed the highest PPG power density distribution was [0.5, 8] Hz: PPG features were thus extracted from this band.Additional PPG features were extracted in the frequency ranges of [0, 2] Hz, [2, 3.5] Hz, and [3.5, 8] Hz, trying to capture more detailed information.Table 4 summarizes the frequency intervals used for feature extraction.
The features extracted from the frequency domain of the GSR, PPG, and HR windows were as follows: the 99% occupied bandwidth, the mean frequency of the power spectrum, the median frequency of the power spectrum, average power, minimum amplitude frequency, maximum amplitude frequency, the ratio between the maximum and minimum amplitude frequencies (if minimum amplitude frequency was zero, it was assigned a value of 10 −6 ), the ratio between minimum amplitude and maximum amplitude, and the ratio between maximum amplitude-frequency and its amplitude.Additionally, HR features were extracted using a resolution frequency of 0.1 Hz and adapting frequency bands as follows [44]: the average power in  variations and spectral information.Fig. 7 shows an example spectrogram.Deep features were extracted from spectrograms using the relu3 layer of Alexnet [45] for its fast feature extraction and high classification accuracy compared to other deep models [46].We selected the relu3 layer because it led to higher levels of accuracy than other layers in pattern recognition problems [46], [47], [48], [49].This layer extracted 64,896 features from each spectrogram's signal (324,840 in total).

D. SYSTEM ARCHITECTURE
Two system versions were developed, S 1 and S 2 , based on a three-class classifier and two binary classifiers, respectively.Fig. 8 9 shows a flowchart of S 2 .

E. DATASETS
As Fig. 5 shows, the union of the features extracted from five contemporary 10-second windows (three from the GSR, one from the PPG, and one from the HR) generated a sample.Outlier samples caused by artifacts from the sensors or participant movement were identified and removed from the dataset using the Isolation Forest algorithm with a contamination ratio of 10% for each class.
Three datasets were created, each in two versions: one version (denoted as TF) contained time and frequency features, whereas the other version (denoted as DEEP) comprised deep features.The datasets were as follows: Each dataset was split into a training set (90%) and a test set (10%).Both sets contained the same proportion of samples per class.Features were preliminarily filtered using the chi-squared (χ 2 ) test, a statistical technique designed to evaluate the association between individual features (f i ) and the output class (c) in a classification problem involving N classes.The null hypothesis (H 0 ) was that no significant relationship existed between a given feature and the output class, indicating that the two are statistically independent: H 0 : F i and C are independent.
(1) The 2 coefficient was calculated for each feature to assess this hypothesis as follows: where O ij is the observed frequency in the i-th category of the feature and the j-th category of the output class, E ij represents the expected frequency in the i-th category of the feature and the j-th category of the output class, assuming independence.We chose a significance level (α) of 0.05, representing the acceptable level of Type I error.The features that obtained a chi-squared coefficient higher than the critical value at the 0.05 significance level and with degrees of freedom corresponding to the number of classes (N −1) were considered to be significantly associated with the output, and were maintained.This statistical process ensured that the features selected had a statistically significant relationship with the output.The Synthetic Minority Oversampling Technique (SMOTE) [50] was applied to fix the class imbalance in the training set.Tables 5, 6, and 7 summarize the number of samples in the datasets before and after the balancing.

F. TRAINING AND EVALUATION
Each classifier (C 1 , C 21 , and C 22 ) was designed, trained, and tested in four steps by using each of its datasets, as follows: 1) Determining the best feature set: the forward sequential feature selection (SFS) [51] found the optimal feature set considering, as a criterion, the minimization of the cross-entropy of an ANN.The ANN was chosen for its ability to learn complex linear and nonlinear input-output relationships [52].We chose an ANN with a single hidden layer, with as many neurons in the output layer as the number of classes (three for C 1 , and two for C 21 and C 22 ).The number of hidden neurons was set according to the following equation [53]: where H is the number of hidden neurons, and N i and N o are the number of input and output, respectively.A total of 100 executions of the SFS were performed using stratified 10-fold cross-validation on the training set, thereby obtaining 100 candidate feature sets.Each fold contained randomly selected samples; folds did not share any sample.A statistical evaluation led to the most predictive feature set among the 100 candidates.
In particular, each candidate feature set (F i ) was the input to train an ANN using a 30-run 10-fold crossvalidation.At each iteration, the performance of the ANN was assessed by measuring the cross-validation error (E ik ), representing the model's predictive accuracy, i.e., the average of the mean cross entropies obtained for each of the 10 test folds at iteration k using feature set F i as input.The Student's t-test with a significance level (α) of 0.05 led to finding the optimal feature set.This significance level represented the statistical significance threshold.In particular, Student's t-test was used to compare the means of the cross-validation errors generated by the ANN models using distinct candidate feature sets F i and F j .The null hypothesis was H 0 : the average mean cross-entropies of F i and F j do not show a statistically significant difference, suggesting that preferring one of the two feature sets over the other would not impact the model performance.The t-test produced a statistic (t) that measured the difference between the means, determined as: where E i and E j are the means of the cross-validation errors for feature sets F i and F j , respectively, K is the number of iterations (i.e., 30), and s p denotes the pooled standard deviation of errors, calculated as: where σ i and σ j are the standard deviations of errors for feature sets F i and F j , respectively, n and m are the sample sizes of the cross-validation errors for F i and F j , both equal to 30.The t-test also generated a p-value measuring the probability of observing a t-statistic as extreme as the one computed if the null hypothesis (H 0 ) were true.
If the p-value is lower than α, then the observed difference in the means of cross-validation errors is highly unlikely to have occurred by random chance.This indicated that the results were statistically significant, and the null hypothesis (H 0 ) was rejected as there was a statistically significant difference in the performance of candidate feature sets F i and F j .The feature set that resulted in the highest number of rejections of the null hypothesis-indicating statistically significant differences in cross-validation errors compared to the others-was considered the most powerful for the classification task.This selection process statistically proved that the chosen feature set had the best predictive power, thus being an optimal choice for the classification task.2) Optimizing models: classifiers based on Decision Trees (Tree), Discriminant Analysis, Naive Bayes, Support Vector Machines (SVM), ANNs, k-Nearest Neighbors (k-NN), Kernels, and Ensembles were optimized using the Bayesian optimizer.The optimizer used a 30-run stratified 10-fold cross-validation with repartitioning to optimize each classifier.Each model's optimal hyperparameters were identified using the dataset's training set and the corresponding optimal feature set.This common practice corresponds to performing a hyperparameter fine-tuning to best exploit the feature sets found by the SFS algorithm.In fact, a given feature set may have special characteristics whose best predictive power can be unveiled by tailoring the hyperparameters of a model to maximize the synergy between the selected features and the model.Also, this approach can improve performance by efficiently fine-tuning the models, avoiding the computational burden of running the SFS method for all models independently.The optimized models were finally used for internal and external evaluations; 3) Internal evaluation: classifiers were trained and evaluated using a 30-run stratified 10-fold cross-validation; 4) External evaluation: classifiers were trained 30 times using the dataset's training set and evaluated on the test set (see Section IV-E).The model that achieved the highest classification accuracy on the test set was selected as the best.
System S 1 consisted of the C 1 classifier that obtained the highest accuracy after the external evaluation (i.e., using the test set) among those trained using D TF 1 and D DEEP  Analyses were performed using MATLAB 2023b running on a personal computer equipped with an Intel CPU i7-12700K, 32 GB of DDR4 DRAM, and NVIDIA GPU 3060 Ti.

V. RESULTS
This section describes the results of the two system versions regarding classification accuracy.

A. FEATURES SELECTED
The numbers of features per signal in each dataset are summarized in Table 8.
As the table shows, 20 features were in the TF domain of the PPG, about triple the number of those of the other signals.Conversely, the GSR features were prevalent in the DEEP domain, where none of the PPG features was considered.The feature sets of TF datasets were as follows: • F 1 : The range and crest factors were the GSR features, along with the percentile and angle degree from the phasic component.The HR feature was the 3rdquantile.The PPG features consisted of median absolute deviation, crest factor, 10% trimmed mean in the frequency range [2, 3.5] Hz, standard deviation in the frequency range [3.5, 8] Hz, occupied bandwidth in the frequency range [0, 2] Hz, the mean frequency of the power spectrum in the frequency range of [0, 2] Hz, [2, 3.5] Hz, and [0, 8] Hz; • F 21 : The GSR tonic features were the ratio between the maximum and minimum values and the geometric mean.The HR features consisted of the mean of absolute values and the 3rd-quantile.The PPG features were the median absolute deviation, the standard deviation in the frequency range [3.5, 8] Hz, the occupied bandwidth in the frequency range [0, 2] Hz, the mean frequency of the power spectrum in the frequency range [0, 2] Hz, and the mean frequency of the power spectrum in the frequency range [2, 3.5] Hz and [0, 8] Hz; • F 22 : The GSR tonic features were the ratio between the maximum and minimum values, the difference between the last and first values, and the variance in the frequency range [0, 0.3] Hz.The HR features were the mean of absolute values, the maximum value, and the range in the frequency range [0, 2] PPG features comprised the median absolute deviation, the mean root square of the pulse rate, the occupied bandwidth in the frequency range [0, 2] Hz, the maximum amplitude frequency in the frequency range [0, 2] Hz, the mean frequency of the power spectrum in the frequency range [3.5, 8] Hz, and the maximum amplitude frequency in the frequency range [3.5, 8] Hz.

B. INTERNAL AND EXTERNAL EVALUATIONS
Performance was assessed by calculating the average (Avg), standard deviation (Std), and weighted F1 score of accuracy for both internal and external evaluations.The highest levels of accuracy achieved during the internal evaluation of each dataset are summarized in Table 9.The hyperparameters selected during the internal evaluation by the Bayesian optimizer for the optimized models are described in Table 10 and Table 11, respectively.
The average accuracy of the best models used in internal evaluation exceeded 80%, reaching accuracy peaks above 90% with both binary classifiers in TF and C 22 in DEEP domains.Specifically, the C 21 model trained with the TF dataset accurately distinguished Anxiety from Another emotion.In contrast, the C 22 model trained with the DEEP dataset effectively distinguished Relaxation from Another emotion.
The best results obtained during the external evaluation for systems S 1 and S 2 are summarized in Table 12.As the table shows, the hybrid system S 2 (HYB) achieved the highest accuracy.This system was developed by selecting the models that reached the highest accuracy throughout the external evaluation, i.e., Subspace k-NN for C 21 and Weighted k-NN for C 22 , trained on the TF and DEEP datasets, respectively (see Tables 14 and 15 in the Appendix).

VI. DISCUSSION
This section discusses the results and presents some application areas that could benefit from implementing a system based on the proof of concept presented in the paper.

A. ROLE OF THE PHYSIOLOGICAL SIGNALS
The predictive power of the GSR, PPG, and HR signals in each optimal feature set (F 1 , F 21 , and F 22 ) was assessed using the feature permutation importance (FPI) technique based on the out-of-bag random forest [54].The FPI involves measuring the impact of randomly shuffling feature values on the model's performance across multiple decision trees.The features causing a more significant decline in performance are considered more important for the model's predictions.In detail, each feature was assigned an importance score by changing its value 30 times.The cumulative importance score of each signal (GSR, PPG, and HR) was calculated by summing up the importance scores of all features extracted from that signal.Fig. 10 and Fig. 11 show three pie charts highlighting the cumulative importance score of each signal in the three optimal feature sets: F 1 , F 21 , and F 22 .
In particular, Fig. 10 shows that the PPG signal obtained the highest cumulative importance score in the TF domain for all three feature sets (57%, on average).Also, there is  no marginal signal in the feature sets used by the most accurate system version (F 21 and F 22 ).The least important signal, i.e., the HR in F 21 , has a score percentage of 15%, suggesting that this signal is not negligible when determining a person's emotion.The pie charts in Fig. 11 highlight that the GSR signal is crucial when classifying emotions in the DEEP domain, for all the feature sets: its mean cumulative importance score is 77.3%.Moreover, the PPG is useless in the DEEP domain.Finally, comparing the pie charts in Fig. 10 and Fig. 11 shows that the GSR and HR signals provide higher predictive contributions in the DEEP domain than in the TF domain.
Regarding performance, Table 12 shows that system S 2 outperformed S 1 during the external evaluation.System S 2 distinguished between Anxiety, Relaxation, and none of the two (i.e., emotions like anger and sadness).Fig. 12 shows the confusion matrix of 30 systems S 2 evaluated using dataset D 21 , whereas the system micro-average and weighted recall, precision, and F1 score are summarized in Table 13.The weighted average considers weights in proportion to the number of samples in each class.
The table shows that all the measures exceeded 0.88: system S 2 thus achieved a high performance.In particular, a precision of 0.9529 highlights that when the classifier predicted that an instance belonged to a particular class, it was correct over 95% of the time.The recall of 0.9522 means that the system correctly identified over 95% of the samples belonging to the corresponding class.Finally, the system achieved an F1 score of 0.9521, demonstrating its high performance in terms of both precision and recall.The results highlighted that using two physiological signals along with features from different domains led to an accurate assessment of emotion.
To further assess performance, the classifiers of the system were trained using a dataset obtained by first combining time-frequency and deep features and then reducing the resulting feature set as explained in Sections IV-E and IV-F.This dataset is referred to as ''TF+DEEP''.Fig. 13 shows a histogram highlighting the mean levels of accuracy of each classifier trained using the TF, DEEP, and TF+DEEP datasets.As the histogram shows, each classifier's mean level of accuracy obtained using the TF+DEEP dataset (yellow column) was not better than the corresponding classifiers trained only using TF or DEEP features (blue and orange columns).
In conclusion, analyzing the cumulative importance scores of the features from the TF+DEEP dataset highlighted that DEEP features achieved higher cumulative importance scores in predicting emotional states, as shown in Fig. 14.

B. ADVANTAGES OF ANXIETY INDUCTION VIA PURPOSELY EDITED VIDEO CONTENT
Generating anxiety through purposely edited horror movie trailers is an alternative way of exploring emotional responses.The precision and control this method offers allow for a finely tuned emotional experience.Manipulating narrative elements, visuals, and auditory cues generates specific emotional reactions, offering a more nuanced approach than conventional time-constrained tasks.Also, inducing anxiety within controlled environments (such as those using video content) is promising for therapeutic and training applications.The structured nature of the experience can provide a safe for exposure therapy or emotional management interventions, contributing to enhanced emotional regulation.
Using purposely edited videos also enables a detailed analysis of emotional responses.Continuous tracking of physiological signals such as GSR, PPG, and heart rate offers a comprehensive understanding of the emotional trip, revealing the intricate interplay between psychological and physiological dimensions.By investigating the emotional patterns in response to video-induced anxiety, we gain insights into the complex landscape of human emotions.This approach can thus enhance our understanding of how anxiety  manifests and evolves, potentially impacting various fields, including psychology, neuroscience, and media studies.
An essential distinction lies between anxiety induced by video content and that arising from time-constrained tasks.The former may elicit subjective emotional responses and imagined scenarios, whereas the latter stems from objective pressure imposed by limited timeframes.The intensity and duration of these anxiety forms also differ.Video-induced anxiety may persist over a prolonged period, providing a sustained emotional experience.Time-constrained tasks tend to elicit a brief but intense wave of anxiety.The proof of concept presented in this paper can help improve the understanding of these variations in anxiety experiences, thereby impacting various application fields.

C. POTENTIAL IMPLICATIONS OF THE RESEARCH
The proof of concept described in this paper has several potential applications and implications that are described in the next sections.

1) MEDIA AND MARKETING
Understanding the interplay between emotional experiences and audience engagement is crucial in media and marketing.Emotional experience can influence how audiences perceive, interact with, and respond to media content.A system implementing the PoC can analyze the anxiety types induced by purposely edited video content using various video clips.This lets the system investigate the emotional nuances causing audience engagement.
When viewers engage with media content that elicits anxiety triggered by suspenseful elements, their emotional responses contribute to a more immersive experience.The exalted emotional arousal can result in increased attention, cognitive processing, and enhanced memory retention [55].Manipulating the emotional content allows the presence or absence of specific anxiety-inducing elements to vary.Media creators (but also marketers) could thus tailor their strategies by assessing the reactions to different emotional triggers and how these reactions impact engagement metrics, such as viewer retention.Implications include: • Improved content planning: content creators and marketing professionals could use our research to understand emotional responses to various narrative elements.For example, an online streaming platform that aims to optimize viewer engagement can make data-driven decisions by analyzing viewer reactions to different emotional triggers within the content.If a specific twist in a series elicits a particular type of anxiety, the platform can use this information to recommend similar content to viewers.This keeps viewers engaged and provides a more satisfying viewing experience.People who enjoy content with specific types of suspenseful elements can receive tailored recommendations that align with their preferences.This approach expands beyond streaming services to traditional television.
• Personalized advertising: marketing professionals can use our research to measure the emotional nuances of advertisements and promotional campaigns.In particular, they can refine content and strategies to elicit specific emotional responses that drive consumer actions by assessing viewer reactions and engagement metrics.For example, an e-commerce platform may use this information to develop emotionally resonant advertisements that increase sales.

2) THERAPEUTIC CONTEXTS
In therapeutic contexts, people experience anxiety in diverse and complex ways.Anxiety can manifest in response to triggers from traumatic experiences, phobias, generalized • Personalized exposure therapy: an exposure therapy that gradually introduces social scenarios within a controlled audiovisual environment may be helpful for those experiencing anxiety in social situations.Therapists could benefit from our research to adjust the narrative elements and emotional scenes.This leads to customizing the exposure experience to align with each individual's specific triggers and sensitivities, thereby allowing people to face and control their emotional responses.
• Continuous monitoring and desensitization: therapists could closely monitor the progress and comfort levels of patients.By controlling video-induced anxiety's emotional content and intensity, a system based on our PoC could help therapists gradually desensitize patients to anxiety triggers, improving emotional regulation and resilience and promoting better emotional well-being [56].In cases of post-traumatic stress or anxiety disorders, the methodology could also be adapted to simulate controlled environments for therapeutic desensitization.Mitigating anxiety disorders and their consequences is crucial in modern society, especially among adolescents.Alarming statistics reportedly estimated an anxiety prevalence of up to 12%, considering 82 countries with wide geographic variety and cultural backgrounds, with these numbers dramatically increasing in the last years [57].

D. FUTURE RESEARCH DIRECTIONS AND POTENTIAL IMPROVEMENTS
There are various research directions for future work.Some of the most attractive are as follows: 1) Refinement of emotional classifiers: further research can focus on improving the accuracy and specificity of emotional classification.This may involve using other machine learning techniques, larger datasets, and more diverse emotional classes to create a model to determine more emotional states.2) Comparison of anxiety induction methods: future studies could compare the efficacy of eliciting anxiety via purposely edited video content to other methods, such as time-constrained tasks, real-life stressors, or different audiovisual stimuli.Comparative research could provide insights into the strengths and weaknesses of each approach and our understanding of emotional experiences.3) Long-term effects of exposure therapy: exploring the long-term effects of exposure-based intervention using controlled audiovisual stimuli is essential in psychological therapy.In particular, analyzing whether individuals who undergo such treatment maintain reduced anxiety levels over time could provide insights into the sustainability of this approach in improving people's emotional management.4) Neurological investigations: analyzing the neurological aspects of anxiety induction through audiovisual stimuli is another interesting future direction.For example, using neuroimaging techniques (e.g., functional magnetic resonance imaging), researchers could examine the neural mechanisms underlying the emotional responses and the impact of purposely edited audiovisual content on brain activity during anxiety induction.Regarding potential improvements to our methodology, future studies will consider investigating a more exhaustive variety of audiovisual stimuli, including different genres and narrative structures.This will help enrich the study by determining specific elements that induce anxiety.This diversification could lead to greater flexibility for various applications.
Another future direction will be focused on investigating participants' differences.In particular, an in-depth exploration of the influence of individual differences (age, gender, and prior anxiety experiences) on emotional responses to audiovisual stimuli could lead to a more comprehensive understanding of anxiety induction.Tailoring interventions based on individual profiles could also enhance the effectiveness of therapeutic applications.
Finally, improvements in data collection methods (e.g., using less intrusive devices), more advanced techniques for signal processing and data analysis could increase precision and reliability.
Some of these activities are already ongoing to improve the proof of concept presented in this paper.

VII. CONCLUSION
This paper has presented a proof of concept (PoC) exploring the possibility of eliciting anxiety through purposely edited horror movie trailers, offering new insights into the emotional experiences evoked by controlled audiovisual stimuli.The editing was based on accentuating anticipatory tension and suspense while excluding explicit frightening scenes.The result was a form of anxiety similar to long-lasting tension, which is highly subjective and different from that elicited by time pressure, considered by most papers in the literature.
An AI-based classifier was also designed and developed to determine a person's emotion among anxiety, relaxation, and none of the two based on the GSR, PPG, and HR physiological signals, achieving an accuracy higher than 95%.
Combining anxiety induction via purposely edited video content with that based on time-constrained tasks could shed light on the nuances of emotional experiences, having various practical applications spanning content creation in media and marketing, and therapeutic interventions in psychology.
In media and marketing, implementing the PoC could optimize content creation strategies, enhance audience connections, and achieve communication goals.
Implications in psychological therapy could lead to designing tailored interventions.For example, generating a controlled and targeted anxiety-provoking stimulus could enable therapists to develop tailored exposure-based interventions that gradually introduce social scenarios within a controlled environment.This could help achieve gradual desensitization to anxiety triggers, improving emotional management and psychological well-being.

FUNDING SOURCES AND CONFLICTS OF INTEREST
No funding sources or conflicts of interest were reported for this study.

FIGURE 1 .
FIGURE 1.The Shimmer3 GSR+ on the wrist.GSR electrodes are around the index and middle fingers; the PPG sensor is around the ring finger.

FIGURE 2 .
FIGURE 2. Overview of the editing procedure.Sudden scary scenes are cut out of the video clip and replaced with a fade-out and fade-in transition.The sudden loud sounds accompanying each scary scene were removed.This editing prevents the viewer from being suddenly scared and the accumulated tension from drifting away, thereby constantly increasing the anticipation level.The result is a continuous sense of anticipation that stresses the viewer, evoking long-lasting anxiety.

FIGURE 3 .
FIGURE 3. The experiment setup.The participant wears the mobile VR headset and the headphones for an immersive experience.A wrist-worn Shimmer3 GSR+ records the galvanic skin response (GSR) and photoplethysmogram (PPG).The heart rate (HR) is derived from the PPG using the internal PPG to HR algorithm of the Shimmer3 GSR+.

FIGURE 4 .
FIGURE 4. The three phases of the experiment.

FIGURE 5 . 1 , 1 ,
FIGURE 5. Overview of the procedure that generates the samples of all datasets D X with X ∈{TF , DEEP}.From top to bottom, the signals (GSR, PPG, and HR) recorded while watching the emotional/relaxing video clip are labeled, preprocessed, and divided in windows w i .Then, each window w i generates a sample by extracting features from the time, frequency, and deep domains (i.e., from w i , its FFT, and spectrogram, respectively).The figure shows how the process generates a sample from an example window (w 1 ).In particular, a sample of a D TF dataset contains the features extracted from all signals' windows (w GSR 1 , w GSR_P1

FIGURE 7 .
FIGURE 7. Spectrogram of a 10-second time window of the PPG signal from 0 to 8 Hz, using 128-sample segments.

[0, 0. 2 ] 3 )
Hz, average power in [0.1, 0.4] Hz, the sum of average powers in [0, 0.2] Hz and [0.1, 0.4] Hz, and the ratio of average powers between [0, 0.2] Hz and [0.1, 0.4] Hz; Deep domain: the short-time Fourier transform (STFT) generated a spectrogram from each signal's 10-second time window.In particular, spectrograms were computed considering the frequency ranges of [0, 5] Hz for GSR, [0, 2] Hz for HR, and [0, 8] Hz for PPG signals.Segments of 128 samples, with a 50% overlap, were extracted from the time window.The Hamming window was applied to each segment to create the spectrogram.These choices were a trade-off between temporal and spectral resolution to capture short-term

FIGURE 8 .
FIGURE 8. Block diagrams of the system versions S 1 (a) and S 2 (b).Each output of classifiers C 1 , C 21 , and C 22 is 0 or 1.The right-hand side of S 2 shows three AND logic gates that generate the output of S 2 .Little white circles preceding some inputs to the AND gates denote NOT logic gates.

1 and D DEEP 1 to design and train C 1 ;
train classifier C 22 .
N i was set to 20 as we considered selecting a maximum of 20 features, whereas N o was equal to 3 for C 1 , and 2 for C 21 and C 22 .Then, H = 7 for C 1 , and H = 6 for C 21 and C 22 .

1 .
System S 2 comprised the C 21 and C 22 classifiers that obtained the highest accuracy in the external evaluation among those trained using D TF 21 and D DEEP 21 , and D TF 22 and D DEEP 22 ,

FIGURE 12 .
FIGURE 12. Confusion matrix on system S 2 .The matrix shows the number of correctly and incorrectly classified samples for each class.

TABLE 2 .
Film names and their director(s).

TABLE 3 .
Equipment name and producer.

TABLE 4 .
Frequency bands and sub-bands for feature extraction.
shows their block diagrams.Version S 1 takes a sample in input represented using features F 1 extracted from the GSR, PPG, and HR signals, considering the time, frequency, and deep domains.S 1 uses a classifier (C 1 ) that classifies a sample as Anxiety, Relaxation, or Another emotion.Version S 2 uses two classifiers, C 21 and C 22 , that take a sample represented using features F 21 and F 22 , respectively, extracted from the time, frequency, and deep domains.The output classes of C 21 are Anxiety or Another emotion; those of C 22 are Relaxation or Another emotion.S 2 classifies a sample as: • Anxiety: if C 21 detects anxiety; • Relaxation: if C 22 detects relaxation; • Another emotion: if C 21 and C 22 detect Another emotion.Fig.

TABLE 5 .
Number of samples in the training and test sets of D TF

TABLE 6 .
Number of samples in the training and test sets of D TF

TABLE 8 .
Number of features of each signal for each dataset.

TABLE 9 .
Summary of internal evaluation results of optimal machine learning algorithms trained with TF and DEEP datasets.

TABLE 10 .
Hyperparameters of the TF models trained during internal evaluation.

TABLE 11 .
Hyperparameters of the DEEP models trained during internal evaluation.

TABLE 12 .
Summary of external evaluation results of systems S 1 and S 2 .

TABLE 13 .
Micro-average and weighted-average scores of system S 2 .

TABLE 15 .
External evaluation results of machine learning algorithms trained using data from the DEEP dataset.