Stress Detection Using Eye Tracking Data: An Evaluation of Full Parameters

Stress has been a common disorder in human societies and numerous studies have been conducted on the early diagnosis of stress. Previous studies have shown that it is possible to diagnose stress using eye tracking data. This study aimed to obtain a new and significant method for detecting parameters of the eye tracker and electrodermal activity signal by discrimination of “stress” vs. “relaxation” and to achieve higher accuracy than previous research. We used a Stroop task and a mathematical stressor task in which stress elements were placed in a novel design to separate stress from relaxation in the Stroop task and evaluate three levels of stress in the mathematical task. In the present study, we recorded the eye tracking data of fifteen participants and thoroughly investigated the pupil diameter (PD) and electrodermal activity (EDA) features to discriminate different stress states. After preprocessing, several features were extracted and selected. Then, the features were used for classification by applying support vector machine, linear discriminant analysis, and k-nearest neighbor classifiers. The linear discriminant analysis classifier, for which the accuracy was 88.43% in the Stroop and 91.10% in the mathematical, showed higher accuracy than the other classifiers when using PD and EDA features. Also, PD features demonstrated more reliability and ability to differentiate stress from relaxation compared to traditional EDA.


I. INTRODUCTION
Long-term psychological stress is a crucial issue in today's societies and imposes mental and physical health problems. It is the cause of many psychosomatic-neural-cognitive disorders and can influence the cognition, perception, and decisions of individuals. Thus, the World Health Organization describes stress as an epidemic in the 21st century. Stress is among the most common health threats, increases the need for doctor visits, and imposes a heavy burden on global healthcare systems. The detection of the stress limit can help enhance health, quality of life, and well-being [1]. Today, The associate editor coordinating the review of this manuscript and approving it for publication was Vishal Srivastava. researchers are focused on different methods for automatic stress detection and analyzing brain processes. Since the sympathetic nervous system (SNS) activates stress effects in different target organs, physical and physiological indicators of SNS activity are efficient in stress measurement. Noninvasive approaches, e.g., eye tracking, electrodermal activity (EDA) or galvanic skin response (GSR), automated facial expression analysis (AFEA), electrocardiography (ECG), electroencephalography (EEC), electromyography (EMG), respiratory rate, and eye tracking, are more desirable than invasive techniques [2].
Since an eye tracker is a non-invasive and powerful tool, it is widely used to detect stress by tracking the point of gaze and eye motion, and by measuring pupil VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ diameter (PD) [3], [4]. The most important features of an eye tracker measurements are PD, fixation, and saccades [5]. PD is a feature that is dictated by two opposing sets of muscles controlled respectively by the SNS and parasympathetic nervous system, such that PD increases in the presence of stress [4]. Moreover, skin conductance is a common physiological indicator known as EDA. Since the SNS controls sweat glands, a rise in sympathetic arousal increases the activity of sweat glands and reduces skin resistance. Therefore, skin conductance (the inverse of skin resistance) can be utilized as a mental or physiological arousal indicator and can serve as a convenient stress measurement approach [2]. In addition, EDA shows skin conductance as a measured criteria for stress evaluation. EDA is thus one of the prevalent physiological measures of stress. The tonic skin response is conventionally explained by a natural increase in conductance. This seems to be the point of reference, but it is not. When the subject is affected by a stressor, transient rapid peaks called phasic responses occur. In the context of studies, this phasic response is considered. The utilization of EDA in recent studies has been notable for two reasons: (i) unlike EEG and ECG, this method does not require complicated set-ups; and (ii) the EDA electrodes usually do not need gel, unless the subject has particularly moist skin. The placement of EDA electrodes is usually preferred in the palmar and plantar areas because sweat glands accumulate more in these areas [2], [6]. EDA has also been measured in study participants along with eye tracking tests [3].
To induce stress in experimental situations, studies use stress-generating methods that resemble real stress stimuli, such as the Stroop color-word interference test (SCWT) [3], [7] and mathematical task [8]. The Stroop task is a classical test of response inhibition associated with brain executive functions. Whereas people usually read a word automatically, a voluntary effort is required to name the colors and images in this task [7], [9]. The performance cost of the mismatch condition is called the Stroop effect or Stroop interference [10].
The Stroop task has been used in numerous studies as a psychological or cognitive stressor to generate emotional responses and increase the physiological level of (especially autonomic) reactions; however, there is disagreement on its precise mechanisms [3], [7]. Researchers have also examined individuals' anxiety and stress using physiological characteristics such as heart rate, skin conduction response frequency, and self-reporting during the Stroop task [7]. Furthermore, driving in real-life conditions [6], [11], visual stimulators (e.g., reflection and luminous intensity) [3], auditory stimulators (e.g., unwanted sound) [12], and mental stress caused by the virtual environment (e.g., virtual reality task in virtual environment by C. Hirt et al.) [13] have been used as stress stimulators. The Stroop [3], [4] and mathematical [8] tests are efficient alternatives to induce stress in experimental conditions.
Research on stress detection through physiological data has shown that eye tracking provides important information [13]. Nonetheless, in research to identify stress versus relaxation, eye tracking indicators have been used much less than other tools. Moreover, the use of eye tracking data in the Stroop [3], [4] and mathematical tests [8] has produced promising results in stress detection [3], [4].
PD is the most commonly measured feature of the eyetracker, and its direct relation to cognitive load enables the use of PD as an efficient estimator of stress and relaxation [2], [14], [15]. In addition to the ability to diagnose stress, this criterion is proportional to the level of difficulty of the task performed by individuals [16]. Previous investigations focused on employing such PD features as mean [6] or meanmax-Walsh [3], [4] by using a Stroop test as a stressor. Other research also used 10 features including 6 PD and 4 facial temperature features and some pictures as the stressor [17]. Moreover, one study used mean PD with a driving task stressor [6] to discriminate stress versus relaxation. Alternatively, some research used EDA to measure stress levels [2], [3], [6].
In sum, despite the significant research regarding some PD features and the Stroop test, previous studies did not apply all PD features in their consideration of discrimination accuracy. Moreover, while two studies used a simple mathematical stressor to apply stress [8], [18], assessment of stress using PD features by employing a mathematical stressor has not yet been investigated.
Therefore, in this study, we aimed to investigate the accuracy in the discrimination of stress versus relaxation by employing all the PD features of an eye-tracker system. We hypothesized that employing all PD features for evaluation would result in a higher accuracy for stress detection. To enhance the ability to detect the impact of PD features, we implied two stressors in the form of the Stroop and mathematical tests. We improved the Stroop task presented in previous articles through a novel design. Moreover, we designed a new and complex mathematical task.
For evaluating the effect of PD features in discrimination and comparing the results with previous work, an EDA device was also employed. For achieving accuracy, three different classifiers including support vector machine (SVM), linear discriminant analysis (LDA), and k-nearest neighbor (KNN) were applied. Our results highlight the effect of using all PD features in discriminating different states of stress with high accuracy.

II. METHODOLOGY
The following sections describe the methodologies used in this study. A detailed schematic description of these steps is illustrated in Fig. 1 and Supplementary File S5-9.

A. EXPERIMENT SETUP AND DATA RECORDING
The eye tracking data collection and testing phases were implemented in the Neuroscience and Neuroengineering Research Lab, Biomedical Engineering Department, School of Electrical Engineering, Iran University of Science and Technology (IUST). The participants sat on comfortable seats at a distance of 60 cm from a 24'' monitor with their heads fixed on a chinrest. An SR Research Eye Link 1000 Plus eye tracker was employed to track and record the data of eye movement.
The right eyes of all participants were tracked. Eye tracking began with calibration and validation to match the pupil position in the camera and the point of gaze on the display. Recalibration was performed in the case of a non-match or poor validity. The EDA device used in this study was developed by our laboratory, the Neuroscience and Neuroengineering Research Lab (Supplementary File S1-4). To measure and record the skin conductance level data, an EDA sensor was connected to the clean index finger of the left hand of each participant. Then skin conductance was measured continuously and recorded as the EDA output feature. To discriminate stress and relaxation phases, the outputs of the eye tracker (PD) and EDA were recorded simultaneously during two stressors. Fig. 2 illustrates our experimental setup.

B. PARTICIPANTS AND STUDY ENVIRONMENT
In this study, fifteen healthy participants, including seven females and eight males, with an average age of 26.93 ± 2.05 years old participated. They had no history of relevant diseases, concentration disorders or visual disabilities. The participants were non-native English speakers (native tongue of all the participants was Persian). They volunteered with full consent.
As the alertness of each subject is critical to proper task performance and true provocation of measurable stress, the experimenters instructed participants to be well-rested and not sleep-deprived. Importantly, no symptoms of fatigue were observed in any of the volunteers on the day of the test.
The experimental procedure for both tasks was described comprehensively to the participants before experiment initiation. These explanations included the process of each task, the visual characteristics of the task (a timer, scores, and the schematic of the task), and audio characteristics of the task (beep sound and time reduction). The participants were instructed to pay close attention to the explanations, and to request clarification of any remaining questions or ambiguities thereafter. Sufficient time for such clarification of details was given to the participants, and the high performance of the subjects reflects their full concentration during this instructional period.
During each task, two experimenters were fully monitoring the participant. If the participant appeared distracted, bored, or tired, he/she was eliminated from the experiment. Moreover, in the relaxation or baseline phase of the Stroop task, everyone except those who did not get a complete score (i.e., 45 score) passed through the next Stroop phase. The condition for including a subjects' data in the analysis was a performance above 70% in both tasks. Therefore, the remaining subjects were engaged in the task.
All experimental tasks and methods in this study involving human participants were in accordance with the ethical standards of the institutional and/or national research committee. The Ethics Committee of the Neuroscience and Neuroengineering Research Laboratory (NNRL) of Iran University of Science and Technology (IUST)reviewed and approved the experiments and methods used in this study according to the 1964 Helsinki Declaration. Informed consent was obtained from all the participants included in the study before data collection. The tests and data collection were carried out in a quiet environment without sound disturbances. Also, to reduce the effect of illumination on the pupil diameter, the lighting of the study environment and the average brightness of the stimulus system were kept constant for all subjects throughout the experiment.

C. STRESS STIMULATION AND STUDY MEASUREMENTS
The stress arousal states data were elicited and compared in the present study. The experiment was designed using MAT-LAB programming in the Psych toolbox software. Both tasks were used as real-time stress stimulators, and the same test timing was considered for all participants. In order to equalize the test conditions for all subjects, first the Stroop test and then the mathematical test was performed. Fig. 3 shows the experiment phases.

1) TASK 1
The first test in our study was the SCWT, in which a word refers to a color while it is displayed in a different color. The classical Stroop test was modified into an interactive version, in which the participant was required to not only verbally state the correct color but also click one of the five buttons displayed (including Green, Yellow, Red, Blue, and Cyan), as shown in Fig. 4. The score of the user is displayed on the top-right corner of the screen during the test (one point for each correct answer), while the remaining time of each trial (a maximum of three seconds) is displayed on the left. Stroop consists of four consecutive defined phases as follows: 1) Introduction: five-minute baseline phase in which a two-minute pre-recorded description of the experiment was played, followed by an opportunity for the participants to ask questions about the task. The remaining time was then spent in silence. 2) Relaxation: congruent phase, in which no significant stress was expected to be stimulated. A total of 45 colored words referring to color were displayed to the participant. The participant was asked to click the button correctly referring to the displayed color on the screen. The trial had a maximum length of three seconds. 3) Stress: incongruent phase in which 30 colored words were displayed to the participant, and the color of the word was different from the written name of the color. The participant had maximum three seconds to say the color of the word (not the name of the word) aloud and then click the corresponding key with the cursor.

4)
Rest: a buffer at the end of incongruent phase which included one minute of rest with no required tasks. The participant was remained silent in this phase.

2) TASK 2
This task is a mathematical test that is a real-time stress stimulator which is representative of a realistic scenario. This test is implemented in three phases, including: 1) Introduction: five-minute baseline phase in which a two-minute pre-recorded description of the experiment was played, followed by an opportunity for the participants to ask questions about the task. The remaining time was then spent in silence. 2) Stress: a phase with three levels of difficulty to induce different levels of stress. A total of 21 three-digit numbers were randomly displayed, and the participant was required to subtract seven from the displayed number in each trial. They first calculated the result in their mind and then announced it in English. Next, they entered the result with the keyboard embedded on the screen. This step was implemented for the participant in three different timing conditions for performing the calculation to induce more stress each time due to the pressure of time constraints and increased speed. These three modes are: (i) fifteen-second maximum for calculation of seven numbers per trial; (ii) thirteen-second mode maximum for calculation of seven numbers per trial; (iii) tensecond maximum for calculation of seven numbers per trial. 3) Rest: one minute of rest was spent silently and without movement by participants. In the stress phase, in the end of level one and level two of the mathematical task, a sound was played through the individual's headphones to increase stress, informing them to hurry due to the limited remaining time.
In the mathematical task, to show the remaining time in different levels, we displayed a moving circular arc around the random number, and the arc's circumference decreased in proportion to the remaining time. The starting point of the circular timer changed at each level and indicated the remaining time to modulate the participant's perception of trial speed and to cause more stress. Furthermore, the score was displayed on top of the screen.
The participant thus had different times allotted to perform the calculation. If the participant's answer was correct, a new score was displayed on the screen and the screen changed to the next page. If the participant answer was incorrect or the delegated time elapsed, the user did not receive a new score and the screen changed to the next page. This process continued until all 21 numbers were shown. To further stimulate stress, the participants were asked to pay attention to their score at the top of the screen. Fig. 5 illustrates this task.
To announce the transition between phases in both tasks, a beeping sound was played at the beginning of each phase and the end of each task. While performing each task, it was not possible for the participant to return to the previous one.

D. PREPROCESSING
Once the data of the participants had been recorded, raw data were extracted from the eye tracker through Data Viewer and converted into (.mat) files. Preprocessing is necessary because artifacts affect analysis results [6]. The PD and EDA signal processing was performed according to the following method: In preprocessing, blinks and other artifacts were first identified. There are several methods for detecting blinks that have been encountered in previous studies [6]. Here, with linear interpolation, the blink was replaced with the last acceptable PD value before each blink. It has been proven that individual subjects have different responsive ranges due to various levels of initial excitation. Normalization is necessary before feature extraction to minimize the effect of this problem [6]. In this study, the z-score method was used as where µ = mean of sample, σ = standard deviation of the sample, z = standard score, x = observed value. for normalization [11]. FIGURE 5. The mathematical task. This task was performed at three levels, each with seven trials. The first level offered 15 seconds to find and select the answer using the keyboard next to the time indicator, and the second and third levels had 13 and 10 seconds, respectively. The number that is typed on the keyboard is displayed at the top of the screen, while the score is shown on the right.

E. FEATURE EXTRACTION
Feature extraction is the conversion of input data into a set of features that have the maximum mutual information with the output. The features extracted from the data must have useful and discriminating information about stress. In the extraction stage, all PD parameters were extracted to be processed and used as features to discriminate stress from relaxation and evaluate stress levels. These extracted parameters were:  Table 1, the first seven parameters (I-VII) were directly used as part of the final set of features.
Additional features were generated based on PD and Current Fixation PD using their maximum, minimum, mean, median, skewness, kurtosis, variance, sum absolute amplitude, and range (difference between maximum and minimum). This resulted in a total of twenty-two features (features which are dissimilar) (Table 1), which were used for further analysis. Furthermore, the EDA output of the skin conductance measurement device was also recorded during the trials. As a result, the mean value of skin conductance in each trial was determined.

F. FEATURE SELECTION
To select the optimal number of extracted features, statistical evaluation was employed. The present study adopted the ladder method in feature selection, in which features are sorted in the order of importance. This prioritization can be based on a variety of criteria, e.g., t-test, entropy test, receiver operating characteristic (ROC) curve, and Wilcoxon signedrank test. The selection of the suitable test among parametric tests (e.g., t-test) or non-parametric tests (e.g., Wilcoxon signed-rank test) is the first step in such statistical evaluations since data distributions differ between the two types of tests. The Kolmogorov-Smirnov test is a criterion of selecting the statistical test based on the data distribution. It investigates the normality of the data distribution. The t-test is used when the distribution is normal; otherwise, the Wilcoxon signed-rank test is employed. The Kolmogorov-Smirnov test was applied to our data in order to investigate distribution. The data were found VOLUME 10, 2022 to have a non-normal distribution. Therefore, the Wilcoxon signed-rank test was employed. For discrimination, we listed 22 PD features in ascending order based on their Wilcoxon signed-rank, such that the feature with the lowest p-value was the most valuable feature. The first feature of the list, which thus had the highest value, was entered into the classifiers, and the accuracy was calculated. Then, the first and second features of the list were entered into the classifiers, and the accuracy was again calculated. This process continued until all 22 PD features on the list were entered to the classifiers.

G. CLASSIFICATION
After selecting features in stress and relaxation states (in the Stroop test) and stress levels (in the mathematical test), the features were introduced as inputs to a machine learning system to differentiate the stress state (incongruent Stroop phase) from the relaxation state (congruent Stroop phase) in the Stroop test, and different stress levels (at different temporal allotments) in the mathematical test.
The tested classifiers were naïve Bayes, Decision Tree, Ensemble Tree, SVM (using several kernels), KNN (with different values for K), LDA, and Random Forest. The classification accuracy obtained from different classifiers was sorted based on obtained accuracy and the best three classifiers with the highest accuracy were chosen for further investigations, which were KNN (with K = 3), LDA, and SVM (with Gaussian kernel, width = 0.9821).
The present study applied three classifiers to classify the data, including SVM, LDA, and KNN. SVM is a computational machine learning system which classifies by trying to find the hyperplanes that best separate the different classes from the dataset by projecting the dataset into a high-dimensional domain [15]. SVM algorithms are widely used in data classification applications [19]. In this study, the Gaussian kernel was used for the SVM classifier by comparing the accuracies obtained by linear, polynomial, and Gaussian kernels and selecting the kernel with the highest classification accuracy.
LDA is a technique to reduce dimensionality of dataset. In LDA, a vector is searched, in which the between-class matrix will be maximized and the within-class matrix will be minimized. Therefore, classes are detected by maximizing the ratio of inter-group dispersion to intra-group dispersion. It is used to differentiate data to find a linear combination of inputs that could separate two classes (i.e., relaxation and stress in the Stroop test and two of the three stress levels in the mathematical task). The data were projected on this vector which maximizes the discrimination of two classes [20].
KNN is a non-parametric classification and supervised learning method. When it is used as a classifier most of the time, it acts on the closest or neighboring training examples in a given region. KNN is notable for the sake of simplicity and effectiveness. The Euclidean distance is used to calculate its nearest neighbors. K value has an important role for unlabeled data in classification [21]. For selecting the K value in the KNN classifier, several values for K were queried and the best results were obtained by setting K = 3.
To evaluate a classification method, it is required to divide the data into a training dataset and a testing dataset, and constructing the model using the training dataset. Then, the model is validated using the testing dataset. The present study used k-fold cross-validation for the Stroop task and Leave-One-Out/each class cross-validation for the mathematical task. The k-fold method with K = 10 was applied to the Stroop test. The dataset was divided into ten subsets. Each time, one subset of these divisions was considered as validation data, and the other nine parts were considered as training data. This process continued until all subsets had been used as validation data once. Because of the fewer trials (seven trials in each level) in the mathematical task, Leave-One-Out was used.
To implement classification, the model is based on the testing dataset, by estimating the validation data using the constructed model. Then, the results from the validation data are compared to the actual results to measure the efficiency of the model. The aforementioned steps were performed ten times for the data of each participant, reporting the average classification accuracy.
Leave-One-Out/each class cross-validation was used in the mathematical test, due to the volume of data. In this method, a sample was employed as the validation data from each class (a total of two samples), treating the remaining data as the training dataset until pairs of data with the same number had been selected from the classes as the training data. That is, the first data of each class was employed as the validation data in the first round, the second data of each class was used as the validation data in the second round, and this process continued until all the data pairs had been used. The model was constructed using the training data and validated each time applying the validation data. Finally, the highest result through repetitions was averaged and reported.

III. RESULTS
The present study classified stress and relaxation states in the Stroop test and stress levels in the mathematical task using PD and EDA signals. A total of eight male and seven female volunteers participated. Data were recorded under the same experimental conditions, and the data were separately analyzed.
Our goal in this study was to design a comprehensive and applied method by which we could achieve greater discrimination accuracy by classifiers, providing and discussing other aspects, and becoming closer to clinical exploitation. It is noteworthy that in this study, all the features related to PD were considered. Therefore, we extracted nine features and by the statistical variables, we obtained twenty-two PD features as explained above (Table 1). In order to accomplish these purposes, we created an experimental design of a dualpurpose stress diagnosis. Firstly, the two main stressors of the Stroop test and mathematical test were used in the experiment, and secondly, a more complex design of these two stressors was used by adding time pressure by both visual and auditory means. Thirdly, the Stroop task in our study was designed as one section and without using lighting as an extra stressor, in contrast to three-section Stroop tasks in previous studies. The extra stressors were added to the two main tasks to affect cognitive function and increase the stimulated stress, in order to achieve higher discrimination accuracy. Afterward, we evaluated the effect of all PD features from the eye-tracker measurements on increasing the classification accuracy.
The twenty-two PD features are shown in Table 1. For discrimination, we listed twenty-two PD features based on their Wilcoxon signed-rank in ascending order. Then, in the next stages of this study, the SVM (with Gaussian kernel), LDA, and KNN (with K = 3) classifications were performed using several different groups of features obtained from PD and skin conductance, respectively. The groups included twenty-two PD features (Set 1), one EDA feature (Set 2), and twenty-two PD features and one EDA feature (Set 3). These sets were entered into the three classifiers and the highest accuracy was reported. Tables 2 and 4 report the classification results.
To classify the stress and relaxation states and the stress levels, the different sets of above-mentioned features were applied (the Stroop task in Table 2   In the Stroop task, the middle two segments were used to create relaxation and stress phases. Thus, the differentiation of the data collected in these two phases was investigated using the above-mentioned sets. Fig. 6 shows the accuracy of classifying PD features by three classifiers for each participant in the Stroop task. Different participants were found to have different accuracy. The final report in Table 2 was calculated from the average accuracy of 15 subjects. It was found that the LDA classifier had the highest accuracy in the classification of relaxation and stress in the Stroop test. Moreover, the use of the PD features with the mean EDA was found to have the highest accuracy (i.e., 88.43%) under the LDA classifier. In Table 3, we introduce the six PD features with the highest contribution to maximizing accuracy in the Stroop task, as well as the three features with the least contribution to maximizing accuracy. The participants underwent three stress levels (1, 2, and 3) in the mathematical task. To examine the discrimination of the data recorded by the eye tracker and EDA, the classifications of stress levels 1 versus 2, 1 versus 3, and 2 versus 3 were carried out. Fig. 7 shows the discrimination of different stress levels of the mathematical task by using the PD features in the LDA classifier. It was again noted that different subjects in different stress levels in the mathematical task had varying accuracy.   Table 4 reports the average classification accuracies of the three levels for all 15 subjects. The average accuracy of the three classifications was assumed to be the accuracy of stress level classification in the test. The highest accuracy obtained was 91.1% using the PD features with the mean EDA under the LDA classifier. Also, the LDA classifier had the highest accuracy in all the analyses. Fig. 8 shows the accuracy of classifying pupil features by three classifiers for each subject in the mathematical task. The average accuracy of the two-class classification in three different mathematical levels was obtained. As in the two previously mentioned datasets, different subjects displayed various accuracy. Also, like the Stroop task, the LDA classifier had a better separation ability than the other classifiers. Lastly, the final accuracy reported in Table 4 was the average accuracy of 15 subjects. In Table 5, we list the 7 features that were found to have the highest contribution to maximizing accuracy in the mathematical task, along with the 3 least contributing features. To evaluate the effectiveness of PD features, the following steps were performed. First, twenty-two PD features were used to calculate the average classification accuracy, which was 84.7% in the Stroop task for the recognition of relaxation versus stress, and 90.65% in the mathematical task in the determination of the stress level (Set 1). Then, the classification was performed using one EDA feature (mean) (Set 2). As can be seen in Tables 2 and 4, the accuracy of the classification using only the EDA feature reduced the accuracy of both the Stroop task and the mathematical task compared to Set 1. The highest accuracy was obtained by adding the PD features to the EDA feature (Set 3), which yielded 85.43% accuracy in the Stroop task and 91.1% in the mathematical.. Comparing our results using the eye-tracking and EDA data with the LDA classifier in the Stroop task reveals that the accuracy of our study is higher than in previous studies. To the best of our knowledge, employing the mathematical task as a stressor was not studied in previous investigations by using PD features. Nonetheless, the present work identified the highest accuracy of discriminating stress versus relaxation using the eye-tracking and EDA data with the LDA classifier in the mathematical task as 91.1% (Set 3). Adding the eyetracker features to the EDA mean feature improved accuracy in stress detection by 3.72% in the Stroop task and 0.45% in the mathematical task on average.
Finally, to explore the effect of gender on the accuracy of classifying stress versus relaxation and the stress level, the average classification accuracy was calculated separately for the female and male participants, as shown in Table 7. Since the classification accuracy difference was insignificant between the female and male participants (p > 0.05), we concluded that our experiment was gender independent.

IV. DISCUSSION
The present study investigated PD features in discrimination of different stress states. Deriving the eye-tracking signals due to human emotion in an individual presents a daunting challenge [22]. The design and implementation of mental stress stimulation tests require deep insights into human psychology.
In our study, fifteen subjects participated. Although a higher number of subjects could further support the validity of our results, the included number of participants was statistically sufficient to support our findings. Importantly, this work included a larger number of subjects than some previous investigations which included fewer than 15 subjects [17], [23].
Two stressors -the Stroop task and the mathematical task -were used as the main stressors in this study. The Stroop task serves to represent a short-term, real-time mental stress stimulus with a medium intensity that is responsive to environmental conditions [24]. In these tasks, the subjects who did not have a performance above 70% were excluded from the analysis. More specifically, for the Stroop task, everyone except those who did not get a complete score (i.e., score of 45) on relaxation Stroop achieved higher performance than 70%.
As shown in Fig. 4, the design of the Stroop task in our study included a one-section Stroop task instead of the three-section Stroop task employed in previous articles [3]. The advantage of our new method is that since Stroop is reduced to one section, the experiment duration is shorter and consequently fatigue, as a possible source of test error, is reduced. Moreover, the use of this one-section Stroop test resulted in the decrease of the computational costs. Another possible source of error in stress diagnosis is having a preconceived notion by the subject regarding the content of the test. In three-section Stroop tests, the user becomes familiar with the Stroop environment in the first section, which can serve as source of error in the next sections. This error caused by preconceived notions could be eliminated by reducing the three-section Stroop to a one-section Stroop task. According to the results in Table 2, the Stroop test created mental stress in the subjects which had a significant effect on the measured features, thereby enabling the separation of stress versus relaxation ''states''.
Since the use of lighting as a stimulus was controlled manually in previous articles [3], [4], there was a potential for reducing reliability and increasing experimental error. We eliminated this environmental error with our design. To this end, instead of creating more stress by light, we used the time limit pressure and showed the person's score during the test to create more stress. This strategy was implemented in our software program and therefore did not require the use of inaccurate external tools.
Meanwhile, as shown in Fig. 4 and 5, the mathematical task designed in this study consisted of several stressincreasing items such as time pressure and score display, each of which had been used as a single stressor in earlier works [25], [26]. Audio characteristics of the task such as the beep sound and time reduction notifications were also employed. Furthermore, to the best of our knowledge, mathematical stressors had not been studied with PD features in previous investigations.
As all participants indicated Persian as their native language, it is additionally noteworthy to consider that they were required to say the colors of the words in English. This condition could induce more stress due to speaking in a non-native language [27]. Furthermore, previous investigation has indicated the correlation between voice intensity and stress level [28]. Therefore, it can be postulated that the use of a loud voice in our study by participants would cause an increase of stress within them. To further increase stress, the participants were asked to pay attention to their scores on top of the screen.
It is necessary to mention that previous studies have shown that internal factors such as negative emotions, fatigue, dyslexia, dysarthria, distraction, and lack of concentration can influence the results of stress detection, specifically in the Stroop task [29]. These factors, however, were minimized in our study by monitoring and asking about the history of individuals' diseases related to our experiment. Moreover, potentially contributing external factors such as (i) light illumination, (ii) noise and sound disturbance, and (iii) head movement were eliminated by (i) keeping the environmental illumination intensity fixed during the experiment, (ii) isolating the environment from disturbing sounds, and (iii) using a chinrest to hold the subject's head steady, respectively. Table 1 shows twenty-two PD features which were obtained by using the statistical variables on nine PD features extracted from the eye-tracker. While the previous articles had no report of examining all the features of the PD and only reported to utilize a few features such as Mean-Max-Walsh, we investigated all the PD outcomes from the eye-tracker and evaluated the effects of those features in increasing the classification accuracy. The significance of investigating all extracted PD parameters led us to comprehensively study the PD features and the effect of each feature on increasing accuracy in this discrimination.
In this study, we used a non-invasive device as an EDA tool to compare our results with previous studies [2], [3], [6]. The PD and EDA data of the participants were recorded simultaneously to distinguish between stress and relaxation and to demonstrate the effect of PD as a strong and important factor in this separation. To this end, the twenty-two PD features and one mean of EDA conductance feature were applied for our experiment. Therefore, in this approach, we arranged three sets of obtained features from PD and EDA to compare our classification accuracy with previous studies as shown in Table 2.
In order to select the best classifier to achieve the highest accuracy, we tested several classification algorithms in our initial investigations. The three final classifiers included SVM with the proper kernel as Gaussian kernel and hyperparameter W = 0.9821, KNN with hyperparameter K = 3 and LDA, resulted in higher accuracy.
As shown in Table 2, the LDA classifier in ''Set 3'' had a higher accuracy (88.43±2.0%) than previous studies [3], [4]. Moreover, Fig. 6 shows a comparison of three classifiers' accuracy in the Stroop task (Set 1) for 15 subjects. Every classifier is observed to have a different ability to discriminate for different subjects, but the LDA classifier shows higher accuracy than two other classifiers, and the highest accuracy is for subject one in this figure.
Importantly, as shown in Table 3, we determined the features that contributed the most and the least to the increased accuracy of the LDA classifier in the Stroop task. Moreover, we calculated the extent of their contribution in achieving high accuracy in this task.
Furthermore, in the mathematical task for comparing accuracy of the three levels of this task, we prepared a linear graph for three different levels of mathematical task (Set 1) for 15 subjects with the LDA classifier, and each level is seen to have different accuracy in Fig. 7. A twoclass process was used in the mathematical task, in which the time was shortened at each next level in order to induce more stress. Then level 1 versus 2, level 1 versus 3, and level 2 versus 3 were compared while determining the accuracies.
As shown in Table 4, the LDA classifier in ''Set 3'' had an accuracy of 91.10±1.9% for the mathematical stressor, which is the highest accuracy in our study. As mentioned before, there is no experiment from previous studies to compare our results to in the mathematical task with PD features of the eye-tracker. Indeed, as shown in Fig. 8, we can observe the difference between the three classifiers' accuracy in the mathematical task (Set 1) for 15 subjects. The LDA classifier shows higher accuracy than the two other classifiers. Moreover, as shown in Table 5, the contribution of features in the mathematical task, and their extent in achieving high accuracy was introduced.
Specifically, as shown in Table 6, the accuracy of this work is 88.43%, while the highest accuracy in the previous studies with PD and EDA features for the Stroop test was 85.43%. Also, to the best of our knowledge, previous investigations did not study PD features by employing the mathematical task as a stressor. Accordingly, relevant accuracy values have not been reported in the literature for a direct comparison with our accuracy results.
Additionally, it is remarkable that we considered the accuracy difference in the gender of individuals and concluded that there was no significant difference (p > 0.05) between genders in the results of this experiment, as shown in Table 7.
In most studies in the field of stress and its recognition, the stress-inducing scenarios are in a laboratory-controlled environment. In this study, we similarly monitored and controlled the laboratory's parameters. However, a general limitation in this field is the need for real-life applications. To make this possible, methods that enable real-life measurement of stress are needed to be developed.
Overall, the presented results reveal that the PD features serve as powerful discriminators of stress from relaxation in the Stroop task and different levels of stress in the mathematical task. Real-time PD measurement was observed to be efficient in computer-human interactions to derive valuable information on emotional shifts, such as the information extracted in the Stroop stimulator (congruent versus incongruent) and mathematical stimulator. While investigating all extracted PD features and multiple classifiers, our observations indicated the highest accuracy was achieved with the LDA classifier. Finally, for further expansion and potential clinical translation of our work, our developed software program has the potential to be applied with an eye-tracker system in clinical settings for future investigations.

V. CONCLUSION
For the purpose of stress detection with higher discrimination accuracy, we investigated a comprehensive set of PD features from the eye-tracker, which had not been employed previously in similar investigations. Accordingly, by incorporating twenty-two PD features and one EDA feature, we achieved a higher accuracy than previous studies for discriminating stress versus relaxation in the Stroop task and three incremental stress levels in the mathematical task. Accordingly, our results indicate that the use of this comprehensive set of features is a promising approach for achieving a more robust and reliable method to discriminate different states of stress.

(Mansoureh Seyed Yousefi and Farnoush Reisi contributed equally to this work.)
FARNOUSH REISI received the B.Sc. degree in biomedical engineering from the Sahand University of Technology (SUT), Tabriz, Iran, in 2015, and the M.S. degree in biomedical engineering from the Department of Biomedical Engineering, School of Electrical Engineering, Iran University of Science and Technology (IUST), Tehran, in 2020. Her main research interests include cognitive neuroscience, biomedical signal processing, eye tracking approach, and the cognitive approach in psychology. VAHID SHALCHYAN received the M.Sc. degree in biomedical engineering from the Amirkabir University of Technology, Tehran, Iran, in 2002, and the Ph.D. degree in biomedical science and engineering from Aalborg University, Aalborg, Denmark, in 2013. From 2011 to 2013, he was a Visiting Researcher with the University Medical Center Göttingen, Georg-August University, Göttingen, Germany. He is an Assistant Professor with the Department of Biomedical Engineering, School of Electrical Engineering, Iran University of Science and Technology (IUST), Tehran. His main research interests include biomedical signal processing and pattern recognition, with emphasis on their application to neural signals, for neuroscience, neurotechnology, and brain-computer interface researches.