Methods and Measures for Mental Stress Assessment in Surgery: A Systematic Review of 20 Years of Literature

Real-time mental stress monitoring from surgeons and surgical staff in operating rooms may reduce surgical injuries, improve performance and quality of medical care, and accelerate implementation of stress-management strategies. Motivated by the increase in usage of objective and subjective metrics for cognitive monitoring and by the gap in reviews of experimental design setups and data analytics, a systematic review of 71 studies on mental stress and workload measurement in surgical settings, published in 2001–2020, is presented. Almost 61% of selected papers used both objective and subjective measures, followed by 25% that only administered subjective tools - mostly consisting of validated instruments and customized surveys. An overall increase in the total number of publications on intraoperative stress assessment was observed from mid-2010 s along with a momentum in the use of both subjective and real-time objective measures. Cardiac activity, including heart-rate variability metrics, stress hormones, and eye-tracking metrics were the most frequently and electroencephalography (EEG) was the least frequently used objective measures. Around 40% of selected papers collected at least two objective measures, 41% used wearable devices, 23% performed synchronization and annotation, and 76% conducted baseline or multi-point data acquisition. Furthermore, 93% used a variety of statistical techniques, 14% applied regression models, and only one study released a public, anonymized dataset. This review of data modalities, experimental setups, and analysis techniques for intraoperative stress monitoring highlights the initiatives of surgical data science and motivates research on computational techniques for mental and surgical skills assessment and cognition-guided surgery.


I. INTRODUCTION
S TRESS affects around one third of employees in the world and is prevalent among health-care professionals such as physicians, surgeons, and nurses [1], [2]. Prolonged exposure to stress can lead to burnout, substance abuse, work absence, and substantial financial loss [3]. Long working hours, time pressure, sleep deprivation, complex surgical interventions and new surgical technology, interactions with traumatizing events, noise, and interruptions, multitasking, communication, and working with or supervising junior colleagues are key factors that affect cognitive states and psychomotor performance of surgeons and staff in the operating rooms (ORs) [4]- [7]. Excessive levels of cognitive impairment in the OR materialize as increased distress, frustration, and workload [5], [6], [8]. According to descriptive models of [9], heightened levels of acute mental stress result in attention narrowing, distraction, loss in working-memory, and preservation. When occurred during surgery, elevated levels of "intraoperative" mental stress also impair decision making, judgement, teamwork, and performance of individual personnel and surgical teams [7], [10].
In their classic review of stress, appraisal, and coping, Lazarus and Folkman defined stress as a "particular relationship between the person and environment" with psychological and physiological responses affected by both external stimuli and internal or personal features [11]. In psychology, stress is defined as "a dynamic process that occurs when an individual appraises situational demands as exceeding available resources" [12]. This cognitive component is also defined as a response to demands and external stimuli that surpass the coping abilities of an organism [13]. This perspective associates stress with resource limitation, workload, and role strain [14]. Empirical research indicate that stress can increase neural activity, available working memory, and experienced cognitive load for learners and decrease task processing efficiency [15]. The conceptual overlaps and close associations of workload and stress have imposed challenges for their measurements in the past 30 years [16], [17]. Therefore, in this systematic review, we focus on real-time assessment of psychophysiological stress in the OR such that elements of cognitive workload and arousal -both tense and energetic-and physiological or autonomic arousal are covered [18]. A stressful situation can be self-evaluated and appraised as being harmful, inducing a threat, or merely acting as a challenge that provides new opportunities [9]. Stress can have negative effects on task performance; however, positive stress and successful stress adaptation can improve skills, alertness, motivation, stimulation, and sense of enjoyable task completion [6], [19]- [21]. Research and anecdotes have shown that expert This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ surgeons efficiently control their brain activation under stress while young trainees tend to use beta-blockers to overcome anxiety [22], [23]. Therefore, studies focusing on quality and safety as two main markers of surgical procedures address the need for improvement of technical and non-technical skills [24], [25], identification of stress factors, assessment of acute stress, and implementation of stress-coping strategies using, for example, closed-loop human-computer systems for simulating surgical situations [26], [27] and technology-assisted meditation/relaxation [28], [29].
In general, no measurement tool or technique can directly assess cognitive states such as stress or workload [30], [31]. Therefore, stress is commonly estimated by measuring "its effects" through individual and subjective perceptions and objective assessment of aroused physiologic states and responses [17], [30], [32]. While using subjective questionnaires and other self-report tools has been a common method for evaluating intraoperative levels of "self-perceived" stress [17], [24], [33], objective assessment of these states from elevated sympathetic, parasympathetic, and electrophysiological activities has been less frequently exercised due to technological complications, difficulties with correct interpretation, interference with task performance, and privacy concerns [24], [34]- [36]. We believe that spatio-temporal patterns of psychophysiological stress, recorded during different stages of surgical procedures, need to be compared with those of expert surgeons for mastering surgical skill acquisition. This real-time monitoring of acute mental stress from surgeons and surgical staff in the OR may also help to: (a) predict periods of high stress, (b) analyze effects of novel image-guided and minimally invasive surgical techniques on workload, stress, anxiety, and technical skills in real and simulated surgeries [37]- [40], (c) accelerate implementation of stress-management interventions to reduce occupational stress and improve clinical outcomes and patients' well-being [41]- [44], (d) introduce mental skill training and multi-session monitoring of skill acquisition in trainees [45]- [47], (e) improve surgical automaticity and performance [7], [17], [48], and (f) develop intelligent surgical assistance in smart ORs [36], [49].
In recent years, advances in wearable neurotechnologies that enable real-time and non-invasive measurement of physiological signals without causing discomfort for users has brought a new interest in monitoring and quantifying cognitive states from engineers, athletes, and surgeons [6], [50]. Despite these advances and the surge in applying machine learning (ML) for mental stress assessment from multimodal datasets [51], the progress in implementing ML for objective stress and workload detection in the OR has been slower and limited to a few research groups [42], [44]. Challenges in quantifying mental states in real life [52], intra-and inter-subject variabilities [53], obtaining ontologies and annotating surgical stages [36], [54], privacy and data protection rights [55], and highly dynamic and case-dependent decisions and actions-interactions of surgical staff that result in data interpretation challenges [56] are a few contributing obstacles. These challenges are being addressed by new academic programs in digital health, data science, and AI-assisted technologies 1 and by public forums on data science and digital healthcare 2 .
Timeliness of a New Systematic Review. Among the growing pool of systematic reviews on intraoperative nontechnical skills [25] and states such as fatigue [57] and workload [17], and effects of distractions [58] and music [59] on surgical performance, we identified four major reviews related to our scope on intraoperative assessment of acute mental stress and workload. In two 2006 and 2010 systematic reviews of morbidity syndromes in minimal access surgery [31] and effects of stress on performance [30], adopted stress measures included questionnaires, electrocardiogram (ECG), heart rate (HR), heart rate variability (HRV), skin conductance level (SCL), electromyogram (EMG), electrooculogram (EOG), salivary cortisol, performance scores, and communication frequencies. Another review of 28 studies from 1998 to 2010 reported that questionnaires, completion time, economy of motion, cognitive and psychomotor error metrics, final product quality, hand efficiency, HR, HRV, and SCL were used for assessment of technical and nontechnical skills [60].
More recently, 33 studies on acute stress measurement during real operations were reviewed [32] that had used objective metrics such as thermal and electrodermal activity (EDA). Finally, a review of 84 studies that measured surgeons' cognitive workload named NASA Task Load indeX (NASA-TLX), Surgery Task Load Index (SURG-TLX), and HRV as the most frequent subjective -post-hoc-and objective measures [17]. Use of eye-tracking, electroencephalography (EEG), and functional near-infrared spectroscopy (fNIRS) was also reported in this review. Thus, it was only after 2010 that modern sensor-based measures such as pupilometrics, EEG, cortical hemodynamics, and functional connectivity were adopted for quantification of intraoperative mental stress [44], [61]- [65]. This trend seems to be accelerated by advances in wireless sensors and lightweight headsets that enable acquisition of physiological signals without imposing ergonomic restrictions on the surgical staff [27]. Data analysis has also witnessed widespread changes in the past decade [56]. However, none of the earlier systematic reviews on stress assessment addressed key aspects on experimental design or data analytics tools.
Contributions of the Current Systematic Review. The earlier and highly valuable systematic reviews on intraoperative stress measurements had limited scopes and inevitable shortcomings. Motivated by gaps in a thorough review of robust experimental designs and modern data analytic methods for stress assessment from multi-modal datasets, this study aims to present a replicable systematic review of methods and measures for objective assessment of mental and psychophysiological stress in surgical settings. This effort is also motivated by the fact that our initial search resulted in finding a large range of descriptive phrases for stress, including cognitive, mental, systematic, psychologic, physiologic, task induced, and perceived or subjective stress. Interestingly, stress, strain, anxiety, workload, and mental/cognitive demand have been mentioned and assessed interchangeably in the literature [17]. This vague definition is emphasized in discussions on unclear relationships between physiologic activation, stress, and self-reported anxiety [66] and multidimensionality and interaction of stress and mental fatigue [67].
In this work, we consider the earlier definition of stress (dynamic process related to self-appraisal of demands exceeding limited resources) [12]. In addition, we perform a literature search with two main keywords -mental stress/distress and cognitive workload-to cover studies on intraoperative assessment of these states due to their conceptual overlap. We aim to provide the following contributions: (a) A review of cognitive states and skills studied in association with mental distress/stress and workload of OR professionals in surgical settings, (b) a review of temporal patterns in adopting objective and subjective measures for intraoperative stress assessment, (c) a thorough and novel review of experimental setups such as baseline recordings, wearable systems, multi-modal datasets, processing, synchronization, and data analysis, and (d) highlighting future directions for research and dissemination of results on intraoperative stress assessments.
Research Questions. The following questions were defined to clarify the research scope [68]. What constituted the studied participants, procedures, and research topics (Sections III-A and III-B); how often subjective and objective measures were used and if they were administered separately or in conjunction with each other, what were the most frequently used features in each group, and what consensus or discord were reported in their associations with intraoperative mental stress (Section III-C); what experimental setup and recording conditions were implemented, if measurements were calibrated or compared with baseline values, if multi-modal datasets were acquired and synchronized with each other and surgical events, and if wearable sensors were utilized (Section III-D); what processing and data analysis techniques were implemented, and if data was available as open source (Section III-E).

II. MATERIALS AND METHODS
Study selection was conducted with keywords on mental stress and mental distress. Since an initial search on Google Scholar demonstrated that not all papers on mental stress assessment and surgery were retrieved by these criteria, and considering the hypotheses on associations of workload, working memory, and memory load with mental stress [17], a second search was conducted to scan papers on cognitive load or workload. This step was performed to identify the literature that measured stress as a side variable of workload during intraoperative assessments.

A. Search Strategy
The search strategy protocol followed guidelines from PRISMA statement [69] and its extension PRISMA-S [70]. The following eligibility criteria were devised to define the structure of the current systematic review.
r Topic: "Mental stress", "mental distress", "mental workload", "cognitive workload", "mental load", and "cognitive load". posters and abstracts, non-English articles, clinical trials, and studies on stress assessment in patients and their caregivers were out of the scope of this review. Cohort or cross-sectional surveys from healthcare workers about general well-being, work styles, job satisfaction, occupational stress, burnout, coping strategies, perceived and moral stress [71]- [74], and papers on design and validation of new scales or training programs without near real-time assessment of psychological data [75] were excluded. Studies on physical and ergonomic stress, pain, or strain instead of mental stress, and those on mental stress of staff who monitored the patients but were not involved in surgical tasks or simulations [76] were also removed. However, papers with measurements collected from surgeons during office days versus on-call and operating days [77], [78], and papers on baseline and post-intervention measurements of surgical performance and mental stress were included [21], [46], [79], [80].

B. Information Sources and Search Terms
Cochrane Library, Ovid MEDLINE, PubMed, and Scopus were accessed through university subscriptions in fall 2020 and January 2021 to obtain the latest publications. When possible, search results were limited to publications in English. Databasespecific search terms and limitations are accessible from our Online Repository. Search results were exported to spread sheets for further analysis.

C. Selection Process
Titles and abstracts of search results were screened by two authors in line with the inclusion and exclusion criteria in Section II-A. If an article was not publicly available, its authors were contacted on ResearchGate. Articles were discarded from this review if they were not accessible until February 2021.

D. Study Characteristics
The SPIDER (Sample, Phenomenon of Interest, Design, Evaluation, Research type) mnemonic [81] was used to define and limit key dimensions and study characteristics to be analyzed in this systematic review. Using this tool, the following pieces of information were extracted from abstracts and full texts. "Sample" consisted of participants' features such as surgical specialties, expertise, and roles. The hypotheses on stress/workload factors and their associations with technical and nontechnical skills and cognitive states were considered as "Phenomena of Interest". Dimension included surgical procedures, experiment conditions (e.g., silence or environmental noise, time pressure or selfpaced operations) and their comparisons. "Evaluation" consisted of subjective assessment tools and their types (e.g. validated questionnaires or custom surveys), objective and physiological measures and their sub-dimensions such as computed features,  their associations with stress, use of wearable systems, multimodal datasets, synchronization practices, baseline recording sessions, and data analysis techniques. Online links were also investigated to locate supplementary, open-access materials. Finally, studies were chronologically sorted to extract information on temporal evolution of subjective and objective measures.

III. RESULTS
A total of 421 and 481 citations were retrieved after searching databases on mental stress/distress and cognitive workload, respectively. The PRISMA flow diagram of the search strategy is presented in Fig. 1. Applying the exclusion criteria and removing duplicate articles resulted in the systematic selection of 71 papers for a qualitative review of mental stress assessment in real and simulated surgery. Temporal distribution of these papers is presented in Fig. 2. The Supplementary Table presents all study characteristics, extracted with SPIDER mnemonic, that will be discussed in the rest of this section. Table I shows conducted surgical procedures for 71 selected papers. Distributions are not mutually exclusive: 41 studies involved a single surgical procedure while 20 studies compared performance and stress using two different procedures, and one included data from open, laparoscopic, and converted  [82]. Five studies did not mention the procedure used in stress-monitoring sessions [6], [78], [83]- [85]. Out of 39 studies that administered laparoscopy, 11 compared laparoscopy with other procedures or focused on different laparoscopic techniques, one paper compared laparoscopy with conventional surgery in terms of physiological correlates of mental strain [86], and seven papers analyzed differences between standard and robot-assisted laparoscopy in terms of cognitive stress, physical strain, workload, performance, attention, coping, and gaze control [38], [40], [87]- [91]. One paper compared induced stress and workload during single-incision and conventional laparoscopy [39], another administered sessions with one-port and four-port laparoscopy to quantify postural constraints, perceived workload, task difficulty, and stress [92], and one analyzed performance differences in relatively easy versus difficult laparoscopy due to different visual scanning and multi-tasking demands [93]. Simulated surgeries were performed with mannequins, physical models for mesh and knot tying tasks, or surgical simulators. Two papers addressed telerobotic operations [94], [95], and three used biofeedback devices for stress management [83], [96], [97]. Table II shows the distribution of the majority of participant groups in terms of expertise levels. Nursing and non-medical students [85], [93], [128], surgical technicians [66], [92], [118], and professionals from anesthesiology [66], [116], [126], emergency medicine [114], medicine [28], [83], nursing [66], [112], [118], [126], and primary care [83] were other included participants.

C. Objective and Subjective Measures
Stress perception induces a variety of cognitive, emotional, and behavioral responses [99] that can affect surgical outcomes depending on personality types, emotional intelligence, trait anxiety, and stress-coping capabilities [30], [66]. However, due to the inherently subjective nature of self-rated questionnaires, assessment of brain functions and cognitive skills using objective metrics is a valuable asset when surgical skills do not demonstrate large inter-subject variations [44]. Furthermore, applying both measurement categories may enhance the assessment of underlying cognitive states [17]. Fig. 3 demonstrates the number of selected studies that used objective and subjective metrics for intraoperative assessment of mental stress. Besides the generally upward trend in both modalities, a total of 25 and 31 papers used objective and subjective measures during the 2016-2020 period.
An in-depth analysis showed that 18 papers (25.35%) had only administered subjective questionnaires and interviews, 10 papers (14.08%) only collected objective, physiological data, and 43 papers (60.56%) used both objective and subjective measures to assess the stress levels for at least one time point during the surgery. To demonstrate the variety of these measures, Fig. 4 presents the timeline of all questionnaires and physiological modalities used in the selected papers.
In what follows, we present a categorization of utilized subjective measures in Section III-C1, identify the most frequently used instruments, and investigate the cognitive states and skills that were subjectively studies in the selected studies. We then categorize all modalities of objective measures and present findings, according to the Evaluation dimension of the SPIDER mnemonic, in Section III-C2. The selected studies administered a variety of questionnaires for self assessments by participants or evaluations by researchers and independent observers. Several papers also recorded the demographics and prior experience with administered surgical instruments and technologies. Subjective measures used for intraoperative assessments can be divided to three categories: (a) Instruments such as NASA-TLX and STAI listed in Fig. 4 that were validated, occasionally adapted to different languages, and presented to participants before or after each surgical session, (b) customized or newly developed surveys combining several dimensions and metrics such as those in [94], [98], and (c) observations and real-time reports by researchers or independent observers, and post-session evaluation of recorded videos for assessment of surgical progress, stress, and anxiety.
As shown in Table III, 36 papers administered between one and five validated instruments, and 14 papers used one or two validated instruments in addition to customized surveys. Only six studies utilized observational assessments and four of them paired these reports with validated instruments or customized surveys. Among 54 papers (76.06%) that used at least one validated measurement tool, the study by Stefanidis et al. [29] stood out for having moderated five validated questionnaires. Overall, NASA-TLX [129] (n = 19), full STAI (n = 10), SURG-TLX [130] (n = 9), and six-item STAI [131] (n = 8) were the most frequently used validated tools.
2) Objective Measures for Intraoperative Stress Assessment: Table IV presents 12 identified categories for objective measurements, names of physiological features, and lists of corresponding studies. The majority of reviewed papers had reported heart activity, salivary stress hormones, ocular activity, blood pressure, and electrodermal activity. It should be mentioned that energy expenditure and heat flux, kinematics, respiration rate, and cortical activation were not covered by earlier reviews on assessment of acute intraoperative stress [31], [32], [132], nor by systematic reviews on impacts of mental stress and nontechnical skills on surgical performance [30], [60]. However, a review on intraoperative workload measurement covered these modalities and pointed to the interchangeability of mental stress and cognitive workload [17]. In what follows, we present a few observations on these objective measures with stress/workload markers exclusively along the boundary defined in Section II-D, and leave details on statistically significant observations to the "Significant Objective Measures" column in the Supplementary Table. a) Cardiac activity: Cardiac activity, either calculated offline from recorded ECG signals or measured in real time with wearable devices, was the most frequently used physiological modality among selected studies (n = 40). Generally, an increase in average HR indicates elevated sympathetic activity and physical expenditure, and decrease in HRV corresponds to smaller changes in time interval between consecutive heart beats, more arousal and strain, and less cardiac relaxation [4]. HRV is controlled by the autonomic nervous system [125] and is calculated through temporal features (SDNN, RMSSD, pNN50) and spectral features (LF: sympathetic activity, HF: parasympathetic activity, LF/HF) [133]. An increase in LF/HF shows imbalance in the autotonomic tone and increase in mental stress [125]. The majority of reviewed studies computed or read measured heart rates, followed by spectral and temporal feature analysis for HRV. However, as summarized in Column K of the Supplementary Table, findings are divided in observing significant or consistent changes in HR and HRV parameters under perceived or designed stressful conditions and significant associations with subjective ratings, and have to be carefully interpreted considering the timeline of reading measurements [107], calculation of absolute values or relative changes from baselines [4], [6], [80], [125], different or low sample sizes [10], [108], and the type of metric-temporal vs. spectral-and body positions during HRV recording [134]. b) Stress-related hormones levels: Cortisol level was the most highly analyzed feature in 11 studies that collected saliva samples. Associations of salivary cortisol with stressful situations were either contradictory [80], [111] or did not result in statistically significant differences among control and intervention groups [21], [83], [117]. The highest cortisol levels were observed 20-30 minutes after task completion in a high-stress scenario [114]. c) Ocular activity: Eye activities were recorded using surface EOG [98], eye-tracking glasses [38], [63], [106], [107], [115], [121], or eye trackers connected to microscope oculars [62]. Gaze and fixation metrics were computed from image frames using commercial or custom scripts [38], [107], [115], [121]. While significant associations were reported between less frequent blinking with higher self-perceived workload and frustration, lack of baseline data was a challenge for comparisons [106]. Changes in pupil dilation from the baseline were the largest under noise distractions than silent or music conditions, in line with higher subjective workload levels. Furthermore, experienced surgeons showed smaller changes in pupil dilation than less experienced operators [63]. Finally, average and SD of pupil size changes during suturing successfully classified surgical expertise while fatigue, caffeine, and illumination changes were possible reasons for no classification improvement from blink rate features [62]. d) Blood-related metrics: Among six studies that measured systolic and diastolic blood pressure (BP) or mean arterial pressure (MAP), four studies did not report any significant differences between control and intervention groups or in changes from baseline (pre-simulation) to post-task [83], [101], [111], [113]. In [107], regression analysis showed that during the baseline phase, evaluating the task as challenging was associated with higher cardiac output and lower peripheral resistancecomputed from outputs of a wearable device and measured MAP. A significantly higher BP was seen during real neurosurgeries than in resting states [123], and increase in WBC after on-call duties with respect to pre-call days was inversely proportional with the level of surgical training [78]. e) Electrodermal activity: Intraoperative sympathetic arousal or nervous response have been measured in different ways in the literature. More recent papers used wearable devices to measure GSR with an armband [66], [95]. In a study with an ergonomic station, experienced surgeons showed less variability in SCL levels across the tasks, and SCL correlated with the reported increase in mental stress [98]. Another study reported higher GSR for surgeons of varying expertise after a crisis simulation and a negative correlation between state anxiety and maximal GSR during debriefing [66]. Finally, SCL analysis was discarded due to its large changes when participants held laparoscopic tools [111].

f) EMG activity:
In [89], surface electrodes recorded muscular activities during robotic and standard laparoscopy. The EMG-based root mean square (RMS), standardized with respect to a voluntary maximal contraction (MVC) period, was significantly larger for all muscles and both sides during the standard laparoscopy than in robot-assisted surgery. In [95], the EMG-based spectral mean power (MNP) was the largest during joint space teleoperation scenario and interpreted as a high muscle fatigue [95]. In [96], critical surgery events were logged and a higher HR and significantly larger masseter tone, measured from EMG signals by a biofeedback device, were detected in high workload segments. However, in endoscopic simulations of [97], the masseter tone and respiratory rate were only used for detection of artifacts and tense situations.

g) Functional near-infrared spectroscopy (fNIRS):
Four studies quantified prefrontal activation (PFC) through changes in cortical oxygenation levels using non-portable fNIRS and characterized temporal and motor demands for different surgical technologies under time pressure and normal conditions. Both cortical hemispheres were almost identically activated during a multi-week laparoscopic and motor learning program [21]; but no significant correlation was found between these and other physiological markers of stress. An improved and careful operation with robotic laparoscopy under time pressure resulted in larger responses in the right ventrolateral PFC -associated with improved sustained attention and distraction suppression [40]. In a similar experiment, junior and intermediate residents showed more bilateral PFC activation during self-paced tasks while senior residents had consistent activation patterns in both conditions [44].
h) Respiration metrics: The breathing rate, measured via a chest belt, was not statistically different between primary and assistant surgeons and not correlated with surgery duration in [122]. Respiration events were unprocessed in [127], but used for interpreting LF and HF HRV bands [96] and locating moving and talking artifacts [97]. One study concluded that changes in breathing frequency were too slow for real-time data acquisition [96], and another pointed out that measuring respiration through expansion and during movement may induce artifacts to the recorded data [122]. i) Energy expenditure (EE) and heat flux: HRV-based EE was larger in primary surgical fellows than in assisting senior surgeons in [4]. Wearable armbands with a heat flux sensor measured the heat convection subset of total thermal energy in [108]. The standard deviation of heat flux during simulation training was correlated with subjective stress levels. j) Body temperature: Facial temperature (FT) was measured alone or in addition to EDA [65], [108]. When using an infrared thermal imaging camera inside an OR with controlled temperature, significantly lower mean frontal head temperature was reported for participants with a poor performance in a simulation laparoscopic training [108]. Instantaneous perinasal perspiration signal, calculated from facial thermal imagery data as a measure of sympathetic arousal response, was not linearly correlated with task completion time or proficiency scores in a multi-week training program [65]. k) Body movements and kinematics: One group measured hand and gaze movements during open surgery training using an eye tracker, and calculated the total hand movement time based knot-tying videos [115]. Arm kinematics were collected via body inertial measurement units (IMU) during simulated robotic operations [95]. Performance metrics from 3D kinematics were improved in specific scenarios although respective physiological features from EEG, HR, and SCL were elevated. l) EEG activity: Through a commercial package and validated framework, EEG activity recorded with a wearable headset was used to calculate probability-based engagement and workload metrics from power spectral density (PSD). These indices were higher in experimental scenarios that also demonstrated the highest mental and physical demands and effort from NASA-TLX, EMG, HRV, and SCL features.

D. Experiment Design and Data Acquisition Practices
In this section, we review the selected studies in terms of experimental design aspects and data acquisition techniques.
2) Multi-Modal Datasets: Data fusion, inspired by processing data from a variety of sensory sources in living beings, is a common practice in clinical decision making and disease prognosis [136], [137]. Relying on synchronous recordings of different signal sources, multi-modal analysis can improve detection accuracy by enhancing predictions of weaker modalities [135], [136]. Among reviewed papers, 24 studies (33.80%) recorded and analyzed data from one, 16 papers (22.54%) from two, 10 papers (14.08%) published between 2010 and 2016 from three, one 2016 paper from four [111], and one 2018 paper from five different physiological modalities [95], pointing to the rise in using multiple information sources to account for changes in intraoperative stress and workload 3 .

E. Data Processing and Analysis Techniques
In this section, we review the techniques used for data preprocessing and analysis in the selected papers.
2) Data Analysis: Sixty-six papers (92.96%) used at least one statistical analysis technique to report measured subjective or objective metrics or analyze effects of experimental sessions, surgical technologies, roles, and expertise on intraoperative performance and stress metrics. These techniques included descriptive statistics, hypothesis testing, correlation and effect sizes, parametric and non-parametric testing, single and multivariate analysis of variance (ANOVA), and Wilcoxon signed-rank and rank-sum tests for within-and between-group comparisons. Details are provided in the Supplementary Table. Two studies applied uni-and multivariate general linear models (GLMs) [64], [80], and one used GLM-ANOVA and intra-cluster correlation for comparing different surgical procedures [92]. Hierarchical regression [107] and LOWESS linear regression [82] were other utilized models. Six papers published after 2016 applied linear mixed models (LMM), generalized linear models (GZLM) and their hierarchical versions [92], generalized estimating equations (GEE) for repeated measurements and autoregressive correlational structure [21], and generalized linear mixed models (GLMM) with mixed, fixed, and random effects [44], [65], [82], [127]. Among the less frequent methods, the area under the curve (AUC) was calculated for cortisol integration [103], and sensitivity and specificity analysis were run with HR and cortisol levels as two predictors of STAIbased subjective psychological stress [102]. Common factor analysis [83], principal factors [104], and CUSUM sequential analysis [126] were run on subjective measures. From an ML perspective, only one study used support vector machine (SVM) after feature selection to predict surgical expertise levels [62].
3) Open Data: Among 71 reviewed studies, only one paper from 2019 included a publicly accessible anonymized dataset of their entire participants [65]. Two studies [44], [64] published average changes in channel-wise cortical haemodynamics for different groups and four papers published administered questionnaires [83], [92], [119], [128]. One paper released a short video of conducted simulated surgeries [127]. Links of these additional data are provided in the Supplementary Table.

IV. DISCUSSION
Real-time monitoring of surgical personnel is an important component in smart ORs and an emerging topic in surgical data science [35], [36] that motivates new research on computational techniques for mental and surgical skills [143], cognition-guided surgery [144], and feedback systems [6]. Cognitive states such as stress and workload may negatively influence the quality and outcome of clinical care [145] or regulate positive cognitive flow [19]. The present study has systematically reviewed papers that assessed mental stress during real and simulated operations with a special attention to the temporal patterns and categorization of utilized subjective and objective measures. For the first time in the literature, a detailed systematic overview of experimental setups, wearable systems, uni-and multi-modal physiological data acquisition, feature and surgical events synchronization, multi-point and baseline recordings, preprocessing and artifact correction, data analysis methods, and open-access data is presented for intraoperative stress monitoring.
Our analysis shows that 60.56% of selected papers used both objective and subjective measures while 25.35% only administered subjective instruments. Validated instruments, notably NASA-TLX, SURG-TLX, and STAI, alone or combined with customized surveys and observations were used in 71.83% of selected papers to assess subjective stress, workload, task demands, etc. (Table III). Not only the number of studies on intraoperative mental stress assessment has increased since the mid-2010 s (Fig. 2), but also the administration of discrete subjective and continuous objective measurements has gained momentum (Fig. 3). This trend is accelerated by advances in wearable systems that enable unrestricted monitoring of physiological modalities and cognitive states. Concerns regarding limitations of subjective questionnaires [95] and the need for fine and real-time assessments that do not disrupt the operators' concentration and task flow support this growth [34], [52].
Cardiac activity is the most frequently used physiological modality for intraoperative stress assessment (Table IV), and has been acquired by wearable devices in 25 (31.21%) reviewed papers; this statistics far exceeds the use of wearables for other modalities. The systematic review in [17] categorizes HRV as a real-time assessment tool; however, since reductions in HRV caused by increased mental strain are detected after 3-5 minutes [6], [91], its use raises challenges for precise and realtime detection of stress and workload across different surgical phases. Same holds true for other measures of sympathetic and parasympathetic activities [146], [147] used in 45 studies (63.38%) (Table IV). For example, stress-related changes in plasma and cortisol are not synchronized with mood variations, and their correlations rarely exceed 0.2 due to different temporal system dynamics [146], [147]. Furthermore, results summarized in Column K of Supplementary Table should be interpreted with great care since a recent meta-analysis found a small effect size between ECG-computed HRV and outputs of portable devices and a significant error sensitive to temporal and spectral features and recording positions [134]. Furthermore, sleep deprivation and caffeine consumption need to be carefully considered as confounding factors for cardiovascular and cortisol measures [148], [149].
Among exercised physiological modalities, the high temporal resolution of electrophysiological signals such as EEG, EMG, and EOG is highly desirable for real-time monitoring of physiological and cognitive processes and human factors in interactive environments. However, EEG was recorded only in two reviewed papers for stress assessment while more evidence exists about EEG correlates of stress in non-surgical contexts such as maritime training and construction [150]- [152]. Discrepancies between EEG-based stress studies such as lack of standardized protocols and variations in recorded brain regions, stressors, experiment duration, signal processing, feature extraction, and classification techniques need to be addressed by the community [153]. Furthermore, although fNIRS cortical hemodynamics have relatively slow responses around 1 Hz, their superior spatial resolution makes hybrid EEG-fNIRS solutions attractive for multi-modal recordings [51]. The introduction of lightweight and wearable hybrid systems listed in [154] has resolved earlier concerns about long setup time in surgical environments [155]. Likewise, slow task-evoked pupillary response with 0.5-1 s resolution, recorded by eye-tracking glasses or lenses attached to microsurgical equipment [143], can decode stress and workload without causing any discomfort for surgeons [62]. Recent progress in battery-operated wearables and introduction of commercial solutions for synchronization, annotation, and feature extraction allow researchers to conduct long and multi-modal data acquisition for intraoperative stress monitoring similar to experiments performed in clinical environments [156].
To the best of our knowledge, this is the first systematic review that addresses experimental design and data acquisition practices, including multi-modal data collection and synchronization, for intraoperative stress assessment. Integration of multiple behavioral, physiological, and audiovisual measurements has been encouraged in assessment of cognitive load [33], [40], intelligent decision-support and objective surgical skill assessment [35], and detecting the best time to deliver stressreduction interventions [157]. Our analysis showed that 24 studies (33.80%) had collected one and a promising ratio of 28 papers (39.44%) collected two or more objective modalities. Multimodal data collection necessities the implementation of data fusion techniques to overcome different sampling rates, physiological properties, spatio-temporal resolutions, sample sizes, and misaligned or non-synchronized streams [138]. Only 16 reviewed papers (22.53%) reported using manual or automated methods to synchronize and annotate videos and physiological activities with surgical events. Surgical event annotation is a time consuming but essential task that has motivated several solutions to gather expert knowledge, annotate tools and actions, and improve reliability and model generalization [36], and should be widely exercised by the community. Finally, baseline data acquisition helps with interpreting observed effects considering inter-subject variability, and is even suggested by studies that did not incorporate baseline recordings [106]. A promising ratio of reviewed papers, 76.06%, performed baseline and multi-point collection of objective and subjective measures.
Investigation of data processing and analysis techniques constitutes another novel contribution of this systematic review. While the majority (92.96%) of selected papers used statistical methods, only 14.08% applied advanced regression models, and only one reviewed study conducted feature selection and classification for stress assessment [62]. Applying modern ML techniques in this context demands further research since, while multi-modal datasets enable information fusion from different temporal and spatial resolutions for decision making, the intercorrelation of different modalities may decrease the accuracy of their predictions [158], [159].
The majority of physiological metrics in Table IV have been also used for assessing stress and workload in aviation and transportation (e.g., fNIRS [160], cardiovascular and hormone measures, EEG, EMG, respiration, and eye movements [161], [162]). However, lack of freedom for operating in poorly optimized conditions, different regulations for simulation training, and unequal opportunities for receiving feedback and supervision from surgical peers and mentors result in important distinctions between these two fields [163], [164].
Finally, findings reported in this systematic review may be affected by a few methodological limitations. Papers that were not indexed in databases or did not include designed search terms were, inevitably, not collected -resulting in possible exclusion of studies that measured stress or workload but did not clarify this in their abstracts. Furthermore, studies such as [165] were excluded due to not being accessible until February 2021 although it had assessed workload in simulated and robot-assisted laparoscopy with NASA-TLX and EEG-based PSD, used a wearable set with step-wise linear discriminant analysis (LDA), and included a resting-state baseline. Furthermore, despite incorporating a rigorous methodology to extract study characteristics, vagueness or incompleteness in experimental procedures and variables in source papers may have resulted in missing detailed features. Finally, study characteristics extracted using SPIDER were in line with defined research questions, and features such as electrodes type and locations, sampling frequency and precision, and duration of recordings were out of scope of this study but will be analyzed in a follow-up meta-analysis.

V. CONCLUSION
By reviewing studies on intraoperative assessment of mental stress published in the past 20 years and analyzing their temporal and categorical developments, we have identified gaps in experimental deign and data analysis to suggest potential research directions considering the current trends and upcoming requirements for the emerging field of surgical data science. The current systematic review has spanned several trending applications including wearable systems, body area/sensor networks, and informatics in biological and physiological systems and healthcare technologies. This work can attract a broad audience that includes, but is not limited to, researchers in the fields of biomedical engineering, affective computing, surgical human factors, human-computer interaction, surgical training, and smart ORs. Our work stimulates further meta-analyses that will help to present a higher-level view on stress consequences in intraoperative work.