Cognitive Load Monitoring With Wearables–Lessons Learned From a Machine Learning Challenge

To further extend the applicability of wearable sensors, methods for accurately extracting subtle psychological information from the sensor data are required. However, accessing subjective information in everyday life, such as cognitive load, remains challenging. To bring consensus on methods for cognitive load monitoring, a machine learning challenge is organized. The participants developed machine learning methods for cognitive load classification using wrist-worn physiological sensors’ data, namely heart rate, R-R intervals, skin conductance, and skin temperature. The data from subjects solving cognitive tasks of varying difficulty is used for the challenge. This article presents a systematic comparison and multi-strategic performance evaluation of the thirteen methods submitted to this challenge. A systematic comparison of preprocessing techniques, classification algorithms, and implementation techniques is presented. Performance variations for different task difficulty levels, different subjects, and different experiment periods are evaluated. The results indicate that the most robust methods used multimodal sensor data, classical classification approaches such as decision trees and support vector machines or their ensembles, and Bayesian hyperparameter optimization for hyperparameter tuning. The most accurate models used handcrafted features that are further selected using sequential backward floating search and evaluated using stratified person-aware cross-validation strategy. Moreover, the results indicated better classification performance for specific test subjects, the tasks with the highest difficulty, and in some cases, the time elapsed since the start of the experiment. This dependency is likely due to model overfitting or due to the subjective nature of the psychophysiological process. The intersubject variability in responses is challenging to be captured through objective binary labels for cognitive load, thereby warranting more sophisticated annotation approaches.


I. INTRODUCTION
The availability of small, wearable, and low-cost sensors combined with advanced signal processing and information The associate editor coordinating the review of this manuscript and approving it for publication was Alberto Cano . extraction capabilities is driving the revolution in mobile behavior monitoring for applications such as sports analytics, ambient-assisted living, and lifestyle monitoring [1]. The applicability of wearable sensors is enhanced by the extraction of subtle physiological information that can serve as the basis of psychological monitoring. However, assessing psychophysiological information in everyday life remains challenging [2] since the association of wearable sensor data to human psychophysiological states is not as explicit as it is for physical states. For instance, smartphones can count steps and distinguish human physical activities (e.g., running vs. walking), but cannot recognize emotions and other affective states (e.g., cognitive load). Additionally, the inability of humans to recognize their own psychophysiological states in a timely and accurate manner poses a challenge for the development of affect recognition systems.
The psychophysiological state addressed in this paper is the cognitive load. It refers to the state of utilization of one's mental resources and is strongly related to attention. Mental resources are limited. A mentally-demanding task deprives the new tasks of resources. Consequently, the person cannot pay attention to these new tasks or must interrupt the current task. Wearable devices and mobile applications should be aware of the user's cognitive load when the user is occupied with a demanding task. This can prevent undesirable effects of attention-grabbing. For instance, nearly 25,000 lives are lost annually on the EU roads where a vast majority of accidents are caused by human error, often by a distracted driver. 1 Intelligent solutions to detect cognitive load and other mental states, and provide a warning when needed, may decrease the loss of human lives, thereby contributing to the EU's goal of zero fatalities and severe injuries by 2050. 2 Additionally, monitoring affective states can help improve mental wellbeing [4] and productivity (e.g., avoiding notifications while the user is in the optimal flow state) [5].
When humans experience a psychophysiological load in the form of a demanding task, the sympathetic nervous system is activated. Depending on the load intensity, this activation increases the heart rate, sweating rate, breathing rate, and blood pressure; the pupils dilate, the saliva flow decreases, the heartbeats become equidistant, the blood flow is restricted from the extremities, and is redirected towards the vital organs. These signals can be measured accurately in controlled environments, such as hospitals, using specialized equipment. However, less obtrusive and less expensive devices are required to capture these signals in daily life through practical and large-scale experimentation [7]. Moreover, an ecological momentary assessment that reveals user experiences are necessary to infer mental states from such measurements in daily life [6]. Recent advances in sensing technology have enabled relatively unobtrusive vital sign monitoring, thereby, bringing us closer towards unobtrusive mental state monitoring [26]. A significant part of research in mental state recognition and monitoring with wearables focuses on mental stress. For instance, Mozos et al. [54] used wearable and sociometric sensors to detect stress using a standard stress induction protocol. Similarly, Gjoreski et al. [30] used commercially available Empatica wristbands to detect stress with up to 92% accuracy using heart rate variability, blood volume pulse, galvanic skin response (GSR), skin temperature, and acceleration. Stress often overlaps with the cognitive load but can be potentially distinguished from it [27]. Inferencing cognitive load from physiological signals is an important research field that is less researched compared to the recognition of physical states and activities, as well as the inference of several psychological states (e.g., stress, affect). To promote this field, a machine learning (ML) challenge was organized in which the participants built pipelines to infer cognitive load. Since the same dataset was used for the ML pipelines, performances of the algorithms could be compared and the best methods for cognitive load inferencing could be ascertained.
This article has the following contributions: i) it presents a systematic comparison of approaches of the thirteen successful machine learning pipelines submitted to the aforementioned challenge, ii) it provides a detailed evaluation of their overall performance and their performances for different subjects, different tasks and their difficulty levels, and iii) it summarizes the learnings from the challenge and presents them as suggestions for ML model development to infer cognitive load.

II. CHALLENGE DATASET DESCRIPTION
In order to collect physiological signals in situations where a subject is cognitively engaged, an experiment was conducted in which the subjects solved cognitive tasks of varying difficulty. The experiment was performed in a quiet, normal-temperature office with one subject at a time under the same circumstances. Twenty-three subjects (four female) were recruited through the institutional communication channels (e.g. mailing lists, social network posts) and personal links. Their mean age was 29.5. The subjects had various degrees of educational qualification -high school (7), B.Sc. (6), M.Sc. (6), and Ph.D. (4). studies. All subjects were (self-assessed) healthy adults and no other criteria were used for limiting the participation. The subjects wore a commercial wristband (refer Figure 1) on their non-dominant arm and sat on a comfortable chair in front of a computer monitor. The experiment session was recorded without any restrictions on the subject's hand gestures, thereby reproducing sedentary workstyle. The experiment protocol is depicted in Figure 2. The subjects were briefed about the experiment. The remaining protocol comprised of two sets of tests -cognitive capac- ity tests and cognitive load estimation tests. A demographic questionnaire was filled in between the two tests. Cognitive capacity tests consisted of n-back tasks where n ∈ {2, 3} (2B and 3B in Figure 2). An n-back task consisted of 3 × 3 grid cells, one of which was colored at each time step. The subjects decided whether the colored cell at a time step was the same as the one colored n steps ago. The ratio of correct and incorrect answers depicted the cognitive capacity of the subject. Cognitive load estimation tests were comprised of six elementary cognitive tasks (ECTs) (denoted by x in T xy in Figure 2). These tasks are designed to elicit perceptual cognitive engagement, often used to demonstrate individual differences among people [35]. Haapalainen et al. [9] developed a software with these ECTs to assess visualperception-based cognitive load factors. A variation of this software was utilized for the data collection. The six ECTs were: i) Gestalt Completion test (T 1 ) to identify incomplete drawings, ii) Hidden Pattern test (T 2 ) to identify if a given model image is hidden in the composition of other images, iii) Finding A's test (T 3 ) to capture the speed of identification of letter 'a's in a text, iv) Number Comparison test (T 4 ) to gauge the subject's speed of comparison of two multidigit numbers, v) Pursuit test (T 5 ) to visually track irregularly-curved overlapping lines from the numbers on left to letters on the right side of a rectangle, and vi) Scattered X's test (T 6 ) to find the letter 'x' placed randomly, crowded with other letters. The first four ECTs were obtained from a manual for reference tests for cognitive factors [36], a popular standard for educational psychology research. The last two ECTs were originally devised by Thurstone and Thurstone [37]. Furthermore, each ECT had three variations in difficulty (easy, medium and hard difficulty levels denoted by y in T xy in Figure 2) and were presented in a randomized order. After each task, a NASA-TLX [8] questionnaire was filled by the subjects to assess subjective cognitive load. The participants rested for three minutes after filling each questionnaire.
The following wristband data was recorded with 1Hz sampling rate: R-R (or inter-beat) intervals, galvanic skin response (GSR), heart rate (HR), skin temperature (ST), barometer data, accelerometer and UV index data. However, the focus of the challenge is limited to the data from the following physiological sensors: R-R, GSR, ST, and HR. The data from the wristband was transmitted via Bluetooth and a mobile phone to a server for offline data analysis. Figure 3 depicts the signals for a subject in a single session. Due to excessive noise, affected segments in the original dataset were disregarded. The dataset used for the challenge consisted of 825 instances from 23 participants. The instances of rest were labelled as 'no load' whereas the task instances were labelled as 'cognitive load'. Each instance was composed of 30-seconds data of four modalities: R-R, GSR, ST, and heart rate. The dataset was split into training and test datasets with 632 instances from 18 subjects in the former. In the training set, 49.6% of the instances had a label '0' or 'no load, hence leading to a nearly balanced dataset. Each subject's data was assigned a unique subject ID. Furthermore, the dataset is the first labeled dataset for cognitive load monitoring with a wristband and is made publicly available following the ML challenge.

III. MACHINE LEARNING CHALLENGE
The goal of this challenge was to recognize two levels of cognitive load -Cognitive load vs. no load, using four physiological signals -R-R, GSR, ST, and HR. The participants of the challenge had access to a labeled training dataset and an unlabeled test dataset. The participants developed ML pipelines that processed the sensor data, created models, and recognized the cognitive load. The problem is deliberately reduced to the binary recognition of whether a subject is engaged in a task (irrespective of whether the task is easy, medium, or hard) or resting, as the previous efforts demonstrate that fine-grain distinction among different cognitive load levels from physiological signals might be impossible [26], [32]. The results were presented at UbiTtention workshop at ACM UbiComp 2020 conference, and the three best-performing teams were rewarded. The following subsections describe the specifications of ML pipelines submitted to this challenge in further detail.

A. METHODS
This subsection describes the methods adopted by the participants of the ML challenge to infer cognitive load. The challenge received thirteen submissions from nine different teams. In the following sections, each submission is regarded as a method and denoted by a roman number. Further details on the teams are provided in the Appendix. Table 1 provides an overview of the methods. Nine methods involved preprocessing techniques such as standardization or normalization. Notably, more than half of them used subjectwise preprocessing. A majority, i.e., ten out of thirteen methods, are based on classical ML approaches, including tree-based algorithms (I, V, VI, X), support vector machines and their ensembles (II, III, IX), and logistic regression (IV, VIII, XIII). The remaining three are based on neural networks: a multilayer perceptron (VII), a recurrent neural network (XII), and an autoencoder based on a convolutional neural network (XI). However, only two of these three are endto-end learning approaches. The small dataset size was noted as a major motivation for choosing classical ML approaches over approaches based on neural networks. To overcome the shortcoming posed by the dataset size during training, three methods adopted dataset augmentation techniques, whereas the transfer-learning-based approach in method XI used an external, yet similar dataset to pretrain the model. Method VII utilized Synthetic Minority Over-sampling Technique (SMOTE) to enlarge the dataset as well as to introduce variability. Meanwhile in method XII, a particular class was upsampled to counteract the input-induced bias in the network. Method X used B-spline interpolation of instances to compensate for the effects of low sampling frequency. All the methods considered the four modalities provided.
A majority (eleven) of the methods involved handcrafted feature extraction. Among the extracted features, the prominent ones encompassed time-domain statistical measures such as mean, variance, kurtosis, median, sum, etc. and frequency-domain measures such as power spectral density ratio of heart rate variability. Several extracted features are modality-specific, e.g., skin conductance peak amplitudes are derived from GSR, and heart rate variability in terms of root mean square of successive differences derived from R-R intervals. The total number of extracted features varied between 4 and 129. However, six approaches did not utilize all the extracted features. Instead, feature selection techniques such as maximal information coefficient, sequential forward or backward floating selection, and Gini impurity method are used to select the most informative features. Method V performed feature extraction and selection using an in-house feature discovery platform. ML algorithms rely heavily on hyperparameters. Hence, hyperparameter optimization plays a vital role. Three methods optimized the hyperparameters with a grid search (IX), Bayesian optimization (I), and their combination (III).

B. IMPLEMENTATION FRAMEWORK
Python is the most prominent programming language used by the participants and the scikit-learn library is commonly used for classical ML algorithms. The hyperparameter optimization library hyperopt is utilized in two methods. The models are internally evaluated on a validation set. Twelve out of thirteen methods have mentioned the use of a cross-validation strategy for evaluation. Most of them used the leave-ksubjects-out strategy, while others used a leave-k-folds-out strategy or a combination of both (refer Table 2). The resulting models vary in size depending on the algorithm. The logistic regression model developed in method IV resulted in the smallest size (845 B), whereas the convolutional neural network model developed in method XI resulted in the largest size (37 MB).

IV. CLASSIFICATION PERFORMANCE EVALUATION
We evaluated the methods on the test dataset using various strategies: i) Overall Classification Performance: Average binary classification accuracies of the methods on the entire test dataset are computed. Further, the highest achievable performance is obtained through voting ensembles of multiple methods. ii) Subject-Related Performance: Binary classification accuracy is computed for the five test subjects. This evaluation strategy potentially depicts the user-generalization capability of the model. iii) Task-Difficulty-Related Performance: This strategy focuses on binary classification accuracy for the three task difficulty levels. This strategy depicts the variation of classification complexity based on task difficulty. iv) Experiment-Period-Related Performance: This strategy focuses on binary classification accuracy for each of the two halves of the experiment period, potentially depicting the influence of the duration of the experiment on the performance of the model.

A. EVALUATION METRICS
The methods are evaluated on the instances in the test dataset. One of the following two performance metrics is used depending on the aforementioned strategies: accuracy (Acc) for the first evaluation strategy and partial accuracy (pAcc) for the remaining strategies. Accuracy is the standard ML score defined as: Partial accuracy is used for the remaining evaluation strategies, and is accuracy calculated over a subset of instances x as: Depending on the evaluation strategy, x can represent any of the following: instances from a test subject, instances from a task with specific difficulty (e.g., rest, easy, medium or hard), or instances from a portion of the experimental period (e.g., first half vs. second half). Though the ML methods are initially developed for binary classification (rest vs. cognitive load), the partial accuracy allows for a better granularity in the analysis of the methods. Additional evaluation scores such as precision, recall, and F1-score for overall performance are presented in the appendix. Table 3 presents the average accuracy achieved by each method on the test dataset. The accuracies spread gradually from baseline 0.5 to the highest accuracy of 0.69. However, VOLUME 9, 2021  none of the methods significantly outperformed the remaining. The top-ranked method resulted in an accuracy of 0.694, which is 0.15 higher than the second-best method and 0.2 higher than the third-ranked method. Table 4 presents the accuracies achieved by voting ensembles of the top-x ranked methods. The highest accuracy of 0.71 is achieved using a voting ensemble of the top-3 methods.  Table 5 presents the partial accuracy per subject in the test dataset for each of the methods. The results are seen to be subject-dependent and most of the methods perform well for specific subjects (e.g., subjects with IDs iz3x1 and bd47a). For subjects 3caqi and f1gjp, most of the methods do not perform well. The dependency on the subjects is less obvious for the higher-ranked methods than for the lower-ranked methods. For instance, method I achieved the highest accuracy of 0.789 and the lowest accuracy of 0.615, resulting in a difference of 0.174. This difference is much higher for the rest of the methods, including the second-ranked and the thirdranked methods. This indicates good user-generalization capabilities of method I. Table 6 presents the partial accuracy per designed task difficulty. The results show that most of the high-ranked methods perform better for the instances belonging to higher task difficulty. The exceptions to this are  methods III, V, and XI. The rest periods are the most challenging to detect for all of the methods. Since the difficulty levels are presented in a random order, the rest periods are further analyzed by segregating them based on the preceding task difficulty to identify whether the prior difficulty influences the accuracy of rest detection. Table 7 presents the partial accuracy for rest periods followed by easy, medium, and hard tasks. It can be seen that there is no specific pattern depicting the influence of task difficulty on rest period accuracies.   Table 8 presents the partial accuracy with respect to the experiment period, i.e., the first half of the experiment vs. the second half of the experiment. The results show that methods such as III and IV are sensitive to the experiment period as they have larger variation in the accuracies achieved for the two halves of the experiment in comparison with the other methods.

C. POSSIBLE CAUSES OF OVERFITTING
Multi-strategic evaluation of models uncovered possible influences of training/test splitting on the performance. Table 6 depicted the dependency of methods' performance on the subjects in the test dataset. The inter-subject performance variation is lower for the top-ranked methods, indicating higher generalizability. Performance variation of lower-ranked methods is likely a sign of overfitting, which needs to be considered by the researchers during model selection. One possible solution is to include the inter-subject performance variation as an additional optimization parameter during the model training. Results in Table 8 depicted higher sensitivity of low-ranked methods to the experimental period. This additionally indicates overfitting where the experiment design influenced the ML models. Possible solutions to these problems include optimal tuning of the ML models and better feature selection methods to remove the features sensitive to the experiment period. Finally, a higher predicted accuracy achieved on a validation set (or using the cross-validation on the train data) compared to the test set accuracy indicates overfitting (refer Figure 4). Such overfitting may appear when hyperparameter tuning is performed using the same cross-validation scheme that has been used for evaluating the final models. A possible solution to this problem for small datasets could be a nested cross-validation approach. For larger datasets, the traditional train-validation-test splits are often sufficient.

D. METHOD SIMILARITY
Performing a statistical-significance analysis over the presented results is challenging since the methods are tested only once on the final test data. To present some intuition about the differences in methods, we performed hierarchical clustering using Euclidean distance and complete linkage applied over the methods' predictions (refer Figure 5).
In Figure 5, the end-to-end learning methods (XI and XII) are partitioned out of homogenous clusters, depicting that they are not identical to the featureengineering-based methods. Additionally, the four of the top-5 (I, II, III, and V) belong to a same cluster.

V. DISCUSSION AND LESSONS LEARNED
Our meta-analysis presented in the previous section reveals the superiority of a combination of data processing techniques for the wrist-worn device-originated physiological signals for cognitive load inference. Namely, we observe that ensemble-based ML algorithms in conjunction with sequential backward floating search feature selection, Bayesian hyperparameter optimization, and evaluation founded in stratified person-aware cross-validation outperform alternative approaches.
To move beyond the competition of different methods, and to guide future efforts in automated cognitive load inference, certain peculiarities of sensor data elicited during human cognitive engagement are listed below. They imply a particular manner in which cognitive load inference pipelines should be constructed. The inferences are as follows: i) Physiological response to increased cognitive load is relatively subtle, represented by changes that may be symptomatic to other phenomena (e.g. a subject's health status, emotions, physical stress, etc.), and prone to noise, especially when collected via inexpensive wearable sensors. Consequently, while deeplearning-based automatic feature extraction excels in several other domains, cognitive load inference still requires carefully handcrafted features and guided feature selection to avoid the algorithm's attention on irrelevant signals. Naturally, the three neural network-based submissions are among the low-ranked methods. ii) The methods analyzed in this paper perform relatively well when a subject is highly cognitively engaged yet fail when the subject is resting or engaged in an easy task. It appears that the physiological signal variation captured by commercial wearable devices is rather minuscule to allow fine-grain detection of cognitive load levels. These findings are in line with the related work [30], [32]. iii) Subject-related analysis reveals that one solution that fits all may not be feasible. Different approaches are successful when inferring the cognitive engagement of different subjects. Confounding variables likely related to a subject's demographics or personality may result in different physiological reactions. Hence, the development of a suitable ML model for a particular subject is an interesting avenue for future research. iv) The analyses demonstrate the need for a separate well-founded evaluation set when physiological signals are considered. Despite the popularity and practicality of cross-validation, independent evaluation with well-stratified data initially separated from the training set is crucial to avoid unintentional overfitting.
Besides the observations presented so far, it should be noted that additional challenges exist for an in-the-wild cognitive load monitoring system. The dataset analyzed in this study was collected in a sedentary environment. On the other hand, Schmalfus et al. [14] explored the potential of wearable devices for mental workload detection in different physiological activity conditions. The study included 32 participants, 2 mental stressors and 4 physical stressors. The statistical analysis indicated that wearable devices are not fully capable of identifying mental workload when physical activity is present.
The tasks of our data collection experiments are geared specifically towards eliciting different levels of cognitive load. These tasks have been a part of the standard psychological toolbox since the 1940s VII, and their implementation (introduced by Haapalainen et al. [9]) used in this work has been considered by other studies as well (e.g., [26] and [16]), affirming that the stimulus of the experiment protocol was indeed cognitive load.
Physiological signals captured by the Microsoft Band wristband include heart activity-related signals, acceleration, skin temperature, and skin conductance. More than one confounding factor may affect the change in these signals. For instance, heart activity can increase due to a subject's health state, emotion, stress, and other factors. However, the relationship between the heart activity-related signals and cognitive load is well-documented in the existing literature (e.g. [16], [23]). To a certain extent, the relationships between cognitive load and skin conductance (e.g., [17] and [18]), as well as the skin temperature [16]) have also been researched.

VI. RELATED WORK
A variety of psychophysiological measures can be used for assessing cognitive states: electroencephalography (EEG), electrocardiogram (ECG), heart rate and heart rate variability, optical imaging, blood pressure, skin conductance, electromyography, thermal imaging, pupilometry [10]. The majority of the efforts related to cognitive load monitoring with wearable sensors, however, focused on EEG devices. This is a natural choice as the brain is the most informative source of information for monitoring human psychological states using sensors. Usually, features are extracted from the EEG sensor data (e.g., the intensity of different frequency bands), and those features are analyzed using correlation analysis [11] or ML models (Naive Bayes, Linear Discriminant Analysis, SVM, Convolutional Neural Network -CNN, Logistic Regression) [19], [20], [25]. Moving further towards multimodal sensing, Jimenez-Molina et al. [23] explored photoplethysmography (PPG), EEG, temperature and pupil dilation sensors to the assess mental workload of 61 participants during web browsing. Contrary to the studies based on physiological sensors, Chen and Epps [21] used gyroscope-based atomic head movement analysis for task load recognition. All of these studies involving EEG and head mounted-devices can be quite useful for cognitive load monitoring in movement-restricted environment, such as in virtual-reality-based scenarios, but their application remains limited in real life.
Compared to head-and chest-mounted devices, wrist-worn devices are likely the least obtrusive because subjects are already accustomed to wearing wristwatches. Johannessen et al. [13] analyzed cognitive load in 5 physician team leaders during trauma resuscitation. They collected glasses-based eye-tracking data and wrist-based GSR, and heart rate data, during five trauma resuscitations. A correlation and regression analysis showed that multiple physiological measures should be employed to most accurately measure cognitive load in a real-world setting. Kohout et al. [24] proposed an approach for detecting cognitive load (relaxed vs. loaded) by collecting data from 8 participants wearing wrist sensors and additionally carrying a smartphone as a sensor in their pocket while performing a pill-sorting task. They stressed their participants by introducing a dual-task situation. They used an SVM classifier to achieve 90% accuracy. Novak et al. used wristbands to infer cognitive load in a simulated driving environment [28]. Similarly, Gjoreski et al. combined physiological sensors with video-based sensors to detect increased cognitive load while driving [31]. Schaule et al. [29] used the same wristbands and an N-back task to elicit different levels of cognitive load among office workers.
Barua et al. [38] used the n-back task to assess cognitive load in drivers while measuring their physiological signals (ECG, GSR, respiration, EEG, electrooculography). The authors used various ML models, including k-nearest neighbor (k-NN), SVM, and random forest for classifying cognitive load, and random forest outperformed other methods. Yomna et al. [40] collected measurements on eye movements in drivers and compounded them with data on braking, acceleration and steering. Reasonable accuracies were obtained by using SVM and random forest methods for recognizing abnormal driving situations through the cognitive load of drivers. Fridman et al. [41] tried to estimate cognitive load in real-life driving situations by employing vision-based methods, captured in a video. The best-implemented method with high accuracy was a 3D convolutional neural network. Appel et al. [42] experimented with participants in various game simulation environments, collecting data on interaction metrics, pupil dilation, eye-fixation behavior, and heart rate data. Participant-specific random forest achieved the best accuracy in classifying cognitive load. Chen et al. [43] measured cognitive load by four methods: the subjective rating of task difficulty, task completion time, performance accuracy and eye activity-based physiological measurement. ANOVA tests and Gaussian mixture model classification resulted in the best classification accuracy in classifying five levels of cognitive load. The authors noted that eye activity is the best measure for cognitive load due to real-time accessibility. Nourbakhsh et al. [44] focused on GSR and eye blinks as their measurements for the cognitive load. The participants in the study took an arithmetic test with four different difficulty levels while the measurements were taken. Naive Bayes achieved the best accuracy for binary classification, while SVM achieved the best accuracy for 4-level classification. Yin et al. [45] estimated three different levels of cognitive load from speech in a speaker-independent setting. The best accuracy was produced by a Gaussian mixture model with 256 mixtures using a background model with maximum a-posteriori estimation technique for different levels of cognitive load, using Mel-Frequency Cepstral Coefficients, prosodic features, acceleration features, and feature warping. Segbroeck [46] extracted static and dynamic features from speech to estimate three levels of cognitive load. By performing a feature-level fusion on various features (prosodic, spectral, voice quality, lexical information, speaking rate) with i-vector modelling, they produced better results than existing SVM models.
Furthermore, the least obtrusive approaches are those approaches that infer cognitive load using remote sensing [26], [40] although they are challenging. Cognitive load inference may also be beneficial in the future for people with various brain-related disorders, e.g. Parkinson's disease or multiple sclerosis [51], [52].
All of these studies demonstrate the usability of wearable sensors for monitoring cognitive load and related psychophysiological constructs (e.g., stress, distractions, etc.). Typically, in all of these studies, one novel approach is compared against a few baselines on a dataset that is not publicly available. In our study, thirteen novel methods were analyzed and evaluated against the same benchmark data, which is publicly available, thus allowing for reproducible and systematic advancement of the field.

VII. CONCLUSION
In this paper, we analyzed thirteen methods for cognitive load inference from wrist-worn physiological sensors that were submitted to an online ML challenge. The methods were compared and evaluated against the same benchmark data, and a systematic comparison was presented with respect to preprocessing techniques, dataset augmentation techniques, extracted features, feature selection algorithms, classification algorithms, hyperparameter optimization techniques, evaluation approaches, and technical implementation. This work also evaluated the impact of different task difficulty levels, different subjects, and different experiment periods on classification performance. Based on this performance evaluation, the most promising data processing blocks, including   classification algorithms, were identified and summarized in Table 9. Weiser's vision of a computer fully understandable of its subjects might appear to be wishful thinking in the early twenty-first century [48]. However, we believe that the identification of the most promising approaches for cognitive load inference that are demonstrated in this paper through an unbiased analysis of solutions submitted to a global machine learning challenge provides a sound basis for the future work towards the realization of this vision.

APPENDIX
See Tables 10 and 11  He has worked with the Department of Intelligent Systems, Jožef Stefan Institute, Ljubljana, Slovenia, ever since. He was a principal investigator with a number of international research projects on this topic. He is currently the Head of the Ambient Intelligence Group. His research interests include analysis of sensor and other data related to human health and behavior using machine learning. He was highly successful at several computer science competitions, such as the XPrize Tricorder competition, EvAAL competition and Sussex-Huawei Locomotion Challenge 2018-2020. He also served as the Chair for the Slovenian Artificial Intelligence Society for two terms.
MATJAŽ GAMS (Member, IEEE) received the Ph.D. degree. He is currently the Head of the Department of Intelligent Systems, Jožef Stefan Institute, Ljubljana, and a Professor of computer science with the University of Ljubljana and the Jozef Stefan Postgraduate School. His professional interests include intelligent systems, artificial intelligence, cognitive science, intelligent agents, electronic and mobile health, business intelligence, and information society. He is a member of several international program committees of scientific meetings, national and European strategic boards and institutions, editorial boards of 11 journals, and the Managing Director of the journal Informatica. His team won two activity recognition competitions and placed in the finals of the XPrize Tricorder Competition. He is also a member of the National Council of Slovenia, representing the field of science for the term, from 2017 to 2022.
VELJKO PEJOVIĆ received the Ph.D. degree in computer science from the University of California at Santa Barbara, USA. Since 2015, he has been an Assistant Professor with the Faculty of Computer and Information Science, University of Ljubljana, Slovenia. Prior to this, he was a Research Fellow with the Department of Computer Science, University of Birmingham, U.K. His research interests include mobile computing, HCI, and resource-efficient computing. His work on mobile interruptiblity won the Best Paper Nomination at ACM Ubi-Compc 2014, while his work on epidemics modeling won the Orange D4D challenge, in 2013. More about his research can be found at http://lrss.fri.uni-lj.si/Veljko/