Machine Learning and End-to-End Deep Learning for Monitoring Driver Distractions From Physiological and Visual Signals

It is only a matter of time until autonomous vehicles become ubiquitous; however, human driving supervision will remain a necessity for decades. To assess the driver’s ability to take control over the vehicle in critical scenarios, driver distractions can be monitored using wearable sensors or sensors that are embedded in the vehicle, such as video cameras. The types of driving distractions that can be sensed with various sensors is an open research question that this study attempts to answer. This study compared data from physiological sensors (palm electrodermal activity (pEDA), heart rate and breathing rate) and visual sensors (eye tracking, pupil diameter, nasal EDA (nEDA), emotional activation and facial action units (AUs)) for the detection of four types of distractions. The dataset was collected in a previous driving simulation study. The statistical tests showed that the most informative feature/modality for detecting driver distraction depends on the type of distraction, with emotional activation and AUs being the most promising. The experimental comparison of seven classical machine learning (ML) and seven end-to-end deep learning (DL) methods, which were evaluated on a separate test set of 10 subjects, showed that when classifying windows into distracted or not distracted, the highest F1-score of 79% was realized by the extreme gradient boosting (XGB) classifier using 60-second windows of AUs as input. When classifying complete driving sessions, XGB’s F1-score was 94%. The best-performing DL model was a spectro-temporal ResNet, which realized an F1-score of 75% when classifying segments and an F1-score of 87% when classifying complete driving sessions. Finally, this study identified and discussed problems, such as label jitter, scenario overfitting and unsatisfactory generalization performance, that may adversely affect related ML approaches.

take over control in difficult situations or when the vehicle requests. Thus, the detection of distracted driving would be especially valuable in future vehicles, at least until complete autonomy is realized. For example, Affective computing (also called artificial emotional intelligence) is the ability of technical systems to recognize and process human affective states, which can be used to enrich and facilitate human-computer interaction (HCI) [1]. The recognition of the human physical state using sensors is now mature, e.g., every mobile device is now capable of recognizing activity based on acceleration sensors. However, it is rare for a device to be capable of recognizing human mental states, e.g., stress, mental health, cognitive load, and distractions. Thus, the recognition of the human mental state is the new frontier where the most important research is being conducted. It can be used for services that are directly related to the psychological state and for enhanced HCI.
An important application of affective computing is the detection of human driver distractions [3]. Regan et al. [4] consider driver distraction as a subcategory of driver inattention, which is defined as ''diversion of attention away from activities critical for safe driving toward a competing activity, which may result in insufficient or no attention to activities critical for safe driving'' (pp. 1776). Hanowski et al. [5] present a list of tasks that could lead to diversion of attention. The list includes dialing and texting on the phone, reading, writing and route checking on a map [6].
Since diverted attention of the driver influences driving safety, the ability to sense the driver's mental state is crucial [3]. These data can be gathered using sensors that are worn by the driver or by using sensors that are embedded in the car, such as video cameras. The recorded data could include behavioral cues, such as facial expressions and gestures, or physiological parameters, such as the heart rate, respiration rate and electrodermal activity. By combining this information with driving parameters and contextual information, safety risks could be estimated and timely alerts could be issued to the driver to avoid accidents and save lives.
This paper presents machine-learning (ML)-based methods for detecting driver distraction using multimodal data. The main contributions of this paper are as follows: • Statistical analysis for the identification of the best features and modalities for detecting each of four types of distraction, namely, cognitive, emotional, sensorimotor and mixed distraction.
• Comparison of classical ML and end-to-end deep learning (DL) models for driver distraction detection, including an analysis with respect to the size of the input window and the type of the input modality: AUs, emotional activation (EMO), heart rate (HR), breathing rate (BR), nasal electrodermal activity (nEDA) or palm EDA (pEDA).
• Identification and discussion of problems such as label jitter, scenario overfitting and generalization performance that may hinder related ML approaches.
The remainder of the paper is organized as follows: Section II presents the related work. Section III describes the data that are used in this work. Section IV presents the proposed ML and DL methods. Section V elaborates the experiments and the experimental results. Section VI discusses the results, and Section VII concludes the study.

II. RELATED WORK
When analyzing systems for detecting driver distraction, one should consider the distraction types, the input signals and the detection methods. Regarding the distraction types, Gomez et al. [8] argued that people differ in terms of their reactions to the same distraction during driving. Thus, the relationship among the distractions, the driver reaction and, consequently, traffic accidents is complex. The distractions can occur in visual, manual or cognitive ways [9], [10].
In this study, cognitive, emotional, sensorimotor and mixed distractions are analyzed.

A. INPUT SIGNALS
The input signals can be direct, namely, measured directly from the driver, or indirect, namely, measured from the vehicle. Vehicle acceleration, steering and braking activities are examples of indirect signals of the driver's state [1]. The related work suggests that methods that use indirect input can be informative for detecting driver distractions. Indirect detection methods rely on the vehicle behavior and are often implemented in recently produced cars. Aksjonov et al. [24] presented a method for detecting the driver's distraction by monitoring lane maintenance and speed performance on specified road segments. Saito et al. [25] proposed an assistance system for prediction the driver's state based on the lane departure duration. Apostoloff and Zelinsky [26] studied the driver's attention to lane maintenance task.
Castignani et al. [27] developed a system, namely, Sense-Fleet, that can identify risky driving events by examining the acceleration, braking and steering activities of the driver. Similarly, Pavlidis et al. [7] presented a statistical analysis of the relation between driver distractions and the speed, acceleration, brake force, steering and lane position. Wang et al. [28] proposed a forward collision warning algorithm that depends on the driver's braking activity. However, if the vehicle is in an autonomous-driving mode, the indirect inputs will not reflect the driver's behavior; instead, they will reflect the behavior of the algorithm for autonomous driving. Additionally, cars may be easily retrofitted with systems that use direct inputs. Thus, this study focuses on input signals that are measured directly from the driver using physiological sensors and visual analysis. The direct input signals can be divided into two subgroups: (i) visual measurements, such as eye gaze, pupil diameter, head pose, facial expressions and driving posture, (ii) and physiological signals, such as electroencephalogram (EEG), electrooculogram (EOG), electrocardiogram (ECG), electromyogram (EMG), photoplethysmogram (PPG) and electrodermal activity (EDA) signals. The physiological measurements provide important cues regarding the driver's state, such as his or her drowsiness and stress levels. Lee and Chung [11] evaluated eye tracking and PPG features with a dynamic Bayesian-network-based framework for the detection of driver drowsiness. Lin et al. [12] measured the drowsiness of the driver by using EEG signals. They decreased the number of EEG features by using the principal component analysis (PCA) method. Then, these features were fed into a linear regression model for the estimation of the drowsiness level. In addition to the EEG signals, Khushaba et al. [13] analyzed the drowsiness of the driver by using EOG and ECG signals. Multiple modalities, such as ECG, EMG, EDA and the respiration rate, have also been used to detect the stress level [14]. In various studies, the physiological sensors were integrated into driving equipment. For example, Singh et al. [15] used ECG signals that were measured via electrodes that were placed on the seat and seatbelt. Similarly, Lee et al. [16] measured ECG signals via electrodes that were placed on the steering wheel. Additionally, they derived the respiratory rate and HR variability from the ECG signals and used PPG that was measured from the driver's finger.
Visual measurements give the driver more freedom than physiological measurements that are obtained using wearable sensors. Bergasa et al. [17] measured the degree of eye closure, the eye closure duration, the blinking and nodding frequencies, and the head pose and conducted eye tracking, and they used these data to estimate the driver's state. Omidyeganeh et al. [18] argued that yawning is an important characteristic for estimating driver drowsiness. They used face and mouth features to detect yawning. Vicente et al. [19] proposed an eyes-off/on-the-road detection system that is based on head pose and eye gaze estimation. Murphy-Chutorian and Trivedi [20] argued that the driver's head pose is a strong indicator of his or her current focus of attention. Similarly, Smith et al. [21] analyzed the driver's attention from head-and face-related features. The hand position was also proposed as an indicator for detecting driver distraction.
In the related studies, there is no consensus regarding the input signals for the detection of driver distraction. Thus, in this study, experiments with both data from physiological sensors and data from video-based sensors were made. The physiological data include pEDA, the HR and the BR. The visual data include nEDA (extracted from data that were captured using a thermal camera), eye tracking data (x-y positions and the pupil diameter), head pose, facial expressions and emotional activation.

B. DISTRACTION DETECTION METHODS
Sikander and Anwar [29] grouped the methods for detecting driver distraction into three subgroups: mathematical models, rule-based models and models that are based on ML algorithms. Most mathematical models are designed for predetermined setups, such as workplace and factory worker workloads. These models consider circadian cycles, sleep history, duration of sleep and wakefulness for the detection of fatigue and performance [30]. For example, the System for Aircrew Fatigue Evaluation (SAFE) is based on such mathematical models [31]. Regarding the rule-based systems, Lee et al. [16] derived if-then rules and applied kernel fuzzy-C-mean to detect driving distractions. Azim et al. [32] proposed two-layered rule-based systems that were based on eye and mouth state information, where each layer had its own if-then rules.
The most advanced methods for monitoring driving distractions are based on ML algorithms. These methods can be classical, deep or a combination of both classical and DL [29]. Goel et al. [10] evaluated random forest, Naïve Bayes, SVM and decision tree for the detection of driving distraction. Random forest outperformed all the other strategies. Lee et al. [23] analyzed hand movements that were detected by acceleration sensors in a smartwatch. They calculated features and fed them into a support vector machine (SVM) classifier.
In addition to the classical feature-based ML methods, end-to-end DL methods, namely, methods for which feature extraction is not required and raw inputs are fed into the models, were also proposed. Masood et al. [6] detected distractions and causes of the distractions by using CNNs. Majdi et al. [34] developed Drive-Net, which combines CNN and Random Forest for the detection of the distraction categories in images. Yan et al. [22] used CNNs to detect various driving postures in images, such as normal driving, cell phone call, eating and smoking. Hssayeni et al. [35] compared two approaches for distracted driving detection: the use of traditional handcrafted image-based features along with SVM and the use of features from three end-to-end CNNs, namely, AlexNet, VGG-16 and ResNet-152. ResNet and VGG-16 outperformed AlexNet by almost 10%. The feature-based SVM realized much lower accuracy than the CNNs. Similarly, Koesdwiady et al. [33] used VGG-19.
All the end-to-end DL approaches use image data as input and are based on available DL architectures that have been successfully applied on images (e.g., AlexNet, VGG-16 and ResNet-152), and most focus on only one architecture. The DL architectures in this study use 1D signals as inputs; thus, specialized DL architectures for multimodal time-series data were investigated. As few studies have been conducted on end-to-end learning on 1D signals, seven DL architectures were compared in this study. To the best of our knowledge, this is the first study on the detection of driver distraction that analyzes end-to-end learning on signals using 1D convolutions and long short-term memory neural networks (LSTMs). Additionally, the DL architectures were compared to stateof-the-art classical ML algorithms using an extensive set of features. Comparison among different features/modalities for detecting driver distraction with both the classical and the DL models was also made.

III. DATA DESCRIPTION
The experimental data are obtained from a study by Pavlidis et al. [7]. In the study, they analyzed the driving behaviors of 68 volunteers in a driving simulator under a variety of distractions. Each volunteer had several driving sessions, which included a normal driving session without distractions and sessions under cognitive, emotional, sensorimotor and mixed distractions. The experimental design and the specific stressors are is presented in Table 1.
Pavlidis et al. [7] analyzed the relations between the distractions and various driving parameters, such as the speed, acceleration, brake force, steering and lane position. From the physiological response, only nEDA [36], [37] was analyzed. In this study, the overall physiological and affective responses in relation to the external distractions were analyzed. The physiological response includes nEDA, pEDA, HR, BR and eye tracking data. The affective response includes emotions, facial expressions and the head pose.
The physiological response, which was measured using physiological sensors, and the emotional response, which was extracted from facial videos using a software that outputs probability estimates for eight prototypical emotions, were already provided in the dataset. As an addition, the facial expressions in the form of AUs and the head pose were extracted using the facial-expression-analysis software that was presented in Hassan et al. [38], which is hereafter referred to as AUReader. AUReader estimates the intensities of 22 facial action units (AUs) using a dynamic state estimation framework that fuses viscoelastic models for facial muscle motion with facial shape and appearance information. AUs are basic facial movements that can be visually distinguished and are defined in the facial action coding system [39], [40]. AUs are produced by a single facial muscle or a group of facial muscles [39], [40], [41]. For example, AU12 represents the action of raising the lip corners (as in a smile) and is produced by the facial muscle 'zygomaticus major'; AU25 represents the mild parting of lips and is produced by either 'depressor labii inferioris' or 'orbicularis oris'; and AU27 represents the stretching or wide opening of the mouth, which is produced by the pterygoids and digastric muscles [39], [40], [41]. Images that show the expressions of AUs are available in [41]. In this study, each facial video in the dataset [7] was analyzed using AUReader to obtain the 3D head pose and AU intensity estimates for each frame in the video.

A. PREPROCESSING, FEATURE EXTRACTION and CLASSICAL MACHINE LEARNING
After the extraction of AUs, 46 channels of information (see Table 2) were available: nEDA, pEDA, HR, BR and eye VOLUME 8, 2020 FIGURE 1. Example EDA signal with two skin conductance responses (SCRs). The horizontal (red) dotted line on the first SCR represents the SCR amplitude. The vertical (red) dotted line on the first SCR represents the SCR duration.
tracking data (4 channels), emotional response (8 emotions/ channels) and AUReader data (30 channels). First, all channels were resampled with a sampling frequency of 1 Hz. Next, following the normalization procedure that was used by the dataset creators [1], all channels were normalized via an unsupervised person-specific approach. The normalization function was as follows: where O ij represents the overall average value for the i th person and the j th channel, S represents the raw data segments, and Sn represents the normalized data segments. The normalized signals of each driving session were segmented into smaller windows.
Experiments were conducted with windows from 20 seconds up to 80 seconds with a stride of 5 seconds. The segmented data were used as the input to the DL models. For the classical ML models, features were extracted from the segmented data and were used as input to the models.
For each window, the following statistical features were extracted from each channel: the mean, standard deviation, skewness, kurtosis, mean of the first derivative, mean of the second derivative, 25 th and 75 th percentiles, inter-quartile range, difference between the minimum and the maximum values and coefficient of variation.
Additional features were extracted for the pEDA and the nEDA signals using skin conductance response (SCR) analysis (see Figure 1). This type of feature/analysis is proven to be useful for the detection of stressful conditions in driving scenarios [14] and in practice [42]. The SCR features for each window were the power of the EDA signal, the number of SCRs per second, the power of the SCRs, the sum of the signals' components that have positive derivative, the ratio between the positive derivative and the negative derivative, the mean value of the derivative of the tonic component (the slowly changing EDA component), the mean value of the difference between the raw signal and the tonic component, the total spectral power of the signal in five frequency bands between 0 Hz and 0.6 Hz with a 0.1-Hz span, the amplitude increase of the largest SCR (from the SCR start time to the SCR peak), the amplitude decrease of the largest SCR peak, the largest SCR increase time, the largest SCR decrease time, the ratio of the increase time and the decrease time of the largest SCR peak, the largest SCR duration, the largest SCR peak increase and decrease slope, the average amplitude increase and decrease of all SCR peaks, and the average amplitude change of all SCRs.
For the classical models, the ML algorithms were used as implemented in the scikit-learn ML toolkit [43]. For each algorithm, parameter tuning was conducted using the following procedure: First, the parameter settings were randomly sampled from distributions that were predefined by an expert. Next, models were constructed with the specified parameters and evaluated using internal k-fold cross-validation on the training data. The search procedure was repeated 10 times. The averaged results are reported in Section V. Experiments were conducted with the following ML algorithms: decision tree [44], RF [45], naïve Bayes [46], KNN [47], SVM [48], bagging [49], adaptive boosting (AdaBoost) [50] and extreme gradient boosting (XGB), which is an updated boosting algorithm. Decision trees were used as the base model for all the ensemble algorithms.

B. DEEP LEARNING
DL represents a class of ML algorithms that use a cascade of multiple layers of nonlinear processing units, which are typically neurons [51]. The first layer receives the input data, and each successive layer accepts the output from the previous layer as input.
The basic strategy dates back to 1943, when McCulloch and Pitts created the first computational model of neural networks (NNs), which was based on threshold logic [52]. Currently, large processing power and memory storage are relatively affordable, and DL models are used to solve complicated artificial intelligence (AI) tasks (e.g., in computer vision, language, biomedicine, and autonomous driving).

1) FULLY CONNECTED NEURAL NETWORKS
A fully connected (FC) NN is a cascade of multiple layers of nonlinear processing units, where each unit receives input from the previous layer. In a typical FC NN, layer i computes an output vector z i as follows: where b i (biases) and W i (weights) are the parameters for the i th layer, z i−1 is the output vector of the previous layer and z 0 is the input data. The activation function f can be a rectified linear unit (ReLU) [53]: or another nonlinear function, such as sigmoid or tanh. For classification problems, the final output layer (z F j ) typically uses a softmax activation function.
where j represent the j th row of the weights W i . The softmax function has the following useful property: and it is always positive; thus, it can be used as an estimator for the probability that an input pattern x belongs to the j th class for a specified problem: The parameters of the network (b and W ) are learned using an optimization algorithm, such as gradient descent [54]. For a binary classification problem, the binary cross-entropy is typically used as a loss function, which is minimized over the pairs of input data/labels (x, y) and predictions p.
2) CONVOLUTIONAL NEURAL NETWORKS CNNs are a type of NNs that are designed with three main architectural strategies to ensure various degrees of shift-, scale-and distortion-invariance. This is realized by utilizing (i) local receptive fields, namely, each unit in a layer receives input from a set of neighboring units in the previous layers; (ii) shared weights, namely, units in a layer are organized in groups and all units in the same group share the same set of weights [57], [58]; and (iii) spatial or temporal sampling, namely, if the input is shifted, the feature map output will also be shifted [55]. In addition, due to the specified architecture (parameter sharing and local connections), the CNNs have far fewer connections and parameters to train, while their theoretical best performance is likely to be only slightly worse than that of FC NNs [56].

3) LONG SHORT-TERM MEMORY
Long short-term memory (LSTM) NNs are a type of recurrent neural networks (RNNs), which are networks with memory mechanisms that enable information to persist through time in the model. LSTMs were introduced by Hochreiter and Schmidhuber [67] in 1997. The main processing unit is an LSTM cell, which contains three main gates that regulate the internal cell state and the cell's output. The first gate decides what information should be forgotten (the forget gate) at time t. The decision is made by a sigmoid function, which is applied over the current input x t and the previous cell output h t−1 (Equation 7). The output of the sigmoid function is a number that is between zero and one, where zero corresponds to no propagation.
Next, the input gate (Equation 8) decides what input information will be passed to the output gate via another sigmoid function. The candidate valuesĈ t for the new cell state are calculated by a tanh layer (Equation 9). The output of the tanh layer is always between -1 and 1. The new cell state (C t ) is calculated by multiplying the old state C t−1 by f t to forget some of the previous information and by adding the element-wise product i t * Ĉ t , which consists of the candidate valuesĈ t , scaled by i t (Equation 10).
Finally, the output gate decides which parts of the cell state (C t ) it is going to output (propagate) via another sigmoid layer (Equation 11), and the final output of the cell (Equation 12) is calculated by applying tanh on the current cell state and scaling it with o t from Equation (11).

4) DEEP LEARING ARCHITECTURES
DL realized a breakthrough performance at solving pattern recognition problems [59], especially in image [56], [60], [63] and natural language processing (NLP) [61], [62]. For example, DL was used to realize image super resolution [64]. In another study, DL was used for ''seeing in darkness'', which is a technique for reconstructing and brightening dark images [65]. For NLP, Google introduced BERT -a state-ofthe-art method for ''language understanding'' [66]. However, DL architectures for signal processing have not yet realized such a breakthrough and designing them remains challenging, especially for problems with limited data. The layered structure of the NNs enables the construction a variety of DL architectures by combining layers. For example, Con-vLSTM stacks CNN layers on top of LSTM layers, namely, the input is received by the CNN layers and propagated to the LSTM layers. In addition to the vertical stacking, one can also experiment with horizontal stacking. For example, for a 2-channel dataset, one can use a ConvLSTM for each channel and later fuse the outputs of the two ConvLSTMs using an FC layer. Which DL architecture is most suitable depends on the dataset; thus, extensive experimentation is required. Figure 2 presents the two fusion approaches that are evaluated in the experiments. The early-fusion approach merges all 46 channels at the input regardless of the modality. Then, the merged input data are fed into DL layers. The DL layers can be FC layers, CNN layers or LSTM layers. The mid-fusion approach uses DL layers that are specific for each modality, and later, the modality-specific layers are fused using a general DL. The early-fusion approach learns shared weights for all input modalities, whereas the mid-fusion approach initially learns separate weights for each modality (represented by purple squares in Figure 2) and later learns shared weights (represented by orange squares in Figure 2). An additional DL architecture that is evaluated in the experiments is the spectro-temporal ResNet (STRNet), which is an architecture that was successfully applied on sensor data VOLUME 8, 2020  in previous study on human activity recognition from smartphone sensors [71], for chronic heart failure detection from heart sounds [72], and for blood pressure estimation from photoplethysmogram (PPG) data [73].
STRNet is a special type of mid-fusion network in which each modality is associated with two branches: one that evaluates the raw sensor signal in the time domain using residual blocks [74] and another that evaluates a spectral representation of the signal. Toward the end of the network, the two branches, namely, the spectral and the temporal branches of each modality, are merged using FC layers.
The structures of the DL architectures that are used in this study are presented in Table 3. There are three early-fusion architectures (eCNN, eLSTM and eConvLSTM) and four mid-fusion architectures (mCNN, mLSTM and mConvLSTM and STRNet). For example, the architecture ''2 x CNN(128) -FC(64)'' contains two CNN layers, each with 128 filters, and one FC layer with 64 neurons. N represents the number of input channels. All DL architectures contain batch normalization layers [75] to reduce the internal covariance shift, ReLU activation layers [53] to accelerate the training process; maximum pooling layers for dimensionality reduction, and a final softmax layer, which outputs the estimated class probability for distracted vs. not distracted driving. The DL architectures are available online https://repo.ijs.si/martingjoreski/drivingdistractions/tree/master/DL%20architectures. All DL models were trained by minimizing the binary cross-entropy loss function using the Adam optimizer with a learning rate of 10 −5 and a decay of 10 −3 . The batch size was set to 256 with a maximum number of training epochs of 30.

V. EXPERIMENTS
First, a statistical analysis of the input signals was conducted to analyze the relations between the modalities and the driving distractions. Next, ML analysis was conducted to compare classical ML and DL for the detection of driving distractions. Next, ML analysis was conducted to compare classical ML and DL for the detection of driving distractions.
For the ML analysis, the data of the first 10 subjects were used as the test set (close to 20% of the overall data), and the data of the remaining subjects were used as the training set. Thus, the classifiers are subject-independent. Each ML algorithm was evaluated in the construction of two types of classifiers: • A window classifier: Outputs a prediction whether distraction was detected for each input window (binary classification). This classifier would be useful for monitoring driver distractions in real time; • A session classifier: Outputs only one prediction per driving session, namely, each driving session is classified as 'with distractions' or 'without distractions'. The decision is based on the predictions of the window classifier that are obtained using a threshold logic. The thresholds were optimized for each classifier using cross-validation on the training set. This classifier would be useful for the offline determination of whether there was a distraction present during the past driving session.
The ground-truth labels were determined using the following rules: (i) the instances for the window classifiers are labeled as positive, namely, distracted driving should be detected, if a distraction was present in at least 5 seconds of the input window and (ii) the instances for the session classifiers are labeled as positive if a distraction was present for at least 5 seconds of the overall driving session. For the window classifiers, one instance is one window (segment) that was extracted using an overlapping sliding window with a 5-second stride; thus, a prediction is output every 5 seconds.
For the session classifiers, one instance is one session. For example, Table 4 summarizes the experimental data (instances) that are produced after using an overlapping sliding window of 60 seconds with a stride of 5 seconds.
Experiments were conducted with window sizes from 20 to 80 seconds. F1-score was used to evaluate the classifiers

A. INPUT ANALYSIS
In the initial dataset study [1], the authors showed that there is a statistically significant difference in the mean values of the nEDA when measured in the normal segments of the driving sessions, compared to the distracted segments of the same driving sessions. Inspired by that analysis, statistical tests were conducted in this study to determine whether such a statistically significant difference is present for the remaining features in the experiments. For the statistical analysis, the Wilcoxon signed-rank test was used, which is an alternative to the paired Student's t-test that lacks the t-test's normality assumption on the distribution of the paired differences. The Wilcoxon test is a non-parametric statistical hypothesis test that is used to determine whether two paired samples are sampled from the same distribution [76]. In this experimental setting, one sample contains values of a specified feature that was extracted from the normal segments of the driving sessions, and the other sample contains values for the same feature that were extracted from the distraction segments of the same driving session. Informative features should differ in terms of their distributions when conditioned on the type of the segment (with vs. without distraction). The tests showed for 177 of the 562 features, the test p-value was smaller than 0.001; these are named ''informative features''. Table 5 presents the top three modalities for each type of driving session (ED, SD, CD, FDL and FDN) and for all driving sessions (Overall). The modalities are ranked using the ratio of informative features per modality. According to the table, nEDA is ranked among the top 3 for each driving session.   and the statistical tests are conducted for normal segments vs. distraction segments, the two most informative modalities are the recognized emotions and nEDA. This is followed by the facial AUs in the third position. Figure 3 presents the distributions of the most informative features, namely, the features with the smallest p-value for each type of driving session (ED, SD, CD, FDL, FDN) and for all driving sessions (Overall).
The distributions are represented as boxenplots (lettervalue-plots), which provide a better representation of the distribution of the data than boxplots when outlier values are present [77]. According to the figure, for recognizing an emotional distraction (ED), the most informative feature is the standard deviation of the activation of the emotion ''joy''. VOLUME 8, 2020  Thus, during the distraction segments, the subjects showed an increased standard deviation of this emotion. Second, for recognizing a sensorimotor distraction (SD), the most informative feature is the 25 th percentile of the subjects' BR. During the distraction segments, the subjects had an increased BR. Third, for recognizing a cognitive distraction (CD), the most informative feature is the 75 th percentile of intensities of AU25 ''Lips Part''. During the distraction segments, the subjects showed increased lip movement. This could be because the cognitive-distraction sessions involved speech, which -if true -may be regarded as an artifact of the dataset rather than a general finding. Fourth, for recognizing the mixed distractions in failure session FDL, the most informative feature is the standard deviation of the activation of the emotion ''joy'', which is the same as for the ED.
Fifth, for recognizing the brake failure in session FDN, the most informative feature is the first derivative of the tonic component of nEDA. An increased positive derivative corresponds to more sweating of the subjects during the brake failure. Finally, for recognizing general distractions, the most informative feature is the difference between the minimum and the maximum values of the activation of the emotion ''joy''. This may indicate that the subjects had stronger emotional responses during the distraction segments.

B. MACHINE-LEARNING ANALYSIS
In the initial experiments, seven classical ML algorithms and seven end-to-end DL algorithms were compared for the detection of driving distraction (binary classification). The eye tracking data were not used in these experiments because the data were missing for more than 50% of the sessions. An overlapping sliding window of 20 seconds with a 5-second stride was used in these experiments. The results are presented in Table 6. Column F1 presents the F1-scores that are realized by the window classifiers, and column F1-s presents the F1-scores that are realized by the session classifiers. According to the results, the highest scores are realized by the classical ML classifiers, namely, GB and XGB. The highest F1-score for the window classifiers is 73%, and the highest F1-score for the session classifiers (column F1-s) is 88%. Among the DL classifiers, eLSTM and STRNet have similar performance, with an F1-score of 67% realized by the window classifiers.
The eLSTM session classifier realized an F1-score of 75%. and the STRNet session classifier realized an F1-score of 80%. Compared to the classical classifiers, eLSTM and STRNet outperformed the KNN, NB, DT and Bagging classifiers and were outperformed RF, GB and XGB. The experiments did not show a clear preference for the use of early or mid-fusion by the DL classifiers (denoted by the prefixes 'e' and 'm' in Table 3).
Next, a more detailed evaluation was conducted for the two best-performing classical classifiers and the two bestperforming DL classifiers. Tests were conducted with various window sizes and input signals (modalities). The results are presented in Table 7. The first column presents the size of the input temporal segment in seconds (varied from 20 seconds to 80 seconds), the second column presents the ML algorithm, and the remaining columns present the F1-scores of the window classifiers (F1) and the F1-scores of the session classifiers (F1-s) for each of the input categories: face AUs, emotional activation (EMO), heart rate (HR), breathing (BR), nEDA and pEDA. For the column ''All'', all features/modalities were used as input to the classifiers. For the column ''Selected'', only the statistically significant features/modalities were used as input. According to Table 7, no classifiers perform well when only one of the physiological signals (EDA, nEDA and BR) is used as input, except the session HR classifier. The classical classifiers outperform the DL classifiers overall. Regarding the window classifiers, the highest F1 score of 79% is realized by the two classical classifiers, namely, XGB and GB, using the AUs as an input with a window size of 60 seconds. Regarding the session classifiers, the highest F1-score (F1-s) of 94% is realized by XGB using the AUs as an input with a window size of 60 seconds. Hence, the visual modalities are the most informative modalities in the experimental dataset. Among the DL classifiers, the highest performance is realized by STRNet using the selected signals and a window size of 60 seconds. The F1-score of the window classifier is 75%, and the F1-score of the session classifier (F1-s) is 87%.
Regarding the size of the input windows, all classifiers perform better with longer windows (40 seconds to 80 seconds), which is probably because longer windows contain more information.
This is especially true for the DL classifiers. Figure 4 presents the precision-recall curves of the best-performing classifiers, namely, the window classifier and the session classifier that were built with XGB using AUs as input with a window size of 60 seconds. Such curves would be useful 70598 VOLUME 8, 2020 TABLE 7. Evaluation results for two best-performing classical classifiers and the two best-performing DL classifiers. The first column presents the size of the temporal segment in seconds. the second column presents the ML algorithm and the remaining columns present the F1-scores (%) of the window classifiers (F1) and the F1-scores (%) of the session classifiers (F1-s) for each of the specified inputs.

FIGURE 4.
Precision-recall curves of the best-performing window classifier (blue) and session classifier (orange) that are built with XGB. AP denotes the average precision, which is defined as n (R n −R n−1 )/P n , where R n and P n are the recall and precision, respectively, for the n th decision threshold.
for the modification of the decision threshold. For example, in various cases, higher recall might be preferred over lower precision, as undetected distractions (false negatives) might be more dangerous than falsely detected distractions (false positives).

VI. DISCUSSION
The best-performing classical ML classifiers outperformed the best-performing DL classifiers. There may be two main reasons for this: (i) The size of the dataset is not sufficient for the end-to-end learning to outperform the best-performing classical ML classifiers. According to Table 3, the models were trained on close to 20,000 instances. While this is a large number of instances compared to related affective computing studies, which typically use a few thousand instances, it is 750 times smaller than ImageNet, which is the dataset that is used to train state-of-the-art DL NNs for image processing.
(ii) DL excels in pattern recognition (e.g., image classification, object detection, and face recognition). In this use case, ''pattern recognition'', namely, emotion recognition and facial AU extraction, was conducted with other modules, and the extracted information was fed into both the classical and the DL classifiers. The access to this information probably gave the classical ML an edge as it can learn better from smaller datasets. The STRNet consistently outperformed all other classifiers when using the breathing rate (BR) as input. This is likely because spectral-domain information is especially important in relation to BR, and STRNet is the only classifier that uses time-and spectral-domain information. Classical ML models use only statistical features (except the pEDA and nEDA features), and the other DL architectures use only signals in the time domain.
The feature selection can significantly influence the classification performance of the classical feature-based ML methods. In this study, ranking-based feature selection (also known as filter methods) was used, as it is computationally efficient and does not require a classifier for feature selection. In contrast, the filter methods estimate the quality of each feature separately; hence, they fail to consider useful feature combinations. This may be the reason why the classical ML models that were built with pre-selected features did not realize the best performance. Wrapper-based feature selection methods or combinations of filter-and wrapper-based methods [71] may be useful in this case.  Table 8 presents the percentages of correctly classified instances by the best-performing classifiers. The window classifier correctly classifies the windows from the normal driving sessions (ND and RD) with an average percentage of 92%, which is significantly higher than the average percentage of correctly classified windows from the distracted sessions (ED, SD, CD, FDL and FDN), which is 72%. This is probably due to the noise in the labels that is present in the windows from the distracted sessions. All windows from the normal driving sessions have the label ''normal''. However, to derive the labels of the windows from the distracted driving sessions, the following rule was used: if a distraction was present for at least 5 seconds of the window, the window should be classified as distracted, and it should be classified as normal otherwise. In various cases, the subject may need require than 5 seconds for the distraction to induce an affective response. Thus, due to the absence of an affective response, the normal windows are same as the distracted windows when analyzed using the physiological and the visual sensors. To mitigate this problem, one might use methods that explicitly incorporate label jitter into the model training process [78]. The label jitter may be why the performance on the physiological signals is worse than that on the visual signals for the detection of the driving distractions. The physiological signals may have a longer latency, namely, it may take longer for the distraction (stressor) to induce a change in the physiological signals than in the visual signals.
The session classifiers outperform the window classifiers mostly because the window classifiers must detect the exact time when the distraction occurred. In contrast, the timing is not important for the session classifiers; they need to detect only some of the distractions, which also mitigates the label-jitter problem.
For recognizing the cognitive distraction, the most informative feature was related to the driver's increased lip movement. This is expected since the cognitive-distraction sessions involved answering questions. One should be careful with using only this feature for the detection of distraction segments, as this finding may be regarded as scenario overfitting rather than a general finding. Another interesting finding is that for recognizing general distractions, the most informative feature is related to the activation of the emotion ''joy''. A more detailed analysis showed that this emotion had both higher average values and a higher standard deviation for the distracted (stressful) segments than for the normal segments. This may indicate that the normal driving sessions were more boring for the participants, whereas the driving sessions that contained distractions were more fun; as this was a driving simulation study, no distraction was regarded as dangerous by the subjects. These findings raise more general concerns regarding the generalization performances of systems that have been trained on a single dataset that was collected in a single environment. Such systems may classify any type of motion, speech, or emotional activation as a ''distraction'' because they were trained only for distraction detection.
In this study, all inputs were represented as 1D signals; thus, specialized DL architectures for time series were used. In the future, comparison of the methods should be made with DL classifiers that detect driver distractions directly from images. Additionally, generative adversarial networks (GANs) and transfer learning may be used to improve the performance of the DL classifiers. Furthermore, since the best-performing classifier in this study was built using AUs, in the future, different fusion strategies can be tested for the extraction of higher-level semantic facial activities (e.g., speaking, listening, and concentrating) with more semantic content.

VII. CONCLUSION
This paper presented an analysis for the determination of which ML methods perform best in detecting various driving distractions using which sensors and which data-capture methods, with a focus on physiological sensors and sensors that are based on video cameras. The statistical analysis showed that the most informative feature/modality for detecting driver distraction depends on the type of distraction. Overall, the video-based modalities were most informative, and classical ML classifiers realized high performance using one of the video-based modalities. In contrast, the DL classifiers require more modalities, namely, either all modalities or pre-selected modalities, for the construction of useful classifiers. For the analyzed data, the classical ML (XGB using the AUs as an input with a window size of 60 seconds) realized high performance and outperformed DL methods; hence, the detection of driver distractions may be technically feasible with the current knowledge. A demo of the final ML classifier is available online. 2 Finally, problems such as label jitter, scenario overfitting and unsatisfactory generalization performance were identified and discussed to provide guidance for future studies in this area. APPENDIX See Table 9.