Attention for Vision-Based Assistive and Automated Driving: A Review of Algorithms and Datasets

Driving safety has been a concern since the first cars appeared on the streets. Driver inattention has been singled out as a major cause of accidents early on. This is hardly surprising, as drivers routinely perform other tasks in addition to controlling the vehicle. Decades of research into what causes lapses or misdirection of drivers’ attention resulted in improvements in road safety through better design of infrastructure, driver training programs, in- vehicle interfaces, and, more recently, the development of driving assistance systems (ADAS) and driving automation. This review focuses on the methods for modeling and detecting spatio-temporal aspects of drivers’ attention, i. e. where and when they look, for the two latter categories of applications. We start with a brief theoretical background on human visual attention, methods for recording and measuring attention in the driving context, types of driver inattention, and factors causing it. We then discuss machine learning approaches for 1) modeling gaze for assistive and self-driving applications and 2) detecting gaze for driver monitoring. Following the overview of state-of-the-art models, we provide an extensive list of publicly available datasets that feature recordings of drivers’ gaze and other attention-related annotations. We conclude with a general overview of the remaining challenges, such as data availability and quality, evaluation methods, and the limited scope of attention modeling, and outline steps toward rectifying some of these issues. Categorized and annotated lists of the reviewed models and datasets are available at https://github.com/ykotseruba/attention_and_driving


I. INTRODUCTION
D RIVING, despite being commonplace, is a demanding activity that involves multiple concurrent tasks. Besides keeping the vehicle within the road boundaries, drivers observe other road users, anticipate potential hazards, and deal with distractions from both inside and outside the vehicle. Drivers rely primarily on vision to make decisions [1], thus understanding how drivers observe the scene, how it affects their reasoning, and what causes lapses in attention is crucial for ensuring road safety, especially given the existing evidence that temporary distractions and sub-optimal visual scanning skills increase risk of accidents [2], [3]. Technology for assistive and automated driving aims to reduce traffic accidents caused by human error, and significant progress has been made towards this goal in recent years. For example, advanced driver assistance systems (ADAS) are gradually becoming standard even in low-and mid-priced commercial vehicles. More than 30% of vehicles sold in the USA in 2016 were equipped with passive sensors, such as rearview cameras, parking proximity sensors, and blind-spot detection [4]. Active assistance features, e.g. lane departure detection, emergency braking, and adaptive cruise control, have become standard in more than 200 car models produced by major manufacturers in the past five years [5]. According to recent estimates, ADAS can potentially eliminate up to one-third of accidents caused by light vehicles on highways [6].
Although existing ADAS can detect specific hazards and automatically take measures to avoid imminent collisions, ultimately, they act independently of the drivers' state or intentions. Driver monitoring systems (DMS) offer a complementary approach to safety by estimating drivers' inattention to alert them or safely stop the vehicle if the driver is not responsive. Currently, most commercial DMS rely on vehicle measures such as steering or lateral control to assess drivers' state [7], however, the next generation monitoring systems will use in-vehicle cameras to observe drivers, analyze where they are looking, and issue warnings to direct their attention back to the road or towards critical objects/events.
Widespread deployment of vision-based DMS is necessary for partially-or highly-automated driving systems corresponding to SAE Levels 2-4 [8]. Past research shows that drivers who are not actively controlling the vehicle (e.g. when using full or partial automation) and perform a supervisory role are more prone to distractions [9], [10]. The safety of switching to manual control depends on whether the driver is distracted or fatigued [11], [12]; therefore monitoring drivers' state and providing feedback is necessary. Together, ADAS and DMS are expected to offer significant improvements in road safety. For example, DMS have been included in Euro NCAP 2025 roadmap to zero road fatalities by 2050 [13], and similar initiatives are likely to be proposed in other countries.
Finally, autonomous vehicles (AVs) are seen by many as the ultimate solution to eliminating some [14] and potentially all crashes [15] caused by driver error (as defined in [16]). Given recent successes of biologically-inspired attention mechanisms in various perceptual tasks [17], many self-driving approaches now incorporate attention to improve perceptual and decision-making abilities as well as their explainability.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ In sum, vision-based assistive and autonomous driving solutions rely on sophisticated algorithms that observe and analyze drivers' behavior and relate it to the events unfolding in the traffic scenes. This review summarizes past works and current state-of-the-art in estimating and modeling drivers' attention, surveys publicly available datasets and discusses open problems. To limit the scope of the review, we focus on the algorithms that use machine learning techniques to model drivers' spatial and temporal attention allocation to objects and areas inside and outside the vehicle.
The paper is structured as follows. Section III provides a brief theoretical background on drivers' attention and inattention. In Section IV we discuss approaches for driver monitoring that rely on drivers' gaze or appearance for in-vehicle gaze estimation, inattention detection, action anticipation, and awareness estimation. Section V covers algorithms for modeling drivers' attention allocation in the traffic scene for assistive and autonomous driving applications. Section VI provides an extensive list of publicly available datasets that contain recordings of driver gaze and other attention-related annotations that enable the design, development, and evaluation of the models discussed in the previous sections. Finally, in Section VII we conclude the review with the general discussion of open problems and limitations of current research and suggest steps toward rectifying some of the issues.

II. LITERATURE SEARCH
To gather a representative set of papers for review, we conducted a thorough search using Google Scholar with the following query words: eye, gaze, fixation, glance, eye-tracker, attention, drowsiness, fatigue, inattention, distraction, and driver. We limited the search to papers published from 2010 to 2021 (inclusive) in premier intelligent transportation, robotics, and computer vision venues, including but not limited to Transactions on Intelligent Transportation Systems, Intelligent Vehicles Symposium (IV), International Conference on Intelligent Transportation Systems (ITSC), International Journal of Robotics Research (IJRR), International Conference on Intelligent Robots and Systems (IROS), International Conference on Robotics and Automation (ICRA), International Conference on Computer Vision and Pattern Recognition (CVPR), International Conference on Computer Vision (ICCV), European Conference on Computer Vision (ECCV). The choice of the past decade is motivated by growing interest in developing driving assistance and self-driving systems during this time period and recent breakthroughs in machine learning that promise to make such systems viable for broad deployment.
Since search terms include commonly used words, a large portion of the 3011 papers initially returned by the search engine was excluded as not relevant upon examining their titles and abstracts. We also excluded the following: 1) studies using modes of transportation other than cars (e.g. bicycles, motorcycles, trucks, buses, trains), 2) studies that rely only on indirect methods to assess drivers' attention (e.g. ego-vehicle sensor information), 3) studies that focused on drivers with medical issues or under the influence of alcohol or drugs, and 4) uncited papers over 5 years old. As a result, 204 papers were selected for this review.

III. THEORETICAL BACKGROUND
Due to space constraints, we cannot discuss all aspects of human visual attention. However, in the following section, we will provide a brief theoretical background helpful for understanding how attention is defined, recorded, measured, and operationalized for applications in the driving domain.
A. Drivers' Attention 1) What Is Attention, and Why Is It Needed?: Vision is a primary source of information for driving [1]. However, drivers do not process the entire scene at once and instead sequentially focus on its various elements. This is caused by the biological properties of human vision, where acuity (resolution) is highest in the center of the visual field (fovea and parafovea) and drops off towards the periphery due to the non-uniform distribution of receptors in the retina [18], [19]. Eye movements help bring portions of the scene into the central field for closer examination [20].
2) Types of Eye Movements: In the driving domain, gaze movements are commonly used as a proxy for attention. The literature we reviewed is dedicated to analyzing episodes when the gaze is held steady (fixations and glances) and transitions between them (saccades), while stabilizing eye movements and vergence were not considered. Fixations indicate gaze held at a single point and last from a fraction of a second to several seconds. Glance (termed dwell in psychology [21]) refers to gaze maintained within some area of interest (AOI). As defined in [22], glance starts from the moment the gaze moves inside the AOI until it moves out. Duration of fixations and glances measure what areas or objects the driver attended to [23], [24] inside the vehicle or in the traffic scene, whereas saccades are indicative of their intentions and decisions [25].
3) Types of Attention Mechanisms: Eye movements are determined by attentional control mechanisms subdivided into two groups: bottom-up and top-down [26]. The former is guided by the saliency of the objects or areas in the scene that attracts gaze [27], [28]. Top-down attention is driven by the task [29], i.e. it focuses on the objects or events relevant for the task, whereas salient but task-irrelevant stimuli have a lesser effect.
Both are likely involved in driving, but their relative contribution and interaction are still not fully understood. Experimental evidence points to the dominant role of task-based attention in driving [30]- [32]. At the same time, salient stimuli such as bright digital billboards also tend to attract drivers' involuntary attention even though they are irrelevant to driving [33], suggesting presence of bottom-up influences. 4) Gaze Recording Equipment: Eye trackers provide the most accurate recordings of foveal vision. Tower-mounted models offer the highest precision and sampling rates [34] at the expense of significantly limiting subjects' head movements. Remote and head-mounted eye trackers have lower precision but allow normal head movement and are thus more suitable for experiments involving active control of the vehicle. At the same time, eye trackers remain expensive and susceptible to data loss due to calibration issues [35], [36]. Video cameras offer a cost-effective and nearly maintenance-free alternative to eye trackers but require labor-intensive manual coding to extract gaze information. This process involves annotating each video frame with a text label specifying the approximate drivers' gaze direction, typically subdivided into coarse areas of interest (AOIs) [37], e.g. rearview mirror, windshield, or speedometer. Multiple annotators are often employed to reduce errors caused by drivers' individual characteristics and subtle eye movements [2], [38]. Another source of error is low sampling rate of the cameras which may bias the data towards prolonged glances since short fixations and saccades may not be captured [37].

5) Effect of Recording Conditions:
The choice of on-road versus in-lab conditions is a trade-off between realism and replicability. Recording gaze in an actual vehicle in traffic offers the most ecologically valid conditions, but driving simulators provide a more cost-effective solution that can be more reliably replicated across multiple subjects [39]. Therefore, results obtained in a driving simulator need to be verified against conclusions made using on-road data (validity). Absolute validity, i.e. the exact numerical match between measures obtained in simulation and on-road is preferred to relative validity indicating similar trends, but both are acceptable [40].
Validity depends not only on the fidelity of the simulator (how accurately it reproduces the environment and vehicle controls) but also on the measures being considered. According to [41], most of the research focus thus far has been on validating driving performance measures (e.g. lane and speed maintenance, crash rate, etc.) and few studies examined attention-related measures. While measures such as hazard anticipation and fixation durations have been validated across different types of simulators [42], [43], comparisons between on-road vehicles and simulators have not been conclusive. For example, a study in [44] reports differences in road fixations, and [45] showed greater gaze dispersion in the simulator than on-road. In a recent experiment by Robbins et al. [46], mean fixation durations recorded in a high-fidelity driving simulation were similar to an on-road experiment but only for mediumto high-demand situations (such as turning at intersections).
The last result points to another factor affecting the validity of the results, the in-lab environment itself [47]. Numerous studies confirm that in-lab settings affect the transfer of findings to on-road conditions due to short session durations [48], [49], overexposure to rare events [50], low risk [51], small subject groups, and lack of diversity within them [52].

B. Drivers' Inattention
Due to associated safety risks, most of driving literature is dedicated to inattention rather than attention. According to the commonly accepted definition by Regan et al. [53], inattention during driving is operationalized as "insufficient, or no attention, to activities critical for safe driving".
1) Taxonomy of Inattention Types: Besides the definition of inattention, Regan et al. [53] provide a taxonomy of inattention types (Figure 1) that distinguishes between five subtypes of inattention: 1) restricted attention (due to physical obstructions or blinks), 2) misprioritized attention, 3) neglected  [53]. Inattention types shown in bold are the focus of this review.
attention (e.g. not checking the blind spot while changing lane), 4) cursory attention (looking in the right direction but failing to process the information), and 5) diverted attention (distraction by driving-related or non-driving-related tasks and events). Restricted attention and attention diverted towards non-driving-related tasks are two types that have been investigated theoretically and modeled in practice (e.g. drowsiness [54]- [56] and distraction [7], [57]- [60]). Other types of inattention, such as misprioritized, cursory, or neglected attention, can only be identified in hindsight after a safety-critical situation has occurred and are less studied [61].
2) Types of Non-Driving-Related Tasks (NDRT): Two ways of grouping NDRTs have been proposed: by type (e.g. cell phone use, radio tuning, smoking), to determine which activities are more prevalent and pose more risk, and by demand, which includes primary modality (visual/auditory), interaction (active vs. passive), interruptibility (easy/difficult), and coding of information (verbal/spatial) [62]. Demand-based categorization is more common and better reflects what cognitive functions are affected. Based on modality, most tasks can be represented as a combination of one or more of the following [63]: 1) visual -requires averting gaze off the road (e.g. checking the speedometer); 2) cognitive -requires thinking (e.g. talking to the passenger or recalling information); 3) manualrequires taking hands off the wheel (e.g. smoking, drinking). 1 Demand-based categorization agrees with evidence of limited attentional resources, wherein performance in multiple tasks is reduced when those tasks compete for the same resources [66]. For example, driving as a visuo-manual activity, is affected by concurrent visual, manual, or visuo-manual tasks, although cognitive distractions can have a negative impact as well [67].

IV. ATTENTION ESTIMATION FOR DRIVER MONITORING
Vision-based DMS require an accurate estimate of drivers' attention towards areas inside the vehicle (to determine drivers' state and actions) and elements of the traffic scene (to identify what the driver is aware of). In this section, we review methods for in-vehicle gaze estimation (Section IV-A), inattention detection (Sections IV-B and IV-C), and action anticipation (Sections IV-D) framed as classification problems. Methods for driver awareness estimation are discussed in Section IV-E.

A. In-Vehicle Gaze Estimation
The problem of in-vehicle gaze estimation is commonly framed as multi-class classification, i.e. categorizing features related to drivers' gaze or head position with respect to predefined areas of interest (AOIs) within the car interior. It is also possible to solve this problem analytically by determining the driver's 3D gaze direction and its intersection with the 3D model of the vehicle interior, but only a few approaches do so [68], [69]. Thus most research in this field focuses on finding the best combination of features (e.g. head pose, gaze) and classifiers. Figure 2 shows reviewed algorithms for in-vehicle gaze estimation grouped by features they use.
1) Input Sources and Feature Extraction: Specialized hardware such as eye trackers is useful for obtaining high-precision drivers' gaze direction and head pose (as in [70], [71]). Video cameras can extract similar data from images of drivers' faces but with lower precision and limited to predefined areas. At the same time, low-cost and no need for maintenance make cameras more suitable for assistive and monitoring technology. Near-infrared (NIR) imaging cameras are also used in some works, alone [72] or combined with a visible light camera [73], to make the system suitable for night driving conditions. Feature extraction pipeline should ideally satisfy the following criteria: real-time runtime, short processing chain to avoid accumulation of error, and informativeness of features for classification of gaze zone. When using camera images, the following series of steps can be followed to extract drivers' 3D gaze direction [68]: 1) detection and tracking the driver's face; 2) detection of facial landmarks and eye features (cropped image of eyes, iris, and/or pupil location); 3) estimation of head pose (roll, yaw, and pitch angles) from facial landmarks; 4) estimation of 3D gaze vector using eye and head pose models; 5) finding the intersection point between 3D gaze direction and 3D model of the vehicle interior. Some of these steps can be omitted or simplified by use of machine learning techniques discussed below.
2) Classifiers: In the literature, the minimal set of features used for this problem consists of facial landmarks [74], [75] or head poses extracted from the landmarks [72], [76]. These features can then be fed into a classifier to determine a gaze zone. The experimental results indicate that although their computation can be performed in real-time, facial landmarks and 3D head position features cannot reliably differentiate between neighboring zones, such as a speedometer and windshield [76] or side mirror and the window next to it [72]. In both cases, the drivers either made small head movements or did not move their heads and performed eye movements instead. Temporal filtering [74], aggregating features over time [76], and feature normalization [75] led to improved performance but ultimately did not resolve the issue, leading to the conclusion that eye features are also necessary. Fridman et al. [77] in a thorough study estimated that eye information contributes a 5.4% increase in average accuracy, and an even larger boost of 20% was reported in [78].
Deep learning models have the advantage of combining feature extraction and classification steps. Instead of the explicit processing pipeline described above, a single convolutional neural network (CNN) pre-trained on the image classification task can be used to classify cropped frames from driver-facing videos [80]- [83]. These CNN-based models reach high accuracy and can discriminate adjacent areas better than previous methods that relied on hand-crafted features. In one study, Vora et al. [82] experimented with multiple face cropping methods and CNNs and determined that the upper half of the face provided optimal information. Another advantage of using CNNs is that they perform well even with uncropped images [81], thus saving computational costs associated with face detection and tracking.
3) Evaluation and Limitations: Since in-vehicle gaze estimation is a classification problem, metrics, such as accuracy, F1-score, and confusion matrix, are commonly used to evaluate the models. Accuracy and F1-score provide global performance assessment, while the confusion matrix shows accuracy per area of interest (AOI) and which areas are misclassified.
In the literature, there are large differences in the number of AOIs defined inside the vehicle: from 2 zones (drivingand non-driving related as in [71], [77]) to 18 [68], [72], with 6-8 being the most common (although the justification for the particular choice is rarely given). Naturally, more fine-grained zoning is challenging and leads to worse performance since it is more difficult to localize gaze within a smaller area [75]. As an alternative to fixed AOIs, Huang et al. [70] propose to cluster drivers' gaze into zones customized for individual drivers. The downside of this method is that some potentially important areas such as rear-view and side mirrors may be excluded if some drivers ignore them.
Although most models achieve high classification accuracy, some challenges remain. For example, some drivers attend to different zones by moving their heads while others move only their eyes, as noted in [72], [74] and more thoroughly investigated in [77]. Such individual differences can be captured by training the models on user-specific data rather than data aggregated across many drivers [70], [74], [75], but individual data may not be readily available. A more generic solution combining head pose and gaze direction information has been shown to mitigate this issue [76], [78]. Nevertheless, -cognitive, -visual, -manual, -visuo-manual, -nonspecific. Algorithms marked with * use additional features, e.g. vehicle or context. some adjacent zones remain difficult to distinguish, particularly windshield and speedometer, due to their proximity and subtle eye or head movement required to switch between them [68], [69], [71], [77], [78]. More recent CNN-based models appear to suffer less from misclassifications of these kinds [82].The presence of eyewear causes another challenge. Glasses introduce glare and occlusions, making it difficult to estimate gaze direction [81]. In [73], a pre-processing step is added to remove eyewear via a gaze-preserving generative network, however, it cannot handle thick glass frames and glare.

B. Distration Detection
Timely and reliable driver inattention detection is a prerequisite for driver monitoring systems (DMS) that aim to improve road safety by alerting drivers to dangerous behaviors. Distraction detection algorithms summarized in Figure 3 exploit changes in drivers' gaze patterns caused by secondary task involvement, e.g. taking eyes off the road to look at their phone or cognitive distractions. Similar to in-vehicle gaze detection, distraction detection can be solved as a multi-class classification problem. Because of differences in behavioral changes depending on the kind of distraction, the majority of the algorithms focus either on detecting a specific distraction type or distinguishing between different distractions.
1) Types of Distractions: Cognitive and visuo-manual distractions (see Section III-B) are more commonly investigated NDRT types among the reviewed papers. Unlike visual and manual distractions, cognitive tasks are difficult to induce and verify. Furthermore, tasks used to test cognitive distractions vary significantly from study to study. Some imitate natural activities, e.g. conversations [84] and voice-based playlist retrieval [85], and some include artificial tasks such as math quizzes [65] and n-back tasks [86]. Visual-manual tasks considered in the studies are everyday activities, such as using a cell phone for reading [87], [88] or sending messages [65], [88], radio tuning [89]- [91], and selecting a song from the playlist [91]. Since NDRTs differ in how they affect the driver and how they manifest themselves visually, algorithms must be tested on a variety of tasks to ensure robust distraction detection [87], [88], [91].
2) Input Sources and Feature Extraction: Features such as gaze coordinates or coarse AOIs (see Section IV-A) are naturally indicative of visual distractions. According to the evidence from psychological studies, longitudinal [92]- [94] and lateral [95]- [99] vehicle control measures are also sensitive to various types of distractions. Therefore, ego-vehicle information, e.g. speed and steering wheel rotations, is often used in addition to visual features [65], [84], [90], [91], [100], [101]. Given that most secondary activities are not instantaneous and affect the temporal distribution of gaze differently, nearly all distraction detection algorithms use temporal data, with observation lengths ranging from 2 to 10s.
Approaches that rely on gaze data alone process raw gaze coordinates to compute various statistical functionals (e.g. mean, standard deviation, percentiles) and apply feature selection to find the optimal set of features for a specific distraction type. For instance, Wollmer et al. [91] showed that head rotation angle and its derivatives are sensitive to visual-manual tasks, whereas Liao et al. [101] determined that gaze locations are useful for detecting cognitive distractions.
However, existing datasets provide only a limited set of non-driving related tasks (NDRTs), therefore, features and classification approaches tend to be optimized for a narrow range of distractions. A solution proposed in [111] is more versatile as it does not focus on specific activities, but rather relies on off-road glances to detect inattention. Motivated by evidence that long off-road glances increase crash risk [112], this model uses a 2s buffer to track distractions and issue sound alarms. The buffer shortens whenever the driver looks away from the road and lengthens when they look at the road again. Driving-related glances away from the road (e.g. towards mirrors) are treated with a latency of 1s to prevent issuing unnecessary warnings.

4) Evaluation and Limitations:
Standard classification metrics, such as accuracy, precision, recall, and F1-score, are commonly used to evaluate distraction detection models. Most achieve high accuracy and F1-scores, often over 90%, some even close to 100% [88]. However, the results of different algorithms are not directly comparable due to the prevalent use of private unpublished data and the lack of public benchmarks for this problem. Although most authors specify the number of subjects, their age, gender, and driving experience, the volume and properties of the data used for the evaluation are often defined imprecisely and in different units, e.g. duration [101], [103], number of video frames [108], or number of events [107]. Direct comparisons are also affected by inconsistent recording conditions across studies, e.g. in-lab driving simulators [101], parked vehicles [87], or on-road settings [88].
Overall, despite encouraging results, the problem of distraction detection is far from being solved since drivers' gaze distribution depends on the context and is subject to individual differences. For example, different sets of features are needed for urban and highway roads [100], therefore thorough testing in various environments is desirable [91], [105]. Driving tasks, such as vehicle following and passing, also affect gaze distribution patterns and have to be taken into account when modeling distraction [110]. Although there are commonalities in how distractions manifest themselves in different drivers [89], the system should consider individual user characteristics for the best results [84], [105].
Although the purpose of designing distraction detection algorithms is for use in vision-based ADAS, only few of the reviewed systems have been verified in practice. For example, a monitoring algorithm proposed in [88] was tested in a driving simulator to validate the effect of sound alarms on engagement in non-driving related tasks (e.g. texting, reaching for objects, and eating). The subjects played a truck driving game while performing NDRTs at 30s intervals. Sound alarms reduced the number of accidents and traffic tickets, however, experimental conditions were far from realistic. A more extensive field study was conducted for AttenD algorithm [106]. It involved 7 subjects who drove a test vehicle equipped with the system for one month. Despite the small subject pool, the overall changes to subjects' visual behavior were positive and pointed towards increased attention to the road ahead. Some issues were also exposed, such as data loss due to large head movements and excessive alarms caused by not taking into account drivers' intent to change lanes or brake (especially at lower speeds). Given that warning system acceptance by users is reduced significantly by false alarms [113], [114], it is important to consider HCI issues when developing monitoring algorithms and conduct user studies besides evaluation on datasets. Refer to Section VII for further discussion.

C. Drowsiness Detection
Drowsiness detection methods rely on the driver's appearance to detect signs of fatigue such as frequent blinking, closed eyes, yawning, and nodding. Similar to distraction detection, detecting drowsiness is framed as a classification problem, either binary (drowsy/alert) or multi-class for more fine-grained alertness states.
1) Input and Feature Extraction: Most algorithms rely on driver-facing video cameras to detect drowsiness. Nearinfrared imaging (NIR) cameras are often used (along [118], [120], [128] or combined with visible imaging color cameras [130]) due to their versatility for day and night conditions and robustness to changes in illumination and low light conditions.
Detection of many drowsiness symptoms, such as slow blinking and yawning, is further improved by aggregating features across time [116]. Longer time intervals up to several minutes for the blink and eye closure features typically work best [149] but also increase the risk of missing microsleeps and "blank stares" [150], thus alerting the driver too late. Another approach is to use multiple measures, however, the design of the algorithm should account for different measures of drowsiness having different tendencies [135]. Inclusion of global context features such as continuous driving time, temperature, current time, and sleep duration has also been shown to further improve drowsiness detection [131], [137].
Recently, deep learning methods have been applied to drowsiness detection. For instance, Zhao et al. [146] use a deep belief network to classify drowsy facial expressions using a concatenation of facial landmarks and features from cropped images of drivers' eyes and mouths. Weng et al. [123] instead of combining the features used three DBNs to encode mouth, head, and eyes features, and HMMs to learn relationships between them for alert and drowsy states. In order to capture temporal dimension, Shih et al. [121] aggregate per-frame features extracted via CNN over 50 frames and feed them into a recurrent network, followed by additional temporal smoothing. Yu et al. [118] utilize 3D CNNs to extract generic spatio-temporal features in a single feed-forward pass. These features are then processed to extract specific information, such as the presence of eyewear, and condition of head, mouth, and eyes. Specific and generic features are fused via a feed-forward network for final drowsiness classification.
3) Evaluation and Limitations: A significant limitation of research in this field is the prevalence of private datasets. The only widely used public dataset is Driver Drowsiness Detection (DDD) [123], however, as discussed later in Section VI, it is recorded in lab conditions while subjects act drowsy. Many private datasets are also recorded in artificial conditions, e.g. lab [119], [127], [133] or parked vehicles [124], [148], and only a few are captured in on-road conditions, typically highways and rural roads with little traffic [132], [137], [141]. One study used recordings of drowsy passengers instead of drivers due to safety concerns [146].
The lack of benchmarks and publicly available naturalistic data make it difficult to establish state-of-the-art performance for drowsiness detection and their suitability for practical use, respectively. Despite reporting excellent results, many algorithms still struggle with certain aspects of the problem, notably extreme head angles [144], [146] and glare and occlusion from eyewear [117], [118], [143]. Individual differences across drivers can also diminish the accuracy of drowsiness detection. For example, specific signals like blink patterns are subject to considerable individual variations [141] and difficult to detect when participants have smaller eyes [148]. Vehicle vibration and variability of driver positions with respect to cameras further exacerbate these problems [144]. Furthermore, some drivers do not show visible signs of drowsiness even when fatigued [130]. More diverse publicly available multi-modal datasets collected in naturalistic conditions are part of the solution to the problems listed above [124], [130]. On the algorithmic side, including more features and personalization can potentially improve drowsiness detection results and the overall reliability of the proposed solutions but requires additional computation resources and calibration [130], [148].
Other issues are methodological. For instance, there is little agreement in the literature on inducing drowsiness. A common approach is to ask the drivers to yawn and act drowsy [119], [124], [126], [127], [129], however, there is a risk that such data is not representative of more realistic conditions. Alternatively, drowsiness and fatigue can be induced by extended [132], [137] or monotonous [131] driving sessions, driving late at night [130], or reducing sleeping hours prior to experiment [145]. Some studies also employ night shift workers for collecting naturalistic data [144], [147].
Given a specific drowsiness scale, the next step of assigning labels to the data is not straightforward. Some studies rely on self-reported drowsiness scores [130], [143], [147], observerrated sleepiness [135], [144], or both [137]. According to evidence from psychological experiments, neither is bias-free: observer ratings may not be reliable [153], [154] and self-rated sleepiness does not always correlate with driving performance [155]. Simulated data where drivers acted drowsily is devoid of issues with assigning ground truth and is a preferred choice for most studies (Figure 4). A question remains whether such data is realistic enough. For successful application in practice, user studies and validation experiments of various sleepiness assessment methods in different contexts are needed.

D. Driver Maneuver Recognition and Prediction
Recognizing and predicting drivers' actions is another valuable feature for driver monitoring and assistive technology. Knowing what the driver is doing or intends to do next can help direct their attention to the right objects and reduce unnecessary warnings. Since drivers' gaze is linked to the goal and actions being performed [156], it can be exploited to recognize and anticipate drivers' maneuvers.
1) Feature Extraction and Classification: Similar to distraction and drowsiness detection, action recognition and prediction are often framed as a multi-class classification problem: features aggregated across observation time and classifiers are used to predict the upcoming maneuver.
For this task, the approximate direction of drivers' gaze is often used unless eye-tracking data is available as in [157]. The processing pipeline for obtaining gaze features usually includes face detection and tracking, followed by facial landmark detection, extraction of gaze zones [79], [84], [158], gaze duration, frequency, and blinks [79], [84]. Alternatively, an implicit representation of gaze can be used, such as tracked facial landmarks aggregated over time as proposed in [159], [160] or mirror-checking actions [84].
A variety of methods have been proposed to classify actions based on the temporal features above. For example, Li et al. [84] use boosting [161] with a combination of mirror-checking actions and vehicle dynamics features. Martin et al. [79], [158] model maneuvers using a multivariate normal distribution (MVN) of spatio-temporal descriptors that capture gaze duration towards relevant AOIs. Besides discriminative models, temporal modeling that fits the data more naturally has also been applied. Jain et al. [159] and Akai et al. [157] propose auto-regressive input-output Hidden Markov Models (HMMs) to classify driver's actions given driver gaze and vehicle dynamics. Recurrent networks are also effective for multi-modal data [160] but lack the explainability of HMMs.
2) Evaluation and Limitations: Since action prediction is typically framed as classification, common metrics such as precision, recall, and F1 are used to evaluate the results. As expected, it is more difficult to predict maneuvers several seconds in advance, thus precision and recall improve as timeto-maneuver (TTM) decreases [158]. Besides issues with face detection and tracking due to illumination changes [159], some maneuvers are generally more difficult to predict because of the overlap in behaviors (e.g. mirror checking is not always a precursor to lane changing [158]) and lack of visual cues from the driver (e.g. when they are familiar with the route or make a turn from the dedicated lane [159]). Inclusion of scene and route information may help handle such cases.

E. Driver Awareness Estimation
Inattention detection systems discussed so far do not consider the environment and what the driver is aware of. However, a better understanding of driver behavior, current task, and context is desirable for more effective driver assistance systems that could, for example, verify whether the driver attended to relevant objects and provide situation-specific warnings. This section reviews systems that make steps in this direction by associating driver attention to objects of interest in the traffic scene.
1) Visual Features and Processing: Algorithms discussed in this section determine drivers' awareness of vulnerable road users, signs, and traffic signals. Most of them follow a similar procedure: 1) detect drivers' 3D gaze direction, 2) convert gaze to vehicle's frame of reference, 3) detect objects in the scene and their properties, and 4) match drivers' gaze with objects to identify whether they were fixated.
Different strategies have been proposed for detecting what objects the driver observed. The simplest solution is to check whether the driver's gaze falls within the object's bounding box [164], [167], [170]. Since inaccurate measurement of 3D gaze direction can result in large errors, especially for objects far away, the authors of [163], [166] propose to treat attention as a cone projected from the drivers' eyes towards the windshield (Figure 5a). Some algorithms also take into account that drivers retain information about objects for some time after looking at them [164], [171], [172], as well as other properties of the scene, such as weather and proximity of other road users [172]. For example, Schwehr et al. [168], [173] model the joint probability distribution of the object states in the 2D vehicle coordinate system, object coordinates, and the driver's gaze direction in 2D to estimate which objects have been fixated or tracked. Ahlstrom et al. [172] modify the AttenD algorithm (described earlier in Section IV-B) to include elements of context via additional buffers for targets of relevance which, besides traffic ahead and behind, include intersections. Properties, such as proximity to other road users and weather adaptations, are also accounted for in the model. Zhu et al. [174] use SAGAT [175], a method for measuring situation awareness (SA), to associate it with various gaze-, memory-and object-related features. A combination of these three types of features achieved over 70% agreement with human SA results.
2) Evaluation and Limitations: Evaluating driver awareness models is not trivial, however indvidual modules can be evaluated quantitatively. For example, measuring gaze estimation error [176], [177], road user trajectory prediction accuracy [162], object detection [166], [178], etc.. In contrast, there are no unified approaches for evaluating the performance of the entire system. Cross-model comparisons are virtually impossible since algorithms do not share the same definition of outputs, objects of interest, or application scenarios. Typically, a qualitative evaluation of the individual models is provided based on several illustrative scenarios [162], [165], [166], [168], [169], however it gives little idea of how robust, effective, and usable this system might be. Although some algorithms were tested with human subjects, many were not done in realistic driving conditions: some involved staged pedestrian crossing [179], routes on a university campus [164], and tests in driving simulators [172], [180].
In order to make a viable monitoring system that takes into account driver awareness, several fundamental issues that stem from the properties of the human visual system and limitations of recording equipment must be resolved. For instance, establishing the exact point of gaze (PoG) is difficult even with precise eye trackers, as shown in a series of experiments by Schwehr et al. [177]. The authors conclude that regardless of the choice of model for projecting the gaze into the scene, the point of gaze is always off the target by tens of pixels and that the error is primarily caused by the imprecise measurement of gaze by eye tracker. Given that eye trackers provide the most accurate data, it is likely that vision-based systems will suffer from the same issue. A cone of gaze used in many studies instead of PoG alleviates some of the imprecision but does not localize the targets well (Figure 5b). An assumption that the drivers detect all objects within the intersection area [163], [166] may not be accurate. According to Kim et al. [181], gaze is generally correlated with lower levels of situation awareness (as defined in [182]), but gaze alone is not sufficient to predict SA. A regression model that takes into account proximity of the gaze to the target and awareness score explained only 50% of the variance in the data, therefore other factors should be considered. Additional indicators that may be useful are vehicle control [180] and braking intention [183], as well as detectability of signs [184]- [186] and pedestrians [187] in traffic scenes depending on the visual properties of the objects and the scene, e.g. illumination, visual clutter, visibility, etc.

V. MODELS OF ATTENTION FOR ASSISTIVE AND AUTONOMOUS DRIVING
Algorithms discussed in this section do not detect drivers' gaze direction and state but rather model what objects and events need to be attended to for safe driving. Models intended for use in driver assistance systems rely on human data to predict where safe drivers should look in specific conditions. Algorithms for autonomous driving utilize mechanisms Sample outputs of driver gaze estimation algorithms. a) Pixellevel saliency map for vehicle performing a left turn. b) Visualization of object-level importance scores for vehicle following. Red color indicates higher relevance/importance in both images. Sources: a) [198], b) [201]. inspired by human attention to focus on what is important to improve decision-making and make it more transparent.

A. Modeling Visual Attention in the Traffic Scene
In the past decade, a number of methods have been proposed for modeling the spatial distribution of drivers' gaze in traffic scenes. Given a single image or a sequence of images of the scene, these algorithms output saliency maps (or heatmaps) where higher pixel values (usually within [0, 1] range) indicate areas of interest, risk, or importance ( Figure 6a). Fewer algorithms assign importance scores to specific objects. In this case, higher scores are associated with objects relevant to the ego-vehicle (e.g. lead vehicle shown in red in Figure 6b).
Various spatio-temporal features can be helpful for capturing the dynamic nature of the traffic environment and drivers' gaze changes. For instance, optical flow is useful for identifying the direction and magnitude of motion in the scene [198], [203]- [205]. More recently, following successful applications in video action recognition problems [206], 3D convolutional networks have become a popular choice for encoding spatiotemporal data [194], [199], [202], [207]. Some approaches use recurrent networks, combined with individually encoded frames [208], [209] or with a set of frames processed via 3D convolutional layers [197].
While it is possible to learn associations between individual scene images and human saliency maps without explicit task representation using features extracted via convolutional neural networks [198], [203], [204], [210], [211], the results are difficult to interpret and analyze. Approaches that use bottom-up and top-down features are more transparent as they directly control their influence on the resulting predictions.
For example, a weighted sum of bottom-up saliency maps and high-level features (vanishing point and center bias) was proposed by Deng et al. [188], [189]. They later extended this method by learning the weights for individual features using a random forest [189]. Tavakoli et al. [203] in experiments with regression models, demonstrate that bottom-up features are weakly correlated with task-driven gaze, however, an ensemble model that combines bottom-up and top-down influences leads to improved results. Borji et al. in several works [190]- [192] examined the contributions of different types of features using a Hidden Markov Model. In these experiments, top-down features (e.g. actions and previous gaze location) correlated with the human gaze better than bottom-up ones (e.g. saliency maps), however combination of both types of features performed best.
Two recent models, HammerDrive [194] and MEDIRL [195], demonstrate that explicitly modeling the underlying driving task is beneficial for gaze prediction performance. In HammerDrive, a separate module recognizes maneuvers (lane change and lane-keeping) from the vehicle telemetry. The result is used to reweight the output of the ensemble of bottom-up saliency predictors. MEDIRL applies inverse reinforcement learning to learn a policy for visual attention given the present agent state, which includes local and global context, and driving task (braking, lane-keeping, and merging).
2) Evaluation and Limitations: Evaluation of driver gaze models follows the procedure and metrics established in free-viewing saliency research (see [212] for a review). These metrics assess how similar are predicted saliency maps to those of human drivers in terms of saliency value magnitudes, statistical distribution properties, and salient locations [213].
To further verify the quality and human-likeness of the generated saliency maps, the following human experiments were proposed: subjects viewed videos with superimposed drivers' gaze or saliency maps produced by the model and were asked to choose which one came from a human driver [198] or was more consistent with good driving practices [208]. Since participants were not able to reliably discern natural and artificial gaze maps, it was interpreted by the authors as evidence in favor of the models. Whether such predicted patterns lead to safer driving remains unclear, particularly when data for training and evaluation is recorded in lab (see discussion in Section III).
A particular challenge related to driving gaze data is that it is comprised of common driving scenarios (e.g. vehicle following, driving on a straight road) with few surprising events or interactions with other road users. For example, in DR(eye)VE [198], the drivers encounter relatively few other road users and do not perform maneuvers often, resulting in center-biased gaze distributions. This has consequences for models since they learn the dominant gaze behaviors and thus fail to predict gaze for scenarios that occur rarely. Xia et al. [208] propose to mitigate the prevalence of common driving scenarios using two strategies. First, they curate the training data by focusing on abnormal events (e.g. braking). Second, they implement a weighted sampling strategy that selects frames with abnormal gaze distribution more frequently during training. To measure how well the models learn the underlying driving task, some authors compute metrics over segments of data where drivers' gaze distribution significantly differs from the mean due to maneuvers or actions of other road users [207], [208].

B. Attention for Self-Driving Vehicles
Self-driving technology aims to improve safety by eliminating human error. But to match or exceed human driver performance, AI-driven systems require solving multiple problems in many areas of computer vision and robotics. Perception alone involves overcoming significant challenges in object detection, tracking, scene segmentation, depth, and optical flow estimation (see [214]). Decision-making for motion planning, behavior selection, and vehicle control rely on precise mapping and localization [215] as well as understanding the behaviors of vulnerable road users [216].
Current self-driving systems that tackle these issues can be broadly subdivided into modular and end-to-end [217]. The former use dedicated modules for various processing stages, whereas the latter are unified systems that convert input from sensors directly to control commands. Due to the overwhelming number of self-driving approaches, we limit the review to end-to-end driving models that use attention for perception and reasoning to improve models' performance and transparency. The role of attention is to identify objects or areas in the scene that are most relevant for the current driving task and safety. The general principle of many attentional mechanisms is reweighting of the features according to a query that could be a literal question or a different set of features (e.g. hidden state of the recurrent model representing the current context) [218]. The weights themselves can be used to analyze what parts of the input had a larger influence on the output and investigate intermediate processing for better explainability of the model's decisions.

1) Types and Uses of Attention:
Spatial attention is widespread in self-driving models because it retains the spatial arrangement of features and computes attention weights that can be traced back to the locations in the environment. Weights visualized as a heatmap can be interpreted as objects or areas in the scene that were important for the current output of the model. Spatial attention is usually applied to intermediate and final layers of the feature extraction step. For example, in [219], multiple attention modules are inserted after intermediate and last layers of CNN to gradually refine features. Kim et al. [220] insert the attentional module after the convolutional feature extraction and also condition spatial attention weights on the hidden state from the previous timestep.
Computing attention weights for image regions and specific objects in the scene instead of pixels or individual features is also possible. Cultrera et al. [221] use simple spatial attention that is trained as part of the network. The model first extracts features from the input image via pre-trained CNN, Fig. 7. Visualized spatial attention weights: a) object-level, b) pixel-wise. Sources: a) [222], b) [220]. groups them into coarse regions, and passes them through the pooling layer. Then, an attention block consisting of a fully-connected layer followed by softmax activation produces weights that are element-wise multiplied with the output of the pooling layer. In [222], features corresponding to individual objects in each frame are extracted first. An object-level attention network is then applied to a concatenation of local object feature and global image features to output a scalar score indicating the object's relevance. Top-k objects are then passed to a policy network that outputs a discretized action. He et al. [223] and Wei et al. [224] propose methods for computing sparser attention weights for input features which results in a more selective and compact focus of attention and reduced computation.
Recently, the Transformer architecture [225] has been shown effective for many vision tasks [226]. Transformers process sequential data without relying on recurrence, and instead use several identical blocks composed of multi-head attention, feed-forward neural network, residual connection, and layer normalization. Attention module is a key element that reweights input according to current task or context. Stacking several attention blocks with different initialization within a multi-head layer allows learning to focus on different parts of the input.
The flexibility of Transformers for various input modalities [226] and the inherent interpretability of learned attention scores [227], [228] lend themselves well to the self-driving domain. For example, Chitta et al. [229] use a Transformer to encode image patch features, ego-vehicle velocity, and positional embeddings. An additional Neural Attentional Field module then identifies parts of encoded input that are relevant for query waypoint in the bird's eye view (BEV) image. Waypoints produced by the model are then used to generate vehicle control commands. Prakash et al. [230] propose TransFuser, a model composed of multiple transformer blocks for gradually fusing feature maps of different modalities (image and LiDAR BEV) at multiple resolutions. The waypoints generated from these features are shown to be effective in guiding the vehicle. Li et al. [231] make use of Transformers for developing a model for perception and prediction from multi-modal data comprised of LiDAR sweeps, images of the scene, and high-definition maps. Features of all detected road users in the scene are passed to the Interaction Transformer module that identifies for every actor all other relevant actors. Experiments on naturalistic driving data show that focusing on the most relevant agents helps reduce the number of collisions. Some approaches leverage human data for training attention modules. For example, in [245] human gaze is used to train a foveal visual encoder that selects informative locations in the scene, crops patches, and processes them in more detail. The peripheral visual encoder extracts convolutional features from the entire image to provide global context. The two are combined via a planner to produce vehicle speed. In [246], a gaze model trained on human eye-tracking data is used to control the amount and spread of dropout to improve the accuracy of control commands during imitation learning. Gaze-modulated dropout is lower in highly salient areas and higher in irrelevant areas and offers better performance than fixed uniform or center-biased dropout. Instead of the human gaze, Kim et al. in a series of works, leverage human textual annotations using visuo-linguistic techniques. In [239] they propose an explanation module for the vehicle controller that generates a textual explanation for the action and a spatial attention map that highlights relevant regions in the scene image. In [238] the authors use textual advice to generate vehicle control commands and spatial attention maps that influenced the decision. In the most recent paper [247], they use attention for simultaneous generation of control commands using natural language and textual and visual explanations.
Qualitative evaluations of attention modules commonly use visualizations primarily to demonstrate that the algorithm focuses on portions of the environment relevant for safe driving. Object-level attention (Fig. 7a), where importance scores are visualized for individual objects, is easier to interpret and is more common [222], [224], [229]- [231]. Pixelwise attention scores (Fig. 7b) used in some studies [220], [245], [249] often do not match specific objects in the scene and, in some instances, do not adequately reflect the decisions made by the system [220].

VI. DATASETS
High-quality publicly available data are crucially important for applied research, particularly for a complex and dynamic task such as driving. As discussed in previous sections, driving data must capture a wide range of scenarios and conditions, as well as sufficiently large and diverse pool of participants. Thus large-scale data are a must for adequate evaluation of models and benchmarking the overall progress. This section covers a number of public datasets for a range of applications and properties of drivers' attention they represent (Table I).  TABLE I   PUBLIC DATASETS FOR STUDYING DRIVERS' (IN) ATTENTION WITH LINKS TO THE CORRESPONDING PROJECT PAGES AND DATA PROPERTIES. THE  DATASETS ARE SORTED BY AVAILABILITY OF EYE-TRACKING DATA AND YEAR OF PUBLICATION (IN REVERSE CHRONOLOGICAL ORDER). THE  FOLLOWING ABBREVIATIONS ARE USED IN THE TABLE. VIDEO DATA: S -SCENE-FACING CAMERA, D -DRIVER-FACING CAMERA,  RGB -3-CHANNEL IMAGE, IR -INFRA-RED, DEPTH -DEPTH SENSOR, MOCAP -MOTION CAPTURE. ANNOTATIONS: TL -TEXT  LABELS, BB -BOUNDING BOXES, FL -FACIAL LANDMARKS, OCCL -OCCLUSION, SEM -SEMANTIC SEGMENTATION MAPS. FRAME COUNTS MARKED WITH * ARE ESTIMATED BASED ON THE LENGTHS OF THE VIDEOS AND CAMERA FRAME RATE
Only DR(eye)VE dataset is recorded on-road, however due to difficulties in replicating the routes and traffic conditions across subjects, each video is associated with only one driver's gaze recording. 3DDS and C42CN recorded in a low-fidelity driving simulator aggregated gaze data from multiple subjects.
Eye-tracking data for the remaining datasets were recorded while subjects passively viewed driving footage on a computer monitor. As mentioned in Section III-A, such conditions lack ecological validity compared to on-road driving but are replicable across many subjects. As a result, there are measurable differences in gaze allocation between the two setups. For example, Xia et al. [208] reported that subjects who passively viewed videos from the DR(eye)VE dataset looked at more driving-related objects than the drivers whose gaze was recorded originally. Further analysis is needed to establish whether these changes are significant and how they affect safety.

B. Driving Datasets Without Eye-Tracking Data
Datasets in this group usually contain videos from driver-facing cameras from which gaze may be inferred or driving videos with driver attention annotations. Multiple techniques have been developed to associate recordings of drivers with the attended areas inside and outside the vehicle. For naturalistic driving data, the most common method is manual coding of gaze from driver-facing camera recordings. To avoid a labor-intensive and error-prone annotation process, drivers are instructed to look at specific areas or markers in the vehicle or the scene while seated in the parked vehicle. This approach is taken in LISA v2 [73], DGW [235], and DMD [236] datasets. In LISA v2, subjects were filmed under different lighting conditions (daytime, nighttime and harsh lighting) with and without eyeglasses. Additionally, the subjects were asked to rotate their head to capture head motions typical for actual driving. The DGW dataset followed a similar procedure Fig. 8. Types of data, data availability, and use in applications. The size of the circles reflect the number of publications using the corresponding data type for a specific application. Within each circle, the proportion of public and private data is shown. using a much larger and diverse pool of participants. To automate labeling, participants were asked to look at one of the 9 markers placed in the vehicle and speak the corresponding zone number, which was transcribed using a speech-to-text network. DGAZE dataset uses an in-lab setup where videos of the drivers are captured against the backdrop of the vehicle interior while they are looking at the annotated objects in the traffic scene. This makes it possible to associate drivers' appearance and the objects they attend to. However, it is yet to be determined whether such an in-lab setup will translate well to realistic on-road conditions.
Datasets without a driver video stream provide textual labels for drivers' actions and attention allocation in terms of objects and events in the scene deemed important by the annotators. For example, HDD [240], HAD [238], and BDD-X [239] provide causal explanations for drivers' actions, e.g. the presence of crossing pedestrians, or a vehicle ahead slowing down. Besides textual descriptions, HDD also provides bounding boxes for objects that the driver should look at when performing maneuvers. DAD [241] focuses on more extreme scenarios and contains videos of accidents recorded via dashboard cameras. The annotations include textual labels specifying the type of accident, temporal labels indicating when the dangerous situation occurs, and bounding boxes for important objects.
A number of datasets have been proposed for studying driver inattention due to drowsiness or involvement in secondary tasks. DROZY [242], YawDD [244], DDD [123], and RLDD [129] capture drowsy drivers. YawDD features recordings of a diverse set of drowsy drivers demonstrating a wide range of behaviors in varying conditions. Some drawbacks of this dataset are scripted actions and recording in a stationary vehicle. DDD is another scripted dataset where subjects were recorded laughing, talking, and looking to the sides, besides acting normal and drowsy, while playing a driving video game in a low-fidelity simulator. RLDD contains images of people captured with mobile phone cameras against neutral backgrounds. Subjects were asked to record themselves when they felt alert, low-vigilant, or drowsy, making sure that the state was authentic. Their self-assessed alertness score was used as a ground truth. Finally, DROZY is the only dataset where subjects experienced prolonged waking under the supervision and where physiological signals accompanied self-evaluated levels of drowsiness.
Large-scale naturalistic data for studying driving-and non-driving related activities are available in Brain4Cars [159] and DMD [236]. Both contain extensive footage of driver-facing cameras synchronized with traffic views, textual labels, and associated vehicle information useful for detection and anticipation of drivers' behavior.

C. Naturalistic Driving Studies (NDS)
NDS are organized efforts to collect large-scale data on the natural behaviors of drivers over extended periods of time. For one of the first such studies, 100-car NDS, drivers used their private vehicles with instrumentation installed to collect rich visual and vehicle data from 2002 to 2004 in the USA [250]. The largest NDS to date, SHRP2, conducted in 2010-2013 also in the USA, involved over 3000 drivers who generated 50M miles of travel (with 372 crashes) and 2 petabytes of data which is still being analyzed [251]. Unlike datasets listed in the Table I, NDS data is not freely accessible due to privacy concerns. For example, researchers interested in obtaining data from SHRP2 2 are required to complete training and provide a research proposal. Fees may also be charged depending on the data requested.

D. Data Availability and Properties
Overall, the datasets listed in Table I contain data recorded in multiple locations, with hundreds of subjects, and accompanied by rich annotations. However, much of this data has been released only recently and is not equally distributed across application domains. As shown in Figure 8, a large portion of the models covered in this survey are developed using private unpublished data. For instance, research on driver monitoring, such as in-vehicle gaze prediction, action anticipation, inattention detection, and driver awareness estimation, largely relies on private data sources, whereas scene gaze prediction and attention for self-driving are studied primarily on publicly available datasets.
Another limitation of many public datasets is that attention-related data is often recorded in laboratory conditions. Particularly, for eye-tracking data, such conditions have not been validated (see in Section III-A). For datasets without eye-tracking data, manually annotated events and objects often serve as substitutes for attention. However, the procedures for obtaining and verifying the correctness of such annotations are often not discussed, making their validity for on-road conditions a concern.
Finally, there are some gaps in the data types that the open datasets provide, limiting their use in specific applications. Figure 8 shows different kinds of annotations and their respective usage for various purposes. For example, to the best of our knowledge, there are no open datasets containing driving footage synchronized with the in-vehicle view and driver gaze information for applications such as in-vehicle gaze prediction, action anticipation, inattention detection, and driver awareness.

VII. GENERAL DISCUSSION AND CONCLUSIONS
Over the past decade, significant progress has been made towards detecting and modeling properties of drivers' attention for use in assistive and automated driving. For example, detecting where the driver is looking can be used to monitor drivers' alertness and attention, anticipate their maneuvers, and estimate their awareness of the surrounding traffic situation. Research on drivers' attention allocation has also benefited self-driving. Attentional mechanisms inspired by human visual attention or trained on human gaze data help autonomous vehicles focus on important objects and can also be used to explain their decision-making. Nevertheless, many challenges and open problems remain to be solved to make attention-based driving assistive systems viable for production.
A. Data Availability and Quality 1) More Public Datasets and Models Are Needed: Overall, close to 80% of all works that we reviewed in this survey relied on unpublished private datasets, and less than 10% published relevant code for the models and statistical analysis. But, as was shown in Figure 8, data availability also depends on the application area. For instance, more than two-thirds of driver gaze estimation in the traffic scene and self-driving models are based on public data, and many provide source code, whereas this is not the case for other applications. The lack of public data severely hinders the ability of researchers to reproduce the results of others and draw comparisons between different approaches. Moreover, without established benchmarks estimating actual progress in the area and identifying future challenges is nearly impossible, especially since many unpublished datasets are not accompanied by the information on the recording conditions, characteristics of the subjects, and tasks they performed. Although benchmarks are not without issues, much of the recent progress in computer vision and natural language processing can be attributed to high-quality open large-scale data. Similar tendencies can already be observed in some research areas discussed in this survey, e.g. scene gaze estimation, self-driving, and drowsiness detection.
2) Improving Data Diversity and Fidelity Regardless of Data Accessibility: Recording conditions can have a significant effect on the data quality and model applicability in practice. Naturalistic recordings of drivers' behaviors are generally difficult to collect and analyze due to high associated costs and lack of control over the conditions and tasks that the drivers perform. As a result, large volumes of data need to be aggregated to capture specific rarely occurring events. Thus, virtually all data used for developing models is restricted in some sense, e.g. by using predefined routes and tasks or conducting the study in the lab or in a parked vehicle. Even though laboratory conditions may be justified for potentially dangerous experiments (e.g. involving drowsiness or distractions), they nevertheless affect subjects' behaviors due to low perceived risk, overexposure to rare events, and short duration of sessions [47]. Recording in the lab or in a stationary vehicle cannot capture dynamic changes in lighting, shifts of driver's position due to changes in the road angle, and data loss caused by vibrations and road bumps.
Highway and rural road scenarios, often with low traffic volume, are more common in both on-road and in-lab experiments, whereas city driving is not as well investigated. However, when it comes to drivers' attention, urban conditions are far more challenging due to the presence of intersections and vulnerable road users. Although the speed of the vehicle is lower on the city streets, drivers interact with many other agents, which requires complex attention strategies. Furthermore, bad weather, unfamiliar environment, or heavy traffic are rarely modeled. Even in large naturalistic studies, these conditions are not well represented [37], [252].
Diversity of the participant pool is also a concern. The vast majority of the works we considered record data from no more than a dozen subjects, mostly university students, and many do not provide detailed information about the characteristics of the participants. However, given the evidence of significant individual differences between drivers (as discussed in Section IV), recruiting more subjects with diverse demographic characteristics is highly desirable.
The lack of realism in datasets extends from the environment to the drivers' actions, which are often staged. For instance, it is common practice to induce distraction by asking the drivers to perform tasks at timed intervals (see Sections IV). In reality, however, secondary task engagement is voluntary and depends on many factors, including experience, environmental, and situational, as well as characteristics of the secondary task itself [64]. Forcing the subjects to engage in meaningless tasks on-demand and incentivizing high performance produces detectable changes in gaze allocation and driving performance, but such behaviors may differ from inattention occurring naturally during driving.
3) Taking Into Account the Active Nature of Driving: All available datasets consist of pre-recorded driving footage accompanied by gaze information (driver's gaze and/or gaze of passive observers) or manual annotations. As such they provide limited use for estimating the changes in drivers' gaze depending on the task. Counterfactual studies may help in testing how changes in the task or the environment may affect attention allocation [253] but it is virtually impossible to estimate the effect of the drivers' actions on other road users using pre-recorded data. Simulated environments can generate the outcomes of different actions in the same scenarios but lack realistic models of road user behavior and the environment. While the quality of rendering has been steadily improving with advances in computer graphics, the problem of modeling the actions and reactions of the surrounding pedestrians and vehicles remains far from being solved [254], [255].
B. Evaluation 1) Establishing Ground Truth: There are unresolved issues related to establishing ground truth for many applications. For example, determining specific objects or areas that the driver is observing is not trivial. A recent study by Jansen et al. [256] raised concerns regarding the manual annotation of gaze from driver-facing videos. Based on their analysis, the customary practice of measuring several independent annotators' agreement may not produce good quality labels as some areas of interest are easily confused (e.g. accuracy for the AOIs is consistently lower on the passenger side). Other factors, such as the driver's height, may also affect the results but are rarely considered. Even with precise eye-tracking data, establishing a point of gaze, especially for small or moving objects, is prone to errors, as Schwehr et al. [177] show in a series of experiments. Due to these limitations, models that demonstrate high performance on such ground truth may not transfer to real traffic conditions. Similarly, determining driver's cognitive state may be problematic. As discussed in Section IV-C, self-reported and observer ratings for drowsiness are often not accurate and do not correlate with driving performance. Cognitive distractions are also difficult to induce and detect (Section IV-B). Physiological indicators are more suitable for these purposes [55] but require additional sensors, making data collection more costly and the use of such systems less desirable in practice.
2) Including Safety-Focused Evaluation: Assistive and autonomous driving applications are motivated by safety concerns; however, quantitative evaluations can only assess how well they align with ground truth (which, as noted above, may not be accurate). At the same time, actual crash data is exceedingly rare. For example, in 43 thousand driving hours of driving data recorded in 100-Car NDS, 82 crashes (mostly rear-end collisions), 761 near-crashes, and 8295 incidents were recorded [2].
In the literature, two approaches are commonly taken to mitigate this issue depending on the application in question. One is collecting and annotating accident videos published online (see Section VI), and another is integrating attention into vehicle control models, and testing them in simulation to estimate crash risks (Section V-A2). Both methods have limitations. While accident datasets may provide information about various types of collisions and their timelines, annotations collected in lab conditions, whether eye-tracking data, textual labels or importance scores, are difficult to verify with regard to safety. Therefore, it cannot be guaranteed whether visual strategies learned from such data could have prevented the crash or reduced its severity. Simulated experiments provide both the active control and the ability to replicate the same scenarios, as well as accident risk estimates, but typically are not validated in on-road conditions.
There are also assessments of the risk of prolonged off-road glance durations derived from naturalistic studies [109], [257]. Currently, they are widely used in behavioral literature and as guidelines for in-vehicle infotainment system design. Although they are relevant for the design and evaluation of inattention detection algorithms, only one model within our selection of papers uses them [106]. Incidentally, it is the only model that captures the duration of the inattention, whereas the rest focus on instantaneous detection.

3) Better Coordination Between Research Areas:
Research areas covered in this survey are complementary and can benefit from coordinating their efforts. For example, taking into account driver's actions helps better predict attention [194], [195] and detect inattention [109], [258]. Likewise, gaze and appearance features are useful for detecting both distraction and drowsiness (Section IV) but relatively few works investigate these problems together [259], [260].
Human-computer interaction (HCI) research is also very relevant for the design of algorithms intended for use in assistive and autonomous driving systems. To ensure the adoption of such systems, they should function seamlessly and help the driver rather than add to their cognitive load (e.g. by unclear or false alarms [261]). However, in the reviewed models, such considerations are rarely taken into account or verified through user studies. For example, many driver gaze prediction algorithms (Section V-A) output pixel-wise heatmaps where objects of interest or imminent hazards are highlighted. Although several studies show that target and maneuver-relevant cues can help direct drivers' gaze to those areas [262]- [264], how this guidance is realized is important. It has been shown that providing too many cues is detrimental and may obscure other important information [263], [265]. Specifically for hazards, indicating the path to avoid them [266] is more effective than pointing at the obstacle itself [267]. Another example is fatigue detection. While reliable detection is the first necessary step, it is not sufficient to provide effective countermeasures. Fatigue due to cognitive under-or overload and drowsiness caused by sleep deprivation require different interventions [268], therefore context must be modeled as well. Individual characteristics of the drivers discussed in Section IV or their driving preferences [114], [269] should also be factored in when setting thresholds for warnings.

C. Limitations of Attention Models
1) Using Gaze as a Proxy for Attention: As discussed in Section III, driving literature views attention as observable gaze changes and measures related to spatio-temporal properties in gaze, such as location and duration of fixations, and transitions between them. The assumption is often made that most driving-related information is processed in the fovea and is predominantly task-driven. Thus analyzing gaze can shed light on what the driver observed at any given time and how it affected their decision-making. However, gaze as a proxy for attention also has a number of limitations. First, gaze alone does not guarantee processing, in other words, looking at something is not equal to being aware of it (e.g. looked-but-failed-to-see errors [270] and change blindness [30], [271], [272] occur during driving). Second, drivers extensively use peripheral vision for vehicle control [273], [274] and hazard detection [50], [275], [276], however, gaze provides little insight into peripheral processing. Third, gaze is a result of a complex interplay between various attention control mechanisms, tasks being performed, and surrounding context. These caveats must be taken into account when analyzing eye-tracking data and designing models for various applications in the driving domain and beyond.
2) Reducing the Gap Between Behavioral Research and Implementations: Despite encouraging results, most algorithms consider only a fraction of the factors affecting drivers' attention identified in behavioral studies. For example, there is evidence that age [35], [277]- [279] and driving experience [3] affect drivers' attention allocation. Besides driver characteristics, external conditions matter. Effects of driving through intersections [280]- [282], on curved roads [283], in dense traffic [284], as well as the presence of outside distractors (e.g. billboards) [285], [286] have been investigated in numerous behavioral studies but are not taken into account in many implementations.
3) Incorporating Explicit Task Representation: As discussed in Section III-A, top-down factors play a large role in drivers' gaze allocation. High-level features, such as location and class of objects, vehicle telemetry, and optical flow, allow capturing only implicit dependencies between visual features, drivers' gaze data, and vehicle control signals. Such interpretation of attention poorly reflects biological properties of the human visual system and offers little control over algorithms that predict attention allocation in practice. Driving is not a uniform activity, different underlying tasks affect attention distribution differently. For example, when controlling the vehicle, the drivers focus on the road ahead and track road boundaries, periodically fixating on other road users or scanning the intersections [156]. Visual context and gaze may be ambiguous, therefore an explicit top-down signal with intended action or planned route could help better direct the model.

4) Modeling Attention Beyond Selective and Explanatory Functions:
In most models of drivers' attention reviewed in Section III-A), the role of attention is reduced to highlighting and ranking objects or areas in the scenes. Other aspects of attention, such as the effect of task on attentional modulation of perception, sequential nature of processing, relation to working memory, decision-making, and allocation of cognitive resources [287], are not considered. Part of the reason is the limited scope of the proposed models. More sophisticated attention mechanisms would be necessary as models' complexity increases towards incorporating the state of the driver, and analysis of the interactions between road users, infrastructure, and drivers' actions.
In conclusion, the problem of modeling drivers' attention is of immense practical and theoretical importance. In this review, we discussed several research directions that analyze and model information on where the driver is looking for applications in driving assistance and automation. We hope that providing a broad overview of several inter-related research areas and identifying open problems will help guide future investigations and lead to improvements in road safety.