Estimation of Driver's Gaze Region from Head Position and Orientation using Probabilistic Confidence Regions

A smart vehicle should be able to understand human behavior and predict their actions to avoid hazardous situations. Specific traits in human behavior can be automatically predicted, which can help the vehicle make decisions, increasing safety. One of the most important aspects pertaining to the driving task is the driver's visual attention. Predicting the driver's visual attention can help a vehicle understand the awareness state of the driver, providing important contextual information. While estimating the exact gaze direction is difficult in the car environment, a coarse estimation of the visual attention can be obtained by tracking the position and orientation of the head. Since the relation between head pose and gaze direction is not one-to-one, this paper proposes a formulation based on probabilistic models to create salient regions describing the visual attention of the driver. The area of the predicted region is small when the model has high confidence on the prediction, which is directly learned from the data. We use Gaussian process regression (GPR) to implement the framework, comparing the performance with different regression formulations such as linear regression and neural network based methods. We evaluate these frameworks by studying the tradeoff between spatial resolution and accuracy of the probability map using naturalistic recordings collected with the UTDrive platform. We observe that the GPR method produces the best result creating accurate predictions with localized salient regions. For example, the 95% confidence region is defined by an area that covers 3.77% region of a sphere surrounding the driver.


I. INTRODUCTION
R OAD safety is a major concern in today's world. The main cause of road accidents is the negligence of distracted drivers [1]. Therefore, monitoring the driver's actions can be useful for predicting their behaviors, creating warnings to avoid impending mistakes due to lack of awareness. Smart vehicles today are equipped with multiple sensors, which provide relevant real-time information inside and outside the vehicle. The challenge is incorporating heterogeneous information to provide high-level knowledge to understand the driver, the vehicle, and the road. Monitoring the driver's behaviors can also serve as a tool to design advanced user interfaces for infotainment and navigation systems where the drivers naturally interact with the car, without using manual resources [2] (e.g., interpreting commands such as "what is the address of this building?", while the driver briefly glances towards the Sumit Jha and Carlos Busso are with the Erik Johnson School of engineering and Computer Science at the University of Texas at Dallas.
target location). With semi-autonomous cars, monitoring the driver behavior can also be helpful in negotiating hand-over control from the vehicle to the driver, or vice-versa. Visual attention is a major factor when modeling the driver's intentions. The majority of the tasks involved while driving require visual cues. The direction of the driver's gaze strongly depends on the primary driving task and the road condition. Implementing a robust gaze detection system for cars can be helpful in signaling the cognitive state [3], [4], situational awareness [5]- [7], and attention level [8], [9] of the driver. These systems can also be helpful in enhancing in-car dialog systems [2].
In human-computer interaction (HCI), the gaze of a subject is estimated by locating the pupil using various appearance based and feature based techniques [10]- [12]. However, these techniques are not practical in a vehicle environment with challenging situations such as varying lighting conditions, high degree of head rotations, and possible occlusions [13]. Moreover, in the detection of the driver's attention is often more important to achieve robustness across conditions rather than high performance under restricted conditions. A coarse estimation of the driver's visual attention is usually enough for many applications. Following this strategy, studies have proposed the use of head pose to infer the driver's gaze [14]- [17]. Head pose has a strong correlation with gaze, but the relationship is not deterministic [18]. Taking the eyes-off-theroad during longer periods significantly increases chances of accidents. Therefore, drivers tend to have short glances, which involve head and eye movements. This relationship changes according to the driver, primary driving task, secondary driving task, and the traffic condition. Therefore, the head orientation cannot uniquely determine the exact gaze direction.
Instead of aiming to detect the precise gaze direction, this paper proposes to estimate a probabilistic visual map describing the region of visual attention where the driver is most likely to direct his/her gaze. Building upon our previous work [19], [20], we propose to create this probabilistic visual map using a two dimensional Gaussian distribution that is directly learned from data. The formulation relies on Gaussian process regression (GPR) to predict the distribution of the gaze given a certain position and orientation of the driver's head. The proposed model provides not only the probabilistic visual map, but also confidence regions, which can be extremely useful for HCI applications for infotainment and navigation systems, and advanced driver assistance systems (ADAS). The size of the salient region decreases when the confidence arXiv:2012.12754v1 [cs.CV] 23 Dec 2020  (1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11)(12)(13), mirrors (14)(15)(16), side windows (17)(18), speedometer panel (19), radio (20), and gear (21). The subjects were asked to look at these markers.
of the model increases, learning all the parameters of the models directly from the data. We train and evaluate the system with recordings from real driving scenarios using affordable equipment that can be easily installed on regular cars. The experimental evaluation demonstrates the effectiveness of the proposed GPR system, analyzing the tradeoff between accuracy and spatial resolution of the probabilistic visual map. We compare our proposed solution with alternative machine learning methods to predict the visual maps, including simple regression techniques, deep neural networks and mixture density networks (MDNs). The results indicate that our proposed model offers the best accuracy and spatial resolution in predicting the probabilistic region of the driver's gaze. For example, 95% of the target markers lie inside the probabilistic region predicted by the system, where its temporal resolution includes 3.77% of a sphere surrounding the user's range of vision. Finally, we demonstrate the benefit of the proposed probabilistic model by mapping the probabilistic visual map to areas on the road, allowing us to identify coarse regions outside the car that the driver is directing his/her gaze to.
This study is organized as follows. Section II discusses related studies about the importance of visual attention when studying driver's behavior. Section III describes the data collection procedure that we followed to train and evaluate our algorithms. Section IV describes the proposed method to obtain the probabilistic salient visual map to represent visual attention. It also introduces the baseline methods. Section V discusses the results obtained from different models, comparing the tradeoff between spatial resolution and accuracy of the probabilistic salient visual maps. Finally, Section VI concludes the study, suggesting future research directions.

II. RELATED WORK A. Visual Attention of the Driver
Maintaining visual attention while driving a vehicle is important to reduce hazard scenarios. Drivers obtain most information through vision, which is important to maintain road awareness and to complete driving maneuvers [5]. Therefore, several studies have considered the visual patterns of the driver, creating useful automatic tools for intelligent vehicle systems. Liang and Lee [21] conducted experiments by inducing visual distraction, cognitive distraction and a combination of both by asking subjects to perform distracting tasks while operating a driving simulator. They observed that the driving performance was worse when the subjects were performing visually distracting tasks compared to the performance when performing a combination of visual and cognitive distracting tasks.
Robinson et al. [22] studied the visual search patterns of a driver by looking at his/her head movements during lane changes and when entering a highway after a stop sign. They observed longer search times at a stop sign, where the drivers had to observe the whole scene before making a decision. In contrast, for lane change actions the search time was shorter since the driver had to make quick decisions. Underwood et al. [23] used eye trackers to study the eye movement behavior of experienced and novice drivers in three different types of roads: rural, suburban and divided highways. They analyzed the most common sequences of fixation in various regions of the road to compare the driving behavior. They observed that a novice driver tends to change his/her fixation more often, while an experienced driver tends to use peripheral vision to pick up subtle information such as the demarcation of lanes.
Understanding visual attention can also help us infer information about visual and cognitive distractions. Sodhi et al. [24] used a head-mounted, eye-tracker in a vehicle, where they asked multiple subjects to drive a predetermined route while performing tasks that stimulate distractions. They used the eye-tracker to obtain the position and diameter of the pupil. They studied the impact of infotainment systems on driving by stimulating various cognitive and visual distractions. The study observed that the eye movement patterns changed when the driver was distracted by a secondary task. Kutila et al. [25] recorded the face of the driver with stereo cameras in a naturalistic driving scenario. They used head and gaze information along with lane position and controller area network (CAN)-Bus data to detect visual and cognitive distraction. Gaze data was obtained using a gaze tracker. The driver's visual attention is inferred using eyes-off-the-road duration. The eye movement is fused with cognitive workload inferred from the driving data to obtain cognitive distractions. Liang et al. [26] designed a support vector machine (SVM) classifier that used measures of driving performance such as steering angle and lane position, and features from eye movement data such as fixations and saccades to detect cognitive distraction. They obtained an accuracy of 96.1% in a simulated environment. Murphy et al. [27] implemented a real-time system to track the six degrees of freedom of the head pose of a driver. They designed an appearance based particle filter to design a 3D model of the face in augmented reality. Rezaei and Klette [17] monitored both the driver and the road to find possible hazard situations. They designed an asymmetric active appearance model (AAM) to predict the driver's head pose, which was used in conjunction with features extracted from the vehicles detected on the road to design a fuzzy logic based system to predict the risk level of the driving situation.
Understanding the driver's behavior is even more relevant with the advances in autonomous cars. Information and datasets derived from drivers in naturalistic conditions, including their visual attention, can be instrumental in the design of autonomous cars [28], following the ideas of behavior cloning [29]. Likewise, cars that are aware of the driver's visual attention can more effectively negotiate hand over situations. Zeeb et al. [30] compared the take-over time and quality between a distracted driver and an attentive driver. The drivers were asked to be involved in secondary tasks such as watching videos and writing emails. They observed that while a driver could quickly resume control of the car when prompted, the quality of the take over was worse when the driver was distracted.

B. Prediction of Visual Attention
Several studies have worked on predicting the visual attention of the driver, realizing the importance of visual attention in monitoring the behaviors of the driver. The most common approach is to partition the gaze region of the driver into different gaze zones. Then, the problem is formulated as a classification problem to identify the area that the driver is directing his/her attention. Tawari and Trivedi [14] video recorded the face of the driver from two different angles in the car to capture the head pose. The two cameras increased the angular range of the head pose estimation. The task was to classify the driver's gaze into eight different gaze zones. They used annotations obtained from human experts as the ground truth for the target gaze zone, training a random forest classifier with the head pose as features. The zone predictions had high confusion between adjacent zones such as looking forward and looking at the speedometer. Lee et al. [15] suggested a robust method to predict the yaw and pitch of the head. The method relied on simple edge features from the face, making the approach robust to rotation and illumination, and fast enough to be run in real time. They used the predicted yaw and pitch angles to identify one of the 18 predefined gaze zones using an SVM classifier. Chuang et al. [16] designed a gaze estimation system using a smartphone camera. They placed the smartphone on the dashboard to record the driver's face. They used the location of the eyes, nose and mouth regions as features to classify the gaze among eight different zones. Vora et al. [31] tried a generalized approach to classify gaze zones which is subject invariant. They used convolutional neural networks (CNNs) to obtain the gaze zone from the driver's facial image. The best network achieved a 93.36% accuracy while performing a seven class classification task (six gaze zones plus a class for eye closure) While the gaze zone provides useful information about the visual attention of the driver, this information is too coarse for several applications. However, it is challenging to design gaze estimation methods with high precision that work well inside a vehicle. Although there is a strong relationship between head movement and gaze direction, the relation is not one-to-one [18]. In naturalistic driving scenarios, the driver relies not only on head movements to direct his/her gaze toward a target location, but also on eye movement. The interplay between head and eye movements depends on the cognitive load of the driver and the underlying driving task. A feasible alternative to gaze zone estimation or unreliable gaze algorithms that do not work in a vehicle is the definition of a probabilistic salient visual map describing the visual attention of the driver. This probabilistic salient visual map can be used to define spatial confidence regions describing the direction of the driver's gaze. This study pursues this novel formulation, creating models that capture the relationship between head pose and gaze, creating a probability distribution of the gaze given the orientation and position of the driver's head.

III. DATA COLLECTION
This study uses recordings from real driving scenarios collected with the UTDrive platform [32], [33], which is a vehicle equipped with multiple sensors ( Fig. 1(a)). The UTDrive has been successfully used to study driver behaviors [34]- [36]. Instead of using the specific sensors from this car, we decided to only use the commercially available dash camera Blackvue DR-650GW-2ch (Fig. 2), which can be easily installed in any regular car. The device features two cameras along with a global positioning system (GPS) and accelerometer sensors. The front camera was used to record the road view, while the rear camera was used to record the face of the driver. The system is currently implemented offline. However, the dash camera has WiFi connection which can be utilized for real-time implementations. This setting is ideal for in-vehicle

A. Data Collection Protocol
For the analysis, we require data where we know the ground truth information about the direction of the gaze. We achieve this goal by asking the driver to look at predefined markers. We place 21 numbered markers on the windshield (#1-#13), mirrors (#14-#16), side windows (#17-#18), speedometer panel (#19), radio (#20), and gear (#21) ( Fig. 1(b)). Then, we ask the subjects to look at these markers multiple times, where we carefully annotated the corresponding timing information. We recruited 16 students (10 males, 6 females) with valid US driver licenses from the University of Texas at Dallas. We designed a three-phase protocol: Phase 1: The first phase is recorded when the vehicle is parked. The subject is asked to sit in the driver seat, looking at different markers. The numbers are called out in random order and the driver is asked to look at the corresponding points. Each number was repeated five times in random order. We did not provide any further instruction. The goal of this phase is to estimate and model the gaze-head relationship when our subjects are not driving. They have plenty of time to complete this task without worrying about visual, manual and cognitive demands associated with the driving tasks. The drivers can also get familiar with the task in a safe environment. Phase 2: The second phase consists of the same task while the subject is driving the vehicle. The subject is asked to drive on a straight road with low traffic. A passenger reads the numbers, pointing to the target location reducing the cognitive demand of the task. The numbers are requested only when the driver does not have to perform any maneuver. The safety of the subject is our first priority. We do not provide any additional instruction on how to look at the markers. We use this phase to estimate and model the gaze-head relationship while the subject is driving. Phase 3: During the third phase, we ask the driver to park the car and perform the same task again. This time, the driver is asked to look at each marker directing his/her head toward the point. In this controlled condition, the gaze of the driver is the same as the head pose, without the bias added by the movement of the eye. To enforce this requirement, we request the driver to wear a glass frame with a laser mounted at the center (Fig. 3). The driver is asked to point the laser towards the marker. Each number was repeated three times at random for each marker. This phase provides valuable data, where the gaze is exactly aligned with the head orientation. Additionally, we asked the last three of our subjects to look at specific locations on the road including billboards, street signals, and buildings to validate our systems in real-world applications. This data is used to assess the mapping between gaze detection and objects on the roads.

B. Head Pose Estimation Using AprilTags
It is challenging to use computer vision algorithms in a car environment. In our previous work [13], we demonstrated that the robustness of a state-of-the-art head pose estimation algorithm was low for non-frontal faces rotated more than 45 • . For this analysis, we aim to have more robust estimations of head poses regardless of the head orientation. We achieve this goal by using a headband with AprilTags ( Fig. 4(a)). For future work, we can rely on depth cameras to obtain robust head pose estimation [37].
AprilTags [38] are 2D barcodes primarily used for augmented reality applications, robotics and camera calibration. Figure 4(a) shows an example of an AprilTag. The unique black and white patterns of each AprilTag are easy to automatically detect with computer vision algorithms. From the pattern, it is possible to accurately estimate the position and orientation of the tags. Instead of using a single AprilTag, we rely on many tags to robustly estimate the position and orientation of the head. For this purpose, we designed a headband with 17 square faces (2 × 2 cm each), separated by an angle of 12 • . Each of the square faces contains a 1.6 cm × 1.6 cm unique tag. Figure 4(b) shows the headband worn by the participants. During the data collection, the subject is asked to wear the band for the entire recording. The selected design allows us to observe multiple tags for each video frame, regardless of the orientation of the driver's head. Therefore, we can robustly infer the position and orientation of the headband.
The AprilTags from the headband are used to obtain the position and orientation of the driver's head. The AprilTag toolkit provides an estimate of the position and orientation of each tag present in an image. The structure of the headband and orientation of the visible bands help us estimate the pose of the headband. The use of the headband facilitates the analysis of head pose regardless of the orientation of the head or the environmental condition in the vehicle. For real-world applications, the head orientation will be estimated using automatic algorithms using either RGB cameras [39] or depth cameras [37].

C. Calibration of Camera and Markers
A key challenge is to define a common coordinate system. We need the location and orientation of the driver's head along with the location of each enumerated marker in a 3D space with respect to a single coordinate system. Figure 5(a) shows the view from the rear camera facing the driver, and Figure  5(f) shows the view from the road camera. It is clear from these two figures that most of the markers are not included in the view of either of the camera. The problem is even more challenging as we aim to map the gaze direction to areas on the road camera. We need a calibration process to find the exact target marker location in the 3D space and the transformation between the cameras to represent all the coordinates in a single reference system. The calibration process relies on AprilTags to find the location of the markers in the 3D space and to find the relative homogeneous transformation between each camera. The proposed solution consists of placing AprilTags in the vehicle. The AprilTags are used to establish a connection between the road and face cameras, which do not have any overlap in their field of view. The calibration process has two steps: create a common reference coordinate system, and create a mapping between objects outside the vehicle.
The first step in the calibration is to establish a reference coordinate system. AprilTags are placed on each of the markers (Fig. 5(d)), and some reference locations in the field of view of the face camera (Figs. 5(a)-5(d)). These tags are only used to calibrate the system, and are removed during the data collection. Then, we use a third camera to take multiple pictures containing subsets of these AprilTags (Fig. 5). This camera captures locations that are not in the field of view of either of the dash cameras. The relationship between frames containing multiple common tags is calculated. The face camera captures a subset of these additional tags. Using the location of these tags, we create homogeneous transformations to obtain the location of all the tags, including the 21 markers, with respect to the coordinate system of the face camera.
The second step in the calibration consists of estimating a mapping between the reference coordinate system and objects outside the vehicle. For this step, the third camera is fixed inside the vehicle such that it records the windshield and the road view (Figs. 5(e) and 5(f) -these two images were simultaneously taken). As a result, we have two cameras facing the road: the road camera used in the data collection and the third camera used for calibration. We printed an AprilTag sign, which is placed in front of the vehicle such that it is in the field of view of both cameras. This process helps us to establish the relationship between the cameras for different points. Finally, the relation between the face camera and the third camera is established using the location of the markers. With this process, we derive all the transformations needed to map everything into the coordinate system of the face camera. The transformation matrix to relate each camera is calculated using the Kabsch algorithm [40].
Since the placement of the headband slightly varies across subjects, we need to obtain a standard reference per subject to estimate the head pose from the AprilTags. For this purpose, we assume that the long-term average of the driver's head pose is consistent across subjects. We calculate the average orientation of the headband in the quaternion space using spherical linear interpolation (Slerp). We set the origin of the coordinate system as the average pose of the driver's head by subtracting the average head position per participant and multiplying by the inverse of the average rotation matrix to normalize the orientation. We also subtract the average head position value from each of the target marker's location to apply a similar transformation to the target gaze. Finally, the ground truth gaze vector at a given instant is obtained by subtracting the target gaze location from the head position at the given instant, calculating the horizontal and vertical gaze angles from this vector.

IV. METHODOLOGY
This paper aims to create a probabilistic salient visual map describing the visual attention of the driver. We project this map onto the windshield creating spatial distributions for the gaze direction. Then, we map this visual map on the road camera, defining areas on the road where we predict the driver is directing his/her gaze. We can estimate confidence regions for the driver's visual attention by creating a probabilistic map, which is an appealing method with more practical applications than methods predicting a single point for the gaze direction. This section proposes our main method as well as three alternative baselines to obtain a probabilistic distribution of the gaze angle from the position and orientation of the driver's head. These methods estimate the probabilistic salient visual map for the horizontal and vertical angles by modeling the mean and variance of the gaze angles as a function of the position and orientation of the head.

A. Gaussian Process Regression
Our proposed model is based on the original design implemented in our preliminary work [19]. It relies on Gaussian process regression (GPR) [41], which models the outputs as a Gaussian process with the co-variance defined by a kernel function. The model assumes that any subset of the output is a joint Gaussian distribution. Using the ground truth of the training data in the vicinity of each point, the model learns the uncertainty in the prediction of any test data. This method provides a promising and effective approach to learn the manyto-many relationship between the head pose and the gaze. It learns a gaze distribution as a function of different head poses presented in the training data that are in the neighborhood of the target head pose. Let x ∈ R d be the input of the system, where d is its dimension (in our case d = 6). Let Y be a Gaussian random process representing the output. If y is a vector representing n realizations, as a Gaussian random process, y follows a joint Gaussian distribution with prior distribution f y , where, y = {y 1 , y 2 , y 3 , . . . , y n } ⊂ Y (y i is a realization of Y ). The parameters µ and Σ are functions of x. As shown in Equation 2, the mean provides the deterministic component of the model, where ω ∈ R d and ω 0 ∈ R are learned while training the models.
The probabilistic component is given by Σ, which is the covariance matrix. The covariance of any point x calculated jointly with the input points in the training set x is given by the kernel k(x, x ). The covariance is modeled using a squared exponential kernel (Eq. 3). The correlation is learned with respect to the input data from the training set in the neighborhood of the data of interest. This kernel imposes that the outputs will be more correlated to the points in the training data that are closer to the test input data as the covariance matrix will have higher values for points that are closer.
In Equation 3, the parameter σ f represents the amplitude of the covariance. This parameter defines the autocovariance of the data points (i.e., k(x, x) = σ 2 f ). The parameter l represents the length scale value, which defines how much the distance between the training and predicted data affects the cross-covariance between two data points. If l is high, k(x, x ) slowly reduces, as the distance between the points increases ( x − x 2 ). These parameters decide the size of the confidence interval of our prediction as the covariance matrix is a function of these parameters. We also explore the use of automatic relevance determination (ARD). Using ARD, the kernel learns different length scale parameters for each input variable. The kernel function with ARD is given in Equation 4.
Using different values for l may be useful, since the input to our models include position and orientation of the head, which may have different scales. We learn the values for σ f and l (or l i ) while training the models by maximizing the log-likelihood of the ground truth data in the train set.
To obtain the posterior distribution from the prior model, the model is conditioned on the given training data. Let, X tr ∈ R L×d be the training dataset and y tr ∈ R L be the output Gaussian random variables, where L is the number of frames in the train set. Let y * be the random variable we are trying to estimate for the input vector x * . From Equation 1, the joint distribution is given by, Using equation 5, the posterior distribution can be calculated with the conditional probability when y tr = y obs : f y * |ytr=y obs = N (μ * ,Σ * ) where y obs is the ground truth value (i.e., observed y). We use four different settings for the deterministic function. The first setting is a GPR model without the deterministic component (i.e. ω = 0, ω 0 = 0). The mean of the posterior distribution is purely estimated from the kernel function (Eq. 9).
The second setting is with a constant deterministic component. The model learns ω 0 as a single constant mean for the distribution (i.e., ω = 0). The third setting estimates both ω and ω 0 during training. We refer to this setting as linear model. The fourth setting estimates the deterministic component of the model with a neural network (NN). We implement this approach by training NN (x), using back propagation. The network is implemented with two hidden layers, following the architecture used for our second baseline ( Fig. 6(a)). Then, we estimate the residual error, r(x) = y obs − NN (x), which is modeled with the GPR formulation, without the deterministic component (ω = 0, ω 0 = 0). The conditional mean for this implementation is given by Equation 10.
Using this framework, we learn two separate models for the horizontal angle (θ) and the vertical angle (φ). An important feature of our formulation is modeling the output as a heteroscedastic process, where the variance of the output salient map varies depending on the input variables. Therefore, the size of the probabilistic salient visual map increases for regions with higher uncertainty, and decreases when the model is confident in its prediction.

B. Baseline Methods
We compare the model with three methods. Two of these baselines are based on normal regression functions designed with the mean square error loss. We adapted these regression models to create a probability map as the output by assuming a Gaussian distribution. For the third baseline, we explore a variation of mixture density network (MDN) that uses the log-likelihood as the loss function to model the conditional probability density of the gaze given the input head pose. This section provides the details of these baseline models.
1) Linear Regression: The first baseline is the most basic regression model. The gaze is obtained as a linear function of the head pose parameters (orientation and position). The dependent variables are the six degrees of freedom of the head corresponding to its position (x,y,z) and orientation angles (α,β,γ). Two separate models are created for the gaze angle in the horizontal and vertical directions. Equations 11 and 12 show the models, where, θ gaze and φ gaze are the horizontal and vertical gaze angles, respectively. θ gaze = a 0 + a 1 x + a 2 y + a 3 z + a 4 α + a 5 β + a 6 γ (11) This model is similar to the one trained in Jha and Busso [18], but instead of obtaining the gaze location, we obtain the angles representing the gaze vectors. To create a probability distribution as our prediction, we consider a Gaussian distribution with the mean value provided by the regression models. The variance is obtained from the mean square error estimated on the train data. Notice that this model is homoscedastic, where the variance is constant across the data.
2) Regression with neural network: For the second baseline, we design a neural network to perform the regression task. Figure 6(a) shows the model. The neural network contains two fully connected layers, each of them implemented with twelve nodes. The activation used for the hidden layers is the rectified linear unit (ReLU) activation using a linear function at the output layer. The neural network is optimized to minimize the mean square error between the true gaze angle and the predicted gaze angle from the model. Similar to our previous baseline model, the probabilistic distribution is obtained by assuming a Gaussian distribution for the output, where the mean is the predicted gaze, and the variance is estimated with the mean square error in the train data. This approach is also homoscedastic.
3) Neural Network for Density Estimation: The third baseline is inspired by the MDN proposed by Bishop [42]. MDN can directly learn the standard deviation of the output as a nonlinear function of the input data. MDNs are used to model the output as a Gaussian mixture model (GMM) by optimizing the log-likelihood function in Equation 14.
L llk (y, π k , µ k , σ k ) = −log(p(y)) The network has 3 × M nodes in the output layer, where M is the number of components. The output represents the component weights π k , mean µ k and standard deviation σ k with k ∈ 1, . . . , M . Since we assume that our output is a single Gaussian distribution, we design a model with one component, reducing the number of parameters to two. Therefore, the output layer has two nodes that provide the mean µ and the standard deviation σ. Our objective is to predict a Gaussian distribution that maximizes the probability of the ground truth data. To achieve this, we use as our loss function the negative log-likelihood of the ground truth gaze with respect to the predicted mean and the standard deviation.
The variable σ is obtained as the exponential of the corresponding output node to avoid the standard deviation from being negative. With this formulation, we estimate not only the mean, but also the variance in each prediction, providing an appropriate scaling to the uncertainty of each output prediction. This baseline is a heteroscedastic method, where the variance changes according to the input data. Figure 6(b) shows the network architecture, which has two hidden layers implemented with 12 nodes. The network uses the Adam optimizer [43] with a learning rate of r = 0.001, using mini batches of size 32. The neural network is implemented in Keras [44] with Tensorflow [45] as backend. The networks is trained for 1,000 epochs, and the model with minimum validation loss is chosen as the final model to be evaluated in the test set.

V. EXPERIMENTAL EVALUATION
This section evaluates the proposed solution and baselines to estimate the probabilistic salient visual map. The models are separately trained and evaluated for data collected in phase 1 (parked vehicle) and phase 2 (driving condition). The database is partitioned into train, test and validation set using a leave-one-driver-out cross-validation approach. Data from one subject are used for the validation set, data from one subject are used for the test set, and data from the remaining fourteen subjects are used for the training set. This approach is repeated sixteen times, where we report the results across the 16 folds. Note that all the data are, at some point, part of the test set.
We need to analyze the predicted probabilistic salient visual map in terms of accuracy and spatial resolution to evaluate and compare the effectiveness of the baseline and proposed models.
Accuracy is measured as the percentage of the target gaze directions included in a given confidence interval. If the majority of the data do not lie within the confidence interval, the model is not accurate. The spatial resolution determines how large the confidence interval is. Likewise, if the spatial resolution is too high, the prediction is not very useful even if most data lies within the interval. To evaluate the spatial resolution of the system, we evaluate the size of the confidence interval created by each model. The outputs of the model are horizontal (θ) and vertical (φ) angles. Therefore, we express the area of the confidence region in terms of the fraction of a sphere surrounding the driver's head. An ideal approach will create a confidence interval that is both accurate and with reduced spatial resolution. To analyze the tradeoff between accuracy and spatial resolution, we present plots with the accuracy of our model at different spatial resolution (Figs.  7, 8, 12).
The first evaluation considers different implementations of the GPR model with different parameters to establish the best method for our purpose (Sec. V-A). Then, we compare the best performing GPR models with the three alternative baselines (Sec. V-B). Then, we demonstrate the features of the model by projecting the confidence regions onto the windshield (Sec. V-C), and road camera (Sec. V-D). Finally, we study the performance when we have the orientation of the driver's head, but limited information about the head's position, which is a possible scenarios if regular cameras are used to estimate the head information (Sec V-E). Figure 7 shows the accuracy of our model within different confidence interval for the different GPR models. Figure 7(a) reports the results for phase 1 (parked condition) and Figure  7(b) reports the results for phase 2 (driving condition). We zoom these figures between the 75% and 95% confidence intervals for better visualization. We observe that different models work better for parked and driving conditions. We have shown that the relationship between head movements and gaze changes when a person is driving [18], which explain the difference in patterns across phases 1 and 2. During the parked condition, the best GPR model is when the deterministic part is implemented with a neural network, and the kernel is implemented without ARD. We consistently observe lower performance when using ARD, regardless of the implementation of the deterministic function. During the driving condition, implementing the deterministic component with a linear model leads to the best performance. For this case, the use of ARD in the kernel function leads to improvements for all four conditions. Since the relationship between head pose and gaze is more ambiguous when driving, as noted in Jha and Busso [18], it is beneficial to add more flexibility to the model in phase 2. For phase 1, adding a more powerful deterministic function is enough to achieve good performance.

A. GPR Model Selection
For the rest of the evaluation, we will consider the two GPR models that led to the best performance for phase 1 (GPR with neural network model without ARD) and phase 2 (GPR with linear model with ARD).

B. Comparison with Baselines
This section compares our proposed models with the three baselines described in section IV-B: linear regression (LR), neural network regression (NN) and mixture density network (MDN).
1) Accuracy versus Spatial Resolution: Figure 8 shows the accuracy of our models at different spatial resolutions, comparing with results with the curves of different baseline models. We observe that both GPR models perform better than all the baseline models. They are consistently above other  curves showing not only higher accuracies, but also smaller regions. The linear regression baseline is the model with higher performance from the baselines. The values are constantly below our two implementations of the GPR models.
To quantify the spatial resolution of the models, Table I lists the area of the confidence interval at 50%, 75% and 95% accuracies for the baseline and GPR models. We observe that the areas of the confidence interval for the GPR models are smaller than the areas for the baseline models. In phase 1, GPR NN has the smallest area for the 95% confidence interval (3.76%). In phase 2, GPR Linear has the smallest area for the 95% confidence interval (3.77%). The ability to provide high accuracy within a small region makes the GPR models more efficient.
To quantify the accuracy of the models, Table II lists the  accuracy observed when the fractions of a sphere surrounding the driver's head is 1%, 2% and 4%. This analysis quantifies the performance of the proposed and baseline models when their confidence intervals have consistent area. We observe that we can get 86.2% accuracy in phase 2 within an area of 2% with the GPR Linear model. Similarly on phase 1, we can obtain an accuracy of 83.7% with the GPR NN model.
2) Theoretical versus Empirical Cumulative Density Function: Since our proposed and baseline models assume that the predicted gaze follow a Gaussian distribution, it is important to analyze how well this Gaussian assumption holds with respect to the empirical distribution of the ground truth data around the predictions. For this analysis, we plot the fraction of the data observed within the confidence region (y-axis) as a function of the theoretical cumulative density function (CDF) of the region (x-axis). Figure 9 shows the results for the parked and driving conditions. Ideally, we should observe the curves as close as possible as the reference diagonal curve (black curve). We measure the absolute area between each curve and the reference diagonal curve using Equation 16. The legend in Figure 9 reports the results.
In the parked condition, the LR model is the closest to the reference diagonal curve. Since the distribution of the data is structured, a simple linear regression model with constant mean square error is enough to properly match the theoretical distribution for the confident intervals, although with lower accuracies and spatial resolutions than our proposed models (Tables I and II). In the driving condition, the two GPR models are very close to the theoretical curve. The absolute areas from the reference diagonal curve are smaller than the corresponding absolute area for baseline models. Therefore, the GPR models not only provide better tradeoff for accuracy and spatial resolution, but also offer confidence intervals that are closer to the theoretical confidence intervals for the most important condition (phase 2).

C. Mapping the Confidence Regions onto the Windshield
This section projects the predicted confidence regions onto the windshield. We have the marker position, which is used as the ground truth for the gaze. We only use the targets markers from #1 to #13 for this purpose ( Fig. 1(b)). We model the windshield as a plane by fitting the best plane containing these thirteen points. Small errors are introduced because the windshield is slightly curved so the points do not exactly lie on a plane. Therefore, when we project the original points back to the camera, they do not exactly match the target marker location (Fig. 10). From the gaze angles (α and β), the gaze direction is obtained by estimating the line from the position of the head ([x hp ,y hp ,z hp ]) towards the direction provided by the gaze vector. Equation 17 provides the projection used in the study. We estimate the region where a line meets the windshield plane. The probability density function at each point is calculated based on the probabilistic salient visual map created by the models.  Figure 10 shows two examples for the confidence regions created with the GPR model. While these are just two examples, they are representative of the probabilistic salient visual map created by the models. These figures also demonstrate how the predicted angles can be mapped onto the real world coordinates. The figure shows that the confidence regions in front of the drivers are smaller than the confidence regions on the side of the windshield, signaling more uncertainty. The size of the regions is learned from the data.

D. Mapping Confidence Regions onto the Road
We also projected the predicted confidence regions onto the road view. As explained in Section III-A, we asked three of the subjects to look at multiple targets on the road. For these cases, we approximate the gaze distribution in the road by projecting the confident regions at different distances from the car, ranging from 10 to 200 meters in increments of 10 meters (i.e., 20 different projections). Then, we calculated the unweighted average of the probabilities for each pixel creating a 2D visual map projected on the road camera. Figure 11 gives three examples, showing the driver's face, the road view, and the estimated salient visual map created with the GPR models. The target object is highlighted with a black ellipse. We observe that the GPR models perform reasonably well providing an prediction around the target regions attracting the attention of the driver. Notice that in this study we only consider the position and orientation of the head.
We observe that the predicted probabilistic salient visual maps do not always include the true gaze target. These cases are useful to identify some limitations of our model to project the region on the road. First, we add some distortion during the projections, as discussed before. Second, some subjects may depend on subtle eye movements that our models do not capture. Notice that the eye information is not used by our models, which is used in most of the gaze detection system designed for HCI in controlled environment. Third, the inter-driver variability can impact the results, as differences in height and driving behaviors can affect the relationship between head movements and gaze. In spite of these limi- tations, the results in this paper demonstrate that our models effectively capture the visual attention of the drivers by just modeling their head pose. We include a video with the results as a supplemental document.

E. Predicting Gaze Angle with Limited Head Pose Information
RGB cameras are the most common sensors that are used to capture the driver data in the car. Since regular cameras lack depth information, it is not possible for algorithms to reliably estimate the head position in all three degrees of freedom. Our GPR models require this information to estimate the probabilistic salient visual maps. Therefore, we retrain our GPR models by using only head orientation, or by augmenting head orientation with partial head position. We consider two conditions. The first condition only considers the head orientation (i.e., 3D vector). The second conditions is head orientation plus the x and y position of the head, estimated with the AprilTag-based headband. These models are compared with the GPR models trained with the 6D vector, including full orientation and position of the head. Figure 12 presents the results for the GPR models on the test set in phase 2 (driving condition). The GPR linear model with orientation and partial position information achieves results that are very close to the results achieved by the full model ( Fig. 12(a)). The GPR NN model implemented with partial head position even outperforms the results of the model with full information (Fig. 12(b)). We conclude that the distance between the driver and the camera is not critical to build an effective model. We hypothesize that this results is due to the reduced head movements along the z-direction observed while a driver is operating a vehicle. The performances of both models drop when they are exclusively trained with head orientation, indicating that some information about the head position is needed.

VI. CONCLUSIONS
This paper proposed a novel probabilistic model based on GPR to define a salient visual map to predict the driver's gaze location. The proposed method predicts confidence regions containing the gaze direction of the driver using only the position and orientation of his/her head. The size of the confidence region is determined by the uncertainty of the model in the predicted region (heteroscedastic model). To demonstrate the potential use of the proposed method, we projected this salient visual map onto the windshield and the road images. The results demonstrated reasonable performance suggesting that the proposed approach can play an important role in vehicle applications for security, infotainment, and navigation. The study relies on a commercial dash camera that can be easily installed on regular vehicles. The device supports WiFi connection providing a perfect platform for real-time implementation.
The paper opens new areas of research. Once the salient visual map is created and projected onto the road scene, we can leverage computer vision algorithms to detect target objects within the highlighted area. We can look more closely at the region and predict what possible objects the driver is directing his/her gaze (other vehicles, pedestrians, or billboards). We can also determine important objects that a driver fails to look at, creating an appropriate warning. For example, an ADAS can identify cases where the driver is performing a maneuver without paying attention to other vehicles. The salient visual map can also be helpful in designing smart infotainment systems. We can incorporate this information to resolve ambiguous queries, reducing the cognitive load of the driver.
There are open challenges to accurately predict the six degrees of freedom for the driver's head in real driving conditions. We use AprilTags for this purpose in our analysis. Using a single RGB camera, it can be difficult to track the head pose in a vehicle when the rotation is higher than a given threshold (e.g., when the face is not completely visible [13]). We also need to predict the distance of the driver's head from the camera, which will affect the gaze angle. To address this challenge, we are working on using depth sensors to reliably estimate the orientation and position of the head [37]. One of the limitations of the head band used in this study is that it covers parts of the face. We are working on an alternative design that addresses this limitation [46]. This type of data collection protocol can serve as valuable resources to train and evaluate head pose algorithms in real driving scenarios, advancing algorithm development in this area.