Attention Span Prediction Using Head-Pose Estimation With Deep Neural Networks

Automated human pose estimation is evolving as an exciting research area in human activity detection. It includes sophisticated applications such as malpractice detection in the examination, distracted driving, gesture detection, etc., and requires robust and reliable pose estimation techniques. These applications help to map the attention of the user with head pose estimation (HPE) metrics supported by emotion and gaze analysis. This paper solves the problem of attention score estimation with HPE. The proposed method ensures ease of implementation while addressing head pose estimation using 68 facial features. Further, to attain reliability and precision, head pose estimation has been implemented as a regression task. The coordinate pair angle method (CPAM) with deep neural network (DNN) regression and elastic net regression is carried out. The use of DNN ensures precision on low lighting, distorted or occluded images. CPAM methodology leverages facial landmark detection and angular difference to estimate head pose. Experimentation results showed that the proposed model could handle large datasets, real-time data processing, signiﬁcant pose variations, partial occlusions, and diverse facial expressions with a mean absolute error (MAE) of 3 ◦ and less. The proposed system was evaluated on three standard databases: the 300W across large poses (300W-LP) dataset, annotated facial landmarks in the wild (AFLW2000) dataset, and the national institute of mental health child emotional faces picture set (NIMH - ChEFS ) dataset. The results achieved are on par with recent state-of-the-art methodologies such as anisotropic angle distribution learning (AADL), joint head pose estimation and face alignment algorithm (JFA), rotation axis focused attention network (RAFA-Net), and propose an MAE ranging up to 6 ◦ . The paper could achieve remarkable results for attention span prediction using head pose estimation and for many possible future applications.


I. INTRODUCTION
The need to map the attention span of users has become a necessity in recent decades in almost all spheres of the industry being education [1], [2], medical [3], [4], advertising [5], marketing [6], and many more. To face these real challenges of the digital world, research should be focused on rapid estimates of head pose angles and overcome the problems of The associate editor coordinating the review of this manuscript and approving it for publication was Junhua Li . lighting conditions [7], blurring [8], occultation [9], or environment conditions [10].
Detecting the malpractices in an e-learning based environment [11], estimating the attention of students in a traditional classroom environment [12], and driver distraction estimation are the recent popular case studies of head pose applications [13].
Recently proposed HPE methods are not robust enough to handle real-world datasets and applications. The proposed dataset testing in a real-world environment infers the diversified collection of images in different environmental conditions and the difference in age, ethnicity, culture, and more. It is claimed that a convolutional neural network (CNN) could be considered one of the best algorithms to work on real-world datasets [14].
Usually, the dataset is assumed to be precise, and the images used for evaluation are well aligned. But practical applications prove these assumptions to be invalid. To overcome these limitations, a coping methodology is proposed in [15], which shows the use of the deep convolutional neural network (DCNN) to handle under-sampling and uncertainty. The proposed method uses dense sampling intervals with multivariate labeling distributions (MLDs) to show input face image head pose angles.
Unstructured behavior of the dynamic environment makes focus-of-attention(FOA) challenging; it gives rise to the possibility of poor quality, occlusion, low-resolution image, disturbance, and non-linear relationship between head pose angle and ground truth value [16]. This research gap encouraged us to incorporate HPE based attention mapping. To make the model generic to multiple applications of head pose estimation [17], proposed applying transfer learning to existing models with minor tweaks. Making the ability of model training a lot easier and more diverse.
Some methods have used embedded attention model systems [18] to enhance the performance of feature expression in multi-level classification with soft stage-wise regression, reducing the number of neurons in each subsequent layer. The study [21] suggests that head pose estimation requires a robust solution to deal with real-world data. The selection of the accurate training dataset mainly contributes to the increased efficiency of the model. Head pose estimation and facial landmark detection have revolutionized the 3D modeling of image datasets [19]- [21].
Head movement has become an essential part of studies involving behavior monitoring. This paper proposed a novel way to estimate head pose for video surveillance and innovative human pose applications using state-of-the-art techniques and map them to give us the user's attention score, which is visualized as a concentration map towards the end of the paper. The proposed methodology was evaluated against classical open-access datasets, namely the 300W-LP dataset, AFLW2000, and NIMH-ChEFS. Finally, the resultant attention map was visualized on the LIRIS Children Spontaneous Facial Expression Video Database.
The main challenges commonly faced in HPE datasets are extreme pose variations, overlapping key points, blending of the head with the environment, video quality deterioration and noise, blurred or occluded images with poor lighting conditions. The state-of-the-art models proposed in the literature take in huge parameters to train data, hence not providing robust solutions.
This research focuses on giving a mathematically inclined lightweight solution to estimate head pose in a real-time environment and produce immediate attention span results over a while. The proposed methodology is robust, can detect faces at a distance, and is compatible with current devices.
This research paper proposes the following significant contributions: 1. Proposes a module to map the user's attention using a robust regression-based model, then merged with CPAM for HPE. 2. The proposition of a model relies on 68 facial landmark locations and estimates head pose through angular difference. 3. A comparative study of the proposed method with traditional regression-based approaches with extensive training parameters provides a sophisticated solution to estimate head pose angles on benchmark datasets. i.e., 300W-LP, AFLW2000, and NIMH-ChEFS. 4. A baseline approach extracts facial keypoint locations and employs ElasticNet and sequential models for regression of pose angles. 5. A model to map the user's concentration level over a time frame and produce an attention score.
The rest of the paper is divided into five sections; namely, section 2 presents related work, section 3 describes the proposed methodology, followed by datasets discussion in Section 4. Section 5 describes experiments and results, and section 6 explains the attention span calculation with a discussion on the findings. Section 7 presents the conclusion and some future research directions.

II. RELATED WORK
Head pose estimation has been a highly researched topic for the past 50 years. It is fascinating how the research has evolved into shaping this study with innumerable methodologies and presented us with several critical findings in this domain. Head pose estimation can be represented by Euler angles θx, θy, and θz denoting pitch, yaw, and roll, respectively, which is the head's rotation in the X, Y, and Z-axis [21] as presented in figure 1. HPE involves the movement of the jaw and facial muscles in all three degrees of freedom [22].
The review work is divided into five parts based on the research papers studied for this work.

A. REGRESSION-BASED METHODS
Regression is used to map out the relationship between dependent and one or more independent variables. Regression focuses on fitting the regression model on the labeled dataset and predicting pitch, yaw, and roll angles after training, whereas classification focuses on classifying data in discrete bins.
Several studies use regression models for head pose estimation. Many methods have incorporated head pose estimation without any training of neural networks. In [24], the web-based model is combined with the regression model to estimate head pose.  In [25], the model stores facial features in a quad-treebased representation to estimate HPE angles. The tree data structure is adapted to store a landmark-based representation of face orientation and subdivides it into quadrants to estimate head pose.

B. FACIAL FEATURE RECOMMENDATIONS
Many approaches estimate head pose from facial feature extraction. Head pose estimation and landmark localization are highly correlated. A lot of studies explore the optimization of one concerning another.
In [26], the face is partitioned into seven different facial features, and accurate head pose estimation is done using DCNN, as shown in figure 2. Facial landmarks have also been evaluated through a heatmap generator in a feedforward neural network [27]. The five facial key points of the face, mainly the eyes, ears, and nose, as depicted in figure 3, are processed in 2D soft localization heatmap images. The results are passed to a convolutional neural network to predict the head-pose using regression in [28].
ConvNet model is trained on low-resolution grayscale input images to predict tilt and pan angles without using any facial landmarks in [29], and the proposed methodology is RealHePoNet. Pose-invariant face recognition (PIFR) proposed in [31] uses a geometric approach for HPE. Frontal face images are constructed using these estimations. In [32], local CNN features for shape and residual pose are predicted by a local neural network (LNet), whereas global neural network (GNet) does the landmark localization for pose estimation in the proposed Joint Head Pose Estimation and Face Alignment Framework (JFA) algorithm. It produces output for both; the head poses estimation and face alignment, exploit the global and local CNN features. More work in this domain suggests combining the HPE angles with Kalman filters, giving high accuracy to handle extreme head poses [33].
In separate work, [34] used a similar approach by proposing an application of cascaded random forest on nine sub-space classifications of the head pose with each specific space trained by a global shape constraint. This classificationbased method efficiently handled the problem of significant pose variations.

C. GEOMETRICAL HPE
Some methods incorporate a geometrical and mathematical way of estimating head pose. Anisotropic angle distribution learning (AADL) in [35] suggested that, while increasing fixed central pose and angle interval, the image variations of yaw and pitch angle increased and then started decreasing for yaw angle, creating variations in the two angles. This method worked well on movement caused by blurry images, missing pose images, and occulated images.
Rotation Axis Focused Attention Network (RAFA-Net) [36] explores the significance of spatial structures(finegrained to coarse) using self-attention layer and concentrated spatial pooling and combines results to form detailed semantic information of a given rotation axis to overcome the limitations of facial landmark localization modules.
Vectors in a rotation matrix are represented as HPE angles and are used to develop a new neural network-based representation. This vector-based approach [37] also suggests using Mean Absolute Error of Vectors (MAEV) to evaluate the precision, as it reflects the actual behavior of profile views better than MAE.

D. 3D TRACKING FOR HPE
3D tracking-based methods have gained recent popularity in head pose estimation, mostly in combination with the evaluation of synthetic data and images. Single images can't estimate head pose, as the images need to be mapped using 2D and 3D spaces. In [38], 3D feature points are mapped with depth information to get features from color images using the optimization extended LM (Levenberg-Marquardt) method. This method overcomes the limitations of using RGBD images for direct HPE estimation.
CNN is used to reconstruct the input image of the head to a personalized 3D model, a critical point loss along with Euclidian asymmetric loss is combined and applied to the reconstruction phase. An optimization algorithm is then proposed to map the 3D face model to a 2D input head image [39]. The basic idea is to reconstruct a 3D model from the 2D input image to optimize efficient estimation. Monocular system and optical flow analysis [39] reconstruct a 3D head pose from 2D salient points.
The main essence of consumer technology (CT) is to map accurate 3D pose estimation from the given 2D image.
Experiences like user engagement, immersive audio, and user attentiveness can be supported seamlessly. Annotating the training or ground truth data is the essential part of the process and cannot be compromised on [40].

E. LOW QUALITY AND OPTIMIZATION
To deal with the limitations of low-quality images like noise, poor resolution, and occlusion [41], the Restricted Boltzmann Machine (RBM) model is proposed to map landmark locations to a 3D model by projection module and later predicts head pose with KL-divergence method and a gradient method. For optimizing the existing HPE methods, a key observation focuses on increasing the scale of bounding boxes in the captured image dataset to get more information from the input data and provide better results, as proposed in [41]. Some poor-quality images are shown in figure 4.
They also suggest that choosing the correct loss function can majorly impact the accuracy, as they have incorporated a pertained RESNET50 as the backbone of their CNN. Pose from Orthography and Scaling with Iterations (POSIT) and weighted POSIT (wPOSIT) algorithms are used to optimize existing state-of-the-art methodologies for estimating head pose [42].

III. PROPOSED METHODOLOGY
In this paper, head pose estimation is carried out by the sequential and ElasticNet regression models. The overview diagram is depicted in figure 5, where facial landmark determination is done initially. Then the CPAM model and sequential and ElasticNet regression models were implemented to calculate the desired outcomes for the head pose. Figure 6 gives a detailed layout of the workflow of head pose estimation for attention span.
The sequential regression model is a linear stack model which is customized to give optimal results for HPE. Elas-ticNet model primarily focuses on bridging L1 and L2 losses and overcoming dependency of given sample data.

A. REGRESSION-BASED METHODS
Head pose estimation is a computer camera detecting a head's position in 3D spaces concerning the surrounding in an image or a video sequence. The space is relative to the camera [43]. The main goal of head pose estimation is to get threedimensional Euler angles, including the pitch, roll, and yaw angles.

B. HEAD POSE ESTIMATION WITH CPAM
Coordinate Pair Angle Method (CPAM) shown in figure 7 uses 68 facial landmarks coordinates, which are coordinates points (x, y) on the given input image that outline the structure of the facial features given in Table 3. Following the landmark coordinates, we calculate the difference between the angles concerning the x-axis for each pair.
The functional diagram for CPAM and the proposed methodology followed for the implementation are shown in figure 7.
Hence, the proposed method consists of two steps: • Extraction of 68 facial landmark coordinates from the given input face.
• Calculation of angular difference between each pair of coordinates taken w.r.t x-axis (θ = 0). Consider a point A of coordinate A = (x1, y1) and a point B of coordinate B = (x2, y2) in equations 1 and 2; we calculate the angle between these two points in equation 3, which provides us with the base metric of HPE in CPAM w.r.t x-axis: Likewise, for every pair, the angular difference was calculated. As a result, a total number of (68 * 67/2) = 2278 features were obtained.

C. CPAM FOR HPE
The output array obtained through the CPAM is compared with the results of the regression model using DCNN and VOLUME 9, 2021   ElasticNet. Each of the features present in the array corresponds to an angular difference between the two coordinates.
The extracted vector from the input image gives the pose, i.e., yaw, pitch, and roll. Three different regression models are built for each axis.
Regression model prediction results for pitch, yaw, and roll are as follow: Pitch The AFLW2000 dataset shown in figure 9 offers images and corresponding annotations for about 2000 identities in AFLW (Annotated Facial Landmarks in the Wild). The 3D model is used for reannotating 68 3D landmarks. AFLW contains 25,000 RGB photos of faces which are taken from the social network Flickr. The sample images are highly diverse and comprise different lighting conditions, postures, occultations, attributes, and emotions. Since it is highly diverse, we consider this dataset for testing our model. Table 1 depicts the standard spread of pitch, yaw, and roll angles on the 300W-LP dataset, AFLW2000 dataset, and NIMH-ChEFS dataset.

National Institute of Mental Health Child Emotional Faces
Picture Set (NIMH-ChEFS) dataset in figure 10 is a highquality dataset of 482 photographs with colored images depicting emotions of a wide range of children. The gaze directions are either direct or averted.

A. MODEL SPECIFICATIONS
This methodology focuses on mapping 3-dimensional points from 2D image data. Sequential and ElasticNet model evaluations predict the experiment to form a comparative study. Given below in Table 2 are the parameters used to train the    The input is a face image given to the predictor; the output obtained is 68 facial landmark locations. Pi is the pair of Cartesian coordinates (xi, yi) where is the range of i is [1], [68]. The distribution of the coordinates over various face classes is depicted in Table 3. The output coordinates are the pixelated landmark locations. The obtained coordinated points are centered on a group of regression trees [47]. Traditionally, the mean human head model is used to map 2D facial key points to the 3D model of the targeted input image. In contradiction, [48] suggests that landmark detection is considered a fragile method. It implements a multiloss convolutional neural network with both classification and regression models.

B. METHODOLOGY 1) COLOR/GRAYSCALE FIGURES
Supervised machine learning uses ''labeled'' datasets for training. It aims to map input variable x to the feature variable y through a function f(x) and predict the best fit on the given input data. Classification and regression are the two types of supervised learning [49]. One can make use of the regression supervised learning technique in this paper. The regression model usually works on continuous numeric data to produce a continuous output [50]. Here, the proposed approach leverages the sensitivity of the regression model to propose a bestfit prediction model and estimate unknown parameters for the three degrees of freedom for the pitch, yaw, and roll head pose angles.

2) DEEP NEURAL NETWORK REGRESSION
Neural Networks aim to mimic the human brain to understand the underlying association and patterns between a set of parameters [51], [52].
As shown in figure 11, a deep neural network uses layers to process the input data and provide an insightful output. It gives data a sophisticated touch by identifying underlying patterns. The nodes of the input layer are equal to the number of input features. A variable number of nodes are present in the hidden layers. Several layers in output are one if it is a regression task and several for classification tasks. The essential function of every node shown in figure 12 is to apply a weight to the incoming feature and add bias to it. Later, this is processed by an activation function like Relu, sigmoid or tan and returned to the next layer node. With the strong computational ability and robust design feature, the deep learning model keeps the features with maximum weight and contribution and discards the rest. It gives results based on a series of non-linear operations [33].

3) ELASTICNET REGRESSION
Elastic, known for incorporating the penalties from both L1 and L2 regularization [55], is one of the most used models for regression tasks. This model gives the most reliable output in highly correlated features and makes a precise selection through cross-validation [56].
In equation 4 w j -weight for j th feature, n represents the number of features in the dataset, λ 1 and λ 2 are the regularization metrics for L1 and L2, respectively.
The terminology was derived from ''the elements of statistical learning,'' a hyper-parameter ''alpha'' (α) is provided to assign the weight given to each of the L1 (ridge) term and L2 (Lasso) term [57]. ElasticNet allows us to use a million parameters for training. Its tuning feature selection for ridge and lasso when alpha is 0 or 1 gives this model an edge. For instance, let's take α as 0, the penalty function is now L1 and if we take α as 1, we get L2. Thus, optimizing the function will be dependent on the value of alpha [58].
For example, if α = 0.5, it would give a 50% contribution to the loss function of each penalty. This might shrink some coefficients and parameters, making them 0 and giving us a sparse result.

C. RESULTS
The proposed methodology was tested with the 300W-LP dataset, AFLW2000 dataset, and NIMH-ChEFS dataset to obtain a comparative study. The train test ratio of the above datasets was split into a 70:30 ratio described in the implementation section. Mean-Absolute-Error (MAE) is chosen as the evaluation metric for the proposed approach. The MAE measures the difference between the ground truth poses and the predicted value. It can be depicted by: In equation 5, y is the true angular value poses, and yˆj is the predicted pose. Table 4 and Table 5 show a comparative study of our proposed model, i.e., the CPAM-regression method evaluated on the 300W-LP Dataset, AFLW200, and NIMH-ChEFS     Table 6 displays the results of the NIMH-ChEFS Dataset. ChEFS dataset is mainly used to evaluate emotion detection; hence, this paper's HPE for NIMH-ChEFS is unique. Figure 13 displays a graphical comparison with stateof-the-art methodologies. It shows a comparative result mapping of our model to the other state-of-the-art methodologies proposed on the AFLW2000 dataset. We observe that CPAM-DNNR and CPAM-ENR have exceptionally low MAE and hence are the best fit model compared to Hyperface, Multi-Loss Resnet, Quatnet, and more. Figure 14 shows the estimated head pose of images from the NIMH-ChEFS dataset using CPAM-DNNR   regression (left) and CPAM-ENR regression. Evaluations suggest the error to be less than 3 • .

D. DISCUSSION
The result comparison in Figure 13 shows that our model outperforms most of the optimal state-of-the-art methods, such as QuatNet [14], [59], Hyperface [60], Multi-Loss Resnet [61], [62] for HPE on the AFLW2000 dataset. NIMH-ChEFS dataset considers the emotions of children along with HPE. Our novel method has been thus tested upon a wide range of diverse real-world images with challenging limitations of occultation, low quality, and distorted data and yet produce a result with an MAE of 3 • and less. The significance of this work extends the importance of considering facial key points, which become a crucial indicator of the head pose [43].

VI. CALCULATION OF ATTENTION SPAN A. ATTENTION MODULE METHODOLOGY
The concentration level of an individual at a given time while looking at a screen depends on their head orientation. Hence, concentration level and head orientation are correlated. While looking at a screen, pitch and yaw are the two main axial angles that play a vital role in predicting the concentration level. The attention span of an individual while looking at a screen can be determined with the average concentration level over a span of time. Hence, first, it is essential to formulate a methodology for getting the concentration level of an individual at a given frame with given yaw and pitch value. After getting the concentration level of each frame, its attention span was determined, as shown in figure 15.

B. ATTENTION SCORE IMPLEMENTATION
First, it is imperative to calculate the score factor of yaw angle and pitch angle, respectively, with the get_score function mentioned below. Then concentration level (Ci) at given frame i can be calculated with yaw and pitch value. The score variable ranges from [0, 2] for each given angle. Hence, the obtained concentration level at each frame was in a range [0,4]. The proposed function to map the user's attention span from HPE can be seen in figure 16.  After getting the concentration level for each of the 'n' frames, we can determine the attention span score with the help of equation 6: Next, the concentration level was classified based on attention score into four categories: no concentration, low concentration, medium concentration, and high concentration. Refer to Table 7 for mapping details.

C. VISUALIZATION OF RESULTS
To get the intuition of the above-discussed methodology, the LIRIS children dataset was used for visualization. LIRIS dataset contains 208 videos of 12 students between the age of [8], [12]. For visualization, first, the concentration level for each frame (Ci) and attention span score was calculated with the help of these data. Then, the concentration level v/s frame graph was plotted to visualize the student's attention span.
Consequently, three videos were taken to visualize concentration in different scenarios while they showed different emotions. Figure 18 shows the attention graph of these children.

VII. CONCLUSION
Finding out attention span in online mode using head pose estimation is a challenging research problem. In this paper, ElasticNet and DCNN were separately used for HPE; the results obtained were then combined with the proposed methodology, i.e., Coordinate pair angle method (CPAM), to provide high precision results. The CPAM methodology exploits geometry to estimate head poses. When applied to specific standard datasets, the proposed methodology showed results on par with present state-of-the-art methodologies with 3 • or less MAE in contrast to a standard of 6 • MAE. Furthermore, the research formulated a method for predicting the attention span of an individual using head pose estimation, which can be used to visualize the concentration level for the entire period. This leads to defining this project's future scope targeting the eye gaze prediction module led by findings and estimation of roll and yaw orientations with image processing methods like the Viola-Jones algorithm. The authors intend to integrate the head pose estimation with gaze direction determination to propose an attention score. This research can be used in educational setups to estimate the understanding of the user; it can also be used in assistive driving environments to map the user's attention on the road. Further applications can be estimating concentration scores of users in medical setups for anxiety prediction, mental disturbance, and depression and mapping attention in the workplace environment. HPE and user emotion can help gauge engagement in social media advertising settings and improve sale strategies. Hence, this research is dynamic and can play a vital role in many applications.