Towards Robust and Unconstrained Full Range of Rotation Head Pose Estimation

Estimating the head pose of a person is a crucial problem for numerous applications that is yet mainly addressed as a subtask of frontal pose prediction. We present a novel method for unconstrained end-to-end head pose estimation to tackle the challenging task of full range of orientation head pose prediction. We address the issue of ambiguous rotation labels by introducing the rotation matrix formalism for our ground truth data and propose a continuous 6D rotation matrix representation for efficient and robust direct regression. This allows to efficiently learn full rotation appearance and to overcome the limitations of the current state-of-the-art. Together with new accumulated training data that provides full head pose rotation data and a geodesic loss approach for stable learning, we design an advanced model that is able to predict an extended range of head orientations. An extensive evaluation on public datasets demonstrates that our method significantly outperforms other state-of-the-art methods in an efficient and robust manner, while its advanced prediction range allows the expansion of the application area. We open-source our training and testing code along with our trained models: https://github.com/thohemp/6DRepNet360.


INTRODUCTION
H HEAD pose estimation follows the objective of predict- ing the human head orientation from images and is a crucial step in many computer vision algorithms.Applications are wide-ranging and include attention estimation [1], [2], [3], face recognition [4], [5], and the estimation of facial attributes [6], [7], which again are vital features in driver assistance systems [8], [9], [10], augmented reality [11], [12], and human-robot interaction [13], [14], [15].The vast majority of present methods [16], [17], [18], [19], [20], [21], [22], [23] narrow down the research issue to the estimation of solely frontal poses with a limited rotation range.This favors the leverage of the facial feature-richness and suitable, widely available training datasets.However, in uncontrolled application scenarios [24], [25], [26] head orientations are likely to surpass the narrow angle range that most methods are trained for and, consequently, produce random and inaccurate head pose predictions.In view of extending the prediction to the full area of rotation range, the current state of research is challenged by two key limitations.The first is the absence of comprehensive datasets that cover the full range of head orientations [27].The second equally decisive and often neglected factor is an appropriate rotation representation, as it significantly impacts the model's ability to effectively learn the connection between visual pose appearance and corresponding parameterization [28].For instance, the commonly used Euler angle and quaternion representation suffer from ambiguity and discontinuity problems that lead to an unstable training process and a mediocre prediction performance if plainly applied [16], [19], [23], [29].This behavior even intensifies for stronger  rotations in the narrow range spectrum.
We overcome these limitations by proposing a rotation matrix-based 6D representation for efficient and unconstrained network training that we further enhance with a geodesic based loss.Additionally, we take up the ambitious challenge of predicting the full range of rotation by agglomerating new training data with enhanced pose variation.For this matter, we utilize the CMU Panoptic [30] dataset and apply an automatic head pose labeling process to generate head pose samples with focus on the back of the head.We combine these samples with the popular 300W-LP [31] head pose dataset and, together, receive a large scaled dataset with greatly expanded head rotation variations.Finally, the training of our proposed model on this new agglomerated data enables us to predict a significantly extended range of head orientations.We examine our approach in multiple experiments on public datasets that testify our method stateof-the-art accuracy and remarkable robustness in predicting challenging poses.At the same time, it is able to handle a many times greater range of head pose orientations com-pared to current methods from the literature.Fig 1 shows examples of orientation predictions from this model for versatile head poses.To the best of our knowledge, we are the first to tackle the full range of head pose estimation in this extensive and conclusive way.In summary, we make the following contributions: • We introduce a simplified and efficient 6-parameter rotation matrix representation for regressing accurate head orientations without suffering ambiguity problems.

•
We propose a geodesic distance approach for network penalizing to encapsulate the training loss within the Special Orthogonal Group SO(3) manifold geometry.

•
We utilize the CMU Panoptic dataset [30] to expand the traditional 300W-LP [31] head pose dataset with full rotation head pose appearance.

•
We create a new head pose prediction model that surpasses the prediction range of current methods and at the same time achieves lower errors on common test datasets.

•
We demonstrate the superiority of our approach in accuracy and robustness in multiple experimental setups.

•
We conduct an ablation study to evaluate the impact of each component of our model on the achieved results.
Fig 2 shows an overview of our proposed method.Each component will be explained in detail in the following sections.Inspired by the 6D representation that is used in our approach, we call our network 6DRepNet.An earlier version of this work was published in [32](accepted to ICIP2022), where we presented an initial approach for 6Dbased narrow angle prediction.In this version, we enhance this previous work with an improved training procedure, propose an approach for tackling the prediction of the full range of orientation, and provide a more detailed model including an extensive comparison with the state-of-the-art, error analysis and ablation studies.
Our training, testing code, and trained models are made publicly available to facilitate research experimentation and practical application development.

RELATED WORK
In recent years, facial analysis along with vision-based head orientation prediction emerged with the rise of neural networks.Current methods are commonly divided into landmark-based and landmark-free approaches.Landmarkbased methods [33], [34], [35], [36] detect facial landmarks as a primary step and subsequently recover the 3D head pose by aligning the predicted landmarks with a standardized 3D head model [37], [38].Under ideal circumstances, this approach can lead to very accurate head orientation estimations, but it is highly dependent on the precise predictions of the landmark positions.Also, it requires the target head to be shaped similar to the head model to achieve an accurate alignment.Finally, as the target landmarks are only located in the facial area, poses with occlusion and of strong rotated heads with too few or without visible face cannot be estimated [39], [40].Landmark-free approaches overcome these limitations by directly estimating the head pose from the images in an end-to-end fashion.These methods commonly use deep neural networks to formulate the orientation prediction as an appearance-based task.As one of the first of its kind, HopeNet [16] presented an RvC [41] approach by binning the target angle range to combine a cross-entropy and a mean squared error loss function for Euler angle prediction.Along with this classification approach, they at the same time reduced the predictable rotation range between [-99,+99] degree for yaw, pitch, and roll.Later, QuatNet [19] adapted the cross-entropy paradigm with limited prediction range and proposed to split classification and regression into separate network branches.One branch is used for classifying the Euler angles and the second one regresses the pose in quaternion representation.Similarly, HPE [18] treats classification and regression separately and averages the outputs as a pose regression subtask.WHENet [29] keeps the single branch strategy, switches to an Efficient-Net [42] backbone and increases the number of bins for the yaw network branch to extend the predictable angle range.Whereas FSA-Net [17] proposes a network with a stage-wise regression and feature aggregation scheme for predicting Euler angles.TriNet [43] adapts this method, but estimates the three unit vectors of the rotation matrix instead of Euler angles and incorporates an additional orthogonality loss to stabilize the predictions.MFDNet [21] likewise follows the rotation matrix representation but uses its Fisher distribution to model rotation uncertainty and to find its maximum likelihood.Another probabilistic approach was proposed by Liu et al. [44] who train on Gaussian label distributions.Whereas FDN [20] targets optimized feature extraction by proposing a feature decoupling method to explicitly learn discriminative features of different head orientations.DDD-Pose [22] seeks to diversify the training data by proposing an advanced augmentation scheme.The current state-ofthe-art results are achieved by RankPose [23] closely followed by MNN [45].RankPose uses paired training samples to introduce a ranking loss for penalizing incorrect ordering of the Euler pose estimation.MNN and img2pose [46] predict the rigid transformation between the head and the camera.
In general, frequent approaches in the area of head pose estimation achieved continuous improvement over the recent years, yet they still lack of comprehensive solutions for predicting the full range of head pose rotation.First, it became a common convention to split up the continuous rotation variables into bins to convert the problem into a classification task in order to stabilize the predictions [16], [18], [19], [20], [29].However, this is problematic as pruning segments of angles into bins will consequently lead to a loss of information.Apart from that, this constraining approach is commonly combined with reducing the target space [16], [18], [19], [20], [29] which eliminates the opportunity of tackling full rotation estimation.A few works overcome these limiting factors by using the rotation matrix as a more suitable rotation representation [21], [43], [47], but neither deal with more efficient ways of regression nor address its potential for expanding the prediction range.As a consequence, the area of full head pose prediction is still rarely explored yet.WHENet [29] was one of the first to approach full yaw prediction by extending the bin range for the yaw angle and proposing a wrapping loss to handle the influence of the gimbal lock.However, their method still tightly restricts pitch and roll between [-99,+99] degrees.The same restriction is applied by Viet et al. [47] in their multitask approach, where they face detection and head pose estimation.As rotation representation, they use the rotation matrix and follow the same computational extensive approach as TriNet [43] to obtain orthogonality.

METHOD
In the following, we will give details about our proposed method.We start with preliminary information about different rotation representations.Based on its insights, we propose a rotation parametrization scheme to overcome the limitation of the related works.As an accompanying measure, we will introduce a geodesic distanced based loss to precise and stabilize the network penalty for training.

Preliminaries
In general, the orientation of a rigid body in the threedimensional space can be described by multiple kinds of mathematical representation.The most common and widely used one is the Euler angle representation that is used to describe the rotation around each axis of the coordinate system (typical denoted as yaw, pitch, roll).Despite its intuitiveness, Euler angles face limitations when it comes to the specific orientation state, where the second elemental rotation reaches 90 or -90 degrees.Given this setup, yaw and pitch align on the same plane and create infinitive solutions for the same rotation state.This behavior is known as gimbal lock as the first and third axis are locked under this particular condition.The gimbal lock represents the extreme case for the limitations of Euler angles.However, the dependency between first and third angle is a fundamental property of Euler angles, that just becomes stronger the more the pitch reaches the gimbal lock state.As a consequence, the Euler angle representation does not behave in the same continuous form as its visual appearance counterpart that has a detrimental impact on the performance of neural networks.
Another type of orientation is called axis-angle representation, which consists of a unit vector v = (x, ỹ, z) that defines the axis of the rotation and an angle θ that describes the magnitude of its rotation.Closely related to the axis-angle representation, another type called rotation quaternions q with also four parameters q 0 , q 1 , q 2 , q 3 can be derived by q 0 = cos( θ 2 ), q 1 = x sin( θ 2 ), q 2 = ỹ sin( θ 2 ), q 3 = z sin( θ 2 ).Quaternions and the axis-angle representation are not affected by the gimbal lock, but they still have an ambiguity that is introduced by their antipodal symmetry with −v = v and −q = q, respectively.As a result, every orientation can be described by two different representations that are maximum far apart.A more comprehensive notation is the rotation matrix R 3x3 that consists of 9 parameters.Despite its increased number of parameters, it comes with the crucial advantage that it provides a continuous representation with a unique parameterization for each rotation.Fig. 3 shows an example of two dataset samples with similar pose appearances.Yet, their Euler angle and quaternion ground truth are parameterized very differently.Only the rotation matrices reflects the similarity in the pose appearance.In SO(3) the matrix representation R is sized 3 × 3 with an orthogonality constraint RR T = I, where R T is the transposed matrix and I the identity matrix.
One could now try to regress the rotation matrix directly, but this would require finding all nine parameters that at the same time satisfy the orthogonality constraint.The orthogonality can also be enforced in a sequential step by either using the Gram-Schmidt process or the singular value decomposition (SVD).The SVD is an extensive approach for finding those orthogonal vectors that are the nearest to the predictions.The Gram-Schmidt method requires discarding one vector in order to recreate the orthogonal matrix from the remaining two.

6D Representation
In section 3.1 we show that a key aspect for tackling direct orientation predictions is the use of an appropriate rotation representation that is unambiguously interpretable by neural networks.For this matter, we use the rotation matrix representation as a superior alternative to Euler angles, quaternions, and axis-angles.Inspired by Zhou et al. [28], we satisfy the orthogonality constraint by performing the Gram-Schmidt mapping inside the representation itself, which avoids extensive post-processing.We simply drop the last column vector of the rotation matrix that reduces the 3 × 3 matrix into a 6D rotation representation which has been reported to introduce smaller errors for direct regression [28].Then, the predicted 6D representation matrix is mapped back into SO(3) with where the resulting column vectors are defined as Hereby, the last column vector is simply determined by the cross product that ensures that the orthogonality constraint is satisfied for the resulting 3 × 3 matrix: As a result, our network has only to predict 6 parameters that are mapped into a 3 × 3 rotation matrix in a subsequent transformation process that incorporates the orthogonality constraint as well.

Geodesic loss
The l2-norm is the commonly used loss function for head pose related tasks.However, using the Frobenius norm for measuring distances between two matrices would break with the SO(3) manifold geometry.Instead, the shortest path between two 3D rotations is geometrically interpreted as the geodesic distance.Let R p and R gt ∈ SO(3) be the estimated and the ground truth rotation matrices, respectively, then the geodesic distance between both rotation matrices is defined as: In the following, we will use this metric as a loss function for our neural network to compute accurate distance information between the predicted and ground truth orientation.

EXPERIMENTS
We perform an extensive evaluation of our method.We begin the specification of our used datasets, evaluation metrics and implementation setup, followed by a comprehensive comparison with other state-of-the-art methods in crossdataset and intra-dataset tests.Further analysis includes a detailed error analysis and ablation studies on used loss functions and backbones.

Datasets
We conduct our evaluation with the aid of different kinds of data.The most common and public available datasets are 300W-LP [31], AFLW2000 [48], and BIWI [49].300W-LP: 300W-LP consists of 66,225 face samples collected from multiple databases including LFPW [50], AFW [51], HELEN [52] and iBUG [53] that are further BIWI: The BIWI dataset includes 15,678 images that were created in a lab environment with 20 participants.In this dataset, the head takes up only a small area in the images.Hence, we use the MTCNN [54] face detector to loosely crop the heads from the images.All of the above listed datasets provide, due to their nature of annotation, only samples with a frontal view of the faces (mostly between -99°and +99°range of yaw).Therefore, they cannot be used for the training of the entire head orientation range.
CMU Panoptic: Therefore, we utilize another dataset called CMU Panoptic [30] that makes it possible to generate annotated head images with full rotation appearances.In this dataset, a variety of subjects perform arbitrary tasks inside a dome, that is equipped with 31 evenly arranged HD cameras.The main focus of this dataset is to capture the subject's poses, but it also provides 3D facial landmark annotation and camera intrinsicts and extrinsics.This enables to extract head pose annotation from all the different camera angles, that was initially harnessed by Zhou et al. [29].There are 30 sequences public available with multiple subjects per scene, that are standing in a ring with each subject being oriented towards the center of the dome.When extracting the head crops with only accepted those with a minimum size of 320 for both axis, which gives us a dataset with 113914 samples in total.Because of the subject's spacial setup, the majority of the samples are ones showing the back of the head.Samples with frontal face view are likelier to be sorted out by too small sized, as these face images were taken from more far distance.Therefore, we create a combination of the 300W-LP and the CMU Panoptic dataset that includes 236,364 data samples spanning the entire range of yaw rotation.The range of pitch is slightly expanded as we also use the samples that are generated from cameras attached to the ceiling of the CMU Panoptic dome.The distribution of this new training data is shown in Fig. 4. It should be noted that we use the Euler angles for presentation purposes that cannot exactly represent the distribution of visual appearance in the dataset, as discussed in section 3.1.

Evaluation Metrics
We use two different evaluation metrics to quantify the head pose estimations error.The first one is the most common Mean Absolute Error (MAE) of the Euler angles, where N is the number of face images and x g and x p represent the ground truth and predicted pose parameters, respectively.Secondly, we calculate the Mean Absolute Error of the vectors (MAEV) of the rotation matrix.This metric was introduced by [43] in order to surpass the limitations of the Euler representation and to provide a more meaningful picture of the appearance differences between predicted and ground truth orientation.The MAEV defines the angle error between the three vectors of the rotation matrix, where N , again, is the number of face images in the dataset and v g and v p are the ground truth and the predicted head orientation vectors.

Implementation details
We implement our proposed network using PyTorch [55].
As backbone, we choose ResNet50 [56] to enable a fair comparison with other methods [16], [19], [22], [23], [43], that chose the same feature extractor.The backbone's weights are pretrained with the ImageNet [57] dataset.For the final layers, we choose a single fully connected layer with 6 outputs.The network is trained for 80 epochs with a batch size of 80 using the Adam optimizer with a learning of 1e −4 .To exploit full generalization potential, we also extensively augment our training data using Albumentations [58] by applying random horizontal flipping, random scaling and cropping, random rotation up to [-45, +45] degrees, random occlusions, and further image color operations including random blur, random brightness contrast changes, and random RGB shifts.

Comparison with state-of-the-art
In this section, we conduct a comprehensive comparison with the state-of-the-art.We start with a cross-dataset evaluation to analyze our model's generalization capabilities, followed by an intra-dataset experiment and a detailed error analysis for further performance assessment.

Cross-dataset evaluation
In our first experiment, we want to evaluate our approach against the state-of-the-art methods.To the end, we train two models.The first model (6DRepNet) will strictly follow the common training convention by using the synthetic 300W-LP dataset for training and the two real-world datasets AFLW2000 and BIWI for testing.This will provide comparable information about our method's performance of directly regressing a diminished rotation matrix.  1 shows the results from the two model setups along with the results from other methods from the recent literature.For better interpretation, we added an extra column (R) to show which methods are trained to predict a larger range of rotations and which ones restrict their predictions to frontal poses.From the 15 listed methods, only two approached the exceeding of narrow angle range head pose estimation.

6DRepNet
The table demonstrates that our model that was solely trained on the 300W-LP dataset outperforms all other methods on the AFLW2000 test dataset and surpasses the current top performer RankPose on AFLW2000 in Euler and vector errors.Besides the overall error rate, our model achieves top performing results for the pitch and roll error and equal results to the best reported yaw error.This indicates a very stable network learning, resulting in robust prediction properties.On the BIWI dataset, it achieves competitive results in respect to MAE and best results in respect to MAEV.The latter ought to be considered with caution, as there are no MAEV results reported for the MAE top performers.

6DRepNet360
Our second model, 6DRepNet360, achieves very competitive results on AFLW2000 and even new stateof-the-art results on BIWI by surpassing WHENet-V by 3%.Noticeably, this model only differs in its training data, where the added data aims to expand the predictable detection range of the yaw rotation.Yet, these samples include numerous stronger pitch rotations than 300W-LP (see Fig. 4).We argue that these samples benefit the model's performance for processing the challenging poses from  the BIWI dataset, as the error for the pitch is reduced by 33% compared to our the solely on 300W-LP trained model.Remarkably, WHENet is also trained for wide yaw predictions and is therefore most suitable to compare it with our 6DRepNet360.While WHENet is reported to perform even worse than its 300W-LP equivalent WHENet-V, our 6DRepNet360 model achieves over 20% lower error rates on AFLW2000 and over a 10% higher accuracy on BIWI.We believe that our choice of the 6D rotation matrix as rotation representation instead of WHENets Euler angle has a major impact on our superior results.In terms of rotation representation, TriNet is the most similar method to ours.But in contrast to our 6 parameter approach, they predict the entire 9 parameter rotation matrix and use an SVD to find an orthogonal-constrained solution.We argue that our more efficient approach leads to a higher reported accuracy.
Fig 5 shows qualitative results from our 6DRepNet360 model.The first row illustrates prediction on test images from the AFLW2000 dataset with strong varieties of background, lightning, and camera angle.The second row shows test results with very strong head rotations from the CMU Panoptic test set that exceed the common [-99°,+99°] restrictions.In contrast to AFLW2000, it is captured in a laboratory environment with consistent lightning conditions and background.Nevertheless, 6DRepNet robustly predicts the head poses from varying camera angles.A very noteworthy example is the rightmost test image, as it presents a very challenging instance.While for frontal faces even stronger rotated poses provide meaningful features, visual cues are in this example mainly restricted to the head's shape.Yet, our model is able to predict reliable orientations even for these challenging kind of head poses.

Intra-dataset evaluation BIWI
In a second experiment, we follow the convention by FSA-Net [17] and randomly split the BIWI dataset in a ratio of 7:3 for training and testing, respectively.Table 2 and Table 5 show our results compared with other stateof-the-art methods that followed the same testing strategy.We retested those models, that provide source code information, for an additionally MAEV error report.The remaining results are claims by the authors.It demonstrates that our method outperforms all other methods by a margin of more to 10%.In terms of the individual rotation angles,   our approach produces very consistent results by achieving the best results on yaw and roll, and equal results to the state-of-the-art DDD-Pose for the pitch angle.This supports the observed robustness in the cross-dataset evaluation and demonstrates, that achieving stable accurate results for all three angles does not only depend on the trained dataset, but rather on our proposed method itself.This is also reflected in Table 5, where our approach achieves the best overall MAEV results as well as for each single vector.

CMU Panoptic + 300W-LP
In a final experiment, we evaluate our model in an intradataset test on our combined dataset that comprises the data from the CMU Panoptic and the 300W-LP dataset.To this end, we randomly split the dataset into 70% training data and 30% test data.To the best of our knowledge, Viet et al. [47] and WHENet are the only methods that published test results on CMU Panoptic.However, [47]'s prediction pipeline additionally includes face detection and their test set comprises solely samples from CMU Panoptic.Therefore, the comparison ought to be considered with caution.More similar to our experimental approach, WHENet tests on a combination of CMU Panoptic and 300W-LP, but its size and composition are not specified.Thus, our results are mainly for future reference, and we will publish our test list to provide other methods with the capability of precise comparison.

Error Analysis
To receive a more detailed impression of our model performance, we conduct an error analysis with four other stateof-the-art methods (HopeNet, FSA-Net, RankPose, TriNet) where we split up the errors on the AFLW2000 of range [-99°, 99°] into intervals of 33°.All models were solely trained -50 0 50 Ground truth yaw (deg) on 300W-LP.The results are shown in Fig. 6 where each Euler angle is illustrated in a separate graph.It gives insight that in general the prediction error for all methods increases with stronger rotations.It is conspicuous, though, that this error increase is much lower for 6DRepNet compared to all the other methods, especially for the pitch and roll.
While Table 1 shows that our model overall outperforms RankPose by 3%, this detailed error analysis illustrates that our 6DRepNet achieves over 60% smaller error rates for extreme pitch and roll rotations.This is yet another confirmation that our model does not only achieve stateof-that results, but at the same time provides very robust predictions even in extremely challenging test cases.

Ablation Study
In the following, we will analyze how each of our model's remaining components impacts our reported results.This includes the backbone, that is responsible for the feature extraction, and our proposed loss function, which differs from other methods in the literature.

Loss function
Most current methods use the Mean Squared Error (MSE) for calculating the loss in the training procedure.We argue that the geodesic distance gives a better feedback about the distance between prediction and ground truth and, thus, is better suited to be used as a loss function.To prove this, we conduct another experiment where we repeat our previous tests, but this time we train our network with the MSE distance loss and with a combination of MSE and the geodesic loss L g (see Eq 5).Table 3 shows these results compared to our models trained with geodesic distance loss.It states that the network with geodesic loss penalty performed significantly better than the one that used MSE and slightly better than the combination of MSE and L g .

Backbone
In a final experiment, we analyze the impact of the chosen backbone on the results.Our results from table 1 already proved the superiority of our 6D rotation matrix approach over other methods using the same backbone.Nevertheless, we want to evaluate the impact of the number of parameters on our results.In table 6 we compare our previous results with a model that was trained with the smaller ResNet18.It is remarkable, that our model that was trained on the 50% smaller backbone ResNet18 still achieves better results on the AFLW2000 dataset than all other methods from Table 4 except one.For the BIWI dataset, the accuracy compared to ResNet50 is reduced only by a very small margin.This confirms that our model's overall performance is predominantly accounted by our 6D rotation representation and hardly by the used backbone.Moreover, it shows that the commonly used ResNet50 is not necessary for achieving proper accuracy, as the more efficient ResNet18 reports similar performance.This becomes an important aspect, when the head pose estimation is used in settings with limited computational resources.

Limitations
Our model achieves accurate and robust prediction for an extended range of rotation.This especially applies for the yaw angle, which encounters the strongest rotations in common application scenarios.However, the roll and pitch can also reach strong rotations, that are only marginally represented in our training data (see Fig. 4).This can lead to reduced robustness and accuracy in application scenarios with unusual camera angles and head poses.To analyze this, we degreewise calculated the error of our 6DRepNet360 model on the test set of our CMU Panoptic + 300W-LP 70/30 split from section 4.4.2.The results are shown in Fig. 7 and illustrate that the error rate for the yaw angle is consistently low, while the roll and pitch error rate increase with stronger rotations.This demonstrates that there is still a lack of training and also test data for this extended range of rotations.In our test set, only 3 samples exceed [-100,+100] degrees in roll and only 5 samples exceed [-100°,+100°] in pitch.In our experiments, we approached this limitation by performing image rotation augmentation that synthetically expands the roll and pitch range.Further, the CMU Panoptic dataset is taken in laboratory settings with similar background and lightning conditions.Additional data with stronger variation could therefore benefit the generalization performance as well.

CONCLUSION
In this paper, we tackle the major challenge of unconstrained full rotation head pose estimation that is a rarely explored research subject yet.First, we formulate a continuous 6D rotation matrix representation for an unambiguous and continuous appearance parameterization.This approach forms the basis for a stable and precise network training that we further optimize by introducing a geodesic distance based loss.With the use of the CMU Panoptic dataset, we accumulate a more comprehensive head pose dataset that exceeds the common public dataset in variety and size and allows us to create a model that is able to predict full head pose rotations.We evaluate our approach in multiple experiments that demonstrate that our 6D rotation representation achieves superior performance compared to the state-of-theart and is able to efficiently learn the full range of head pose orientation.We complete our study with an ablation study to analyze the impact of the backbone and loss function on our results.

Fig. 1 :
Fig. 1: Example images of predicted orientations of various rotated heads.

Fig. 5 :
Fig. 5: Example images with converted Euler angle visualization from the AFLW2000 dataset (first row) and the CMU Panopic test dataset (second row).
I R Yaw Pitch Roll MAE Yaw Pitch Roll MAE Left Down Front MAEV Left Down Front MAEV

TABLE 1 :
Comparisons with the state-of-the-art methods on the AFLW2000 and BIWI dataset.All models are trained on the 300W-LP dataset.Results from methods with positive I are generated by our own tests.Methods with negative I are not or only partially open-source.Their results are claims from authors.Methods with positive R target the prediction of a wider range of rotation.

TABLE 2 :
Euler error comparisons with the state-of-the-art methods on the 70/30 BIWI dataset.Results from methods with positive I are generated by our own tests.Methods with negative I are not or only partially open-source.Their results are claims from authors

TABLE 3 :
Vector error comparisons with the state-of-the-art methods on the 70/30 BIWI dataset.Results from methods with positive I are generated by our own tests.Methods with negative I are not or only partially open-source.Their results are claims from authors

TABLE 4 :
Model performance on the CMU Panoptic + 300W-LP combined dataset.70% of the dataset is used for training and the remaining 30% for testing.Results from methods with positive I are generated by our own tests.Methods with negative I are not or only partially open-source.Their results are claims from authors.

TABLE 5 :
Analysis of the influence of different loss functions L M SE and geodesic loss L g on the MAE.

TABLE 6 :
Comparison of the MAE between the different backbones.