Domain Adaptation for Head Pose Estimation Using Relative Pose Consistency

Head pose estimation plays a vital role in biometric systems related to facial and human behavior analysis. Typically, neural networks are trained on head pose datasets. Unfortunately, manual or sensor-based annotation of head pose is impractical. A solution is synthetic training data generated from 3D face models, which can provide an infinite number of perfect labels. However, computer generated images only provide an approximation of real-world images, leading to a performance gap between training and application domain. Therefore, there is a need for strategies that allow simultaneous learning on labeled synthetic data and unlabeled real-world data to overcome the domain gap. In this work we propose relative pose consistency, a semi-supervised learning strategy for head pose estimation based on consistency regularization. Consistency regularization enforces consistent network predictions under random image augmentations, including pose-preserving and pose-altering augmentations. We propose a strategy to exploit the relative pose introduced by pose-altering augmentations between augmented image pairs, to allow the network to benefit from relative pose labels during training on unlabeled data. We evaluate our approach in a domain-adaptation scenario and in a commonly used cross-dataset scenario. Furthermore, we reproduce related works to enforce consistent evaluation protocols and show that for both scenarios we outperform SOTA.


I. INTRODUCTION
H EAD pose estimation (HPE) describes the problem of predicting the orientation (head pose) of the human head. Head pose is a key factor in many biometric systems related to facial analysis and human behavior analysis. Accurate estimation of head pose can bring many benefits.
In driver assistance systems [1] HPE can be used for automatic assessment of the focus of attention. Head pose is used for visually assessing human-object and human-human interaction [2]. Further, it is the starting point for many gaze estimation methods [3]. In facial recognition systems head pose is important for pre-processing [4] or during training for data augmentation and to obtain pose-invariant recognition models [5]. Similar use cases apply for facial expression Manuscript  recognition [6] as pose-invariance is a desired property for facial analysis systems. Motivated by the many possible applications, there has been a lot of progress in this field of research, most recently through deep learning methods. Collecting the required training data is still a challenging task for several reasons. Manual annotation is a problem, because humans cannot accurately annotate a 3D head pose from a 2D image. This has led to the creation of head pose datasets using devices like depth sensors and 3D head scans [1], [7], or special tracking equipment attached to the head [8], [9], [10]. However, with these recording setups, it is cumbersome and costly to reach a high diversity in subjects, environments and poses. Therefore it is still an open challenge to acquire suitable training data.
A solution is to use synthetic (rendered, computer generated) face images to provide inexpensive and virtually unlimited quantities of perfectly labeled data. Several methods train on synthetic [11], [12], [13], [14], [15], [16] or synthetically extended (warped) images (e.g., [17], [18] on 300W-LP dataset [19]). Unfortunately, learning-based approaches trained only on synthetic data (source domain) tend to perform poorly on real-world data (target domain) compared to methods trained on real-world data. This can be explained by the difference between domains (domain gap). In [16] we addressed this issue for HPE by introducing a method for domain adaptation (DA). DA methods typically use unlabeled data from a target domain during training, to overcome the domain gap. However, the performance of methods trained on synthetic head pose data is still inferior to methods trained on real-world head pose data.
In this work, an extended version of [20], our goal is to improve the performance of HPE on real-world data using only labels from a synthetic dataset in combination with an unlabeled real-world dataset. In [16] an adversarial training approach based on domain adversarial neural networks [21] is used to force the extraction of domain-invariant features. In contrast, we propose to tackle the problem using consistency regularization and relative pose labels.
Consistency regularization is a semi-supervised learning (SSL) technique. Semi-supervised learning utilizes labeled and unlabeled data simultaneously during training. This property was exploited successfully for domain adaptation using consistency regularization in [22]. Consistency regularization forces network outputs for the same input under different perturbations to be consistent. For visual tasks, these perturbations are typically implemented as various image augmentations, e.g., spatial transforms. However, head poses are not invariant Fig. 1. From an image with unknown head pose (picture from [7]) two different augmented versions A and B are created. Augmentations can be color distortions, blurring, etc., but also pose-altering transforms like rotations (r A , r B ) and flipping. In addition to a supervised loss, our method allows to train a network on unlabeled data with relative pose consistency between A and B. Relative pose consistency provides an unsupervised loss to enforce consistent predictions under different augmentations, and also to enforce predictions that comply with a relative pose.
to spatial image transforms, like flipping and rotation. If the ground truth pose is known, the pose label can be adjusted, however, the ground truth pose is unknown for our targetdomain data. In [20] and this extended work, we therefore propose to take advantage of relative pose.
The relative pose, which we store in a relative pose label, is the pose difference between two realizations of the same input (see Fig. 1 for an example). Recalling that training with consistency regularization requires different realizations of the same input, we implement relative pose label in a consistency training framework (see Figure. 2 for an overview of our method). This extends the consistency supervision from static augmentations to relative pose labels. As a consequence, the network is trained not only to make consistent predictions, but also predictions that comply with the relative poses. Our consistency-enforcing method does not require absolute pose information and can therefore be used with unlabeled data samples in semi-supervised or domain-adaption scenarios. We show the effectiveness of our approach on the popular BIWI Kinect Head Pose estimation benchmark [7].
Our main contributions are as follows: • We show, for the first time that consistency regularization can be used for pose regression problems. • We propose relative pose consistency, a novel extension to consistency regularization. • We present an improved training framework compared to [20] • We uncover effects of preprocessing on performance during reproduction of related work. • We improve state-of-the-art results for two challenging HPE scenarios (domain-adaptation and cross-dataset).

A. Head Pose Estimation
In recent years, traditional approaches based on facial landmarks and 3D face models are mostly superseded by deep learning methods. These methods estimate the pose directly from an image. In addition to images, different modalities such as depth images [7] or temporal information [14] can be used. In this review we will focus only on deep-learning methods for HPE from a single RGB image.
The first convolutional neural networks (CNN) to directly regress head pose from an image are presented by Anh et al. [23] and Patacchiola and Cangelosia [24]. Recent work proposes variations of loss functions and network architectures. Ruiz et al. [17] combine a regression loss with binned pose classification, by assigning continuous pose to discrete pose categories (bins). Shao et al. [25] use a similar combined loss but also evaluate the effect of adjusting the margin around the face image that is fed into a CNN. Similarly, Lathuilière et al. [26] evaluate various factors of deep regression like hyperparameter selection or image preprocessing in the context of head pose estimation. Wang et al. [15] present a coarse-to-fine approach, where head pose is coarsely classified in bins and later refined by regression. An attention based network structure for HPE is proposed by Yang et al. [18]. Their goal is to extract a set of representative features by learning a fine-grained structure mapping before a feature aggregation step. Zhou and Gregson [27] extend the work of [17] to full-range HPE by proposing a wrapped loss that allows training with the full range of yaw angles (−180 • , 180 • ]. They further show that a small model, EfficientNet-B0 [28], can reach SOTA HPE performance. Hempel et al. [29] propose a continuous 6D rotation representation for fullrange HPE and a geodesic distance-based loss. In contrast Gu et al. [14] present an approach for temporal prediction of facial features. They propose to use a recurrent neural network (RNN) on top of a VGG16 network [30] for joint estimation and tracking of head pose in videos. The above methods can be seen as orthogonal to our approach, because we are not trying to improve supervised performance with new losses or network architectures for HPE. For simplicity and comparability we focus on Mean Squared Error (MSE) loss and ResNet [31] network architecture. Nevertheless, our method can be applied to other loss functions or network architectures as well.
Another approach to HPE is multi-task learning [32], [33], [34], [35], [36]. In this setting multiple tasks like HPE, landmark detection, age estimation, etc, are solved simultaneously. A benefit of multiple tasks is that multiple data sources can be used for training, which considerably increases the amount of training data. In contrast, our method does not use additional labeled tasks or datasets.
An interesting unsupervised approach is presented by Mustikovela et al. [37]. In their work, a viewpoint estimation network is trained purely via self-supervision with an analysis-by-synthesis framework using a network similar to HoloGAN [38]. Similar to our work, they enforce flip consistency by applying a flip consistency loss. In contrast, their loss forces images synthesized from a flipped latent code to be consistent. Our loss forces the predicted labels from flipped images to be consistent.
Lastly, it is a common approach to use synthetic face datasets from 3D models for HPE [11], [12], [13], [14], [15], [16], [20]. This has the advantage of learning from a high amount of diverse images with perfect labels. To date, [14] and [13] are publicly available datasets. These related works train on synthetic or mixed datasets and evaluate on real-world datasets, however, most of them do not explore any domain adaptation or semi-supervised techniques to overcome the domain gap. This was only explored in our previous works [16], [20].
In [16], we improve HPE for an unlabeled target dataset by enforcing a network to extract domain-invariant features using a domain discriminator and an adversarial training loss. To account for an only partially-shared label space, we apply a weighted resampling of the source domain to filter out samples with poses that are outside the target domain. To find poses outside the target domain, a distance between source and target domain samples is estimated by a network trained on the source domain. The method has two drawbacks. First, in addition to the pose estimation network, a domain-discriminator network needs to be trained simultaneously during training. This leads to additional computations, hyperparameters and complexity. Second, the resampling of training data assumes the availability of a pose distance metric to find suitable training samples. It is not guaranteed that a network trained only on the source domain provides an accurate distance metric to select appropriate training samples.
In [20], we tackle the same problem but choose a completely different approach to avoid the aforementioned drawbacks. One goal of [20] is to find a method that does handle partially-shared label spaces like [16] but does not need an explicit filtering of training data. As a result, the approach presented in [20] does not require to resample the source data. Furthermore, the approach does not need an additional discriminator network with adversarial training. These factors and improved performance compared to [16] make the proposed solution an exciting new solution for HPE in domainadaptation scenarios. In this iteration of [20] we further refine the solution and provide a more detailed evaluation of the method. We introduce a modification to the training framework that increases the performance relative to [20]. In addition, we apply our method to a cross-dataset task to verify that our method can be applied in different scenarios. In an attempt to enable meaningful performance comparisons to related work, we also investigate the reproducibility of head pose estimation experiments in the context of face detection (see Section IV-E).

B. Consistency Regularization
Consistency-enforcing methods provide state-of-the-art performance for semi-supervised learning. During training, consistent network predictions for unlabeled data under input and network perturbations are enforced. Although one can find many terms and variants like self-ensembling, consistency regularization, self-training, temporal ensembling, or pseudolabels, the core principle of enforcing consistent outputs is similar. Consistency-enforcing methods have also been successfully applied to domain-adaptation scenarios, where the unlabeled data is from another domain. While first used as semi-supervised methods, these principles are now popular for unsupervised pre-training of neural networks and paved the way for modern contrastive (self-supervised) methods like SimCLR [39], MoCo [40] and BYOL [41].
Laine and Aila [42] proposed two self-ensembling methods, -Model and temporal ensembling. Both methods enforce consistent network predictions for the same input under different stochastic input augmentations and network perturbations. In this case, dropout was used to provide network perturbations. The -Model randomly augments the same input twice during an iteration and forces consistent predictions. In contrast to the -Model, temporal ensembling forces network predictions over multiple previous training epochs to be consistent to the current prediction. Self-training and training with pseudo-labels, e.g., [43], [44], can be seen as a variant of temporal-ensembling. The Mean Teacher method by Tervainen and Valpola [45] adapted this idea but instead of reusing previous predictions, they added a teacher network that is an average of previous network weights. The teacher network's predictions and the current model's (named student) predictions are forced to be consistent. French et al. [22] applies the Mean Teacher method to domain adaptation and proposes modifications to improve DA performance. As we are the first to introduce consistency regularization to head pose estimation, we opt for simplicity and base our work on the -Model.

III. METHOD
Semi-supervised learning is typically used to learn from a large dataset, which is only partially labeled. We take up this idea for domain adaptation to learn from labeled synthetic images (source domain) and unlabeled real-world images (target domain). On the one hand, synthetic data provides perfect labels for a wide variety of poses. On the other, it only provides an approximation of real-world image features. Real data provides real-world features but lacks annotation quantity and quality. Combining them in a training scheme, where both datasets can be used simultaneously, is a promising way to improve performance on real-world images.
This concept is not limited to scenarios with synthetic and real-world data. A domain gap can decrease performance if the source data distribution is different from the target data distribution. For image datasets, different recording settings like lighting, camera, or subjects can already provide enough differences to cause reduced performance across domains. Our method can be applied in cross-dataset scenarios as well.
We will first introduce the required notations and baseline supervised learning. Then, we will describe the consistency regularization framework. Subsequently, we will describe the concept of relative pose labels and how these are embedded into the training framework. Finally, we discuss how we avoid degenerate solutions with consistency regularization and the effects of batch normalization while training with two dataset.
In a semi-supervised or domain-adaptation scenario, data is available from the labeled source domain where n s is the number of data samples x s i ∈ X s and associated labels y s i ∈ Y s . For head pose estimation, x is an image of a head and y is a vector of the three corresponding Euler angles of the head. We are interested in utilizing the unlabeled target , which includes only samples but no labels. A network f can be trained using the source data (X s and Y s ) and a supervised loss. For head pose estimation the supervised Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. Source data follows the supervised path (green) to train the pose estimator f . Target data is copied before the stochastic augmentation module which creates two different augmented versions of the target input. Note that even though the ground truth pose is unknown for X t the relative pose between the augmented versions X t and X t can differ and is stored in a relative pose label. The relative pose label and predictions are fed into the consistency loss. The consistency loss provides supervision from consistency and relative pose labels. f is trained jointly on both losses. Fig. 3. Illustrations of the studied image augmentations. Each augmentation transforms the input image with random transformation parameters. The left images in each square show the inputs and the right images randomly transformed outputs. For synthetic images, a random background is added. Top row images from [7] and bottom row images from [14].
loss is a measure of how similar two poses are and typically the Mean Squared Error is used as pose loss with the predicted Euler anglesŷ = f (x), and the ground truth angles y.

A. Consistency Regularization Framework
Stochastic input perturbations are a central aspect of consistency-based models. In practice standard image augmentations like blurring, translation and scaling (usually implemented as random cropping), horizontal flipping, rotation, and color distortions provide appropriate image perturbations (see Fig. 3).
Given a sample x we create two randomly perturbed (augmented) inputs x and x which are fed into the network f to produce predictions f (x ) and f (x ).
A consistency loss L cons enforces that both predictions are similar. This consistency loss is typically the Mean Squared Error or KL divergence [46]. We formulate our total loss with λ controlling the relative effect of the consistency term in the overall loss.
The same stochastic perturbations are applied to both source and target images. Note that in SSL the consistency loss is typically applied to samples from both D s and D t [42], [45], [46]. Following [22], who use consistency regularization for domain adaptation, we apply the consistency loss only to samples from the target domain D t . Our framework is shown in Figure 2.
Unfortunately, flipping and rotation will change the ground truth label of a source-domain sample and produce targetdomain inputs that break the consistency assumption that x and x share the same label. We therefore need to distinguish between pose-preserving and pose-altering augmentations and need to redefine our loss functions for pose-altering augmentations. As shown in Fig. 3, pose-preserving augmentations are random color distortion, blurring, translation, and scaling and pose-altering augmentations are flipping and rotation. The required changes for pose-altering augmentations will be described in the next section.

B. Relative Pose Consistency
Pose-altering augmentations change the head pose. Knowing the spatial transformation and the true pose, an augmented image can be relabeled. However, this is not possible if the true pose is unknown. We create a new consistency loss based on the relative pose between augmented samples to benefit from pose-altering augmentations on our real-world unlabeled target data.
We will first give a short recap on pose representation and then provide the interdependence of image rotation and flipping to the orientation change of the head pose and required adaptations to the loss functions. Both augmentations require 1 that the pose is stored in Euler angles (Tait-Bryan angles) that describe intrinsic rotations around Z-Y'-X". These are known as: roll, yaw, and pitch. This means that the rotation is performed by three successive rotations around the Z, Y' and X" axis. Recall that for intrinsic rotations the first rotation around Z will create a new coordinate system from which Y' will be used for the second rotation and so on. For this representation a rotation around Z can be carried out independently from Y' and X" rotations. That means that any image rotation will result in an additive rotation term to the roll label.
Augmenting an image with (unknown) rotation r with two random rotations r A and r B would result in images A and B with rotations r + r A and r + r B , respectively. One can easily see that the difference in rotation between the two augmented images is r B −r A , which is the relative pose difference between the images. To account for this difference we can change the consistency loss for the roll angle to: where is the element-wise product. This complete example is also illustrated in Figure 1. Cases where B, or A and B are flipped are handled with negating B's, or A's and B's yaw and roll angles, respectively. Note that the information provided by a relative pose only influences yaw and roll angles, as the pitch angle is untouched by rotation and flipping.

C. Avoiding Degenerate Solutions
Several works report difficulties when training with consistency regularization. In contrast to previous works, we apply consistency regularization to a regression problem and therefore use a different loss combination. For this reason, instead of class logits, we regularize the predicted pose angles. In the following we will address these difficulties and how we dealt with them.
The first difficulty is the selection of λ. Reference [42] found that the network can get easily stuck in a degenerate solution if the unsupervised loss component (L cons ) is too high in the beginning of the training. As a solution, they ramp-up λ from 0 to 1 during training. The same procedure was also adopted in [45]. In contrast, French et al. [22] replaced the ramp-up with a confidence threshold. They utilized the predicted class activations as probabilities and the loss of all samples with activations below the threshold is set to 0.
In our case we found that high λ values usually yield degenerate solutions, regardless of ramp-up or not. Our explanation is that in order to minimize the consistency loss the network can learn to output only a constant. However, we found a good indication on how to set λ comes from the supervised loss. As a simple rule, the regularization feedback should not be stronger than the supervised loss. Preliminary experiments showed that consistency training is quite robust and λ values in the range [0.1, 0.4] converge to similar performing networks. For λ > 0.5 the consistency loss became larger than the supervised loss and the overall performance decreased for both, source and target data.
Another issue with consistency regularization arises if the labels of the source and target domain do not come from the same underlying distribution [47]. This is ignored in many works, because it is assumed that the unlabeled data contains the same class distribution as the labeled data. As described in [16], this assumption usually does not hold for regression tasks like HPE. For classification, [22] introduced a class balance loss term that forces the network's mean class predictions to be uniform. This helped to avoid a degeneration to the most dominant class.
Following this approach, we introduce a weighted relative consistency regularization for HPE. To enforce a more evenly distributed feedback of the consistency loss, we re-weight the consistency loss based on the pose predictions. Poses that are found often in a batch should be weighted down, whereas rarely appearing poses should be weighted up. In most natural image collections of faces, the poses are usually distributed around the pose that is facing the camera. The same holds true for most head pose datasets. Therefore, assuming a normal distribution of poses in a batch, we formulate our weighting: where w p is the weight given to a pose angle p (pitch, yaw, roll) and μ P and σ P are the mean and standard deviation of all p ∈ P in a batch. In particular, we apply this weight to all angles independently. To keep λ constant between experiments, we rescale w p with so that the overall weight in a batch sums to one. We compare the results for weighted and unweighted predictions in our experiments in Section IV. Both λ and re-weighting are associated with the same underlying problem: an effective setting of the regularization strength. Although we have made two proposals, we think that uncertainty or curriculum approaches like [48] are paths worth looking into for future improvements.

D. Batch Normalization During Cross-Domain Training
Batch normalization (BN) [49], among other normalization methods, is a popular technique for training deep neural networks. BN normalizes activations in a network across the mini batch. Each feature is normalized by mean and standard deviation of the feature in the mini batch during training. For inference/evaluation typically a running estimate of the features mean and standard deviation is used. When working with multiple datasets (possibly from different domains) it is not realistic to assume that features in these datasets share the same statistics. Adapting feature statistics using batch normalization to adapt to different domains has been investigated, e.g., [50], [51]. Regardless, here we will discuss the effects of BN in our framework.
Using BN on the target domain implicitly affects training and evaluation performance. Similar to previous works like [16], [20], [22], we process mini batches of source and target data sequentially. The other possible method is to combine the data in every batch, e.g., half and half. We found that sequential processing provides stable training. With sequential processing, BN uses domain-dependent statistics during training but mixed (running) statistics during evaluation. However, this implicit mixing of domain feature statistics can have beneficial or detrimental effects on the evaluation performance for each domain. Assuming that domain statistics are very different, the mean statistics of two domains, which are typically used during evaluation, might not provide good results. Aside from mixed statistics, it is possible to use only target-domain statistics [51] during target evaluation. In our experiments, we found that mixed or target-only statistics decrease performance on the target dataset. The effect gets worse the more the distribution of the label space, and thus the distribution of features, differs between source and target.
We therefore propose to use BN in a way that does not use target-domain statistics during training or evaluation at all. If data from the target domain is fed trough the network, the running statistics of the source domain are used. The same applies for evaluation. This way, computation of BN statistics is completely independent of target-domain data. The advantage of this method is that we avoid any deteriorating effects from BN. The only drawback of this method is decreased training stability. In our experiments this is alleviated by using gradient clipping during training. We evaluate the performance of using "no target BN" in our experiments in Section IV-D.

IV. EXPERIMENTS
In the following, we will analyze the performance of our method. First, we will give a description of the used datasets and the implementation details. We then compare our results with related work. In the following section, our ablation study, we show the effects of different components of our method. Finally, we conduct a series of experiments to reproduce related work. In doing so, we gain insight into some factors that influence the reproducibility of results and also obtain results with which we can compare our work more fairly.

A. Data
To validate our method we use revised datasets SynHead++ and Biwi+ proposed by [16]. These datasets are extensions of the popular face pose datasets Biwi Kinect Head Pose Database (Biwi) [7] and NVIDIA Synthetic Head Dataset (SynHead) [14]. For both datasets, [16] provides labels in Z-Y'-X"-angle representation and face bounding boxes. SynHead was artificially extended to include more poses, so that SynHead++ is a superset of Biwi+ in regard to pose labels. Here, we give a brief overview of the datasets.
Biwi+ is used as real-world, target-domain dataset. It contains 24 sequences of 20 different subjects recorded with a kinect sensor. SynHead++ is used as synthetic, sourcedomain dataset. It contains images of 10 different rendered 3D head models. The total number of images is 15678 for Biwi+ and 653910 for SynHead++. All images are cropped to the given bounding boxes and scaled to 224 x 224 pixels. Exemplary images and illustrative augmentations are shown in Figure 3.
To make our work more comparable with previous works, we also perform experiments on the popular 300W-LP [19] dataset. 300W-LP uses the images from 300W [52], a dataset that combined multiple datasets containing faces in unconstrained, "in-the-wild" conditions. These images have been re-annotated with facial landmarks. Landmarks and image features are used by [19] to fit a 3D morphable model (3DMM) to the face images. With the image and fitted 3DMM they synthesize (render) face images with new head poses. 300W-LP includes 5488 real images and 55737 synthesized images. In addition all images are flipped to create a dataset with 122450 images. We processed the data the same way as Biwi+ in [16].

B. Implementation Details
For all our experiments, the pose estimator f is ResNet18 as provided by PyTorch [24] with last linear layer being replaced by a new linear layer with 512 inputs and 3 outputs for Euler angle estimation. This is consistent to previous work [16], [20].
We use different combinations of augmentations throughout the experiments with parameters provided in Table I. If an augmentation would produce a roll angle > ±89 • , the rotation is performed in the opposite direction. We used the code 2 and parameters from [40] for color distortions and Gaussian blur. Similar to [22], we process mini batches of source and target  [53], a batch size of 84, and a learning rate set to 10 −4 . For our "supervised only" baselines f is initialized with the default PyTorch pretrained ResNet18. The learning rate is ramped-up to warm start the optimization. During baseline training λ is set to 0. The baselines are trained for 35000 iterations which is equivalent to ≈5 epochs for SynHead++ or ≈24 epochs for 300W-LP.
For all consistency regularization experiments we fine tune a baseline model. To make the comparisons fair for all runs, we select one (supervised only) baseline model that is trained with all augmentations (full). From the 10 runs, the baseline model that performs most similar to the average performance is selected. For the consistency regularization experiments, all models are fine-tuned for 16000 iterations. λ is ramped-up to 0.2, to avoid deterioration from too strong regularization. For all experiments the performance at the end of training is reported, i.e., no early stopping is used. The performance for Biwi+ is reported without any augmentations.
We timed the execution of our approach and found that for one image it takes less than 3 ms on a consumer GPU (NVIDIA GTX 1080 TI). For the methods we reproduced (see Section IV-E) we found similar or slightly higher times but all of them would be suitable for real-time processing at 30 fps.

C. Head Pose Estimation Results
In this section, we compare our results to groups of similar experimental settings that we clustered from the related work. In Table II we measure the mean absolute error for every angle and the overall MAE (mean absolute error) of them. We report the mean and standard deviation of these values over 10 runs using different random seeds. Please note that this is not the standard deviation of the pose errors, but the standard deviation of the mean errors over all runs.
In the upper part of Table II we collected the results of related work. The lower part shows the results of reproduced related work (Section II-A) and our work, all using Biwi+ as a common test set. Most work trains with real-world images and typically focuses on improving head pose estimation by improved network structures or supervised loss functions. In contrast, our main goal is to learn from synthetic images and unlabeled real-world images using a semi-supervised method. In addition to the differences in models and learning strategies, it is not always straightforward to compare head pose estimation results, even if the results are reported for the same dataset. For example, the evaluation protocol and processing of the Biwi dataset is not universally the same for all works. A deeper analysis of a factor that makes the comparability of results problematic is given later in Section IV-E. However, even though not always directly comparable, we think it can be valuable to discuss our results in a broader context of related work.

Cross-Validation:
This section of Table II shows the performance for methods trained and evaluated on different splits of the same dataset. Compared to our results and other Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. experimental settings, higher performance is likely explained by having no domain gap between training and test set. Both sets are from the relatively homogeneous Biwi dataset, which is recorded under lab conditions. The performance in this section is also dependent on the training/test split. Non surprisingly, a random split offers the best results, as splitting by persons introduces a small domain gap.
Self-Supervised: shows that learning head pose can even be accomplished self-supervised. No pose labels are used during training. To gain pose predictions, a linear regressor is trained on 100 random test set samples to map network outputs to pose labels. While results are not on par with our or recent work, it shows the potential of self-supervised pose estimation. Training with our proposed relative pose labels can be seen as a self-supervised approach.
Cross-Dataset: Sometimes called inter-domain or crossdomain evaluation, these sections show results where the training set and test set are taken from different datasets. Most commonly, 300W-LP [19], a dataset created from real-world images is used for training. As 300W-LP uses real-images, the domain gap to Biwi should be small, compared to using fully synthetic images from SynHead++. However, the distribution of poses in 300W-LP does not match the one found in Biwi, which makes it challenging to gain high performance when training on 300W-LP only.
The section in the upper part of Table II shows results from related work. Similar to us, [15] uses rendered synthetic images but also a part of the Biwi dataset to train a HPE model, which makes a fair comparison difficult. Only [27] (WHENet and WHENet-V) publish better results than ours. WHENet-V, uses additional data from the (real-world) Panoptic Studio dataset [54] to better match the pose distribution of Biwi.
In an effort to create comparable results, we reproduced related work on Biwi+ in the section provided in the lower part of Table II. More details can be found in Section IV-E. Using the proposed batch processing scheme, our method weighted relative consistency regularization (RCRw) trained on 300W-LP outperforms all reproduced work evaluated on Biwi+.
Domain Adaptation: shows results for synthetic-to-real (SynHead++ to Biwi+) domain adaptation. Our proposed method RCRw, outperforms the partial domain adaptation method proposed in [16]. Although we get worse pitch performance, we achieve better yaw, roll, and average performance. For roll error, the improvement to [16] is nearly one degree. We also report our results from [20]. In comparison to [20], improved handling of batch normalization boosted our average and pitch performance but also slightly deteriorated pitch and roll accuracy. Nevertheless, the additional gain is so significant that this deterioration can be accepted. A detailed analysis of the effects of different components of our method can be found in Section IV-D.
Looking at Table II, pitch error is higher than roll or yaw for all reported results. We suspect that pitch estimation seems to be a harder problem. For our experiments, pitch estimation has gained the least from our method. Presumably, pitch estimation can not benefit from the relative pose labels, as the pitch angle is constant for all our augmentations. In contrast, roll benefits the most from our relative pose labels. Probably for this reason, we outperform almost all other work in terms of roll error.

D. Ablation Studies
In this section we analyze the different components of our framework in relation to head pose estimation performance. Even though, we are mainly interested in the synthetic-toreal scenario, we include experiments with 300W-LP, as this database is highly used in the HPE community.
We first establish a number of baselines in the first section, "Supervised Only", of Table III. These experiments are trained only with the supervised loss on a source dataset and evaluated on the Biwi+ dataset. The experiments only differ in the use of different augmentation combinations and the selected source dataset.
Experiments trained on SynHead++ reveal the effects of augmentations during training on synthetic data. It is clearly visible that augmentations help to improve target performance. Interestingly, rotation and flipping augmentations only give small or no performance boost. This can be explained by the structure of the datasets. SynHead++ already covers all poses found in the test set Biwi+. Any additional pose augmentations to SynHead++ will likely make the training set more different to the test set. Color, scale, translation and blur augmentations create images that might look more similar to the real-world test set. However, augmentations are not sufficient to reach the performance of related work on the Biwi dataset, which suggests that additional methods might increase performance. These findings demonstrate two of our key assumptions. First, training on a synthetic image dataset does not provide automatically good results for a real-world image dataset. Secondly, the tested augmentations improve the performance but are not sufficient to force the network to learn features that generalize well to real-world images.
Experiments trained on 300W-LP show that augmentation also increase the performance if an "in-the-wild" training set such as 300W-LP is used. The performance gain from augmentation is smaller than for SynHead++. We think this is explained by a smaller domain gap between 300W-LP and Biwi+, because both contain real-world images. Comparing the absolute performance on Biwi+, training with 300W-LP and full augmentations is inferior to training on SynHead++ with full augmentations. This again can be explained by the structure of the datasets. 300W-LP does not include all poses of Biwi+. So even if we have very similar (real-world) image features, some poses can simply not be learned from 300W-LP as they can only be found in Biwi+ or SynHead++.
Consistency regularization uses our proposed consistency framework. Again different augmentations are evaluated. Using only pose-preserving augmentations for consistency regularization (CR), already improves the results to baselines. Adding rotation and flipping further increases the performance. Best performance for yaw and roll is gained when using the full augmentation scheme with relative pose (RCR). However, best overall performance is gained when using all augmentations without flipping. In the RCR setting, using "no target BN" boosts the overall performance and is beneficial for pitch estimation. Although a general increase in performance can be observed, yaw and roll estimation is slightly deteriorated by using "no target BN". We theorize that this is caused by the distribution differences between Biwi+ and SynHead++. Compared to pitch, the yaw and roll distributions are more similar between the datasets, and therefore the estimation might even benefit from an implicit alignment by a shared batch norm. Weighted consistency regularization includes the proposed weighting of angles during training.
Compared to CR and RCR, regardless of augmentation scheme, weighting gives a slight performance boost. The observations for applying "no target BN" are the same as for RCR. Again, best overall performance is gained when using all augmentations without flipping and "no target BN".
We also applied the RCRw with full augmentations to 300W-LP. Interestingly, the gain for "no target BN" is very high. We conjecture that this effect is created by a high difference in pose labels between 300W-LP and Biwi+. The different poses also create very different feature distributions, which causes detrimental effects if batch normalization statistics are averaged from source and target dataset. Possibly for the same reason, the overall gain of using consistency regularization is lower compared to the SynHead++ experiments.

E. Influence of Face Detection on Performance
A phenomenon evident from the upper part of Table II is that the community has not settled on a standardized evaluation protocol for head pose estimation. Even though all of these works report results on the same dataset, the dataset preprocessing and splitting is often different. In an attempt to improve comparability of results, we try to recreate results on Biwi+ for related work. Luckily, code and models for the works of Ruiz et al. [17], Yang et al. [18] and Zhou and Gregson [27] are available online. However, we can not simply feed the Biwi+ data to the models, because every method uses different image preprocessing. As suggested in previous work [25], [26], how the face is cropped before pose estimation is a crucial factor for performance. Furthermore, the models are trained to recognize poses that come from crops similar to those used during training, which might depend on the used face detector. We therefore conduct a small study on preprocessing and ask two questions. What role plays face detection for HPE performance? How important is a predictorspecific bounding box, i.e., is a bounding box of one detector replaceable by another? While there are certainly more differences (see 3 for more), here we focus on face detection and transferability of detections (bounding boxes), two steps that have not been analyzed before.
In Figure 4 we visualize the bounding boxes that are used for cropping in different methods. Note that our methods use the crops from Biwi+ [16], which use a CNN based face detector from Dlib [56] and manual labeling. Hopenet [17] uses Dockerface [55] a Faster R-CNN based detector inspired by [59]. FSA [18] uses MTCNN [57] and WHENet [27] utilizes a YOLOv3 [58] model trained to detect faces. Using the aforementioned detectors on the Biwi dataset, we found that face detection provides three sources of errors: finding the wrong person in the image; finding no person in the image; finding a bad crop of the face, e.g., not all parts of the face are visible. All of these cases happen on Biwi and the latter two cases can be seen in Figure 4. To avoid detecting the wrong person, detections are usually selected by size or by distance to image center. We did this in our reproduction experiments for all evaluated face detectors except for MTCNN, where the selection procedure is provided by the authors code [18]. In Table IV we show our results of recreating related work. "Used Biwi Images" reports the number of images with face detections for each face detector, which is also the number of images used for our evaluation. One can see that this number ranges from 13219 (MTCNN) to 15678 (Dlib and manual detections). From our experiments it seems likely that only images with detected faces are used for evaluation in related work. This makes reported results on Biwi variants highly incomparable. Not only the number of compared images differ, e.g., ≈16% of images omitted for FSA, but especially faces with large head rotations and therefore difficult samples are usually the ones not found by face detectors. Similar results 3 can be found for methods like [29] that utilize FSA's data processing code.
As a consequence, we tried to use the boxes provided by Biwi+ for all methods, to gain equally sized test sets. As it would be unfair to use a crop that is dissimilar from the one a network is trained on, we calculated a transformation from Biwi+ box to the box associated with the method in question. The transformation minimizes the mean difference in box translation and box scale between the intersection of detections found by two detectors. Although straightforward, this transformation works very well for FSA and Hopenet, as performance is nearly identical to the original boxes. This shows evidence that the predicted face position from a face detector has only small influence on the final performance. That is, we can interchange face detectors and get similar results. Surprisingly, Biwi+ boxes work even better than the ones created by MTCNN for FSA. Unfortunately, we could not recreate the results reported for WHENet. Another conclusion of Table IV is the fact that changing the face detection method can change the evaluation dataset and therefore the performance from state of the art to irrelevant.
In summary, the role of face detector seems less important as long as the same subset of images is used during evaluation. Furthermore, this section showed that a much stricter evaluation protocol should be used to compare head pose estimation results. There are even additional differences to the Biwi+ evaluation protocol, which we neglected here, a full analysis is available on 3. Even though, discussed in literature [16], [25], [26] as an important factor, by using a non-standardized preprocessing, many works indirectly assume preprocessing to have negligible influence for performance comparison. On the contrary, we believe the contribution of preprocessing might even conceal the effects of novel methods on observed performances. This calls for a more unified evaluation protocol. We make the code to recreate the results of Table IV publicly available 3 . There we also investigate additional factors, such as different rotation representations, and their effects on performance.

V. CONCLUSION AND FUTURE WORK
We propose relative pose consistency, a new approach to improve deep head pose estimation performance with semi-supervised learning. Our method allows pose-altering augmentations, like rotation, to be incorporated into a consistency regularization framework. In addition, we introduce two extensions, a weighting scheme and a batch processing scheme, to improve performance. We evaluate our method in two scenarios: domain-adaptation and cross-dataset evaluation. For domain-adaptation, our method outperforms previous work by 7%. To enable direct comparisons for the cross-dataset evaluation, we reproduced results from related work. Thereby, we uncover that inconsistent preprocessing seems to prevent comparability of head pose estimation results and suspect that the effects of data preprocessing can conceal methodological contributions. This raises the question of whether stricter evaluation protocols are needed. In this context, our method improves the state of the art for the cross-dataset evaluation by 6%. Furthermore, our approach trained only on labels from synthetic data outperforms previous work trained on realworld images (300W-LP) by 20%. We thereby demonstrate that state-of-the-art performance on real-world images can be achieved when using only labels from synthetic training data. Ultimately, however, there is still a gap to methods trained with real-world datasets that are very similar to the target domain.
In future work, our framework could also be combined with other methods. Among the many consistency regularization frameworks we based our work on the simple -Model [42]. Therefore, another direction could be to extend our framework to consistency frameworks like self-ensembling with teacher student models [45]. Lastly, the concept of relative pose consistency could be applied to other pose estimation tasks such as hand or body pose estimation, or scale and translation estimation methods like [60]. An example of another pose estimation application has been proposed for gaze estimation in [61].