Cross-Domain Multitask Model for Head Detection and Facial Attribute Estimation

Extracting specific attributes of a face within an image, such as emotion, age, or head pose has numerous applications. As one of the most widely used vision-based attribute extraction models, HPE (Head Pose Estimation) models have been extensively explored. In spite of the success of these models, the pre-processing step of cropping the region of interest from the image, before it is fed into the network, is still a challenge. Moreover, a significant portion of the existing models are problem-specific models developed specifically for HPE. In response to the wide application of HPE models and the limitations of existing techniques, we developed a multi-purpose, multi-task model to parallelize face detection and pose estimation (i.e., along both axes of yaw and pitch). This model is based on the Mask-RCNN object detection model, which computes a collection of mid-level shared features in conjunction with some independent neural networks, for the detection of faces and the estimation of poses. We evaluated the proposed model using two publicly available datasets, Prima and BIWI, and obtained MAEs (Mean Absolute Errors) of 8.0 ± 8.6, and 8.2 ± 8.1 for yaw and pitch detection on Prima, and 6.2 ± 4.7, and 6.6 ± 4.9 on BIWI dataset. The generalization capability of the model and its cross-domain effectiveness was assessed on the publicly available dataset of UTKFace for face detection and age estimation, resulting a MAE of 5.3 ± 3.2. A comparison of the proposed model’s performance on the domains it was tested on reveals that it compares favorably with the state-of-the-art models, as demonstrated by their published results. We provide the source code of our model for public use at: https://github.com/kahroba2000/MTL_MRCNN.


I. INTRODUCTION
HPE (Head pose estimation) is an open research area that has drawn the attention of specialists in different domains. The wide applications of HPE in assistive systems, humancomputer interface systems, virtual reality etc., have brought it into the center of attention of the research community. For instance, HPE is one of the most efficient UIs (User Interfaces) for paralyzed patients who are suffering from complete quadriplegia [1], [2]. The patients in this group have little control over their four limbs, so head movement is one of the few ways for them to interact with computers and electronic devices. For example, several studies have used head movements in the yaw and pitch directions to control an EPW (Electric Powered Wheelchair) [3]- [5]. Another application The associate editor coordinating the review of this manuscript and approving it for publication was Li Zhang . of HPEs lies in vehicle-related technologies, where HPEs are implemented to examine the attention of drivers [6]- [8], as well as students' attention in class [9], [10]. On the other hand, the recent success of VR-related technologies motivated researchers to use HPE for estimating the users' gaze and FOV (Field of View) via head pose information [11]. Having a fast and reliable HPE model for all the aforementioned applications is critical, and to this end the research community has been focusing on two main HPE approaches; sensor-based and vision-based methods. Though sensor-based (IMU, tilt sensors, etc.) approaches are regarded as promising solutions, they impose an unwelcome level of discomfort and distraction to the users, due to their required attachment to the users' heads.
In contrast, vision-based techniques enable us to calculate Euler angles from 2D scans of a user's head, without requiring physical contact. Vision-based HPE is not a new idea, and various studies with different levels of success have been carried out to tackle this problem [12]- [18]. Despite the promising success of these models in pose estimation, they often suffer from the lack of an integrated head detection mechanism. Therefore, before feeding the image to the model for estimation, a preprocessing step needs to be introduced for cropping the ROI (region of interest; in this study, faces). This can be achieved either manually or via existing face detection modules [19]- [21]. Face detection algorithms, used in conjunction with the HPE, can adversely affect accuracy, speed, and efficiency [22]. Also, a non-integrated head detection mechanism would introduce significant processing demands and delays in a multi-face pose estimation task. Following the recent success of neural networks in performing concurrent face detection and landmark detection [23]- [25] through a set of shared features, some works have proposed the idea of using multi-task learning models for parallelizing the face detection and the pose estimation process [22], [26], [27].
Inspired by the wide applications of HPE, but taking into account the limitations of the existing models and their lack of extensibility to other domains, we developed a multipurpose, multitask object detection model. The proposed model localizes objects of interest (in this case faces), while it concurrently estimates attributes of that object. In other words, the cross-domain, multitask object detection model can be used for simultaneous face detection and pose estimation, with adequate generalization capability to be also used for estimation of other facial attributions (e.g. age). Consequently, we developed an improved version of the current Mask-RCNN model [19] to detect the face and estimate its attributes (head pose in this case). Motivated by practical applications of the HPE, as in assistive technologies for headoperated wheelchairs [5], we developed our model for estimating the head orientation in the yaw and pitch axes. The model is built on top of the Mask-RCNN object detection model and has been tested on two public datasets: BIWI and Prima. The cross-domain aspect of the model is also validated on the public dataset UTKFace, for face detection and age estimation. In the next chapters, we first explore the existing studies in this area (section II), followed by an in-depth explanation of the proposed model (section III), and its testing and evaluation (section IV). Section V discusses the limitations of the proposed model, while the concluding remarks summarizing the findings of this study are provided in section VI.

II. RELATED WORKS
Due to the various applications of HPE, a number of visionbased techniques for this purpose has been proposed by the research community. In this section, we first discuss the existing vision-based HPE techniques, followed by a comprehensive exploration of the multitask learning models developed for parallelizing several tasks in neural networks.

A. HEAD POSE ESTIMATION
Vision-based techniques for pose estimation have gained momentum in the computer vision area. Several advantages over sensor based methods make them more attractive to developers and end users, including the fact that they require minimal equipment, being contactless, and ability to be set up cost-effectively (using just an RGB camera, for example). Geometrical and learning-based are two approaches have been used to develop vision-based HPE, in which the geometry-based ones analyze geometrical features (such as facial landmarks) to estimate the head pose, while the learning-based ones estimate it with machine learning techniques. Geometry-based approaches are mainly built upon two individual modules; i) performing landmark detection and ii) processing the geometry of the landmarks to estimate the pose. Amongst the first attempts, [28] analyzed the geometry of five facial landmarks to estimate the head pose. In [29], the authors analyzed the facial landmark geometry to estimate the head pose via two cascade steps: first, they identified the facial landmarks, and then they processed the landmarks with respect to a virtual web-shaped network for head pose estimation, in all three axes of yaw, pitch, and roll.
In a more advanced approach, [30] developed a novel face ellipsoidal model to estimate the yaw pose of drivers' heads, with the aid of some facial landmarks. Similarly, [31] utilized a set of modified facial feature extractors, including 54704 VOLUME 10, 2022 adaptive Hough transform [32], template matching, active contour model, and projective geometry properties to detect facial landmarks, and consequently estimate the yaw angles. In line with the geometry-based approaches, some other works attempt to estimate head pose from the correspondence of the features, extracted from a 2D image, and a 3D facial model [11]. This technique analyzes the projection relationship between a 3D facial model and 2D features to calculate the rotation matrix [33]- [36]. Another geometrybased approach [3], estimated the head pose with the aid of a Kinect sensor and three landmarks on the head. Given the high cost of Kinect, other studies have used inexpensive RGB web cameras to capture facial frames that seem to be more cost-effective [4], [37], [38].
On the other hand, we have the learning-based HPE models. Learning-based techniques aim to train a model for estimating the spatial head pose via appearance features. This spatial pose estimation can be either a classification, that classifies the input head images in specific position intervals (discrete), or a regression approach that estimates the head pose continuously. The features in this technique are mainly extracted automatically by convolutional neural networks that need to be trained with a large, annotated face dataset. For instance, [37] deployed a set of Gabor features (i.e. a linear filter used for texture analysis in image processing), along with a machine learning model (i.e., random forest algorithm), for face images classification. Due to the classification nature of this model, it is considered as a discrete HPE model. In another study, [38] proposed a new model for yaw and facial landmark estimation in the wild. Similar to [37], they have classified facial images into several classes of yaw angles with intervals of 15 • . Following the great success of CNN in the extraction of features, [39] has trained a deep neural network to learn the mapping function between the visual appearance and the 3D head orientation angles. The authors developed their model as a regression model that finds the correlation of extracted features from a CNN. Similarly, [16] developed an HPE model based on a multiloss neural network with a function to estimate each of the Euler angles.

B. MULTITASK LEARNING
For all the HPE models discussed in the previous subsection, the presence of a face in the input image is an assumption. It means in practice the face must be first detected and then cropped from the original image before being fed into the HPE network. Multitask Learning (MTL) models can simplify this process by integrating a head detection step into a HPE model. Given the fact that the early layers of a deep CNN tend to learn generic features of an image, which can be also be useful for other tasks, the idea of sharing learned features for different tasks formed the first multitask learning models [40], [41]. Sharing features in MTL models, does not only lead to an increase in processing speed, but also to less biased features against the data of a particular task [42]. Despite the recent success of MTL models, their application in computer vision and object detection is still in its infancy.
One of the few examples is [43], which introduced a multitask learning object detection model for detection of dangerous objects, by detecting an object and estimating its distance from the camera. The authors used a number of convolutional layers for the extraction of features, which were shared for both object detection and distance estimation. Similarly, [44] developed an MTL learning model for object detection and saliency estimation, trained from a non-jointly annotated dataset. In the context of HPE, a lot of efforts have also been put forward to develop a MTL head pose estimation model [22], [26], [27], [42], [45], [46]. Some of them tried to jointly estimate the head pose along with facial landmarks [27], [47], while others tried to detect the head, along with estimating its pose [42]. For instance, [48] used an Mask-RCNN model in a multitask learning setup for joint position estimation, orientation estimation, and body segmentation, by sharing the global features among all tasks. However, the effectiveness of such a network for head pose estimation remains unknown. In a similar way, [42] has developed a multitask learning approach to improve the performance of previous work by integrating a face detection step with feature extraction and pose estimation. Their model, outputs continuous values of head orientation and demonstrated a MAE of less than 4 • . One issue with the existing multitask learning HPEs is that most of them are problem-specific models, with the sole goal of face detection and head pose estimation.
The ideas and challenges discussed above, have led us to develop a general purpose MTL model, whose use is not restricted in the HPE domain (hence ''cross-domain''), but can be trained to determine an attribute of choice (e.g., other facial attributes, such as age), while performing object detection. In the next section, we present the proposed model architecture, the methodology of its implementation, as well as the training considerations.

III. APPROACH
This section discusses the architecture of the proposed model, the hyper parameters, and evaluation metrics.

A. ARCHITECTURE
The overview of the proposed model is presented in Figure 2 It is important to mention that the backbone of this algorithm is adopted from Mask-RCNN [19]. The whole idea of the proposed network is described as follows. The input images are fed to both a RPN (Region Proposal Network) and a feature descriptor (i.e., Resnet50 for extraction of features from the input image). RPN is a network that identifies the prospective objects (also known as ROIs; Region of Interest) within images. ROIs are coordinates of rectangles (known as bounding boxes) that are likely to contain an object, which would be fed to another classifier to determine the class of the bounded object. In Mask-RCNN [19], the researchers have developed a novel, lightweight neural network that performs a preliminary object detection to extract the ROIs. The RPN network needs to be simultaneously  trained along with the object detection and the attribute estimation model. For training the RPN network, a window slides over the image with a certain stride (sliding steps). For each step, three different windows (called anchors in [20]) with three different aspect ratios (9 anchors in total) are created. RPN_ANCHOR_RATIOS and RPN_ANCHOR_SCALES are two hyper parameters of the RPN, representing the widthto-height ratio of the anchors and their sizes, respectively. For instance, for stride of one, in an image with dimensions w × h, the model generates w × h × 9 anchors. As defined by [20], the anchors with an IOU (Intersection of Union; a metric that measures the overlap between two windows) greater than 70% with the GT's (ground truth) bounding box, are flagged as positive (foreground) and the ones with an IOU below 30% are flagged as negative (background). These positive and negative target anchors are then used to train the RPN network. During the training process, the positive ROIs generated by the RPN with an IOU greater than a threshold (i.e. usually 50%), are selected for training the object detector (a classifier to identify the class of the object) and other headers (e.g., attribute estimation model); this technique is known as NMS (non-Max Suppression). A certain number of the positive and negative ROIs (specified by the Train-ROIS-Per-Image parameter), generated by the RPN, with a Positive/Negative ratio of ROI_POSITIVE_RATIO, are then selected for training the headers.
The positive ROIs, are then cropped from the feature map and converted to two fixed-size feature maps with a technique called ROIAlign (see [20] for more info). The feature map's size, which is the input for the bounding box and classifier network, remains at the size of 7 × 7, according to [19]. The cropped ROI for the pose estimation is resized to the fixedsize of 28 × 28. The feature maps are then connected to three sets of head networks, including a network for classifying objects within the proposals and fine tuning the bounding box coordinates, a network for generating masks, and yet another one for attribute (i.e., pose) estimation. The fixedsize 28 × 28 feature map is fed to a network that contains a series of convolutional layers, activation functions, and dense layers (see Figure 3).
In our pose estimation convolutional network, a fixed-size feature map is passed through two sets of Conv. Layer (kernel size: 3 × 3) + Max-pooling layer (window size: 2 × 2) + Batch Normalization + Activation function (ReLU), followed by one more Conv. Layer and some dense layers as shown in Figure 3. In the last layer, a linear activation function generates the full range of 0 to 1, which is then linearly mapped to the range of 0 to ϕ max .

B. MULTI-LOSS
The different tasks in neural networks result in different losses, making it necessary for multitask learning algorithms 54706 VOLUME 10,2022 to have a multi-loss function. For our model, we proposed a multi-loss that combines the losses of the bounding box, classifier, segmentation, and pose estimation regressors. For the bounding box detection, the L1 loss function is implemented as below: where the Smooth L1 is defined as below: Here, BB prediction is the vectorized tensor of the predicted bounding box with a length of 4 (x, y, w, h) and BB True is the true bounding box. In [19], the L1 loss function has been implemented to eliminate the malicious effect of potential outliers in bounding boxes. However, due to the restricted pose labels in our datasets, we implemented an L2-loss for training the pose regressor as: where GT pose and f (x) pose are the real and predicted pose values of the i th instance. To train the classifier, to distinguish between face and non-face ROIs, the difference between the prediction and the GT is minimized by computing the softmax cross-entropy loss as: In Eq. 4, let y i be the real class of the i th instance, y i ∈ {0, 1}, and x i be the probability that the proposed region by RPN network contains a face or not. The combination of the individual loss functions, explained in this section, are jointly used for training the model. The next section describes the training process as well as the various hyper parameters.

C. DATA AUGMENTATION
Due to some factors like clearance, brightness, resolution, occlusion, etc., images taken in controlled environments are fundamentally different from those taken in the wild. This discrepancy can be detrimental to the performance of a model trained on a controlled-environment dataset in real-world scenarios, due to the lack of generalization. On the other hand, if the training dataset covers a variety of possible imaging conditions (i.e. well-diversified), the trained model will be well-generalized, and automatic translation invariance will be guaranteed [42]. However, the generation of such a diversified dataset is tedious and expensive. For bridging this gap, we have utilized a set of augmentation filters over the input images to enhance the dataset, both in quantity and quality, and improve the resulting model's generalization capability. Figure 4 demonstrates the augmentation of an original image with applied contrast, blur, Gaussian noise, pixelation, fog, rain, and snow filters. Varying weather conditions and camera vibration in the wild are among the most prevalent factors that can affect the quality of the captured image. Vertical and horizontal flipping of images is one of the most common augmentation practices in the computer vision domain. However, this technique does not apply to this study because the datasets already contain the same angle on both sides of yaw, and therefore, flipping the images will add very little to no variability to the dataset due to its symmetric nature. Moreover, a tricky and very important consideration about the flipping augmentation is that the image flipping also requires the GT to be changed accordingly, to account for the reversed angle. See Figure 5 for further clarification.

A. DATASETS
The public datasets of Prima and BIWI were used for training and testing the proposed model. The Prima dataset contains images of 15 participants; each participant's images have been taken in two different conditions (i.e., different clothes, different hairstyle, with or without glasses, etc.); 93 images in each condition are taken per participant. The images are taken in 13 different yaw angles (15 • intervals) and 9 different pitch angles. The dataset contains close-up images of participants, with mostly gray backgrounds. On the other hand, the BIWI dataset contains facial images from 20 participants (14 males, 6 females) with a head pose distribution of ±75 • degrees and ±60 • in the yaw and pitch direction, respectively. Figure 6 shows some sample images of the two datasets.   For training, we generated a jointly annotated dataset 1 for multitask learning. Generated datasets have been annotated in the COCO format [49] that enables us to train our model, requiring the GT of the faces' bounding boxes, the heads' masks, the class label of the instance (face/non-face), and most importantly, the yaw and the pitch.

B. TRAINING
In order to evaluate our method, we have trained two individual models for each training dataset (Prima and BIWI). We used 70% of the datasets to train the models, while the rest was equally split between the testing and validation sets (15% for test and 15% for validation). The final global loss function for convergence of the model is declared as below: where λ denotes the weight for each loss term. Throughout the training process, each batch of data, contains the raw images, the GT for the bounding boxes, the segmentations, and the attribute values (i.e. yaw and pitch). The model is trained for 10 epochs with 1000 iteration per epoch. The learning rate is set to 0.001. For achieving a high processing speed, the image meta-size is set to 128 × 128. Table 1 summarizes the hyper parameter values and the training pipeline. 1 www.ai-console.com Given that not all of the proposed regions by the RPN contain a face, a NMS (non-max suppression) technique is implemented to eliminate negative (non-face) regions as explained in section III.A. The NMS technique computes the overlap between the ROIs and the GT bounding boxes, as measured by IOU, and removes the proposed regions with an overlap below the threshold. As with most object detection approaches, the threshold of IOU for NMS is 50%. When it comes to joint face detection and pose estimation, the 50% threshold might degrade the performance, since the proposed region with an overlap of more than 50%, might be detected as positive, but a lot of information and face components might be lost on the other 50% [42]. Various techniques have been proposed to overcome this problem. For instance, [16] has used a Kinect depth sensor to detect a face area in the input images as a pre-processing step for the detection of face. In our approach, to avoid this issue, we set the NMS threshold to 80% to ensure that the proposed anchors cover the majority of the face's characteristics. Then, the models have been trained with 70% of the datasets, while the rest was then used for evaluation. Figure 7 shows some of the learned features from the different layers of the Resnet50 descriptor. For training the model, given the limited number of faces within the image and in order to have a decent training time, we set the hyper parameters as follows. Post_NMS_ROIS_Training = 1000, Train_ROIS_Per_Image = 100, RPN_NMS_Threshhold = 0.8, and ROI_Positive_Ratio = 0.33, which means 100 of the 1000 ROIs (generated by RPN), with a score above 0.8, and with a ratio of ROIS + ROIS − = 0.33 would be selected for training the headers. We reduced the number of the ROIs (Train_ROIS_Per_Image) from 1000 (i.e. as suggested by [19]) to 100, since we already knew that there is very few number of faces per image in the current datasets. Practitioners might need to increase the values if they want to train the model for crowded images. As shown in Figure 8, the error for bounding box detection, pose estimation, as well as global loss, has converged exponentially.
The performance of the trained model, in terms of the head pose estimation is presented in the next subsection. 54708 VOLUME 10, 2022

C. RESULTS
The performance of our proposed model is reported in this section. The performance is evaluated by the MAE (Mean Absolute Error) metric as: where N is the number of images, andp i and p i represent the GT and the predicted pose respectively. Given the importance of real-time inference for such algorithms in real-world scenarios, and the fact that there is only one face per image, to increase the inference time we set the detection parameters as: Detection_Max_Instances = 5, Post_NMS_ROIS_Inference = 5. In this case, the model selects 5 ROIs (generated by RPN) with the highest confidence score for detection of up to 5 instances. It is important to mention that the Post_NMS_ROIS_Inference has been set to 5, given that the lower the number, the higher speed. For detection, also the Min_Detection_Confidence was set to 70%, meaning that any detected instances by the model with a confidence score above 70% was considered as positive.
The results of our models' performance on both datasets are shown in Table 2. Given the wide range of the standard deviation, we plot the distribution of detections on yaw and pitch axes for both datasets in Figure 9, where the blue dots represent the GT, and the red ones show the predicted value by the algorithm. Due to the smaller intervals between the labels of the BIWI dataset, we can see a more scattered plot for the BIWI dataset. According to Figure 9, apart from some outliers, the model on both datasets appears to perform well. Comparing our model's performance with the state-of-the-art algorithms, as shown in Table 2, revealed that the proposed model can estimate the pose on par with the current state-of-the-art models. Figure 1 demonstrates some successful cases of head detection and pose estimation from the Prima and BIWI datasets. We deployed our trained model on two machines: 1) NVidia Xavier development board (for mobile robot applications) 2) NVidia GT2070 GPU, where the FPS (frames per second) of ∼4.5 and ∼16, were respectively achieved. The model was also tested on its effectiveness in detecting faces. Performance was measured as F1-scores at the minimum detection confidence of 70%, as defined below: where TP (true positive), FP (false positive), and FN (false negative) represent the number of correctly detected faces, the number of mistakenly detected faces, and the number of missed faces, respectively. Not surprisingly, the model achieved the high F1-score values of 98.7% and 97.2% for the Prima and BIWI datasets, respectively. One potential explanation for the high F1-Scores is the similarity between the images' visual characteristics in the datasets, which leads the models to be able to detect most of the faces with just a low number of missed or mistakenly detected faces. Another helping factor to the high accuracy, is that the images in the two datasets were taken in a controlled way, with a relatively clean background.

D. GENERALIZATION TEST
CNN networks have shown promising results in extracting meaningful features for a wide range of facial attribute estimations, including gender, age, or hairstyle [52]. Therefore, we believe that our proposed model can also be applied to other domains, as its backbone is the standard Resnet-50 feature descriptor. To investigate the generalization capability of the proposed model, we have trained our model on the public dataset UTKFace for age estimation. This dataset contains ∼20,000 facial images of people in the range of 0 to 116 years old with wide variation in terms of illumination, pose, facial expression, etc. We used a randomly sampled portion of the dataset (∼30%) for face detection, segmentation, and age estimation. Like the head pose estimation model, we jointly annotate the dataset, where the final annotation file contains the bounding box, masks, class labels, and the corresponding ages. Both output nodes of the attribute estimation header, which were initially developed for yaw and pitch estimation in the HPE problem (see Section III), were now assigned for age prediction. Figure 10 shows some example images of the dataset that are used both in training and in testing. Like the pose estimation, 70% of the annotated images were used for training the model (see Section IV.B) with the same parameters. We then evaluated the performance of the trained model as measured by the MAE (Mean Absolute Error). As shown in Table 3, the model achieved a MAE of 5.3 ± 3.2 on the evaluation dataset. Comparing the results with the state-of-the-art, revealed that our model performs equally well, however, it did not achieve the best result. Figure 11 presents some successful examples of age detection via our model.  Table 3, shows that [53] is the only model that performs better than our proposed model, however, the requirement to manually crop the face before feeding the image to the network can act as a deterrent for its practical applicability.

V. LIMITATIONS
We have developed a novel multitask cross-domain object detection model and tested its ability to detect faces and estimate facial attributes, including head pose or age. Although the proposed model has shown promising results and good generalization capability, there are still ways in which it can be improved. Practical deployment of the HPE models on an NVidia Xavier development board, when tested on the snapshots of a webcam stream, revealed that the HPE model is very sensitive to several factors, such as the distance between the camera and the face of the user, as well as the background of the snapshots. It is fair to assume that this is a matter of the training datasets' limited diversity, and we believe that a collective effort is needed to generate a richer dataset to enable training a well-generalized model, suited for real-world applications. In addition, we recommend that future systems implement GANs [57] to generate costeffective, diversified synthetic images in order to train a wellgeneralized HPE model for real-world applications.
Apart from the limitation discussed above, the test of the model on the NVidia Xavier also shows some outlier estimations which can be problematic if the system is intended to be used in a sensitive real-world scenarios like head-controlled EPWs [1], [2]. Fortunately, these outliers can be dampened by some techniques like moving average or Kalman filter, however, they can adversely affect the FPS of the system. The FPS of ∼4.5 that was achieved on the specific platform may not be fast enough for some real-world applications. Therefore, in future studies optimizing the model speed needs to be a point of focus and further exploration. Using a shallower feature descriptor than Resnet50, or optimization of the headers by reducing their size and complexity, are some potential solutions that can be explored in the forthcoming studies. Furthermore, while determining the yaw and pitch may be adequate for some applications like head-controlled EPWs, roll estimation may also be required in some circumstances, and thus, needs to be taken into account in the relevant implementations.

VI. DISCUSSION AND CONCLUSION
Inspired by the wide application of HPE models, we have presented a cross-domain multitask learning (MTL) model for object (head) detection, segmentation, and attribute estimation (pose estimation). Our model is developed on top of the state-of-the-art MRCNN [19] object detection model, where a Resnet50 feature descriptor for extraction of highlevel features is implemented. After extracting the features, they are converted into two fixed-size feature maps (sizes 7 × 7, and 28 × 28), which are then passed to the classifier/regressors for head detection, bounding box estimation, and pose estimation. The performance of our proposed model has been evaluated on two public datasets, BIWI and Prima, for pose estimation. Our model achieved a MAE of 6.2 ± 4.7 and 6.6 ± 4.9 for the yaw and pitch on the BIWI dataset, and 8.0 ± 8.6 and 8.2 ± 8.1 on the Prima dataset (see Table 2). Comparing those results to the state-of-theart models for HPE, our model appears to have an equally strong performance, or just marginally lower in a few cases. Moreover, our model's smaller standard deviation demonstrates better consistency (i.e., less uncertainty) in terms of estimation (see Table 2). We also evaluate the generalization capability of our model by testing it on a different domain problem for age estimation. For this evaluation, our model was trained and tested on the public dataset UTKFace, for head detection and age estimation where we achieved a MAE of 5.3 ± 3.2.
The proposed multitask learning model parallelized the process of the object detection (i.e. head) and attribute estimation (pose, age), which eliminates the requirement for manual cropping of the images or the requirement of having access to expensive equipment like depth camera sensors (e.g. Kinect). The proposed model shows promising results and the potential to be used in various domains, while it maintains an advantage over the problem-specific state-of-the-art models by merging a two-stage process into a single one.