MR-CapsNet: A Deep Learning Algorithm for Image-Based Head Pose Estimation on CapsNet

Head pose estimation based on a single image is a challenging endeavor because of the complex background conditions and characteristics of the human face. In this report, we propose a Multi stage Regression-Capsule Network (MR-CapsNet) to predict head posture based on a single image input. In the study, we used the residual attention block and squeeze-and-excitation block to extract features in three levels. CapsNet overcomes the shortcomings of the traditional convolutional neural network and implements module aggregation to describe the spatial relationship of features after aggregation, in addition to realizing a compact and robust model using a multi-stage regression scheme. We tested our method on the AFLW2000 and BIWI datasets obtaining mean absolute errors of 4.26% and 3.95%, respectively. In addition, we discuss the accuracy of our method in the case of eye or mouth occlusion. The results of comprehensive experiments reveal that our method can accurately predict head posture.


I. INTRODUCTION
The development of a variety of perceptual devices has served as the basis for recent advancements in personalized entertainment. Head pose estimation is an essential part of human-computer interaction, which can provide information on the direction of human attention. The prediction of head pose based on a single image is still a challenging task. The head pose can be represented by a three-dimensional vector that includes the top view, roll, and yaw angles [1]. To extract head pose information from images, it is necessary to determine the feature mapping between two-and threedimensional space. The head pose estimation task involves inferring the head pose direction based on images acquired using a camera. In a driving system, it is possible to ascertain the driver's attention and consciousness based on position information [2]. Head pose information is also important for human-computer interaction [3]. The system can interact with the user's head monitoring software [4] [5] to estimate the level of interest.
The process of head pose estimation and related tasks is often associated with many challenges, such as imaging problems due to the camera system, complex backgrounds, blurred targets caused by different light sources, and target occlusion problems [6]. In real space, the use of human vision to obtain information often results in significant challenges. Human vision is often too fuzzy to perceive distant objects, and in the case of dim or poor lighting, head pose estimation may result in failure. Therefore, in the field of computer vision, the face alignment method is used in many face detection algorithms [7] [8], which places the target information in the same semantic domain as that of the simplified object being detected. In previous studies, the influence of the background on the target was effectively eliminated using facial localization and image clipping, and the impact of noise caused by an image-independent target was reduced [9].
The traditional workflow is based on deep learning, especially convolutional neural networks (CNNs) [10]. These traditional networks have a wide range of learning capabilities, but they also have some key shortcomings. For example, the lack of local equivariant features leads to weak generalization ability necessitating additional parameters for the construction of a deep network whereby the location relationship between local and global features is no longer well-maintained [11], and the robustness is not high. To overcome the shortcomings of CNNs, CapsNet was recently proposed by Sabor [12]. Each capsule in CapsNet is a group of neurons that can represent different instantiation parameters related to different targets and their probability of existence. There has been significant interest in the use of CapsNet in different application fields and the development of different variants. CapsNet has a particularly important feature,a unique "routing" process can effectively handle the transformation model. Only when the son-capsule is consistent with the predicted value, it can be transformed into the parent capsule. Recently, a technique was incorporated into CapsNet to enhance its robustness to transformation. CapsNet is highly sensitive to the image background, which contributes the accuracy of head pose estimation and classification as the detailed information on the position and pose of the object has to be retained, which in turn is useful in learning relations, determining the exact position of the extracted features, and establishing the representation of the object in terms of partial hierarchical structures [13].
The classic head pose estimation methods include machine learning [14][15] as well as appearance template [16] [17], geometric model-based [18][19], depth image-based [20], and landmark-based methods. To estimate the head pose from an image, it is necessary to perform a mapping from two-to three-dimensional space. Compared to traditional RGB images, depth images can retrieve missing 3D information from 2D images and provide additional information to estimate head posture. At present, the depth camera has not been popularized and can only be used in certain fixed places. Moreover, the required computational burden and memory is too large for small servers. In the landmark-based method, Adrian [21] proposed converts 2D landmark annotations into 3D, to reasonably enhance and summarize the existing data set. In the course of studying the various aspects of face alignment with respect to different factors, training of the neural network model achieved excellent accuracy. Other methods include the componentbased discrimination method proposed by Lin [22], which uses a discriminative search algorithm to identify the shape of the face in the component. The classifier can detect the facial components in the configuration of the face component to effectively improve the accuracy and efficiency of face detection. These methods first recognize the road signs of the face, and then use them to predict the head pose. In the model-based method, Martins [23] proposed a framework to automatically estimate the pose of the human head in a single-view image. This method uses a 3D rigid model of the human body as an approximation of the human head, combined with an active appearance model. With respect to facial feature extraction and tracking, Krinidis [24] proposed a method to estimate head pose in a single-view video sequence. First, a face detector is used to detect the face; then a deformable surface model approximates the tracking technology of facial image strength; and finally, a feature vector is used to realize the head pose. Estimation methods use key points of the face to construct three-dimensional head models, and then obtain the result by training the appearance model. FSA-Net [25] uses a hierarchical coarseto-fine classification strategy, then a soft phase stepwise regression scheme to extract intermediate features followed by aggregation and regression to predict the final head pose. Based on the deep learning method [26], a convolutional neural network (CNN)-based model is constructed using CNN to estimate the pose of the human head in lowresolution multi-modal RGB-D data. Kumar et al. [27] proposed a method to correlate the trajectory of key points with the trajectory of the head posture, which changes the prediction results in accordance with the transformation of landmarks. Yang [28] proposed an advanced capsule network of RS-CapsNet, which improves the capsule network on the basis of the original network architecture and addresses the shortcomings of the capsule network pertaining to weak feature extraction ability and multiple training parameters, to achieve good performance in image classification. Xia et al. [29] proposed a face marker-assisted pose estimation method. In their work, they combined landmark-based face images with channel-level grayscale images for head pose prediction [30]. Ranjan et al. [31] regularized the shared parameters of the CNN, and a synergy effect was established between different fields and tasks such as smile detection, age estimation, and face recognition. Gu et al. [32] proposed a face feature tracking algorithm based on an RNN. Hyperface [33] uses a CNN to learn common features in the middle layer, which are then inputted into the multitask learning network for face detection, head pose estimation, and facial gender information. FacePoseNet [34] uses a CNN to perform 3D head pose regression, based on camera positioning, as auxiliary information for target recognition to improve precision. HopeNet [8] calculates the yaw, pitch, and roll angles by combining Resnet50 [35] and multiple loss functions. Zhao used multi-feature fusion to obtain head pose estimation. Wu used hog and pyramid settings to describe local gradient features and global shape features of the image of the face to facilitate head pose estimation in the local occlusion state [36]. Abate [37] proposed the Web-shaped Model algorithm to encode the posture of the face, and then regression for further face posture prediction. This method improves the sensitivity of head posture estimation and prediction accuracy. Recent studies have shown that multitask learning [38] can achieve better results compared to a single task.
Hence, the main contributions of this paper are: 1. We proposed a head pose prediction model based on multi-stage Regression-CapsNet (MR-CapsNet). We built a detection model based on feature extraction, feature aggregation, and multi-stage regression. The model can obtain multi-stage feature information. The probability vector of different stage features are then dynamically combined to predict and improve the accuracy of head pose estimation.
2. We created an accurate feature extraction network, which uses an efficient attentional mechanism model to combine the residual attentional block [38] and squeezeexcitation (SE) block [39]. The network does not only enhance the feature information extraction ability of the network, but also highlights useful features while suppressing useless ones. This structure can better explain the spatial relationship of target features and more accurately estimates the head posture.
3. We first applied the capsule neural network to the head pose estimation task. We applied the capsule structure of the network during the feature aggregation stage of head pose estimation, then constructed intermediate capsules using the "vertical and horizontal sliding method Windows" to select feature information, and finally used the linear combination method between capsules to enhance the representative ability of capsules. Compared with traditional CNN, our method can better discern the spatial relationship of features and improve the prediction accuracy of partially occluded faces.
The structure of this paper is as follows: the second section introduces the theoretical basis of the model in our algorithm, the third section provides the training details and experimental results, and the fourth section presents the conclusion.

II. METHOD
The flowchart in Figure 1 illustrates our head pose estimation algorithm based on the capsule neural network and multi-stage regression. The algorithm can be divided into three parts: (1) The feature extraction network performs the main feature extraction; (2) Capsule network performs feature aggregation on feature information; (3) Multi-stage regression obtains the probability vector of each stage.
First, we preprocess the input image to detect the head region. Then we output the detected image as input into the feature extraction network. In this network, we divide the feature extraction into three stages. Each stage is processed by a residual attention block and an SE block to improve feature processing, to strengthen feature weights of key information, and to enhance facial feature extraction capabilities. There is continuity between the stages to ensure that the effect of feature extraction is enhanced layer by layer. Then, the feature maps obtained in these three stages are inputted into the feature aggregation network. We constructed the intermediate capsule through feature selection so that our capsule neural network would be more sensitive to spatial information. The capsule neural network linearly combines the information graphs, and passes them through a dynamic routing algorithm to obtain richer feature information, which enhances the network's ability to understand the extracted facial features and reduces the impact of missing facial feature information on the prediction results. Finally, we combine the feature maps of the three stages to perform multi-stage regression to obtain the required probability vectors to improve our prediction accuracy.

A. FEATURE EXTRACTION
Our network is based on the network proposed by Song et al. [38], which is a compact model for age estimation from a single image. Our feature extraction network has three branches. Each branch consists of convolution, weight normalization, activation, three basic residual blocks, a pooling layer, and SE blocks. In addition, residual attention blocks are embedded into each stage. The structure of the residual attention block is also composed of convolution, weight normalization, channel, spatial attention, and a fusion layer, similar to the structure depicted in Fig. 1. Different filter cores and down-sampling methods are used for the residual unit. The feature maps with different kernel sizes are combined by multiplying the elements of the two feature maps generated by channel attention. Then, the features maps are inputted into the aggregation space, which focuses on the process of constructing the head rotation. This is illustrated in Fig. 2.

Linear combination between capsules
Proposed routing process where R indicates the range of the head pose angle, and i p indicates the probability of the 3D vector i l . In addition, to ensure the accuracy of the algorithm, we use the Mean Absolute Error (MAE) as the evaluation standard to reduce the error between the predicted head pose angle and the ground truth label, is the predicted pose for the training image n x .

2) FACE DETECTION
In the unconstrained case, the human head may have a large angle conversion and low resolution in a remote image; therefore, a relatively stable head detector is required. We chose MTCNN [40] as our detector, which can achieve realtime head detection at different scales and angles for a complex background. MTCNN combines face region detection with face key point detection, and its framework is similar to a cascade. It can be divided into three layers: P-net, R-Net, and O-net, which yields a robust detection.

3) RESIDUAL ATTENTION BLOCK
The residual attention block is a type of attention unit that promotes facial feature extraction via transform-ations as follows: ( ) F • can be regarded as a standard convolution operation along the channel and spatial dimensions. For the channel, we used the multi-scale kernel and pooling operations to map features to obtain distinguishable vectors, and then the results were fused by channel multiplication. The calculation of the space dimension is the same as that of channel size.

4) SQUEEZE-AND-EXCITATION(SE) BLOCK
The SE block is a type of attention block based on a feature graph channel. The core idea of an SE block is to learn the feature weights according to the loss, increase in the weights of the effective feature map, and to be able to reduce the weights of invalid or small feature maps to achieve better results. It has been demonstrated that SE blocks can improve the performance of a network with minimal computational cost. The architecture of the SE block is shown in Fig. 4. SE blocks map any given input graph into the network module.
Here, X is the input graph and U is the extracted feature.
To establish the dependence between channels, we need to squeeze the feature u, and aggregate the feature graph to where the subscript c represents the channel, u c represents the two-dimensional matrix with channel C in U.
Next, the aggregate information obtained from the compression operation is used to fully capture the dependency on channel dimensions. To achieve this goal, we use the following: where σ is the Relu function, 1 W , 2 W are the two fully connected layers. The second fully connected layer is followed by the sigmoid function. After these operations are completed, the weights of the feature map are obtained, and these weights are fused with the original view features: ∈ scaling index is multiplied by. The function of the two full connection layers is to fuse the feature map information of each channel. After the exception operation, a set of channel weights S' is generated, which represents the weight of the feature maps between the channels. The enhanced feature map can then be obtained by multiplying S' and the input feature map.

B. FEATURE AGGREGATION
The role of the aggregation module is to aggregate a small number of representative features of the calculated feature maps into local maps. For the aggregation module, we consider CNN and CapsNet. We determined that CNN is ideal for capturing the existence of features because its convolution structure is designed for this purpose. However, when exploring the relationship between feature attributes, CNN is not optimal, causing the input image to lose the exact target information of the feature detector. As such, CNN does not successfully identify the object in case of rotation or other similar situations. In head pose estimation, the human head often has a large rotation angle, and a method based on CapsNet is proposed to overcome the limitations of the CNN method. The CapsNet in this work was inspired by the RS-CapsNet architecture, which is designed for feature fusion. Therefore, we use CapsNet as our aggregation module for the features. In addition, to reduce the amount of calculation and capsules, we use a 1 1 × convolution layer to reduce the number of channels. We remodel all the realized feature maps into capsules, using the linear relationship between the capsules to fuse features and halve the capsule to enhance its ability to express features. We obtain different types of capsules for different local feature maps, implement a dynamic routing algorithm for them, and construct capsules that can represent most of the objects. Each local feature map can construct 3 N capsule networks, where each capsule is 2 D Finally, the intermediate capsule constructed using the local feature map and the original capsule obtained by feature mapping are used to obtain the classified capsule.

1) FEATURE SELECTION
We first divide the feature map generated by the last convolution operation of the input image into small local feature maps, which are then used to construct the "intermediate capsule." This capsule can represent most of the detected objects. The intermediate capsule and the original capsule obtained by feature mapping are used to obtain the classified capsule. Regarding the problem of "how to slice the feature map," we recommend using vertical and horizontal sliding windows, as illustrated in Fig. 5. There are two reasons for selecting the vertical and horizontal sliding window methods. First, for objects with horizontal, vertical, or other symmetrical structures, the "vertical and horizontal sliding window" method is more conducive to maintaining their integrity; second, we expect to use the maximum number of small local feature maps. Compared with the traditional sliding window method shown in Fig. 5, the " Improved sliding window" method allows for more local feature maps.
(a)Traditional sliding windows (b) Improved sliding windows

2) LINEAR COMBINATION BETWEEN CAPSULES
To address the problem of the presence of redundant information in the background of the input image, we use a linear relationship in the capsule, and remodel the feature map into capsules such that each capsule represents the detection object in the input image. We then construct a connection between the capsules in the same position, and finally use the linear relationship of the capsule in the input image. The aforementioned linear combination method is utilized to flatten the capsules, maintain their length in [0,1], cause their direction to be constant, and provide a more nonlinear relationship for the entire network. Fig. 6 shows the linear combination method between capsules with the same pixel location.

3) DYNAMIC ROUTING ALGORITHM
In the capsule network, the length of the capsule represents the probability that the target is correctly detected. Dynamic routing based on EM uses the maximum likelihood estimator and clustering technology to group capsules into a part-whole relationship.
The coupling system of higher capsules is calculated by estimating their activation degree and their probability values. In network regularization training, routing is not combined with image reconstruction. By converting convolution and routing to a specific computing domain, the number of parameters can be significantly reduced to achieve better results.
Based on the results of the comparison, dynamic routing based EM is more suitable when the image size changes. The dynamic routing algorithm works as follows: In traditional age estimation, to improve the accuracy and simplicity of age classification, usually one year is used as the interval. However, given the large number of network parameters and the need for a large amount of computing resources, the training network becomes both complex and time-consuming. To address this shortcoming and maintain the accuracy of age prediction, the scale of the deep neural network is reduced to produce a more compact and effective network, which can transform a regression into a multi-stage process.
As shown in Fig.7, the structure of the multi-stage regression module is as follows: the main branch is composed of 1×1 convolution; ReLU activation function, pooling layer, and three function quantities are output through three branches, respectively. The first branch outputs θ directly through the full connection layer and the tanh activation function. The second branch outputs p through the dropout layer, the full connection layer, the tanh activation function, the full connection layer and the softmax function.
The third branch outputs β via the dropout layer, the full connection layer, tanh activation function, the full connection layer and tanh activation function. we print our predicted value ɶ y . The multistage regression formula can be applied to any regression problem. In this study, we apply multistage regression to head pose estimation. Unlike the age estimation problem, the pose estimation problem obtains vectors instead of a scalar.

Ⅲ. EXPERIMENTAL RESULTS AND DISCUSSION
In this section, we describe our experimental process in detail as shown in Fig.8. The experiment was divided into four parts. In the first part, we introduced the evaluation criteria for the experiment. In the second part, we described some basic experimental settings. In the third part, we provide the details of the experimental training. Finally, we present the results of head posture prediction using different assessment schemes.

Experimental Criterion
Experimental Setting

Experimental Result and Analysis
Training Deatails

A. EXPERIMENTAL CRITERION
In the following experiments, we evaluate the experimental results using MAE: where i l ɵ and i l are the ground truth label and the final predicted value of the yaw, pitch, roll angles of the i-th image, respectively, and N is the total number of images of the test set clock.

1) CAPSNET SETTING
We consider that n is the number of intermediate capsules generated by each feature map 3 N . This value cannot be too large because the capsule is different from that of a CNN as it is generated by the routing process, represents the characteristics of the target object rather well, and does not contain too much superfluous information. Moreover, the value cannot be too small because we obtain the classification capsule based on the weight of the sub capsule. If it is too small, it cannot achieve a good effect. Therefore, we set 3 N to 16. Given that the original capsule 2 D is constructed directly from the feature map, there is significant surplus information. Thus, the routing intermediate capsule better represents the target. Therefore, we changed the size of the original capsule 2 D to 8.
We trained the IN-THE-WILD model using the 300W-LP dataset. 300W-LP is a simulation dataset based on the 300W dataset and the 3DMM dataset. A 3D model is constructed using a 2D image to simulate head pose estimation. The model is then gradually flipped to further enhance the effect of the dataset. The dataset contains large-angle images, and 122450 flipped images are expanded on this basis. It is a good dataset for training head pose estimation models.
AFLW2000 is a challenging dataset, which consists of a large-scale face database with multiple poses and multiple angles It provides real 3D facial pose angle landmarks for the first 2000 images of the AFLW dataset, including pose changes of different characters under different scenes and luminosity. We use the AFLW2000 dataset to test the model, which can verify the generalization ability of the model.

3) LABORATORY MODEL SETTING
We trained the laboratory model using the BIWI dataset. The BIWI dataset was created using Kinect sensors. It consists of 24 sequences with a total number of 15.6 K frames, and includes 1000 high-quality 3D face pose data samples captured using RGBD cameras, including 24 RGBD cameras capturing 20 different people and 24 videos of 20 different characters, head pose range including approximately ±75° yaw and ±60° pitch. The dataset consists of about 15,000 images, including not only RGB images, but also depth images and annotations. Unlike the other two datasets, which were collected from the field, all BIWI images were taken indoors, it can verify the detection ability of the model in the indoor environment.

4) EXPERIMENTAL PLATFORM
In this work, all experiments were conducted on a platform with a Windows 10 operating system, an NIVIDIA GeForce RTX 2060 with 8 GB graphics memory, and an Intel Core i7-4790K with 16 GB memory. The software platform is Python 3.7.3, based on the Keras and Tensorflow1 framework.

C. TRAINING DETAILS
We used Adam [42] as the optimizer for training, and the initial learning rate was set to 0.001. The learning rate was decreased by 0.1 times every 30 periods. To enhance the ability to process blurred and zoomed images, random clipping and random scaling were applied to the training images to augment the training data. The 3D rotation of the Z-axis in the X-Y-Z axis was consistent with the 2D rotation, thus the rotation of the head along the X-Y-Z axis was fixed. Therefore, to establish a better relationship between image and head posture, we converted the Euler angle of Z-Y-X to X-Y-Z to reduce the average prediction error.
For the IN-THE-WILD model, training was performed on the 300W-LP dataset, whereas the AFLW2000 and BIWI datasets were used for testing. When using the BIWI dataset for evaluation, we only considered images with rotation angles in [-99° and 99°]. The batch size for the training and testing sets was 16. For the laboratory model, 70% of the training was performed on the BIWI dataset, and the rest was used for testing. The training and test batch sizes were set to 8.

1) EVALUATION OF THE IN-THE-WILD MODEL
The IN-THE-WILD model was trained using the 300W-LP dataset. Tables 1 and 2 summarize our methods, for which the AFLW2000 and BIWI datasets were used for comparison with the latest method, using MAE as the evaluation standard. Our method achieved excellent results compared to other advanced approaches. HopeNet [8] uses Resnet50 to separate yaw, roll, and pitch, and uses MAE and cross-entropy to estimate the fine-grained head posture. FSA-Net [25] uses the SSR net collective attention module for soft phase aggregation. 3DDFA [30] matches CNN and RGB images, evaluates shape-related parameters, and transforms the head into a dense 3D model to facilitate detection even in a closed environment. FAN [21] is a landmark detection method that solves 2D-3D problems by merging features of landmarks across multiple layers.   Figure 9 displays the ground truth, the results of HopeNet [8], the results of FSA-Net [25], and our results. The blue line indicates the direction the subject is facing; the green line indicates the downward direction; and the red line represents the side. The performance of the method is based on landmarks and depends on the underlying face alignment algorithm, whereas our method does not rely on other auxiliary aspects.
For further analysis, we applied our algorithm to two additional cases (no CapsNet block and SENet block); CapsNet block is the part for feature fusion, and SENet is an attention mechanism network added to feature extraction. All calculations were performed according to the MAE standard to better demonstrate the process on the IN-THE-WILD Model.
As shown in Table 1, this method performs best when tested on the AFLW dataset, reaching the minimum on yaw, pitch, and roll with an average deviation angle value of 4.26. Compared with other methods, the detection result value of this method changed significantly. Therefore, the method in this paper is the best in detection performance. As shown in Table 2, this method also displayed the best performance when tested on the BIWI dataset. It reached the minimum on yaw, pitch, and roll with an average deviation angle value of 3.95. Therefore, it was confirmed that the method in this paper is the best in detection performance.

2) EVALUATION OF THE LABORATORY MODEL
The laboratory model was trained using 70% of the BIWI dataset. This dataset contains a variety of model assessment information. In addition to RGB color information, depth image information and time information can also be used. As shown in Table 3, Martin et al. [19] estimated head posture using a depth camera to obtain a depth image. Drouard et al. [14] used the hybrid method of linear regression to acquire a high-dimensional feature vector to determine the head posture. The table also records the time spent by each method to test the image.
As shown in Table 3, the performance of the proposed method on the BIWI training dataset was relatively good. The deviation angle on yaw is 2.64, which is the lowest value among all methods, and the average deviation angle is slightly higher than that of Hopenet [8] with 3.31 degrees. However, for the method proposed in this manuscript, the shortest run time is 0.53ms, and the test efficiency is the highest. Therefore, the experimental results demonstrate that this method has certain advantages with respect to detection error and test time in the indoor environment.

3) EVALUATION IN THE PARTIALLY OCCLUDED CASE
In order to verify the performance capability of the method in covering the head, we tested the accuracy of our algorithm under different occlusion conditions. As shown in Fig. 10, we divided the facial region into two areas: eyes and mouth. The two regions were then occluded separately to calculate the accuracy for the non-occluded face area. The occlusion rate of the entire face from top to bottom as well as in the opposite direction was 0%, 12.5%, 25%, 37.5%, 50%, corresponding to two important feature intervals of the eye area and the mouth area, respectively.  Table 4 displays the relative accuracy rate for the head postures with occlusion of the eye area. When the occlusion rate reaches 50%, the eye area is blocked and the mouth area is active; the relative accuracy rate is 82.97%. Table 5 displays the relative accuracy rate for the head postures with occlusion of the mouth area. When the occlusion rate reaches 50%, the mouth area is blocked and the eye area is active; the relative accuracy rate is 89.81%. Table 6 displays the relative accuracy rate of the head postures of each algorithm with occlusion of the eye area. The results show that the proposed algorithm is the best in the case of occlusion of each eye. Table 7 displays the relative accuracy rate of the head postures of each algorithm with occlusion of the mouth area. The results show that the proposed algorithm is the best in the case of occlusion of the mouth.
Compared to the eye region, the mouth region contributes less to the head pose estimation. This shows that our method can address the problems associated with wearing masks or head-covering. Compared with other algorithms for occlusion experiments, the method proposed in this paper has the highest accuracy rate, which proves the superiority of our model with respect to occlusion.

IV. CONCLUSION AND FUTURE WORK
In this study, we developed a deep neural network model MR-CapsNet to predict head posture. Our method can infer the head posture from only an image without additional factors such as a depth map or facial markers. Initially, MTCNN [37] was used to detect the target, which was then divided into three levels. A residual attention block and SE block were used for feature extraction. CapsNet is an emerging network that is more sensitive to posture information, as reflected in facial expressions, than a traditional CNN. Therefore, we combined the extracted feature map with CapsNet to obtain more accurate attitude information. Finally, a multi-stage regression function was used to predict head posture. The MAE of our model is superior to that of other advanced methods.
In the future, we will continue to improve our model. At present, the detection ability in outdoor environments is not ideal. To further improve the pertinence and accuracy of prediction, additional low-resolution datasets need to be integrated. Currently, capsules are emerging; however, there are no relevant application examples of CapsNet in the field of head posture estimation, which requires further attention. In this study, although only the estimation of head posture was considered, the overall framework is widely applicable.

V. AUTHOR CONTRIBUTIONS
Hao Fang conceived the algorithms, Jun-Qing Liu designed the experiments; Kai Xie reviewed the paper; Chang Wen conducted the comparative experiment; Peng Wu and Xin-Yu Zhang is responsible for data collection; Jian-Biao He checked the spelling and made suggestions.