Animal Pose Estimation Algorithm Based on the Lightweight Stacked Hourglass Network

Pose estimation has been a hot topic in the field of machine vision in recent years. Animals exist widely in nature, and the analysis of their shape and movement is important in many fields and industries. In the pose estimation task, to improve the detection accuracy, the existing models often need to consume a lot of computing and memory resources. Therefore, it is a key problem for the pose estimation methods to carry out a lightweight model and reduce the computational overhead on the premise of ensuring model accuracy. In this paper, we focus on the structure of the convolutional neural network in animal pose estimation, construct a lightweight and efficient stacked hourglass network model oriented to optimize the balance of model computation and accuracy, and implement the application algorithm design based on it. Aiming at the problem of large parameters in depthwise convolutional neural networks, a lightweight residual module is proposed, that is, based on the lightweight efficient channel attention improved conditional channel-weighted method (ICCW-Bottle), thereby reducing the weight of the network and obtaining the feature information of different scales. Given the problem that a large amount of feature information is easily lost after the network pooling operation, a lightweight dual-branch fusion module is proposed that fully integrates high-level semantic information and low-level detailed features under the condition of a small number of parameters. Finally, the same as the CC-SSL method: the model is trained jointly using synthetic and real animal datasets, but the CC-SSL method does not take into account the computational power of the model, which consumes a lot of time and memory to run. Through experiments, it is known that compared with the CC-SSL method, the PCK@0.05 of this method is increased by 5.5% on the TigDog dataset. The model in this paper reduces the number of parameters and calculations of the network while ensuring less information loss and model accuracy. The ablation experiment verifies the advancement and effectiveness of the overall network.


I. INTRODUCTION
In recent years, human activities have caused serious damage to the natural ecological environment, directly leading to the extinction of a large number of species. Ex situ conservation is an important part of biodiversity conservation and an important supplement to in situ conservation. Grasping the behavior information of ex situ conservation animals can better evaluate the physical and mental health and animal welfare of captive animals. In the study of animal health and The associate editor coordinating the review of this manuscript and approving it for publication was Antonio J. R. Neves .
welfare, quantitative analysis of behavior plays an important role. Through continuous monitoring of animals and pose estimation of monitored behavior, it can accurately reflect the dynamic law of animal behavior with the change of physiological health. It is an important step for animal behavior research to be scientific and rigorous, and it is also an important basis for behavioral mathematical modeling. The behavior of animals is closely related to their physical state and health. Different behaviors convey different health information. Behavior is composed of different poses, and pose estimation is an integral part of behavior analysis. Obtaining pose is the basis for understanding behavior. Therefore, there is an urgent need for new technologies to quickly and intelligently identify the behavior of animals and provide technical support for auxiliary assessment of their health status.
With the rapid development of machine vision, pose estimation [1] has gradually become an important part of many computer vision fields. It is widely used in many fields, such as human-computer interaction [2], intelligent surveillance, and pose tracking. Pose estimation mainly refers to the process of identifying and estimating the position of each part or joint point of a detection target from an image. Applications such as pose tracking and gait analysis require accurate pose estimation as support. Therefore, it is particularly important to conduct in-depth research on pose estimation. The earliest methods of pose estimation are mainly graph structure model matching and K-means clustering, but traditional methods are susceptible to the interference of complex backgrounds and lead to repeated calculations when there is occlusion of the body, resulting in low accuracy of the extracted features and suboptimal results. With the rapid development of computer vision, the convolutional neural network (CNN) method is used for calculation. Compared with the traditional method, the parameters for calculating the weight of the convolutional layer are lower. Moreover, when the object in the image is moved to another position in the image, the convolutional neural network can still identify the object well, which can greatly improve the accuracy of attitude estimation.
Currently, there are two main problems in pose estimation. First, when the pose estimation finds the detection target from an image, the detection target is vulnerable to the camera angle and the detection and recognition of the occluded target. Second, the pose estimation algorithm mainly focuses on improving its performance while ignoring its computational efficiency. Toshev et al. [3] proposed the DeepPose network to perform regression learning on the key points of the detection target, which is superior to the traditional human pose estimation method. Lin et al. [4] proposed the feature pyramid network (FPN), which can detect precise small targets but lacks contextual information, resulting in low detection performance and difficulty in identifying key points in occluded parts. Chen et al. [5] proposed the cascaded pyramid network (CPN), which improves the performance of keypoint detection, but its generalization performance for multi-scale pose estimation is poor owing to the lack of structural information between joints. Sun et al. [6] proposed the high-resolution network (HRNet), which always maintains high-resolution feature maps and solves the problem of low detection accuracy of human key points in medium and low resolutions. However, if serious occlusion occurs, the estimation results of the HRNet have large errors. Lan [7] proposed the generative adversarial network (GAN), which improves the generalization performance and accuracy of the network through adversarial training between the generator and the discriminator, but it has problems; for example, the loss function cannot indicate the training process, and the generated samples lack diversity. Newell et al. [8] proposed the stacked hourglass network (SHN), which creates the network feature layer by concatenating the network from high resolution to low resolution and then from low resolution to high resolution. The layers are superimposed, and the features of all layers are retained so that the human body joint point information is learned in the way of heat map detection. Moreover, the detection results of the entire image are inferred, thereby improving the resolution of the image features. In the field of computer vision, we will find that the general CNN network structure can be used to classify objects well. That is to say, the computer knows what is in the picture, but does not know where the subject is, because the feature map output by CNN contains little spatial information. To accurately locate the joints of animals and human bodies, it is not only necessary for the computer to know what is in the picture, but also to let it know where the subject is. The SHN network can use multi-scale features to identify postures and predict the position of animals and human joints in RGB images at the pixel level. Therefore, SHN is used as the basic network in this paper.
In recent years, with the continuous development of deep learning, animal pose estimation has been widely used in many fields. Feng et al. [9] proposed the use of Spatiotemporal networks to combine skeleton features with contour features to automatically identify the actions of cats, thereby assisting in the protection of wild cats. Li et al. [10] designed a multi-scale domain adaptation module that proposes a way to learn from synthetic animal data. Zhou et al. [11] proposed a structured context-enhanced network based on a graphical model to estimate the posture of mice and analyze their behavior. Qi et al. [12] introduced the ASPP module based on HRNet to enhance the network's ability to capture multi-scale information. However, the giant panda's posture varies and is affected by self-occlusion and light changes. The effectiveness of pose estimation still needs to be further evaluated on larger data sets. Zhou et al. [13] proposed a temporal-spatial consistency semi-supervised learning method based on smoothness assumption and a twin network structure for the rhesus monkey pose estimation task and proposed a temporal-spatial consistency semi-supervised learning method based on smoothness assumption and a twin network structure for rhesus monkey pose estimation task. At present, although there are datasets containing animals, most animal datasets are built for classification and detection, and only a small number of animal datasets are used to resolve animal key points [14], [15], [16]. Because a large number of datasets for animals are labeled, which is extremely expensive, existing work has mostly addressed this problem by employing synthetic animal datasets [17], [18], [19], [20]. Therefore, the requirement for a large-scale animal dataset remains a problem for animal pose estimation research. In recent years, most researchers have used a deeper CNN to improve the detection accuracy while ignoring the problems of the increasing number of parameters and computational complexity in the neural network, which has seriously hindered the method's use in resource-constrained situations.  Based on the above research, this paper uses the stacked hourglass network as the basic network. Because the neural network model of SHN is large and slow, this paper uses a new lightweight unit CCW (Conditional Channel Weighting) proposed by Yu et al. [21] in 2021, that is, the replacement module Shuffle block in ShuffleNet is embedded into the convolutional neural network to obtain a lightweight scheme with better performance than the convolutional neural network. At the same time, aiming at the computational bottleneck problem existing in the Shuffle block, the lightweight unit CCW is introduced to replace the convolution of the Shuffle block, and a high-performance lightweight network is obtained to innovate and improve the residual module. A lightweight residual module ICCW-Bottle replaces the Bottleneck module to lightweight the network and obtain different scales of feature information from feature maps with different sizes of receptive fields. Aiming at the problem that a large amount of feature information is lost after the network pooling operation, a lightweight dual-branch fusion module is proposed, which fuses high-level semantic information and low-level detailed features to reduce the loss of information. At the same time, unlike most studies, to solve the problem of lacking a large number of animal datasets, this paper attempts to train the model by combining synthetic animal datasets and real animal datasets. While we reduce the size and complexity of model parameters, the accuracy of the network model is improved, which can minimize the cost and achieve the best effect.
Along this introductory Section I, Section II introduces the network setup, and the study's idea of using various methods for pose estimation, which can improve the accuracy of attitude estimation while reducing part of the network complexity and parameters. Section III introduces the experiment's results and analysis, discusses the results, and compares the model's performance to the CC-SSL method. Finally, Section IV concludes the paper and presents future work.

II. THE PROPOSED NETWORK SETUP
In this paper, SHN is used as the basic network architecture, and a residual module based on the efficient channel attention improved conditional channel-weighted (ECA-ICCW) method, ICCW-Bottle, is proposed to replace the bottleneck module, which retains and accumulates the feature maps of different receptive fields while making the network more lightweight. Then, a lightweight dual-branch feature fusion module is added to the improved lightweight network, and the context information is aggregated under the condition that the parameter amount does not change much. This, in turn, reduces the amount of model parameters and improves the accuracy of pose estimation. The overall network model is shown in Fig. 1.

A. STACKED HOURGLASS NETWORK
The backbone network of this paper is the SHN for joint point localization. SHN uses the residual module Bottleneck as the basic module to construct the entire network. The module is divided into two paths. The main path includes three convolutional layers to extract high-level features. The size of the convolution kernel of the first layer is 1 × 1, the size of the second layer of the convolution kernel is 3 × 3, and the size of the third layer of the convolution kernel is 1 × 1. Before each convolutional layer, a batch normalization (BN) layer and a ReLu layer are passed through, and the structure is shown in Fig. 2.  The feature map should be separated from the branch before passing through the pooling layer. To preserve the spatial information of the picture, the joint point information of the original size (Outsize), 1/2 of the original size (Outsize/2), 1/4 of the original size (Outsize/4), and 1/8 of the original size (Outsize/8) is extracted. After each pass through the pooling layer, the resolution and computational complexity of the feature map can be reduced, and the image features are also extracted through the residual module. Then, the feature map is upsampled by the Nearest Interpolation method to recover the higher resolution. The features extracted by the network incorporate multi-scale and contextual information, which not only retains the information of all layers but also has the same size as the original image. Therefore, the network does not change the image size. Its structure is shown in Fig. 3.
Existing studies have shown that adding attention mechanisms to CNNs can greatly improve the performance of the network. For example, attention mechanisms such as SENet [22], CBAM [23], and CA [24] can greatly improve the performance of the network. However, many existing methods are devoted to designing more complex attention mechanisms to achieve better performance, which inevitably increase the number of parameters and computation of the model, making the model more complex.
Therefore, Wang et al. [25] proposed an ECA module, as shown in Fig. 4. The ECA module is a new method to capture local cross-channel information interaction. It performs cross-channel information interaction without reducing the channel dimension, aiming to ensure computational performance and model complexity. The ECA module can realize information interaction between channels by one-dimensional convolution with convolution kernel size k. Avoiding dimensionality reduction is important for learning channel attention, and proper cross-channel interaction can significantly reduce complexity while maintaining performance. Suppose the output of a convolutional block is χ ∈ R W ×H ×C , where W , H , and C represent the width, height, and number of channels, respectively, and GAP represents global average pooling. The ECA module realizes the information interaction between channels through fast onedimensional convolution with a convolution kernel size of k, as follows: where σ is the sigmoid function, C1D is the 1D convolution, k is the kernel size of the 1D convolution, and y is the output signal.

2) CONDITIONAL CHANNEL WEIGHTING
In existing research, pose estimation has usually required high-resolution feature maps to achieve high performance, which inevitably increases the number of parameters and the computation of the model. Yu et al. proposed a lightweight unit CCW, which embeds the Shuffle block in ShuffleNet into the CNN to obtain a lightweight scheme with better performance than that of the CNN. Due to the heavy computation caused by the replacement module using 1 × 1 convolution, a lightweight conditional channel weighted CCW is introduced to replace the 1 × 1 convolution to obtain a highperformance lightweight network, as shown in Fig. 5.
Assuming that the number of channels of the input and output feature maps of a convolution is C, the time VOLUME 11, 2023 complexity of the 1 × 1 convolution is C 2 , and the linear time complexity of the 3 × 3 depthwise convolution is (9C) 3 . In the Shuffle block, the complexity of the two 1 × 1 convolutions is much higher than that of the depthwise convolution: 2 C 2 > (9C) when C > 5. Therefore, to further reduce the computational complexity of the network, a lightweight unit CCW is used to replace the 1 × 1 convolution of the Shuffle block, as follows: where Ws is the 3D tensor of size Ws × Hs × Cs, and is the element-wise multiplication operator. The complexity of this computing unit is linearly related to the number of channels (C), which is much lower than the 1 × 1 convolution in the Shuffle block, which plays a role in exchanging information across channels and resolutions.

3) ECA-ICCW
According to the structure of the SHN, there are too many residual modules in the network, which makes the network large and the data processing more redundant. In each residual module, a standard convolution of 3 × 3 and two 1 × 1 convolutions are needed. When the network inputs a highresolution image, the standard convolution easily brings a huge number of parameters and computations. At the same time, the standard convolution lacks information exchange across channels. It is from this that convolution methods such as cavity convolution [26], grouping convolution [27], and depthwise separable convolution [28] were derived.
Therefore, this paper proposes an improved conditional channel-weighted ECA-ICCW based on efficient channel attention ECA to replace the 3 × 3 convolution in the residual module Bottleneck, as shown in Fig. 6. In ECA-ICCW, the channel is not divided, and the feature map is directly input into the two branches. Both branches use 3 × 3 convolution with a step size of 2 to reduce the dimension of the height (H ) and width (W ) of the feature map, thus reducing the amount of network calculation. Then, the Concat operation is performed after the output of the two branches. The number of channels added is twice the original input, which increases the width of the network and increases the number of channels without significantly increasing FLOPs, and makes the network more capable of extracting features. Finally, channel shuffle is also performed to achieve information exchange between different channels. Since the module is an improvement of the lightweight unit CCW, only a small number of parameters are involved, and appropriate cross-channel information interaction can bring significant performance gains while maintaining network lightweight, significantly reducing the complexity of the model. For the activation function ReLu, it is truncated to 0 when the input is negative. Generally, it is considered that the negative value represents the introduction of background information or noise information. Therefore, the negative truncation not only excludes the interference information but also increases the network sparsity. However, the sparsity brought by ReLU may also hinder the network training process. Negative values do not mean that there must be interference information. Truncation to 0 represents shielding the feature, which may lead to the omission of some key features in the model learning process. In the case of a large learning rate, most neurons may enter the truncated state, and the truncated neurons in the current training will not be able to learn again. In order to solve the problem of negative truncation, this paper uses Mish activation function [29] instead of ReLU activation function in Bottleneck. As shown in Fig. 7, the function has no upper boundary and lower boundary, which avoids the saturation caused by the capping. Moreover, the Mish function is a smooth nonmonotone activation function, which can make the gradient descent smoother and allow better information to go deep into the neural network, so as to obtain better generalization ability and improve the recognition accuracy. It also helps to optimize the back propagation process, reduce the parameter quantity and calculation amount of the model, and improve the information flow, making it easier to train. Its function expression is shown in Formula (3): where ς (x) ln (1 + e x ) is the softplus activation function. The Mish activation function realizes the self-selection function, which is conducive to replacing an activation function such as ReLU, which receives a single scalar input without changing the network parameters. When compute unified device architecture (CUDA) is enabled, Mish can reduce the graphics processing unit (GPU) transmission time and effectively improve the training efficiency of the model.

2) ICCW-BOTTLE
To ensure the constant number of network parameters and complete transmission of the required information flow and minimize the input loss of the next module, a residual module ICCW-Bottle (ECA-ICCW-Bottleneck) based on lightweight unit ECA-ICCW is proposed to improve the accuracy of the feature map, as shown in Fig. 8. In the ICCW-Bottle module, we use the skip layer connection method to connect the shallow convolution output with the convolution output of the last layer. Pose estimation needs to detect key points of different scales, so it is necessary to obtain visual information of different scales from feature maps with different sizes of receptive fields. At the same time, the feature maps of the first few layers are connected by the channel cascade method, which can retain and accumulate the feature maps of multiple receptive fields to improve the feature representation ability of the ICCW-Bottle module. In addition, the ICCW-Bottle module also uses the Mish activation function to replace the Relu activation function to further improve the optimization and generalization ability of the ICCW-Bottle module. The ICCW-Bottle module has better performance in feature extraction than the original bottleneck module.

D. DUAL-BRANCH FEATURE FUSION MODULE 1) DEPTHWISE SEPARABLE CONVOLUTION
To solve the problem of large network parameters, Chollet [28] proposed depthwise separable convolution. The use of depthwise separable convolution is the most important reason for the MobileNet series to reduce the amount of parameter calculation. Depthwise separable convolution is proposed based on a traditional convolution structure. The traditional convolution structure is divided into two convolutions: DepthwiseConv (DW) and PointwiseConv (PW). DW will first split all the multi-channel feature maps from the previous layer into single-channel feature maps, perform single-channel convolution on them, and then stack them together again, only adjusting the size of the feature map of the previous layer without changing the number of channels. Then, the feature map obtained above is convoluted by PW. The size of the convolution kernel is 1 × 1, and the filter contains the convolution kernel with the same number of channels as the previous layer. Each filter outputs a feature map. It is such a convolution method that can greatly reduce the VOLUME 11, 2023 amount of calculation compared to conventional convolution. Compared with the direct use of 3 × 3 network convolution, it is more efficient and can significantly reduce the number of parameters and calculations of the model. Standard and depthwise separable convolutions are shown in Fig. 9: Assume that the input feature is C × H × W , and after a convolution kernel of size K × K , the output feature is N × H × W .
Then, the parameter (P 1 ) of standard convolution is given by The computational cost (F 1 ) of standard convolution is given by The parameter (P 2 ) of the depthwise separable convolution is given by The computational cost (F 2 ) of depthwise separable convolution is given by Then, the ratio of depthwise separable convolution to standard convolution is γ , which is given by By analyzing the above formula, it can be seen that the use of depthwise separable convolution operation can double the number of parameters and calculation of the model, reduce the time complexity and space complexity of the network model, and greatly shorten the convolution operation time. Therefore, this method can be used to save the calculation cost and lightweight the network structure.

2) DUAL-BRANCH FEATURE FUSION MODULE
Lin et al. [30] have mentioned that feature layers at different depths will cause large semantic differences. The shallow features of the CNN contain more location information, but the semantic information is insufficient. The deep features of the network contain richer semantic information, which is conducive to the regression prediction of the center point of the heat map, but the location information is rough. Also, the shallow localization information of SHN is easily lost. Considering the detection accuracy of large-scale animal targets, deep and shallow feature fusion are introduced. In this paper, the preliminarily extracted features of the network are input into the last hourglass network through depthwise separable convolution for feature transmission. As shown in Fig. 1, under the condition of keeping the number of parameters unchanged, the high-level semantic information and the bottom detail features are fully integrated to obtain more accurate feature information and to effectively improve the detection accuracy of difficult samples.

E. LOSS FUNCTION
At present, only a small number of animal datasets are used to analyze the key points of animals, and it is expensive to annotate a large number of datasets for animals. Therefore, in this paper, a cheaper synthetic animal dataset that renders the real ground is used to train the model jointly with the real animal dataset. According to the study of Mu et al. [31], models trained jointly on synthetic animal datasets and real animal datasets can achieve better results than models trained only on real animal datasets. We trained the model using only synthetic datasets at the beginning and obtained the initial model f (0) . Then, we used synthetic datasets and real datasets to train f (0) , repeating the iteration training n times. On the nth iteration, we used (X s , Y s ) and (X t ,Ŷ (n) t ) combined with L (n) to train the model. It allows the network training to distinguish different pixels, the foreground pixel training to converge faster, and a higher sensitivity overall to the foreground pixel small error. In this paper, the loss function L (n) is defined as the mean square error of the heat map of the source dataset (X s , Y s ) and the target dataset Xt . The loss function structure and algorithm complexity are low and can accurately predict the heat map of the joint points. The formula is shown in (9): where f (n) is the trained model, X j t is the jth image in the target dataset, andŶ (n) t is the pseudo-label generated by the nth training.

III. EXPERIMENTAL RESULTS AND ANALYSIS
This paper combines the advantages of CCW and ECA and performs dual-branch feature fusion to propose the model presented in the current work. The specific network structure is shown in Table 1, where H and W represent the height and width of the input image, respectively. A lightweight unit ECA-ICCW replaces the 3 × 3 convolution of the residual module, and the residual module is improved to obtain an ICCW-Bottle that can reduce the weight of the network while retaining the feature maps of different receptive fields. At the same time, the depthwise separable convolution is used to perform dual-branch feature fusion to obtain deep and shallow information. Compared with the original SHN network, the lightweight SHN more effectively reduces the parameter size of the network.

A. DATASETS
The SHN was used as the basic network of the experiment, and the experiment used the synthetic animal dataset and the TigDog [32] real animal dataset to train, validate, and test the network. The TigDog dataset provides key point annotations for horses and tigers. The images of horses were from YouTube. The ratio of the training set to the test set was 5:1. Then, 8380 images were used as the training set, and the remaining 1772 images were used as the test set. The images of tigers came from the national geographic documentary. The ratio of the training set to the test set was 4:1. Then, 6523 images were used as the training set, and the remaining 1765 images were used as the test set. The synthetic animal dataset contained five types of animal images, including horses, tigers, sheep, dogs, and elephants. Each animal category had 10000 images: 8000 images being used as the training set and 2000 images being used as the test set.
To verify the generalization performance of the network model, we also tested it on the VisDA2019 dataset. The dataset had six areas: real, sketch, clipart, painting, infograph, and quickdraw. We mainly used sketch, painting, and clipart to test the generalization performance of the network and verify the advanced nature of the network.

B. EVALUATION
The percentage of Correct Keypoints (PCK) is the most commonly used pose estimation evaluation standard. PCK refers to the proportion of correctly estimated key points, that is, the proportion of the normalized distance between the detected key points and their corresponding ground truth less than the set threshold. A detected joint was considered correct if the distance between the predicted joint and the true joint was within a certain threshold. PCK@0.05 refers to the percentage of correct key points at a threshold of 0.05. The specific formula is as follows: where PCK k i represents the PCK index of the key point with an id of i under the Tk threshold, PCK k mean represents the algorithm PCK index under Tk thresholds, k represents the kth threshold, the key point i represents the id of i, p represents the pth animal, dpi represents the Euclidean distance between the predicted value of the key point with the id of 3 in the pth animal and the manually labeled value, d def p represents the scale factor of the pth animal, and Tk represents a manually set threshold.

C. IMPLEMENTATION DETAILS
The PyTorch framework was used to realize the network architecture. To obtain accurate network parameters and effectively train and optimize the network, the high-performance NVIDIA GeForce GTX 3090 was used to train the network. The software platform used in this experiment was Python 3.8.   The number of SHNs was 4. In this experiment, RMSProp [33] optimizer was selected to optimize the model. The training cycle of the epoch was 200, the batch size was 10, the initial learning rate was 2.5 × 10 −4 , the attenuation coefficient of the learning rate was 0.1, and the learning rate decayed once in 120 and 180 cycles, 10 times each time.
To evaluate the advancement and effectiveness of our method, we improved and optimized the original network and then conducted experiments on the TigDog dataset and the synthetic dataset. The input image was cut to 256 × 256, and the image was randomly rotated and flipped to enhance the data. Finally, it was compared with other advanced animal pose estimation networks. Fig. 10 shows the heatmap results obtained when training on images.

D. MODEL COMPARISON EXPERIMENT
The experimental results are shown in Table 2. Under the TigDog dataset, the proposed method reduces the model parameters and calculation amount by 57.6 % and 53.0 % respectively compared with CC-SSL, and increases by 5.5 % under the PCK @ 0.05 index. Compared with OpenPose, the proposed method reduces the model parameters and calculation amount by 54.8 % and 79.9 % respectively, and increases by 12.3 % under the PCK @ 0.05 index. Compared with BlazePose, the proposed method reduces the model parameters and calculation amount by 4.8 % and 65.1 % respectively, and increases by 9.5 % under the PCK @ 0.05 index. Compared with MobileNet, the proposed method reduces the model parameters and computation by 37.2 % and 19.7 % respectively, and increases by 9.2 % under the PCK @ 0.05 index. Table 2 shows the experimental results of different network training on the TigDog dataset. The experimental results show that the parameters and computational complexity of the proposed method are significantly reduced, indicating the effective performance of the proposed model. Table 3 shows the experimental results on the TigDog dataset. When the animal was a horse, the PCK@0.05 accuracy of the method proposed in this paper and the existing method was compared. ''Real'' means the model was trained only with the real animal pose dataset, while ''Syn'' means the model was trained only with synthetic data. The experimental results from Table 3 show that compared with the CC-SSL method, the method in this paper reduced the size of the model parameters in the horse experiment and simultaneously improved the PCK@0.05 accuracy of pose estimation by 5.5%. The results of training directly on real images were close. Table 4 shows the experimental results on the TigDog dataset. When the animal was a tiger, the PCK@0.05 accuracy of the proposed method and other advanced methods was compared. From the experimental results in Table 4, the PCK@0.05 of the method in this paper was 3.9% higher than that of CC-SSL in the tiger experiment. As tigers usually live in forests and are often shaded by surrounding creatures, however, the synthetic animal dataset used for training had no such shading; it was difficult for the model to adapt to heavily occluded scenes. Thus, all methods in Table 4 failed to achieve the same accuracy as horses.
As shown in Tables 3 and 4, the proposed method in this paper improved the accuracy of PCK@0.05 compared to Cycgan [34], BDL [35], Cycada [36], and CC-SSL [31] while reducing the size of model parameters. Fig. 11 shows the visualization result of pose estimation and local segmentation. This method could produce more accurate predictions. Fig. 12 shows that our model can also accurately predict some difficult key points: our trained model can accurately predict fuzzy pictures, overlapping limbs, and lying tigers (the first row); When there are people in the environment captured by the camera, our model can also accurately estimate (the second row); When in a complex environment, our model can also accurately predict the key points of animals (the third row); Our model can also accurately capture the posture of animals when they are in the natural habitat and moving fast. In addition, as shown in Fig. 13, this method could perform good pose estimation for other animal categories, such as sheep, elephants, and dogs.    It can be seen that the accuracy rate of animal detection for horses has reached 74%, while the accuracy rate of animal detection for tigers in natural habitats is only 66%. The result is due to the fact that tigers are often shielded by the surrounding environment, which shows that although our model can obtain good detection results for animals in general environments, the detection results for animals in severely sheltered environments would decline. Fig. 15 shows our confusion matrix for the joint points of animal horses. The accuracy rate for eyes and chin has reached above 90%, the rate for the mean was 74%. This shows that our model has strong coping ability in attitude estimation and can achieve good recognition results.

E. GENERALIZATION TEST ON VISDA2019
To verify the generalization performance of the model, we also used images from the Vision Domain Adaptation Challenge dataset (VisDA2019) for testing. The dataset had six areas, including real, sketch, clipart, painting, infographics, and quickdraw. This paper mainly used sketch, painting, and clipart to test the generalization performance of the network, as shown in Fig. 13 (from left to right for each animal: clipart, painting, and sketch).
Experiments showed that both the CC-SSL method and the method in this paper outperformed the model trained on real images, thus proving the feasibility of jointly training the model using synthetic datasets and real datasets, as shown in Table 5.

F. MODEL ABLATION EXPERIMENT
Based on CC-SSL, this paper proposes a lightweight SHN algorithm. Considering the problem of large parameters of deep CNN, a lightweight residual module is proposed to obtain feature information at different scales while     VOLUME 11, 2023 number of feature information is easily lost after the network pooling operation. To prove the effectiveness and advancement of each key module in the proposed model, we performed ablation experiments on the TigDog dataset and synthetic animal dataset with horses as the experimental subjects. At the same time, we compared it with CC-SSL. The '' √ '' representation model contains this module, and the experimental results are shown in Table 6. Table 6 shows that under the PCK@0.05 index, compared with CC-SSL, a residual module ICCW-Bottle (without ECA-ICCW) was proposed to replace the original Bottleneck, which improved the accuracy by 3.1%. On this basis, it was proposed that the convolution of the lightweight unit ECA-ICCW to replace the residual module was improved by 3.4%, and the addition of the dual-branch feature fusion module was improved by 1.5%. As can be seen from the table, the final model was 5.5% more accurate than the baseline, fully demonstrating the feasibility of the method proposed in this paper.

IV. CONCLUSION
Animal pose estimation is very important for animal behavior detection, motion analysis, and medical rescue. Although there are many animal studies based on deep learning, animal pose estimation is rarely mentioned. In this paper, we use the method of deep learning to design a 3 × 3 convolution of a lightweight unit ECA-ICCW to replace the residual module based on the stacked hourglass network and improve the residual module, which has been effectively verified in the ablation experiment (i.e PCK@0.05 index reaches 73.15%), which effectively reduces the parameters of the model and improves the ability of the model to obtain information at different scales. Secondly, aiming at the problem that a large amount of feature information is easily lost after the network pooling operation, a dual branch feature fusion method is proposed to enable the network to fully extract and fuse the feature information of the context (PCK@0.05 index of the method in this study reached 71.80%). The experimental results on the TigDog dataset show that the method in this paper improved the accuracy of pose estimation by 5.5% compared with CC-SSL. The proposed network also reduced the number of parameters and the computation of the network, as well as achieved parallel accuracy and speed.
However, the work done in this paper still needs to be improved: (1) to solve the problem of animal occlusion.
(2) At present, the research on animal pose estimation in this paper is limited to a single animal. In the future, we will carry out further research on posture estimation of multiple animal individuals, which will be of great significance for researchers to accurately understand animal behavior.