Fast and Lightweight Human Pose Estimation

Although achieving significant improvement on pose estimation, the major drawback is that most top-performing methods tend to adopt complex architecture and spend large computational cost to achieve higher performance. Due to the edge device’s limited resources, its top-performing methods are hard to maintain fast inference speed in practice. To address this issue, we proposed the fast and lightweight human pose estimation method to maintain high performance and bear the less computational cost. Especially, the proposed method consists of two parts, i.e., the fast and lightweight pose network (FLPN) for pose estimation and a novel lightweight bottleneck block for reducing computational cost, which can integrate the simple network and lightweight bottleneck into an efficient method for accurate pose estimation. In terms of lightweight bottleneck block, we introduce the structural similarity measurement (SSIM) to refine the appropriate ratio of intrinsic feature maps and reduce the model size. Furthermore, an attention mechanism is also adopted in our lightweight bottleneck block for modeling the contextual information. We demonstrate the performance of the proposed method with extensive experiments on the two standard benchmark datasets by comparing our method with state-of-the-art methods. On the COCO keypoint detection dataset, our proposed method attains a similar accuracy with these state-of-the-art methods, but the computational cost of these top-performing methods is more than 7 times that of ours.


I. INTRODUCTION
The goal of estimating human pose based on input images can be simplified to precisely localize human anatomical keypoints (elbows, wrists, knees, etc. ). Human pose estimation which is a fundamental task in computer vision is extensively adopted for action recognition [24], [25], pose tracking [26], and human-computer interaction [27].
Recently, multiple tasks related to human pose estimation have been extensively studied in various fields [28]- [30], [33]. We pay attention to single-person pose estimation, which is the basis of relevant vision tasks, such as multi-person pose estimation, video-based pose estimation, and pose tracking.
We contend that applying a lightweight model for realtime human pose estimation is one of the major unaddressed issues. To the best of our knowledge, there have been a quite few works on the lightweight of human pose estimation methods. However, the lightweight human pose estimation networks, with small model size, light computation consuming, and high accuracy are suitable to directly deploy on resource-limited devices such as mobile phones and smart laptops. Majority of state-of-the-art methods which reach higher performing level are always related to complex networks, with mass parameters and numerous float-point operations (FLOPs). Despite their top performance, the delay based on inference time is one of the major drawbacks for these complex models with large computation. Besides, the demand for high memory is indeed for complex models with huge amounts of parameters.
Intuitively, if we aim at designing lightweight pose estimation networks, it is reasonable to focus on simple pose estimation networks and efficient bottleneck blocks. Among top-performing networks, SimpleBaseline [13] has provided prior knowledge on designing a simple yet efficient network for pose estimation and exploring how simple could an efficient model be. Inspired by their graceful design, the lightweight bottleneck of Lightweight Pose Network (LPN) [10] is proposed to exploit the best of choice depth-wise convolution for the low memory demanding network architecture. At the same time, many lightweight bottleneck blocks adopted for image classification are put forward to replace these standard bottleneck blocks such as mobilenet-v3 bottleneck [6] and ghostnet bottleneck [9]. Practically, these methods can significantly reduce the model size and computational complexity without too much performance degradation. To design an efficient lightweight network for human pose estimation, we need to explore the best balance between accuracy and computational cost. The major difficulty lies in how to trade off the performance and lightweight size of the network. We address this problem by using a simple network with a novel lightweight bottleneck. As is shown in figure 1, the method of SSIM is introduced to compare similarity among feature maps and determine the ratio of intrinsic feature maps. A novel bottleneck block is proposed to reduce computational cost and maintain efficient performance. (The method is described in greater detail in the following Section III) To demonstrate the effectiveness and efficiency of the proposed fast and lightweight human pose estimation method, extensive experiments were conducted to prove the superior performance over two benchmark datasets: the COCO keypoint detection dataset [11] and the MPII Human Pose dataset [14]. The experimental results confirm that our proposed method has an extremely small model size and computational complexity than these existing state-of-the-art methods [12], [13].
The contributions of this paper are as follows.
• After observing most of these state-of-the-art methods adopt standard bottlenecks in their network with heavy computational cost, we design a novel bottleneck for drastically reducing the parameters and floating-point operations (FLOPs). This allows us to deploy complex architecture network on limited resources computational platform.
• We propose a lightweight human pose estimation method by redesigning a quite simple network with surprising effectiveness. Further, the series of bottlenecks with lightweight designing are complementarily trained following a beginning block with two convolutional layers to study the high-to-low resolution representation for predicting accurate heatmaps.
The remainder of this paper is organized as follows: Section II describes the related works on top-performing and lightweight human pose estimation networks, lightweight block for various vision tasks and attention mechanism. Section III illustrates the proposed lightweight bottleneck and simple network. The detailed implementations and experiment results are presented in Sections IV and V, respectively. Ultimately, Section VI summarizes the paper.

II. RELATED WORKS A. TOP-PERFORMING HUMAN POSE ESTIMATION
With the introduction of DeepPose by Toshev and Szegedy [20], the problem of human pose estimation has transformed from a pictorial stricture to a DNN-based keypoints regression. Since then, a mass of studies in the human pose estimation field have achieved significant improvements by adopting DCNNs [10], [12], [13], [15], [18]- [20], [24]- [27]. There are two mainstream approaches, keypoints regression [20] and keypoints heatmap [10], [12], [13], which have become dominant in this field. Compared with the method of keypoints regression, keypoints heatmap is extensively adopted in human pose estimation tasks with an overwhelming superiority in the quality of performance.
Newell et al. [32] proposed a dominant approach called Stacked Hourglass Network on the MPII benchmark [14], which is widely adopted by superior methods. Its features are processed in a multi-stage architecture with repeated Bottomup, Top-down processing and skip layer connection are critical to capture the various spatial relationships between body parts. Chen et al. [34] proposed a method called the Cascaded Pyramid Network (CPN) which integrates all levels of feature representations to relieve the problem of these invisible keypoints. To obtain rich high-resolution representations for accurate and precise human pose estimation, Sun et al. [12] proposed a high-resolution representation (HRNet) that achieved state-of-the-art performance by connecting the multi-resolution subnetworks in parallel. HRNet starts with a high-resolution subnetwork as the first stage and gradually adds high-to-low resolution subnetworks one by one to form more stages, and connects the multiresolution subnetworks in parallel. Repeatedly performing multi-scale fusions among these parallel multi-resolutions subnetworks, HRNet can obtain rich high-to-low resolution representations from other parallel representations over and over and finally get the rich high-resolution representations. Through its superior pose estimation results over two benchmark datasets, Sun et al. empirically demonstrated the effectiveness of the multi-resolution subnetworks and the repeated multi-scale fusions.
Most of these prior works mainly focus on how to design a top-performing pose estimation method by adopting complex architecture or expensive computation cost model, ignoring these limitations on edge devices such as time-consuming and high memory demanding.

B. LIGHTWEIGHT POSE ESTIMATION
Lightweight design for pose estimation has attracted little attention from outstanding researchers. Recently, there rarely exists research on lightweight design to improve the efficiency of pose estimation. For model compression and execution speedup, Bulat and Tzimiropoulos [35] binarized the network architecture to adapt for edge devices. However, it suffers from a performance drop by a large margin. After observing these complex top-performing pose estimation algorithms, Xiao et al. [13] proposed the Simple Baseline method that is based on a residual backbone network followed by several deconvolutional layers. It provides us a new idea that simple network architecture can also achieve excellent performance on the COCO2017 benchmark [11]. Inspired by the design principles of Simple Baseline [13], Zhang et al. [10] provided a lightweight pose network (LPN) that has obvious superiority in terms of model size, computational complexity, and inference speed. Further, Zhang et al. empirically demonstrated the efficiency and effectiveness of their lightweight network on the challenging COCO2017 benchmark [11].

C. LIGHTWEIGHT BLOCK
The depth of representations is of crucial significance for the human pose estimation task. Based on comprehensive empirical evidence, He et al. [23] experimentally demonstrated that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. Therefore, most of these aforementioned methods adopt the ResNet series as their backbone network that is substantially deeper than those used previously. However, top-performing methods based on ResNet are not suitable to directly deploy on resource-limited devices because of the heavy computation. In recent years, a series of compact networks [4]- [9] are proposed with the increasing demand for a lightweight model. Based on a streamlined architecture, MobileNets [4] adopts depth-wise separable convolutions to establish lightweight deep neural networks and efficiently trades off between latency and accuracy. MobileNetV2 [5] has introduced a new mobile architecture that consists of inverted residuals and linear bottlenecks. Through a combination of hardware-aware network architecture search (NAS) complemented by the NetAdpt algorithm, MobileNetV3 [6] takes advantage of the novel architecture to subsequently improve the accuracy. Besides, it further explores the issue of how automated searching algorithms and network design can work together to improve the overall state of the art on mobile classification, detection and segmentation. ShuffleNet [7] primarily introduces pointwise group convolution operations and channel shuffle operations to extensively decrease computational cost. Ma et al. [8] presented a new architecture called ShuffleNet V2 and their work derives several practical guidelines for efficient network design. Accordingly, comprehensive experiments have demonstrated that their work achieves state-of-the-art performance in terms of speed and accuracy tradeoff. Han et al. [9] proposed a novel Ghost module to generate more feature maps from cheap operations. They applied a series of linear transformations on these intrinsic feature maps to generate many relevant feature maps for getting primary information with a less computational cost.

D. ATTENTION MECHANISM
Recently, related works based on attention mechanism have achieved great success in various computer vision tasks such as image classification [9], [37], object recognition [38], lightweight human pose estimation [10], and so on. Chu et al. [36] firstly introduced attention mechanism into pose estimation models. They proposed a method that incorporates convolutional neural networks with a multi-context attention mechanism into an end-to-end framework. Nonlocal network proposed by Wang et al. [3] employs selfattention mechanisms to model pixel-level pairwise relations for capturing long-range dependencies. Although gaining some performance in human pose estimation, the non-local operation computes the response at each query position with extensive computation cost. Hu et al. [2] proposed a novel architectural unit, termed as Squeeze-and-Excitation (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modeling interdependencies between channels. These blocks can be stacked together to form SENet that significantly improves the performance for topperforming models at a slight additional computational cost. Based on rigorous empirical analysis, Cao et al. [1] found that the global contexts modeled by the non-local network are almost the same for different query positions within an image. They designed a better instantiation, called the global context block (GCB) and constructed a global context network (GCNet), which maintains the performance of NLNet but takes significantly less computational cost. Therefore, it is applied in each bottleneck block of our model that can increase the performance of our network without too much computational cost.

E. STRUCTURAL SIMILARITY
Generally, feature maps are highly structured in that their pixels exhibit strong relationships, especially when they are spatially and temporally proximate. Their relationships in spatial and temporal sequence usually carry extremely significant information about the structure of the object in visual scenarios. Wang et al. [39] constructed a formulation SSIM for measuring the structural similarity quality from the perspective of image formation. Their method composes of three parts: the average luminance, contract, and structural information.
Unlike [39], we just take two channels of the feature maps as the input of the method. Then, SSIM is adopted to evaluate the structural similarity among these feature maps which come from one original image. Finally, a value computed by SSIM indicates the similarity between two input feature maps and the larger value which ranges from 0 to 1 means a strong relationship. These output values determine the compress ratio of intrinsic feature maps in our module.
Despite their top performance, we will analyze the performance in Section V by comparing the size of parameters, GFLOPs, and inference time among these state-of-the-art methods. The aforementioned lightweight blocks aim to reduce the computational cost without too much accuracy decrease. However, their methods (LPN) have not been extensively used for human pose estimation and their performance has not been demonstrated by extensive experiments. Therefore, we propose a simple but powerful human pose estimation model with a lightweight bottleneck block which can significantly reduce the computational cost.

III. APPROACH
Owing to complex architecture and vast computational cost, a lightweight but powerful pose estimator is extremely hard to design which is described in Section I. To conquer this limitation, we propose a novel lightweight human pose estimation method by redesigning a simple network (FLPN) with several groups of lightweight bottleneck (Smart bottleneck) blocks. The smart bottleneck is mainly composed of two stacked smart modules and a global context (GC) [1] block. Then, the smart module is introduced to utilize cheap operations to generate more feature maps from these intrinsic feature maps. The structural similarity (SSIM) [39] measurement method is adopted in the smart module and determines the appropriate proportion of intrinsic feature maps in the total feature maps. At the same time, we also append the GC block which is effectively able to model the global context by capturing long-range dependencies with the less computational cost increase. To achieve extremely efficient architecture and high performance, we proposed the FLPN network with a simple architecture.
Firstly, we explain the architecture of the novel lightweight module, bottleneck and efficient network and then compare the computational cost of the proposed method.

A. SMART MODULE FOR MORE FEATURES
The success of GhostNet [9] proposed by Han et. al has provided prior knowledge that intermediate feature maps calculated by mainstream CNNs often contain mass redundancy and some of them are similar in many aspects. Inspired by their creative idea, we point out that it is necessary to compare feature maps between different channels for one input image and determine the ratio of intrinsic feature maps in the module.
Top-performing human pose estimation models [10], [13], [34] often adopt ResNet as their backbone with a large number of convolution layers that result in extremely massive computational cost. Given the comprehensively existing redundancy in the process of feature maps calculated by these high performance models as shown in Figure 1, Han et al. [9] proposed the ghost module to reduce the demanded resources. The ghost module adopts a handful of intrinsic feature maps to generate more feature maps with some cheap transformations. However, there is a question that why these intrinsic feature maps occupy half of the input maps. In this section, we will further explore the reason for the compression ratio.
In practice, X ∈ R (c×h×w) is the input data, where c is the number of input channels. At the same time, h and w represent the height and width of the input data, respectively. Y ∈ R (n×w ×h ) is the output map with n channels. The operation of the convolutional layer, which transforms the input data X into output map Y , can be formulated as: (1) where * is the transformation operation, b is the bias term and f ∈ R (c×k×k×n) is the convolution filters in convolutional layer. Besides, h and w are the weight and width of the output feature maps, and k × k is the kernel size of the convolution filters f , respectively.
During the convolutional process, a standard convolutional layer can be parameterized by convolution kernel k of the size D k × D k × c × n, where D k is the spatial dimension of the kernel assumed to be square. Correspondingly, the calculated number of FLOPs can be formulized as n×h ×w ×c×k ×k, which often results in massive computation cost because of the large number of filters n and abundant channel numbers c.
Based on Eq. 1, the large number of parameters in f and b to be optimized can be simplified to reduce the dimensions of input feature maps and output feature maps. We point out that the ratio of intrinsic feature maps dynamically change with the number of redundant feature maps in total feature maps. Therefore, we introduce SSIM to estimate the similarity among these feature maps and find the appropriate ratio of intrinsic feature maps. Specifically, m d intrinsic feature maps Y ∈ R (h ×w ×m d ) are produced by some primary convolution filters: where f ∈ R (c×k×k×m d ) is the convolution filters, m d is much lower than n and the bias b is omitted for simplicity. To keep the consistent spatial size of the output feature maps, a series of cheap linear operations are adopted on each intrinsic feature maps in Y d to generate t relevant feature maps in the following equation: where y i is the i-th intrinsic feature maps in Y d , φ i,j is the j-th linear operation for generating the j-th ghost feature map y i,j . Obviously, each y can have t ghost feature maps. The final φ i,s is an identity mapping for maintaining the intrinsic feature maps. Finally, we integrate these intrinsic feature maps and relevant feature maps into the output feature maps with a consistent spatial size. In terms of computational cost, these linear operations φ are much less than the standard convolution. As shown in Figure 3, the blue block represents the method of SSIM which determines the ratio of intrinsical feature maps. Under the guidance of SSIM, we increase the number of linear operations which can reduce the computational cost in our proposed method. The identity represents the intrinsical feature maps and others generated by cheap operations (φ) represent relevant features. The above standard convolution is used to compare with ours.
As the module can be easily integrated into top-performing human pose estimation networks to reduce the computational cost, we further analyze the income on theoretical speed-up ratio and the total number of parameters. There exits one identity mapping and m d × (t − 1) = n t × (t − 1) linear operations, and the averaged kernel size of each linear operation is equal to d × d. For simplification, we take the same kernel size for linear operation and ordinary convolution layer in one module for efficient performance. The total number of parameters for an ordinary convolutional layer is n·k ·k ·c. Comparatively, the parameters of our module compose of primary convolution m·k ·k ·c and linear operations m·k ·k ·(t −1). The compression ratio of parameters can be calculated as: Similarly, the theoretical speed-up ratio can be formulized as where s ≤ c. In our paper, t ≥ 2 leads to much decrease in computational cost.

B. BUILDING LIGHTWEIGHT BOTTLENECK BLOCK
The bottleneck is first introduced in ResNet [23]. As shown in Figure 4, the bottleneck block composes of several convolutional layers and a shortcut connection. Correspondingly, the total number of the parameters for a standard bottleneck block can be represented as For a bottleneck, the number of input channels N is consistent with that of output channels and N = M × expansion. Correspondingly, M represents the hidden dimensions and expansion is a hyperparameter with a default value of 4. Therefore, the above Eq.11 can be simplified as Based on three modifications of the standard bottleneck block, we introduce the novel bottleneck (Smart bottleneck) specially designed for lightweight networks. Taking the advantage of the lightweight module, we firstly replace the standard bottleneck with our smart module, which can significantly reduce the computation cost. And then, the standard 3 × 3 convolution is replaced by a 3 × 3 depth-wise convolution, which can generate more features with fewer parameters. Finally, it is illustrated in Figure 5, we also adopt the global context (GC) block in the lightweight bottleneck, which can capture long-range dependencies without too much computational cost. As shown in Figure 4, the structure of the bottleneck seems to be similar to the bottleneck in ResNet. The most obvious difference with Ghost Bottleneck [9] is the application of the two stacked modules. Except for downsampling between stages, our bottleneck maintains the same number of channels in the stage. To a certain extent, for designing a lightweight bottleneck, we aim to reduce these operations between channels rather than increasing the number of channels. The number of parameters of the smart bottleneck is where t is the compression ratio for a module. Thus, the final reduction in the parameter is where t determined by SSIM in the range [2], [16].

C. FAST AND LIGHTWEIGHT NETWORK
The simple and widely adopted pipeline [10], [13] to estimate human pose consists of a stem decreasing the size of VOLUME 9, 2021 input images, the main body learning the features of these maps by reducing the resolution continuously, and a regressor estimating the heatmaps by transforming these low resolution heatmaps into the full resolution heatmaps and choosing the accurate positions of key points. Following the simple design principle, SimpleBaseline [13], which achieves topperforming in human pose estimation, adopts a series of standard bottlenecks as the main body and employs several deconvolutional layers as the regressor. Inspired by their simple design architecture, we basically follow the architecture of SimpleBaseline [13] for its superiority and replace these standard bottleneck blocks used in the backbone with our smart bottleneck blocks. In the stem, we use two successive convolutional layers with a small kernel size (3 × 3) to reduce the resolution of input images rather than a convolution layer with a large kernel size (7 × 7) followed by a max-pooling layer. The main body consists of a series of bottlenecks with gradually increasing channel numbers and decreasing feature map resolution. These bottleneck blocks are grouped into four stages according to the input size of their feature maps. Under the guidance of the SSIM method, the appropriate ratio of these intrinsic feature maps in the module varies with the depth of the network. During the downsampling process, we experimentally demonstrate that the correlation among feature maps fades away gradually. It means that there exist a large number of redundant features in large size of feature maps and using a low ratio can drastically reduce the scale of parameters and FLOPs. Through the whole process, we replace these convolution layers with group convolution layers as many as possible to reduce the abundant parameters while keeping the quality of feature maps. Finally, the group size of the group convolutions can be simplified to the great common divisor of input channels and output channels. The architecture of our network is illustrated in Figure 2.

IV. IMPLEMENTATION
In this section, we mainly describe the training setup, two datasets which are publicly available benchmarks used for human pose estimation, and an evaluation protocol. Moreover, we also introduce evaluation metrics for every dataset.

A. TRAINING SETUP
The same set of parameters and settings as SimpleBaseline [13] and LPN [10] were adopted to guarantee a fair comparison between the two methods [10], [13] and our method. Our network and the above mentioned two networks were all initialized by pre-training on the ImageNet classification task [21]. The Adam Optimizer [22] was also adopted. Similar to the two methods, the base learning rate was initiated at 1e-3 and dropped to 1e-4 at 90 epochs and 1e-5 at 120 epochs respectively. These networks were trained for 140 epochs in total. Except for a similar network as SimpleBaseline, we also use the novel lightweight module and SSIM to fine tune the proposed network. Following [10], [12], [13], the input image is cropped into a fixed ratio bounding box with the human. Then, we resize the bounding box to 256 × 192 to train our model. Moreover, data augmentation, composed of scale, rotation and filp, was applied to train the baseline methods [10], [13] and our proposed method. For the COCO2017 dataset [11], random rotation through [−40, 40] degrees, random scalings in [0.7, 1.3], and horizontal flips were adopted. For the MPII dataset [14], random rotation through [−30, 30] degrees, random scalings in [0.75, 1.25], and horizontal flips were also adopted. In the testing phase, we use human body detection bounding boxes based on COCO2017 to crop these images and put them into our model to evaluate the performance of our method. During the actual inference stage, a human body detector finds the human body box and puts it into our model to generate human poses.

B. DATASETS 1) COCO KEYPOINT DETECTION DATASET
The COCO dataset [11], widely used for human pose keypoint detection, contains over 200, 000 images and 250, 000 person instances labeled with 17 keypoints. Three datasets train2017/val2017/test-dev2017, cover 57K, 5K and 20K images individually, are used for training our model, evaluating our approach locally and evaluating our approach on an online platform respectively. Most of the existing methods [12], [13] evaluate the performance on 256 × 192 input images by cropping the heights and widths in a 4 : 3 ratio, therefore, we trained our network to utilize 256 × 192 input images to ensure a fair comparison.
The mean average precision(AP) and average recall(AR) were adopted as evaluation metrics based on object keypoint similarity(OKS) to evaluate the result. The standard evaluation metric is based on Object Keypoint Similarity (OKS): OKS is a measure that converts the Euclidean distance d i between the ground truth keypoint and the estimated keypoint to a value between 0 and 1. Here v i indicates the visibility of the ground truth, s indicates the object scale, and k i indicates a per-keypoint constant that controls falloff. AP 50 (the average precision as OKS = 0.50), AP 75 (the average precision as OKS = 0.75), AP(the mean of AP scores at OKS = 0.5, 0.55, 0.6, . . . , 0.95), AR(the mean of average recall scores at OKS = 0.5, 0.55, 0.6, . . . , 0.95). Further, AP M for medium objects (object area between 32 2 and 96 2 ) and AP L for large objects (object area larger than 96 2 ) were reported.

2) MPII HUMANPOSE ESTIMATION DATASET
The MPII Human Pose Dataset [14] composes of realworld images taken from various human daily activities with full-body pose annotations. This dataset contains over 25K images and 40K subjects, where 12K subjects are used for testing and the remaining subjects are used for training. The data augmentation and the training setup are the same as COCO2017, except that the size of the input image was cropped to 256 × 256 for providing a fair comparison with other methods.
The PCKh (head-normalized probability of correct keypoint) score is adopted as the standard evaluation metric in MPII human pose estimation. A joint is correct if it falls within αl pixels of the ground-truth position, where α is a constant and l is the head size that corresponds to 60% of the diagonal length of the ground-truth head bounding box. PCKh@0.5 is used for evaluating the accuracy of joint localization, which indicates that the distance between the estimation joint point and the ground-truth is less than 0.5 times the length of the head segment.

C. EVALUATION PROTOCOL
The general accuracy evaluation metrics were applied in the proposed method for a fair comparison with other methods. Apart from that, we redesigned experiments to measure the performance of our proposed lightweight method with stateof-the-art human pose estimation methods. These experiments were divided into the following three parts for detailed analysis.
• Experiment 1: To be more general and fair, we compare our method with state-of-the-art methods on the two publicly available benchmarks: COCO2017 and MPII. Besides, another major task was to explore efficient network which occupying low resource and achieving high accuracy. The selected simple method will take part in the next experiment.
• Experiment 2: Lots of lightweight models applied for image classification make the human pose estimation network available for mobile devices. In terms of inference time, not all top-performance methods suit mobile edge devices. Considering low calculation cost and high performance, we use a lightweight bottleneck block to replace the bottleneck of the selected model and compare their performance with ours.
• Experiment 3: Intuitionally, the size of the model is an extremely significant factor for evaluating the performance of the model. Therefore, we adopted SSIM to fine tune the proposed network and find the appropriate ratio which could balance the accuracy and lightweight size.
Under the guidance of the evaluation protocol, we can fairly compare our proposed method with others on the performance, calculational cost, and the size of parameters. In the next section, we will use quantitative results and qualitative results to demonstrate the performance of our method.

V. EXPERIMENTS A. EVALUATING PERFORMANCE
We compare our method with various top-performing methods on the COCO2017 dataset and the MPII dataset. For the fair comparison, we adopt the same person detector provided by HRNet [12] that can evaluate the inference time of these methods based on a uniform criterion. It is reasonable to compare our method with these top-performing methods under the above evaluation protocols in Section IV. For these lightweight models whose official codes are not available, we directly adopt their published data for comparisons.

1) COMPARISONS WITH SOTA METHODS
The results which present the performance comparisons under the above mentioned protocol1 are summarized in Table 1 and Table 2. Notably, we point out that iterative training on his own pre-training model can increase their accuracy with a lot of time consuming. The lightweight pose method LPN has not released their codes online. Therefore, we adopt the reproducible version of LPN for comparison VOLUME 9, 2021  without iterative training and the results are much lower than these top-performing methods.
From the results of Table 1, our methods have been achieved comparable performance with the SimpleBaseline series and HRNet series. On the COCO2017 validation dataset, our methods surpass SimpleBaseline on various backbones (e.g. Resnet50, Resnet101, and Resnet152). Especially, the size of parameters and flops of our proposed method whose backbone is Resnet50 are less than one-third and one-eighth of Simplebaseline (Resnet50) respectively. Though achieving lower accuracy compared with the HRNet series, the size of parameters and flops of our methods much lower than theirs. Obviously, the parameter size of HRNet-W32 is more than three times of our method (Resnet50). At the same time, the size of the flops is six times that of ours. On the COCO2017 test-dev set, our method achieves comparable results with these top-performing methods with the same input image size of 256 × 192. Even LPN adopts an iterative training strategy, our lightweight method also achieves the same performance as the LPN series.
In summary, our proposed method significantly outperforms all state-of-the-art methods in the size of parameters and computational cost. At the same time, our model maintains a similar accuracy with these top-performing methods. In comparison with the LPN, we have a large size of parameters and similar computational cost, but the accuracy of our method is greater than their method.
To further compare our method with these top-performing methods, we trained our model on the MPII dataset and evaluated the performance. The results are described in Table 3 and  Table 4. Different from the above mentioned experiments, the size of the input image is 256 × 256. On the MPII val dataset, our method achieves similar performance with much less computational cost. The small size of params and flops and high accuracy demonstrate the efficiency of our method.

2) COMPARISONS ON VARIOUS BOTTLENECK BLOCKS
Considering the computational cost and inference time of these top-performing methods presented in the above Table 1 and Table 3, we finally chose the SimpleBaseline (Resnet50) as the optimal model which can balance the accuracy and real-time performance. To further analyze the performance of various bottleneck blocks, we conduct some experiments on the COCO2017 dataset. For example, Sim-pleBaseline with ghost bottleneck block, SimpleBaseline with mobilenet-v3 bottleneck block are trained and compared with our method in Table 5 under the same experiments setting.
Ideally, the integration of SimpleBaseline and lightweight bottleneck is the best way for human pose estimation.  However, we adopted ResNet50 as backbone and mobilenetv3-bottleneck in the same experimental setting. Finally, the SimpleBaseline with ghostnet-bottleneck performs better than SimpleBaseline with mobilenetv3bottleneck and original SimpleBaseline. Notably, our method outperforms all these methods with an accuracy of 71.3. Our method has a similar size to SimpleBaseline with ghostnet bottleneck and the size of flops is rather less than these methods with various bottlenecks.

3) COMPARISONS ON INFERENCE SPEED
To compare the inference performance of our method and these compared methods, we have conducted all the experiments on the same platform that composes of an Intel 2.8GHz CPU and one NVIDIA GeForce GTX 1080Ti GPU. In this section, we mainly compare the inference time for these top-performing methods. The inference time consists of detecting human body boxes and estimating human keypoints. In our experiments, we adopt the same human detector to detect human body boxes and the inference time is about 2.5 seconds. The inference time of estimating human keypoints changes with different methods. As described in Figure 6, our method achieves the fastest speed among these methods. According to the appropriate reference [40]- [42], we also concect all the predicted key points in Figure 7. The results confirm the efficiency of the proposed method.

B. ABLATION STUDY
Ablation study is conducted to analyze the effect of each component in our methods, including the lightweight bottleneck  Table 1 on a non-GPU platform. Notably, we adopt the same input size 256 × 192 for all experiments. Several different colors denote the same backbone with various bottleneck blocks. The area of a circle represents the size of the FLOPs. block, the redesigned Network, and the method of SSIM. Under the above mentioned protocol in Section IV, our method was trained on variant conditions and we conducted extensive experiments on the COCO2017 dataset to pursue a detailed component analysis.

1) LIGHTWEIGHT BOTTLENECK BLOCK
To demonstrate the superiority of the lightweight bottleneck block, we build our model by utilizing the lightweight bottleneck block without and with GC block respectively and compare them with the original experiment on the same platform. The following experiments were conducted on the COCO2017 validation dataset: utilizing the Smart bottleneck block and the Smart bottleneck block without a GC block in both training and inference process for estimating the final heatmaps (denoted as ''Ours'' and ''Ours w / o GC block'' respectively).
The result in Table 8     operations (FLOPs), our method have a small increase than ours w / o GC block (about 2.6818M and 0.1277G). These results justify the contribution of the GC block without too much computational cost. Compared with the standard convolution bottleneck block (ours w / standard bottleneck), our method lower than ours w / standard bottleneck around 0.7 percent. However, the size of parameters and float-point operations are 2.3 times and 3.1 times that of our method. These results confirm that our method has a better ability to balance the accuracy and computational cost.
To further compare the performance of the Lightweight bottleneck block, we directly employ two kinds of state-ofthe-art lightweight bottleneck blocks to replace our proposed bottleneck block. We denote the version of our model as ''ours w / ghost bottleneck'' and ''ours w / mobilenet-v3'' respectively. Table 8 demonstrates that our method performs better than ours w / ghost bottleneck (71.3 percent versus 69.7 percent) with similar or less computational cost (in terms of the size of parameters, 10.1922M versus 10.1594M), (in terms of the size of float-point operations, 1.2638G versus 1.7198G). As depicted in Table 6, our method significantly outperforms ours w / mobilenet-v3 bottleneck (69.6 percent versus 68.1 percent), and our module requires quite less computational cost. Considering the accuracy and computational cost, our method achieves the best balance between them as illustrated in Table 8.

2) PROPOSED NETWORK
The proposed network with different versions in the training and testing phase is illustrated in Table 6. Note that the max-pool layer may reduce some useful information for human pose estimation. Hence, we use a beginning block that contains two sequential convolutional layers to replace the original max-pool layer and denotes this version as  ''ours w / o beginning block''. Compared with our model, the max-pool layer reduces the performance by around 0.3 percent. Then, to evaluate the effect of SSIM, we discard the SSIM method in our network which denoted as ''Ours w / o SSIM''. As illustrated in Table 6, our network with SSIM can have high performance than ''Ours w / o SSIM''. Most importantly, the size of float-point operations is reduced by SSIM. From the result of Table 6, we infer that expanding the ratio of cheap operation in the basic module can reduce the computational cost and increase performance. To further reduce the computational cost and maintain the high performance, group deconvolution is applied in the regression phase.

3) STRUCTURAL SIMILARITY MEASUREMENT
In our former experiments, we have found that the similarity of different channel feature maps comes from one image changes with the stage of our network. The structural similarity substantially decreases in the down-sampling stage0 and stage1. Meanwhile, the other two stages still maintain a lower level. Therefore, we adopt the first two stages and fixed ratios to explore the ideal model. Under the guidance of SSIM, we have employ ten group data to evaluate the performance of our model.
As illustrated in Table 7, to simplify the compression process and make the compression rate suit for our network, we use a group number (i.e 2,4,8,16) to denote the proportion of the intrinsic feature maps. Thanks to the proposed structural similarity measurement mechanism, our model can effectively leverage the power of lightweight bottleneck block to decline the computational cost and maintain high performance. In Table 7, we compare ten versions of our model to determine the best one as our model.

VI. CONCLUSION
This paper presents a fast and lightweight method consists of FLPN network for more accurate pose estimation, a Smart bottleneck block for reducing the computational cost, and the method of SSIM to refine the appropriate ratio of intrinsic feature maps for reducing the module block size and maintaining the high accuracy. Extensive experiments on these above mentioned datasets demonstrate that our method has achieved similar accuracy with these top-performing methods and our computational cost is extremely lower than theirs. Considering the inference time and computational cost, our method is more suitable to employ edge devices. Finally, we hope our method could take some inspired ideas on realtime and lightweight pose estimation field.